## LDA gensim and sklearn implementations

Here is a quick comparisson between the two models to ensure that is ok to use one or the other (a matter of preference).

For a more in-depth comparison one can have a look to this [issue](https://github.com/RaRe-Technologies/gensim/issues/457) and the code cited there. 

Let's start by reading the docs, select a randomm sample of 2500 documents and split the dataset into train/test

In [2]:
from __future__ import print_function
import random
from nlp_utils import SimpleTokenizer, read_docs

# logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.DEBUG)
# logging.root.level = logging.DEBUG
TEXT_DATA_DIR = '/home/ubuntu/working/text_classification/20_newsgroup/'
NB_TOPICS = 10

docs, doc_classes  = read_docs(TEXT_DATA_DIR)
# picking documents at random
random.seed(1981)
rand_docs = random.sample(docs,2500)

# Apply a simple tokenizer based in gensim's simple_preprocess
rand_docs = [SimpleTokenizer(doc) for doc in rand_docs]

#train,test
train_docs, test_docs =  rand_docs[:2000], rand_docs[2000:]



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Let's prepare the data before building the models

In [3]:
import gensim
from sklearn.feature_extraction.text import CountVectorizer
from gensim.matutils import Sparse2Corpus

# PREPARE DATA
# gensim corpus can be prepared using just gensim...
# id2word = gensim.corpora.Dictionary(train_docs)
# id2word.filter_extremes(no_below=20, no_above=0.5)
# gensim_tr_corpus = [id2word.doc2bow(doc) for doc in train_docs]
# gensim_te_corpus = [id2word.doc2bow(doc) for doc in test_docs]

# or using sklearn vectorizer and the very convenient Sparse2Corpus
vectorizer = CountVectorizer(min_df=20, max_df=0.5,
    preprocessor = lambda x: x, tokenizer=lambda x: x)
sklearn_tr_corpus = vectorizer.fit_transform(train_docs)
sklearn_te_corpus = vectorizer.transform(test_docs)

id2word = dict()
for k, v in vectorizer.vocabulary_.iteritems():
    id2word[v] = k
gensim_tr_corpus = Sparse2Corpus(sklearn_tr_corpus, documents_columns=False)
gensim_te_corpus = Sparse2Corpus(sklearn_te_corpus, documents_columns=False)

Let's now define the model parameters that will be passed to both, gensim and sklearn LDA. Information on these parameters can be found [here](https://radimrehurek.com/gensim/models/ldamulticore.html) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

In [4]:
# MODEL PARAMETERS
decay = 0.5
offset = 1.
max_iterations = 10
batch_size = 200
max_e_steps = 100
eval_every = 1
mode = "online"

And build and run the models!

In [5]:
import time
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models import LdaMulticore

#SKLEARN
lda_sklearn = LatentDirichletAllocation(
    n_components=NB_TOPICS,
    batch_size=batch_size,
    learning_decay=decay,
    learning_offset=offset,
    n_jobs=-1,
    random_state=0,
    max_iter=max_iterations,
    learning_method=mode,
    max_doc_update_iter=max_e_steps,
    evaluate_every=eval_every)

start = time.time()
lda_sklearn.fit(sklearn_tr_corpus)
sk_time = time.time() - start

gamma = lda_sklearn.transform(sklearn_te_corpus)
sklearn_perplexity = lda_sklearn.perplexity(sklearn_te_corpus, gamma)

# GENSIM
start = time.time()
lda_gensim_mc = LdaMulticore(
    gensim_tr_corpus,
    id2word=id2word,
    decay=decay,
    offset=offset,
    num_topics=NB_TOPICS,
    passes=max_iterations,
    batch=False, #for online training
    chunksize=batch_size,
    iterations=max_e_steps,
    eval_every=eval_every)
gn_time = time.time() - start

log_prep_gensim_mc   = lda_gensim_mc.log_perplexity(gensim_te_corpus)
preplexity_gensim_mc = np.exp(-1.*log_prep_gensim_mc)

  if doc_topic_distr != 'deprecated':


Let's have a look to the results

In [6]:
print("gensim run time and perplexity: {}, {}".format(gn_time, preplexity_gensim_mc))
print("sklearn run time and perplexity: {}, {}".format(sk_time,sklearn_perplexity))

gensim run time and perplexity: 62.3879930973, 1878.04572157
sklearn run time and perplexity: 17.2567749023, 1737.68489529


These results are on a AWS-p2 instance.

I have run a few times the script with different seeds. Normally perplexity values are quite similar but `sklearn` is $\sim$3 times faster.

In [7]:
import pandas as pd

topic_words = dict()
gensim_topics = lda_gensim_mc.show_topics(formatted=False)
def sklearn_show_topics(model, feature_names, n_top_words):
    sk_topics = []
    for topic_idx, topic in enumerate(model.components_):
        tot_score = np.sum(topic)
        top_words = [(feature_names[i],topic[i]/tot_score)
            for i in topic.argsort()[:-n_top_words - 1:-1]]
        sk_topics.append([topic_idx,top_words])
    return sk_topics
feature_names = vectorizer.get_feature_names()
sklearn_topics = sklearn_show_topics(lda_sklearn, feature_names,10)
topic_words['gensim']  = gensim_topics
topic_words['sklearn'] = sklearn_topics

# or in data frame formta
topic_words_df = dict()
for model, result in topic_words.iteritems():
    df = pd.DataFrame()
    for topic in result:
        cols =  [[word[0] for word in topic[1]] for topic in result]
        for i,c in enumerate(cols):
            df["topic_"+str(i)] = c
    topic_words_df[model] = df

print('Sklearn \n')
print(topic_words_df['sklearn'])
print('\n')
print('Gensim \n')
print(topic_words_df['gensim'])

Sklearn 

      topic_0    topic_1  topic_2      topic_3    topic_4  topic_5  topic_6  \
0  government       said     like   discussion    science      new      use   
1      people       time  article      general  objective      gov     need   
2     turkish       know      mac       number       uiuc   ground    space   
3        jews      years     know    reference      moral     nasa      com   
4      israel  insurance   people         copy      point     book  windows   
5   president       went      com      article       mean    price      bit   
6      health      think      key         news      frank  subject  problem   
7    armenian        way    going  information    article      man     mail   
8         war       came      lot    situation     theory  appears     work   
9         fbi        old     sure        islam   morality      old     file   

   topic_7  topic_8  topic_9  
0   people  article     year  
1      god      com     game  
2    think     good      co