# Topic Modeling (Prepare)

On Monday we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
Primary use case: what the hell are your documents about? Who might want to know that in industry - 
* Identifying common themes in customer reviews
* Discovering the needle in a haystack 
* Monitoring communications (Email - State Department) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* Part 0: Warm-Up
* Part 1: Describe how an LDA Model works
* Part 2: Estimate a LDA Model with Gensim
* Part 3: Interpret LDA results & Select the appropriate number of topics

# Part 0: Warm-Up
How do we do a grid search? 

In [31]:
import warnings
warnings.filterwarnings("ignore")

  and should_run_async(code)


In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# Load training data
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      remove=('headers', 'footers', 'quotes'))

# Load testing data
newsgroups_test = fetch_20newsgroups(subset='test', 
                                     remove=('headers', 'footers', 'quotes'))

print(f'Training Samples: {len(newsgroups_train.data)}')
print(f'Testing Samples: {len(newsgroups_test.data)}')

Training Samples: 11314
Testing Samples: 7532


In [4]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [5]:
newsgroups_train['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [6]:
newsgroups_train['data'][1000]

"Anybody seen mouse cursor distortion running the Diamond 1024x768x256 driver?\nSorry, don't know the version of the driver (no indication in the menus) but it's a recently\ndelivered Gateway system.  Am going to try the latest drivers from Diamond BBS but wondered\nif anyone else had seen this.\n\npost or email"

### GridSearch on Just Classifier
* Fit the vectorizer and prepare BEFORE it goes into the gridsearch

In [7]:
# Instantiate vectorizer
vect = TfidfVectorizer()

# Transform the training data
X_train = vect.fit_transform(newsgroups_train['data'])
print(X_train.shape)

(11314, 101631)


In [8]:
params_1 = {
    'min_samples_leaf': [1, 2, 5, 10]
}

# Instantiate classifier
clf = RandomForestClassifier()

# GridSearch
gs1 = GridSearchCV(clf, params_1, cv=5, n_jobs=-1, verbose=1)
gs1.fit(X_train, newsgroups_train['target'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:  1.3min remaining:    8.6s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  1.3min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [9]:
gs1.best_score_

0.6574148851336595

In [10]:
gs1.best_params_

{'min_samples_leaf': 2}

In [11]:
test_sample = vect.transform(["The new york yankees are the best team in the region."])
test_sample.shape

(1, 101631)

In [12]:
gs1.predict(test_sample)[0]

9

In [13]:
newsgroups_train['target_names'][9]

'rec.sport.baseball'

### GridSearch with BOTH the Vectoizer & Classifier

In [16]:
from sklearn.pipeline import Pipeline

# 1. Create a pipeline with a vectorize and a classifier
# 2. Use Grid Search to optimize the entire pipeline
pipe = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', RandomForestClassifier(random_state=42))
])

params_2 = {
    'vect__stop_words': (None, 'english'),
    'vect__min_df': (2, 5),
    'clf__max_depth': (10, None)
}

gs2 = GridSearchCV(pipe, params_2, cv=5, n_jobs=-1, verbose=1)
gs2.fit(newsgroups_train['data'], newsgroups_train['target'])

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

In [17]:
gs2.best_score_

0.6607746264533867

In [18]:
gs2.best_params_

{'clf__max_depth': None, 'vect__min_df': 2, 'vect__stop_words': 'english'}

In [19]:
pred = gs2.predict(["The new york yankees are the best team in the region."])
pred

array([9])

In [20]:
newsgroups_train['target_names'][pred[0]]

'rec.sport.baseball'

Advantages to using GS with the Pipe:
* Allows us to make predictions on raw text increasing reproducibility. :)
* Allows us to tune the parameters of the vectorizer along side the classifier. :D 

# Part 1: Describe how an LDA Model works

[Your Guide to Latent Dirichlet Allocation](https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d)

[LDA Topic Modeling](https://lettier.com/projects/lda-topic-modeling/)

[Topic Modeling with Gensim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

In [21]:
# Download spacy model
import spacy.cli
spacy.cli.download("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [22]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

In [23]:
df = pd.DataFrame({
    'content': newsgroups_train['data'],
    'target': newsgroups_train['target'],
    'target_names': [newsgroups_train['target_names'][i] for i in newsgroups_train['target']]
})
print(df.shape)

(11314, 3)


  and should_run_async(code)


In [24]:
pd.set_option('display.max_colwidth', 0)
df.sample(3)

  and should_run_async(code)


Unnamed: 0,content,target,target_names
10478,"\nAs usual, you are missing the whole point, Russell, because you are not\nwilling to even consider questionning your basic article of faith, which\nis that science is merely a matter of methodology and that the highest\npurpose of science is to avoid making mistakes. \n\nThis is like saying that the most important aspect of business management\nis accurate bookkeeping. \n\nIf science were no more than methodology and not making mistakes, it\nwould be a poor thing indeed. What was the methodology of Darwin? What\nwas the methodology of Einstein? What was, for that matter, the\nmethodology of Jenner and Pasteur? \n\n\n\n\nFirst of all, I think you are arguing against a straw man, because I\ndon't think that anyone here is arguing that quackery, pseudo-science,\nhomeopathy, chiropracty, and traditional Chinese medicine should be\naccepted as science. I, in particular, think the basic ideas of\nhomeopathy and chiropracty seem extremely flaky. \n\nWhat some of us do believe, however, is that some of these things\n(including some of the flaky ideas) are deserving of serious scientific\nattention. \n\nIf in fact it were true, as you have stated above, that those who do not\nuse the currently fashionable methodology can have no idea what is\neffective and what is not, then science today would not exist. For all\nof current science is based on the past work of scientists whose\nmethodology, by current standards, was seriously flawed. \n\nIt is certainly true that as methodology improves, we need to re-examine\nthose results derived in the past using less perfect methodologies. It is\nalso true that the results obtained by people today who still rely on \nthose early methodologies needs to be re-examined in a more rigorous \nfashion by those qualified to do so credibly. \n\nBut to say that nobody who fails to do elaborate double-blind studies is\ncapable of knowing their ass from a hole in the ground and to say that no\nideas that come from outside the scientific establishment could possibly\nbe worthy of serious investigation ... this truly marks one's attitude as\ndoctrinaire, cultist. This attitude is not compatible with a belief in\nreason. \n\n--\nIn the arguments between behaviorists and cognitivists, psychology seems \nless like a science than a collection of competing religious sects.",13,sci.med
8200,"Just a few cheap shots a Christianity:\n\nRiddle: What is the shortest street in Jerusalem?\nAnswer: The Street of the Righteous Poles.\n\nLimrick:\n\nThere was an archeologist Thostle\nWho found an amazing fossil\nBy the way it was bent\nAnd the knot it the end\n'twas the penis of Paul the Apostle.\n\nJingle:\nChristianity hits the spot\nTwelve Apostles thats a lot\nJesus Christ and a Virgin too\nChristianity's the faith for you\n(with apologies to Pepsi Cola and its famous jingle)\n\nRiddle:\nHow many Christians does it take to save a light bulb.\nAnswer: None, only Jesus can save.\n\nAphorism:\nJesus Saves\nMoses Invests\n\nProof that Jesus was Jewish:\n1. He lived at home till he was 33\n2. He went into his fathers business\n3. He thought he mother was a virgin\n4. His mother thought he was God.\n\nQED.\n\nSo long you all\n\nBob Kolker\n""I would rather spend eternity in Hell with interesting people \nthan eternity in Heaven with Christians""\n\n",19,talk.religion.misc
1682,"Hooray ! I always suspected that I was human too :-) It is the desire to be like\nChrist that often causes christians to be very critical of themselves and other\nchristians. We are supposed to grow, mature, endeavour to be Christ-like but we\nare far far far from perfect. Build up the body of Christ, don't tear it down,\nand that includes yourself. Jesus loves me just the way I am today, tomorrow and\nalways (thank God ! :-).",15,soc.religion.christian


In [27]:
# how to deal with white space

'  the apple has fallen from the tree '.strip()

  and should_run_async(code)


'the apple has fallen from the tree'

In [28]:
# how to deal with white space part du

' '.join('   the apple has fallen   from the tre   '.split())

  and should_run_async(code)


'the apple has fallen from the tre'

In [32]:
# 1. Remove new line characters
df['clean_text'] = df['content'].apply(lambda x: re.sub('\s+', ' ', x))
# 2. Remove extra whitespace 
df['clean_text'] = df['clean_text'].apply(lambda x: ' '.join(x.split()))
# 3. Remove Emails
df['clean_text'] = df['clean_text'].apply(lambda x: re.sub('From: \S+@\S+', '', x))
# 4. Remove non-alphanumeric characters
df['clean_text'] = df['clean_text'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

In [33]:
df.sample(3)

Unnamed: 0,content,target,target_names,clean_text
2456,"There is another useful method based on Least Sqyares Estimation of the sphere equation parameters.\n\nThe points (x,y,z) on a spherical surface with radius R and center (a,b,c) can be written as \n\n (x-a)^2 + (y-b)^2 + (z-c)^2 = R^2\n\nThis equation can be rewritten into the following form: \n\n 2ax + 2by + 2cz + R^2 - a^2 - b^2 -c^2 = x^2 + y^2 + z^2\n\nApproximate the left hand part by F(x,y,z) = p1.x + p2.x + p3.z + p4.1\n\nFor all datapoints, i.c. 4, determine the 4 parameters p1..p4 which minimise the average error |F(x,y,z) - x^2 - y^2 - z^2|^2.\n\nIn 'Numerical Recipes in C' can be found algorithms to solve these parameters.\n\nThe best fitting sphere will have \n- center (a,b,c) = (p1/2, p2/2, p3/2)\n- radius R = sqrt(p4 + a.a + b.b + c.c).\n\nSo, at last, will this solve you sphere estination problem, at least for the most situations I think ?.",1,comp.graphics,There is another useful method based on Least Sqyares Estimation of the sphere equation parameters The points x y z on a spherical surface with radius R and center a b c can be written as x a y b z c R This equation can be rewritten into the following form ax by cz R a b c x y z Approximate the left hand part by F x y z p x p x p z p For all datapoints i c determine the parameters p p which minimise the average error F x y z x y z In Numerical Recipes in C can be found algorithms to solve these parameters The best fitting sphere will have center a b c p p p radius R sqrt p a a b b c c So at last will this solve you sphere estination problem at least for the most situations I think
10430,"##I strongly suggest that you look up a book called THE BIBLE, THE QURAN, AND\n##SCIENCE by Maurice Baucaille, a French surgeon. It is not comprehensive,\n##but, it is well researched. I imagine your library has it or can get it\n##for you through interlibrary loan.\n##\n\n I shall try to get hold of it (when I have time to read of course :-)\n\n##In short, Dr Baucaille began investigating the Bible because of pre-\n##ceived scientific inaccuracies and inconsistencies. He assumed that\n##some of the problems may have been caused by poor translations in by-\n##gone days. So, he read what he could find in Hebrew, Greek, Aramaic.\n##What he found was that the problems didn't go away, they got worse.\n##Then, he decided to see if other religions had the same problems.\n##So, he picked up the Holy Qur'an (in French) and found similar prob-\n##lems, but not as many. SO, he applied the same logoic as he had\n##with the Bible: he learned to read it in Arabic. The problems he\n##had found with the French version went away in Arabic. He was unable\n##to find a wealth of scientific statements in the Holy Qur'an, but,\n##what he did find made sense with modern understanding. So, he\n##investigated the Traditions (the hadith) to see what they had to\n##say about science. they were filled with science problems; after\n##all, they were contemporary narratives from a time which had, by\n##pour standards, a primitive world view. His conclusion was that,\n##while he was impressed that what little the Holy Qur'an had to\n##say about science was accurate, he was far more impressed that the\n##Holy Qur'an did not contain the same rampant errors evidenced in\n##the Traditions. How would a man of 7th Century Arabia have known\n##what *not to include* in the Holy Qur'an (assuming he had authored\n##it)? \n##\n\n So in short the writer (or writers) of Quran decided to stay away from\nscience. (if you do not open your mouth, then you don't put you foot into\nyour mouth either). \n\n But then if you say Quran does not talk much about science, then one can\nnot make claims (like Bobby does) that you have great science in Quran.\n\n Basically I want to say that *none* of the religious texts are supposed to\nbe scientific treatises. So I am just requesting the theists to stop making\nsuch wild claims.\n\n--- Vinayak\n-------------------------------------------------------\n vinayak dutt\n e-mail: vdp@mayo.edu\n\n standard disclaimers apply",0,alt.atheism,I strongly suggest that you look up a book called THE BIBLE THE QURAN AND SCIENCE by Maurice Baucaille a French surgeon It is not comprehensive but it is well researched I imagine your library has it or can get it for you through interlibrary loan I shall try to get hold of it when I have time to read of course In short Dr Baucaille began investigating the Bible because of pre ceived scientific inaccuracies and inconsistencies He assumed that some of the problems may have been caused by poor translations in by gone days So he read what he could find in Hebrew Greek Aramaic What he found was that the problems didn t go away they got worse Then he decided to see if other religions had the same problems So he picked up the Holy Qur an in French and found similar prob lems but not as many SO he applied the same logoic as he had with the Bible he learned to read it in Arabic The problems he had found with the French version went away in Arabic He was unable to find a wealth of scientific statements in the Holy Qur an but what he did find made sense with modern understanding So he investigated the Traditions the hadith to see what they had to say about science they were filled with science problems after all they were contemporary narratives from a time which had by pour standards a primitive world view His conclusion was that while he was impressed that what little the Holy Qur an had to say about science was accurate he was far more impressed that the Holy Qur an did not contain the same rampant errors evidenced in the Traditions How would a man of th Century Arabia have known what not to include in the Holy Qur an assuming he had authored it So in short the writer or writers of Quran decided to stay away from science if you do not open your mouth then you don t put you foot into your mouth either But then if you say Quran does not talk much about science then one can not make claims like Bobby does that you have great science in Quran Basically I want to say that none of the religious texts are supposed to be scientific treatises So I am just requesting the theists to stop making such wild claims Vinayak vinayak dutt e mail vdp mayo edu standard disclaimers apply
2336,"#Yet, when a law was proposed for Virginia that extended this \n#philosophy to cigarette smokers (so that people who smoked away\n#from the work couldn't be discriminated against by employers),\n#the liberal Gov. Wilder vetoed it. Which shows that liberals don't\n#give a damn about ""best person for the job,"" it's just a power\n#play.\n\nOf course Clayton ignores the fact that employers pay health\ninsurance, and insurance for smokers is more expensive than for\nnon-smokers. \n",18,talk.politics.misc,Yet when a law was proposed for Virginia that extended this philosophy to cigarette smokers so that people who smoked away from the work couldn t be discriminated against by employers the liberal Gov Wilder vetoed it Which shows that liberals don t give a damn about best person for the job it s just a power play Of course Clayton ignores the fact that employers pay health insurance and insurance for smokers is more expensive than for non smokers


In [26]:
nlp = spacy.load("en_core_web_lg")

  and should_run_async(code)


In [34]:
# Leverage tqdm for progress_apply
from tqdm import tqdm
tqdm.pandas()

# If you're on macOS, Linux, or python session executed from Windows Subsystem for Linux (WSL)
# conda activate U4-S1-NLP
# pip install pandarallel
#
# from pandarallel import pandarallel
# pandarallel.initialize(progress_bar=True)
#
# df['lemmas'] = df['content'].parallel_apply(get_lemmas)
#
# Ref: https://github.com/nalepae/pandarallel

In [37]:
# Create 'lemmas' column
def get_lemmas(x):
    lemmas = []
    for token in nlp(x):
        if (token.is_stop!=True) and (token.is_punct!=True):
            lemmas.append(token.lemma_)
    return lemmas

df['lemmas'] = df['clean_text'].progress_apply(get_lemmas)

100%|██████████| 11314/11314 [06:07<00:00, 30.78it/s]


In [38]:
df.head()

Unnamed: 0,content,target,target_names,clean_text,lemmas
0,"I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.",7,rec.autos,I was wondering if anyone out there could enlighten me on this car I saw the other day It was a door sports car looked to be from the late s early s It was called a Bricklin The doors were really small In addition the front bumper was separate from the rest of the body This is all I know If anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail,"[wonder, enlighten, car, see, day, , , door, sport, car, , look, late, , s, , early, , s, , call, Bricklin, , door, small, , addition, , bumper, separate, rest, body, , know, , tellme, model, , engine, spec, , year, production, , car, , history, , info, funky, look, car, , e, mail]"
1,"A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.",4,comp.sys.mac.hardware,A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll Please send a brief message detailing your experiences with the procedure Top speed attained CPU rated speed add on cards and adapters heat sinks hour of usage per day floppy disk functionality with and m floppies are especially requested I will be summarizing in the next two days so please add to the network knowledge base if you have done the clock upgrade and haven t answered this poll Thanks,"[fair, number, brave, soul, upgrade, SI, clock, oscillator, share, experience, poll, , send, brief, message, detail, experience, procedure, , speed, attain, , cpu, rate, speed, , add, card, adapter, , heat, sink, , hour, usage, day, , floppy, disk, functionality, , , m, floppy, especially, request, , summarize, day, , add, network, knowledge, base, clock, upgrade, haven, t, answer, poll, , thank]"
2,"well folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i'm in the market for a\nnew machine a bit sooner than i intended to be...\n\ni'm looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected? i'd heard the 185c was supposed to make an\nappearence ""this summer"" but haven't heard anymore on it - and since i\ndon't have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo's just went through recently?\n\n* what's the impression of the display on the 180? i could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don't really have\na feel for how much ""better"" the display is (yea, it looks great in the\nstore, but is that all ""wow"" or is it really that good?). could i solicit\nsome opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk size and money hit to get the active display? (i realize\nthis is a real subjective question, but i've only played around with the\nmachines in a computer store breifly and figured the opinions of somebody\nwho actually uses the machine daily might prove helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - if you could email, i'll post a\nsummary (news reading time is at a premium with finals just around the\ncorner... :( )\n--\nTom Willis \ twillis@ecn.purdue.edu \ Purdue Electrical Engineering",4,comp.sys.mac.hardware,well folks my mac plus finally gave up the ghost this weekend after starting life as a k way back in sooo i m in the market for a new machine a bit sooner than i intended to be i m looking into picking up a powerbook or maybe and have a bunch of questions that hopefully somebody can answer does anybody know any dirt on when the next round of powerbook introductions are expected i d heard the c was supposed to make an appearence this summer but haven t heard anymore on it and since i don t have access to macleak i was wondering if anybody out there had more info has anybody heard rumors about price drops to the powerbook line like the ones the duo s just went through recently what s the impression of the display on the i could probably swing a if i got the Mb disk rather than the but i don t really have a feel for how much better the display is yea it looks great in the store but is that all wow or is it really that good could i solicit some opinions of people who use the and day to day on if its worth taking the disk size and money hit to get the active display i realize this is a real subjective question but i ve only played around with the machines in a computer store breifly and figured the opinions of somebody who actually uses the machine daily might prove helpful how well does hellcats perform thanks a bunch in advance for any info if you could email i ll post a summary news reading time is at a premium with finals just around the corner Tom Willis twillis ecn purdue edu Purdue Electrical Engineering,"[folk, , mac, plus, finally, give, ghost, weekend, start, life, , k, way, , sooo, , m, market, new, machine, bit, sooner, intend, , m, look, pick, powerbook, , maybe, , bunch, question, , hopefully, , somebody, answer, , anybody, know, dirt, round, powerbook, introduction, expect, , d, hear, , c, suppose, appearence, , summer, , haven, t, hear, anymore, , don, t, access, macleak, , wonder, anybody, info, , anybody, hear, rumor, price, drop, powerbook, line, like, one, duo, s, go, recently, , s, impression, display, , probably, swing, , get, , Mb, disk, , don, t, feel, , ...]"
3,\nDo you have Weitek's address/phone number? I'd like to get some information\nabout this chip.\n,1,comp.graphics,Do you have Weitek s address phone number I d like to get some information about this chip,"[Weitek, s, address, phone, number, , d, like, information, chip]"
4,"From article <C5owCB.n3p@world.std.com>, by tombaker@world.std.com (Tom A Baker):\n\n\nMy understanding is that the 'expected errors' are basically\nknown bugs in the warning system software - things are checked\nthat don't have the right values in yet because they aren't\nset till after launch, and suchlike. Rather than fix the code\nand possibly introduce new bugs, they just tell the crew\n'ok, if you see a warning no. 213 before liftoff, ignore it'.",14,sci.space,From article C owCB n p world std com by tombaker world std com Tom A Baker My understanding is that the expected errors are basically known bugs in the warning system software things are checked that don t have the right values in yet because they aren t set till after launch and suchlike Rather than fix the code and possibly introduce new bugs they just tell the crew ok if you see a warning no before liftoff ignore it,"[article, , C, owcb, n, p, world, std, com, , tombaker, world, std, com, , Tom, Baker, , understanding, , expect, error, , basically, know, bug, warning, system, software, , thing, check, don, t, right, value, aren, t, set, till, launch, , suchlike, , fix, code, possibly, introduce, new, bug, , tell, crew, , ok, , warning, , liftoff, , ignore, ]"


### The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus.

In [39]:
# Create Dictionary
id2word = corpora.Dictionary(df['lemmas'] )

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df['lemmas']]

In [40]:
# How many words do we have?
len(id2word.keys())

77754

In [41]:
# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=5, no_above=0.75)

In [42]:
# How many words do we have?
len(id2word.keys())

14778

In [43]:
id2word[300]

'sheet'

In [51]:
df['lemmas'][5]

['course',
 ' ',
 'term',
 'rigidly',
 'define',
 'bill',
 ' ',
 'doubt',
 'use',
 'term',
 ' ',
 'quote',
 'allegedly',
 ' ',
 ' ',
 'read',
 'article',
 'present',
 'argument',
 'weapon',
 'mass',
 'destruction',
 ' ',
 'commonly',
 'understand',
 ' ',
 'switch',
 'topic',
 ' ',
 'point',
 'evidently',
 'weapon',
 'allow',
 ' ',
 'later',
 'analysis',
 ' ',
 'give',
 'understanding',
 ' ',
 'consider',
 'class']

In [45]:
corpus[5]

[(0, 11),
 (117, 1),
 (177, 1),
 (193, 1),
 (221, 1),
 (225, 1),
 (226, 1),
 (227, 1),
 (228, 1),
 (229, 1),
 (230, 1),
 (231, 1),
 (232, 1),
 (233, 1),
 (234, 1),
 (235, 1),
 (236, 1),
 (237, 1),
 (238, 1),
 (239, 1),
 (240, 1),
 (241, 1),
 (242, 1),
 (243, 1),
 (244, 1),
 (245, 1),
 (246, 2),
 (247, 1),
 (248, 1),
 (249, 2)]

In [46]:
id2word[252]

'rm'

In [47]:
id2word[276]

'controller'

In [52]:
# Human readable format of corpus (term-frequency)
[(id2word[word_id], word_count) for word_id, word_count in corpus[5]]

[('  ', 11),
 ('helpful', 1),
 ('address', 1),
 ('ignore', 1),
 ('course', 1),
 ('evidently', 1),
 ('later', 1),
 ('mass', 1),
 ('point', 1),
 ('present', 1),
 ('quote', 1),
 ('read', 1),
 ('switch', 1),
 ('term', 1),
 ('topic', 1),
 ('understand', 1),
 ('weapon', 1),
 ('News', 1),
 ('Sean', 1),
 ('September', 1),
 ('Sharon', 1),
 ('accidentally', 1),
 ('bounce', 1),
 ('couldn', 1),
 ('delete', 1),
 ('directly', 1),
 ('file', 2),
 ('glad', 1),
 ('instead', 1),
 ('prob', 2)]

# Part 2: Estimate a LDA Model with Gensim

 ### Train an LDA model

In [54]:
%%time
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=20, 
                                            chunksize=100,
                                            passes=10,
                                            per_word_topics=True)
# https://radimrehurek.com/gensim/models/ldamodel.html

IndexError: index 14778 is out of bounds for axis 1 with size 14778

In [None]:
# lda_model.save('lda_model.model')

In [None]:
%%time
lda_multicore = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=20, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)

# https://radimrehurek.com/gensim/models/ldamulticore.html

In [None]:
lda_multicore.save('lda_multicore.model')

In [None]:
from gensim import models
lda_multicore =  models.LdaModel.load('lda_multicore.model')

### View the topics in LDA model

In [None]:
newsgroups_train.target_names

In [None]:
pprint(lda_multicore.print_topics())
doc_lda = lda_multicore[corpus]

In [None]:
doc_lda

In [None]:
distro = [lda[d] for d in corpus]

### What is topic Perplexity?
Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of , you estimate the LDA model. Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents.

### What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_multicore.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_multicore, 
                                     texts=df['lemmas'], 
                                     dictionary=id2word, 
                                     coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Part 3: Interpret LDA results & Select the appropriate number of topics

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore, corpus, id2word)
pyLDAvis.display(vis)

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
%%time
model_list, coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus, 
                                                        texts=df['lemmas'], 
                                                        start=2, 
                                                        limit=40, 
                                                        step=6)

In [None]:
coherence_values = [0.5054, 0.5332, 0.5452, 0.564, 0.5678, 0.5518, 0.519]

In [None]:
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
#optimal_model = model_list[4]
optimal_model =  models.LdaModel.load('optimal_model.model')
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))