                                Topic Modeling by LDA approach
First We have corpus or set of documents which will help us to prepare dictionary so,

In [1]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

Instead of taking such short documents we can also take big data like newsgrous data or wikipedia data that is also exactly same fomat like this

In [2]:
doc_complete = [doc1, doc2, doc3, doc4, doc5]  #corpus or compiled documents

In [6]:
#Now first thing we need to do is cleaning documents like removing stopwords,punctuations, and lemmatization
import string
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
from nltk.stem.snowball import SnowballStemmer
sbEng = SnowballStemmer('english')

If you want to separate sentences in document thwn u won't be removing punctuation marks first,then we will use sentence tokenizer to separate into sentences and then will clean each of sentence but here we need list of words tokens so we can do remove punctuation first.

In [14]:
def clean(doc):#to clean document
    doc=' '.join([item.strip('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') for item in (doc).lower().split(' ')]) #removing punctuation marks
    doc=' '.join([item for item in (doc).split(' ') if item not in stop])#stopwords removed
    doc=' '.join([sbEng.stem(item) for item in (doc).split(' ')]) #stemming,can slso perform lemmatizing
    return doc

In [27]:
doc_clean=[clean(doc).split() for doc in doc_complete]

In [28]:
doc_clean #documents cleaned and converted into list of words/tokens

[['sugar', 'bad', 'consum', 'sister', 'like', 'sugar', 'father'],
 ['father',
  'spend',
  'lot',
  'time',
  'drive',
  'sister',
  'around',
  'danc',
  'practic'],
 ['doctor',
  'suggest',
  'drive',
  'may',
  'caus',
  'increas',
  'stress',
  'blood',
  'pressur'],
 ['sometim',
  'feel',
  'pressur',
  'perform',
  'well',
  'school',
  'father',
  'never',
  'seem',
  'drive',
  'sister',
  'better'],
 ['health', 'expert', 'say', 'sugar', 'good', 'lifestyl']]

In [32]:
#preparing dictionary from documents
import gensim
from gensim import corpora
dictionary = corpora.Dictionary(doc_clean) #basically each unique word now given id 
#now we gonna make document term matrix or bag of words representation of each document that is list of tuples having two values one is id of word and other is frequency of that word in that document
doc_term_matrix=[dictionary.doc2bow(doc) for doc in doc_clean]

In [33]:
doc_term_matrix[0]

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)]

In [34]:
dictionary[5]

'sugar'

In [35]:
doc_clean[0]

['sugar', 'bad', 'consum', 'sister', 'like', 'sugar', 'father']

So we can see word sugar has id 5 and it has arrived twice in sentence so (5,2).
Next step is to create an object for LDA model and train it on Document-Term matrix. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.

In [48]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)


In [49]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(0, '0.029*"father" + 0.029*"sister" + 0.029*"drive"'), (1, '0.073*"sister" + 0.073*"father" + 0.072*"drive"'), (2, '0.100*"sugar" + 0.040*"stress" + 0.040*"doctor"')]


In [41]:
#Latent semantic indexing model
from gensim.models import LsiModel

model = LsiModel(doc_term_matrix, id2word=dictionary)

In [42]:
model

<gensim.models.lsimodel.LsiModel at 0x7fae799ae550>

In [43]:
model.print_topics(num_topics=3, num_words=3)

[(0, '0.401*"father" + 0.401*"sister" + 0.377*"drive"'),
 (1, '-0.561*"sugar" + 0.260*"pressur" + 0.225*"drive"'),
 (2, '0.354*"sugar" + 0.275*"doctor" + 0.275*"blood"')]

In [51]:
hert=ldamodel[doc_term_matrix[4]] #to implement trained model on new document

In [54]:
hert #probability of each topic for this document

[(0, 0.04854002), (1, 0.0479168), (2, 0.9035432)]

Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is. In my experience, topic coherence score, in particular, has been more helpful.

In [57]:
from gensim.models import CoherenceModel
# Compute Perplexity
print('\nPerplexity: ', ldamodel.log_perplexity(doc_term_matrix))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, texts=doc_clean, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -4.230102421239365

Coherence Score:  0.32324888110161115


In [61]:
import pyLDAvis
import pyLDAvis.gensim
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary)
vis

 So how to infer pyLDAvis’s output?

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

We have successfully built a good looking topic model.

Given our prior knowledge of the number of natural topics in the document, finding the best model was fairly straightforward.

In [84]:
#for new document using trained model
new_document='stress daughter to school and his blood pressure is high'

In [85]:
cleandoc=clean(new_document)

In [86]:
cleandoc=cleandoc.split()

In [87]:
matrix=dictionary.doc2bow(cleandoc)

In [88]:
matrix

[(13, 1), (18, 1), (19, 1), (25, 1)]

In [89]:
ldamodel[matrix]

[(0, 0.06938401), (1, 0.31519938), (2, 0.61541665)]

In [90]:
ldamodel.print_topics(num_topics=3,num_words=3)

[(0, '0.029*"father" + 0.029*"sister" + 0.029*"drive"'),
 (1, '0.073*"sister" + 0.073*"father" + 0.072*"drive"'),
 (2, '0.100*"sugar" + 0.040*"stress" + 0.040*"doctor"')]

                        Using big dataset newsgroup dataset to train our lda model

In [93]:
import pandas as pd
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())
df.head()

['rec.autos' 'comp.sys.mac.hardware' 'rec.motorcycles' 'misc.forsale'
 'comp.os.ms-windows.misc' 'alt.atheism' 'comp.graphics'
 'rec.sport.baseball' 'rec.sport.hockey' 'sci.electronics' 'sci.space'
 'talk.politics.misc' 'sci.med' 'talk.politics.mideast'
 'soc.religion.christian' 'comp.windows.x' 'comp.sys.ibm.pc.hardware'
 'talk.politics.guns' 'talk.religion.misc' 'sci.crypt']


Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
10,From: irwin@cmptrc.lonestar.org (Irwin Arnstei...,8,rec.motorcycles
100,From: tchen@magnus.acs.ohio-state.edu (Tsung-K...,6,misc.forsale
1000,From: dabl2@nlm.nih.gov (Don A.B. Lindbergh)\n...,2,comp.os.ms-windows.misc


In [95]:
df.content[2]

'From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject: PB questions...\nOrganization: Purdue University Engineering Computer Network\nDistribution: usa\nLines: 36\n\nwell folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985.  sooo, i\'m in the market for a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected?  i\'d heard the 185c was supposed to make an\nappearence "this summer" but haven\'t heard anymore on it - and since i\ndon\'t have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo\'s just went through recently?\n\n* what\'s the impression of the display on the 180?  i could probably swin

In [98]:
#so we goona use each content as document 
documents=[df.content[i] for i in range(len(df))] #corpus ready

In [107]:
documents_clean=[clean(document).split() for document in documents ] #cleaned corpus and converted to list of words

In [112]:
new_dictionary=corpora.Dictionary(documents_clean) #dictionary prepared 

In [113]:
doc_matrix=[new_dictionary.doc2bow(doc) for doc in documents_clean] #matrix generated

In [114]:
doc_matrix[0]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 2),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 5),
 (13, 1),
 (14, 1),
 (15, 1),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 1),
 (21, 1),
 (22, 1),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 2),
 (33, 1),
 (34, 1),
 (35, 1),
 (36, 1),
 (37, 1),
 (38, 1),
 (39, 1),
 (40, 1),
 (41, 1),
 (42, 1),
 (43, 1),
 (44, 1),
 (45, 1),
 (46, 1),
 (47, 1),
 (48, 1),
 (49, 1),
 (50, 1),
 (51, 1),
 (52, 1),
 (53, 1),
 (54, 2),
 (55, 1),
 (56, 1),
 (57, 1),
 (58, 1),
 (59, 1),
 (60, 1),
 (61, 1),
 (62, 1),
 (63, 1)]

In [115]:
newldamodel = Lda(doc_matrix, num_topics=10, id2word = new_dictionary, passes=50)
#new ldamodel trained on big data

NameError: name 'newldamode' is not defined

In [118]:
print(newldamodel.print_topics(num_topics=10,num_words=5))#new lda model trained

[(0, '0.009*"israel" + 0.008*"isra" + 0.006*"db" + 0.004*"arab" + 0.003*"p"'), (1, '0.007*"the" + 0.006*"peopl" + 0.006*"would" + 0.005*"in" + 0.005*"one"'), (2, '0.062*"max>\'ax>\'ax>\'ax>\'ax>\'ax>\'ax>\'ax>\'ax>\'ax>\'ax>\'ax>\'ax>\'ax>\'ax>\'" + 0.002*"dog" + 0.002*"ra" + 0.001*"rock" + 0.001*"counterst"'), (3, '0.011*">" + 0.010*"organ" + 0.010*"subject" + 0.010*"writes:" + 0.009*"lin"'), (4, '0.041*"x" + 0.029*"1" + 0.020*"0" + 0.016*"2" + 0.009*"3"'), (5, '0.011*"organ" + 0.011*"subject" + 0.010*"lin" + 0.008*"nntp-posting-host" + 0.005*"univers"'), (6, '0.011*"use" + 0.010*"subject" + 0.009*"organ" + 0.008*"lin" + 0.007*"i"'), (7, '0.008*"space" + 0.008*"armenian" + 0.006*"the" + 0.005*"turkish" + 0.004*"of"'), (8, '0.006*"use" + 0.004*"list" + 0.004*"inform" + 0.004*"mail" + 0.003*"research"'), (9, '0.008*"god" + 0.007*"the" + 0.007*"one" + 0.006*"would" + 0.006*"i"')]
