## Topic Modeling and Latent Dirichlet Allocation (LDA)

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

##### The Data

The data set we’ll use is a list of over one million news headlines published over a period of 15 years 

In [2]:
import pandas as pd

data = pd.read_csv(r'C:\Users\athiq.ahmed\Desktop\Other\Python code\ML\Text data analysis\Datasets\abcnews-date-text.csv')

In [3]:
print(data.head())

   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers


In [10]:
data_text = data[['headline_text']]
data_text['index']=data_text.index
documents = data_text
print(documents[:5])

                                       headline_text  index
0  aba decides against community broadcasting lic...      0
1     act fire witnesses must be aware of defamation      1
2     a g calls for infrastructure protection summit      2
3           air nz staff in aust strike for pay rise      3
4      air nz strike to affect australian travellers      4


In [11]:
len(documents)

1103665

##### Data preprocessing

    We will perform the following steps:
    1. Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
    2. Words that have fewer than 3 characters are removed.
    3. All stopwords are removed.
    4. Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are 
       changed into present.
    5. Words are stemmed — words are reduced to their root form.

In [None]:
# !pip install gensim

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

In [17]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\athiq.ahmed\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

##### Lemmatize example

In [18]:
print(WordNetLemmatizer().lemmatize('went',pos='v'))

go


##### Stemmer Example

In [None]:
stemmer = SnowballStemmer('english')

stemmerstemmer = SnowballStemmer('english')
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]
pd.DataFrame(data={'original_word':original_words, 'stemmed':singles})

##### Write a function to perform lemmatize and stem preprocessing steps on the data set.

In [21]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text,pos='v'))

In [26]:
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [27]:
preprocess('aba decides against community broadcasting licence')

['decid', 'communiti', 'broadcast', 'licenc']

##### Select a document to preview after preprocessing.

In [35]:
doc_sample = documents[documents['index'] ==4310].values[0][0];doc_sample
# doc_sample = documents[documents['index'] ==4310];doc_sample

'rain helps dampen bushfires'

In [46]:
print('original document: ')
words =[]
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\ntokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['rain', 'helps', 'dampen', 'bushfires']

tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


###### Preprocess the headline text, saving the results as ‘processed_docs

In [48]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

##### Bag of Words on the Data set

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.

In [52]:
dictionary = gensim.corpora.Dictionary(processed_docs)
print(dictionary)

Dictionary(62245 unique tokens: ['broadcast', 'communiti', 'decid', 'licenc', 'awar']...)


In [55]:
count = 0
for k,v in dictionary.iteritems():
    print(k,v)
    count+=1
    if count>10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


##### Gensim filter_extremes

Filter out tokens that appear in

less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).
after the above two steps, keep only the first 100000 most frequent tokens.

In [56]:
dictionary.filter_extremes(no_below=15,no_above=0.5,keep_n=100000)

##### Gensim doc2bow

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [58]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

##### Preview Bag Of Words for our sample preprocessed document.

In [59]:
bow_corpus[4310]

[(76, 1), (112, 1), (483, 1), (4014, 1)]

In [64]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("word {} (\"{}\") appears {} time."
          .format(bow_doc_4310[i][0]
                  ,dictionary[bow_doc_4310[i][0]]
                  ,bow_doc_4310[i][1]))

word 76 ("bushfir") appears 1 time.
word 112 ("help") appears 1 time.
word 483 ("rain") appears 1 time.
word 4014 ("dampen") appears 1 time.


##### TF-IDF

Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document.

In [99]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5892908644709983),
 (1, 0.38929657403503015),
 (2, 0.4964985198530063),
 (3, 0.5046520328695662)]


##### Running LDA using Bag of Words

Train our lda model using gensim.models.LdaMulticore and save it to ‘lda_model’

In [101]:
lda_model = gensim.models.LdaMulticore(bow_corpus,num_topics=10,id2word=dictionary,passes=2,workers=2)

In [102]:
for idx,topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx,topic))

Topic: 0 
Words: 0.029*"elect" + 0.018*"death" + 0.017*"hospit" + 0.017*"say" + 0.016*"tasmanian" + 0.015*"labor" + 0.013*"deal" + 0.013*"china" + 0.011*"polit" + 0.011*"talk"
Topic: 1 
Words: 0.019*"nation" + 0.018*"coast" + 0.016*"help" + 0.016*"countri" + 0.015*"state" + 0.015*"chang" + 0.014*"health" + 0.013*"hour" + 0.013*"indigen" + 0.012*"water"
Topic: 2 
Words: 0.019*"canberra" + 0.018*"market" + 0.014*"rise" + 0.014*"west" + 0.014*"australian" + 0.013*"turnbul" + 0.013*"price" + 0.013*"share" + 0.011*"victoria" + 0.011*"bank"
Topic: 3 
Words: 0.063*"polic" + 0.023*"crash" + 0.019*"interview" + 0.018*"miss" + 0.018*"shoot" + 0.016*"arrest" + 0.015*"investig" + 0.013*"driver" + 0.012*"search" + 0.011*"offic"
Topic: 4 
Words: 0.029*"charg" + 0.027*"court" + 0.021*"murder" + 0.018*"woman" + 0.018*"face" + 0.016*"die" + 0.016*"alleg" + 0.015*"brisban" + 0.015*"live" + 0.015*"jail"
Topic: 5 
Words: 0.035*"australia" + 0.022*"melbourn" + 0.021*"world" + 0.017*"open" + 0.014*"final" +

##### Running LDA using TF-IDF

In [103]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf,num_topics=10,id2word=dictionary,passes=2,workers=4)

In [104]:
for idx,topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx,topic))

Topic: 0 
Words: 0.010*"sport" + 0.008*"plead" + 0.008*"abbott" + 0.008*"michael" + 0.007*"monday" + 0.007*"dairi" + 0.006*"toni" + 0.006*"wrap" + 0.005*"origin" + 0.005*"live"
Topic: 1 
Words: 0.020*"countri" + 0.019*"hour" + 0.008*"violenc" + 0.007*"grandstand" + 0.007*"korea" + 0.006*"strike" + 0.006*"asylum" + 0.005*"domest" + 0.005*"north" + 0.005*"islam"
Topic: 2 
Words: 0.012*"interview" + 0.010*"donald" + 0.007*"leagu" + 0.007*"final" + 0.007*"rugbi" + 0.006*"peter" + 0.006*"wednesday" + 0.006*"thursday" + 0.006*"august" + 0.006*"syria"
Topic: 3 
Words: 0.007*"climat" + 0.007*"friday" + 0.007*"mother" + 0.006*"tuesday" + 0.006*"festiv" + 0.006*"care" + 0.006*"histori" + 0.005*"quiz" + 0.005*"thousand" + 0.005*"music"
Topic: 4 
Words: 0.008*"christma" + 0.006*"farm" + 0.006*"energi" + 0.006*"stori" + 0.005*"decemb" + 0.005*"plan" + 0.004*"blog" + 0.004*"centr" + 0.004*"town" + 0.004*"council"
Topic: 5 
Words: 0.010*"australia" + 0.009*"podcast" + 0.009*"market" + 0.008*"weather"

##### Performance evaluation by classifying sample document using LDA Bag of Words model

In [105]:
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [107]:
for index,score in sorted(lda_model[bow_corpus[4310]], key=lambda tup:-1*tup[1]):
                         print('\nScore: {} \t \nTopic: {}'.format(score,lda_model.print_topic(index,10)))


Score: 0.41999879479408264 	 
Topic: 0.026*"south" + 0.025*"kill" + 0.015*"island" + 0.013*"fall" + 0.011*"attack" + 0.009*"forc" + 0.009*"shark" + 0.009*"east" + 0.008*"northern" + 0.007*"great"

Score: 0.2200000286102295 	 
Topic: 0.019*"nation" + 0.018*"coast" + 0.016*"help" + 0.016*"countri" + 0.015*"state" + 0.015*"chang" + 0.014*"health" + 0.013*"hour" + 0.013*"indigen" + 0.012*"water"

Score: 0.2200000286102295 	 
Topic: 0.035*"australia" + 0.022*"melbourn" + 0.021*"world" + 0.017*"open" + 0.014*"final" + 0.013*"donald" + 0.011*"sydney" + 0.010*"leagu" + 0.010*"take" + 0.010*"win"

Score: 0.020001189783215523 	 
Topic: 0.019*"council" + 0.015*"power" + 0.013*"farmer" + 0.012*"busi" + 0.011*"guilti" + 0.010*"region" + 0.010*"feder" + 0.009*"research" + 0.009*"industri" + 0.009*"energi"

Score: 0.019999999552965164 	 
Topic: 0.029*"elect" + 0.018*"death" + 0.017*"hospit" + 0.017*"say" + 0.016*"tasmanian" + 0.015*"labor" + 0.013*"deal" + 0.013*"china" + 0.011*"polit" + 0.011*"talk

##### Performance evaluation by classifying sample document using LDA TF-IDF model

In [109]:
for index,score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup:-1*tup[1]):
                         print('\nScore: {} \t \nTopic: {}'.format(score,lda_model_tfidf.print_topic(index,10)))


Score: 0.4107843041419983 	 
Topic: 0.022*"rural" + 0.016*"news" + 0.009*"nation" + 0.008*"ash" + 0.007*"celebr" + 0.007*"victorian" + 0.006*"busi" + 0.006*"liber" + 0.005*"explain" + 0.004*"kid"

Score: 0.27544841170310974 	 
Topic: 0.010*"sport" + 0.008*"plead" + 0.008*"abbott" + 0.008*"michael" + 0.007*"monday" + 0.007*"dairi" + 0.006*"toni" + 0.006*"wrap" + 0.005*"origin" + 0.005*"live"

Score: 0.17374882102012634 	 
Topic: 0.008*"christma" + 0.006*"farm" + 0.006*"energi" + 0.006*"stori" + 0.005*"decemb" + 0.005*"plan" + 0.004*"blog" + 0.004*"centr" + 0.004*"town" + 0.004*"council"

Score: 0.020004134625196457 	 
Topic: 0.012*"drum" + 0.007*"rise" + 0.007*"mental" + 0.006*"queensland" + 0.006*"rat" + 0.006*"june" + 0.006*"rate" + 0.005*"drive" + 0.005*"polic" + 0.005*"spring"

Score: 0.02000388875603676 	 
Topic: 0.010*"australia" + 0.009*"podcast" + 0.009*"market" + 0.008*"weather" + 0.008*"share" + 0.007*"australian" + 0.006*"south" + 0.005*"world" + 0.005*"novemb" + 0.005*"test

##### Testing model on unseen document

In [110]:
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.27935636043548584	 Topic: 0.029*"elect" + 0.018*"death" + 0.017*"hospit" + 0.017*"say" + 0.016*"tasmanian"
Score: 0.2539708912372589	 Topic: 0.019*"nation" + 0.018*"coast" + 0.016*"help" + 0.016*"countri" + 0.015*"state"
Score: 0.1833333522081375	 Topic: 0.019*"canberra" + 0.018*"market" + 0.014*"rise" + 0.014*"west" + 0.014*"australian"
Score: 0.18333333730697632	 Topic: 0.063*"polic" + 0.023*"crash" + 0.019*"interview" + 0.018*"miss" + 0.018*"shoot"
Score: 0.016672732308506966	 Topic: 0.036*"trump" + 0.031*"australian" + 0.019*"queensland" + 0.014*"leav" + 0.014*"australia"
Score: 0.01666666753590107	 Topic: 0.029*"charg" + 0.027*"court" + 0.021*"murder" + 0.018*"woman" + 0.018*"face"
Score: 0.01666666753590107	 Topic: 0.035*"australia" + 0.022*"melbourn" + 0.021*"world" + 0.017*"open" + 0.014*"final"
Score: 0.01666666753590107	 Topic: 0.026*"south" + 0.025*"kill" + 0.015*"island" + 0.013*"fall" + 0.011*"attack"
Score: 0.01666666753590107	 Topic: 0.019*"council" + 0.015*"pow

https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24