Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents

Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

In [1]:
import pandas as pd
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);
data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [3]:
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


In [4]:
documents.shape

(1226258, 2)

### Data Pre-processing
We will perform the following steps:
- Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
- Words that have fewer than 3 characters are removed.
- All stopwords are removed.
- Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
- Words are stemmed — words are reduced to their root form.

### Loading gensim and nltk libraries

In [13]:
from nltk.stem.porter import PorterStemmer
PorterStemmer.stem()

In [14]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer, StemmerI
from nltk.stem.porter import *
from nltk.stem.porter import PorterStemmer
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [35]:
def lemmatize_stemming(text):
    st = PorterStemmer()
    return st.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [37]:
doc_sample = documents[documents['index'] == 4309].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['rain', 'helps', 'dampen', 'bushfires']


 tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


In [38]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

0               [decid, commun, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

Bag of Words on the Data set

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.

In [39]:
dictionary = gensim.corpora.Dictionary(processed_docs)
dictionary

<gensim.corpora.dictionary.Dictionary at 0x18fd1f1abe0>

In [40]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 commun
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


## Gensim filter_extremes

Filter out tokens that appear in
less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).
after the above two steps, keep only the first 100000 most frequent tokens.

In [41]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

## Gensim doc2bow

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [45]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4309]

[(76, 1), (112, 1), (484, 1), (4051, 1)]

In [47]:
bow_corpus[:10]

[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(4, 1), (5, 1), (6, 1)],
 [(7, 1), (8, 1), (9, 1), (10, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1)],
 [(14, 1), (15, 1), (16, 1), (17, 1)],
 [(18, 1), (19, 1), (20, 1), (21, 1)],
 [(22, 1), (23, 1), (24, 1), (25, 1), (26, 1)],
 [(27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)],
 [(11, 1), (33, 1), (34, 1), (35, 1), (36, 1)],
 [(37, 1), (38, 1), (39, 1)]]

In [48]:
bow_corpus[-1]

[(360, 1), (1134, 1), (1188, 1), (3238, 1), (12066, 1)]

Preview Bag Of Words for our sample preprocessed document.

In [46]:
bow_doc_4310 = bow_corpus[4309]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))

Word 76 ("bushfir") appears 1 time.
Word 112 ("help") appears 1 time.
Word 484 ("rain") appears 1 time.
Word 4051 ("dampen") appears 1 time.


## TF-IDF

Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document.

In [49]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5852942020878993),
 (1, 0.38405854933668493),
 (2, 0.5017732999224691),
 (3, 0.5080878695349914)]


# Running LDA using Bag of Words
Train our lda model using gensim.models.

LdaMulticore and save it to ‘lda_model’

In [51]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

For each topic, we will explore the words occuring in that topic and its relative weight.


In [52]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.027*"donald" + 0.025*"kill" + 0.023*"market" + 0.022*"attack" + 0.018*"border" + 0.014*"morrison" + 0.013*"dead" + 0.013*"interview" + 0.010*"find" + 0.009*"share"
Topic: 1 
Words: 0.032*"sydney" + 0.032*"polic" + 0.029*"year" + 0.026*"news" + 0.022*"record" + 0.017*"rise" + 0.017*"hous" + 0.017*"investig" + 0.016*"death" + 0.013*"shoot"
Topic: 2 
Words: 0.033*"china" + 0.019*"miss" + 0.017*"work" + 0.015*"australia" + 0.015*"countri" + 0.014*"victim" + 0.013*"releas" + 0.012*"season" + 0.012*"search" + 0.012*"chines"
Topic: 3 
Words: 0.019*"women" + 0.018*"feder" + 0.016*"peopl" + 0.016*"farmer" + 0.015*"health" + 0.014*"win" + 0.014*"labor" + 0.013*"take" + 0.011*"servic" + 0.011*"beat"
Topic: 4 
Words: 0.023*"bushfir" + 0.023*"school" + 0.021*"canberra" + 0.017*"busi" + 0.014*"student" + 0.011*"darwin" + 0.011*"adelaid" + 0.009*"vaccin" + 0.009*"port" + 0.009*"black"
Topic: 5 
Words: 0.046*"trump" + 0.034*"elect" + 0.024*"live" + 0.015*"island" + 0.013*"say" + 0.0

## Running LDA using TF-IDF

In [53]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} \nWord: {}'.format(idx, topic))

Topic: 0 
Word: 0.019*"coronaviru" + 0.013*"covid" + 0.012*"countri" + 0.010*"hour" + 0.009*"market" + 0.009*"queensland" + 0.008*"australia" + 0.008*"weather" + 0.007*"china" + 0.007*"australian"
Topic: 1 
Word: 0.035*"trump" + 0.018*"interview" + 0.010*"search" + 0.010*"monday" + 0.010*"wednesday" + 0.009*"scott" + 0.008*"miss" + 0.007*"daniel" + 0.007*"extend" + 0.006*"flight"
Topic: 2 
Word: 0.027*"news" + 0.009*"sport" + 0.009*"financ" + 0.009*"busi" + 0.009*"david" + 0.008*"nation" + 0.007*"rural" + 0.007*"energi" + 0.007*"grandstand" + 0.007*"gener"
Topic: 3 
Word: 0.017*"polic" + 0.017*"charg" + 0.015*"murder" + 0.012*"crash" + 0.012*"woman" + 0.011*"court" + 0.011*"death" + 0.010*"kill" + 0.010*"jail" + 0.010*"shoot"
Topic: 4 
Word: 0.011*"donald" + 0.010*"final" + 0.009*"world" + 0.009*"australia" + 0.009*"royal" + 0.007*"leagu" + 0.007*"andrew" + 0.007*"live" + 0.007*"beat" + 0.006*"cricket"
Topic: 5 
Word: 0.013*"elect" + 0.009*"tuesday" + 0.007*"lockdown" + 0.007*"liber" +

## Classification of the topics
Performance evaluation by classifying sample document using LDA Bag of Words model

In [55]:

processed_docs[4309]

['rain', 'help', 'dampen', 'bushfir']

### Performance evaluation by classifying sample document using LDA Bag of Words model

In [56]:
for index, score in sorted(lda_model[bow_corpus[4309]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.5549039244651794	 
Topic: 0.053*"coronaviru" + 0.043*"australia" + 0.035*"queensland" + 0.028*"victoria" + 0.022*"covid" + 0.019*"south" + 0.017*"coast" + 0.016*"tasmania" + 0.015*"home" + 0.013*"final"

Score: 0.2850033640861511	 
Topic: 0.023*"bushfir" + 0.023*"school" + 0.021*"canberra" + 0.017*"busi" + 0.014*"student" + 0.011*"darwin" + 0.011*"adelaid" + 0.009*"vaccin" + 0.009*"port" + 0.009*"black"

Score: 0.020016008988022804	 
Topic: 0.019*"women" + 0.018*"feder" + 0.016*"peopl" + 0.016*"farmer" + 0.015*"health" + 0.014*"win" + 0.014*"labor" + 0.013*"take" + 0.011*"servic" + 0.011*"beat"

Score: 0.020013820379972458	 
Topic: 0.028*"govern" + 0.018*"crash" + 0.017*"chang" + 0.014*"restrict" + 0.014*"plan" + 0.014*"hospit" + 0.012*"fund" + 0.012*"council" + 0.011*"region" + 0.011*"nation"

Score: 0.020011354237794876	 
Topic: 0.033*"china" + 0.019*"miss" + 0.017*"work" + 0.015*"australia" + 0.015*"countri" + 0.014*"victim" + 0.013*"releas" + 0.012*"season" + 0.012*"searc

### Performance evaluation by classifying sample document using LDA TF-IDF model

In [57]:



for index, score in sorted(lda_model_tfidf[bow_corpus[4309]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.4920821189880371	 
Topic: 0.019*"coronaviru" + 0.013*"covid" + 0.012*"countri" + 0.010*"hour" + 0.009*"market" + 0.009*"queensland" + 0.008*"australia" + 0.008*"weather" + 0.007*"china" + 0.007*"australian"

Score: 0.34785884618759155	 
Topic: 0.014*"drum" + 0.013*"bushfir" + 0.012*"victorian" + 0.012*"coronaviru" + 0.010*"victoria" + 0.007*"violenc" + 0.007*"home" + 0.007*"coast" + 0.006*"burn" + 0.006*"domest"

Score: 0.020010540261864662	 
Topic: 0.009*"health" + 0.009*"rural" + 0.008*"morrison" + 0.007*"care" + 0.007*"friday" + 0.006*"coronaviru" + 0.006*"indigen" + 0.006*"age" + 0.006*"mental" + 0.006*"fund"

Score: 0.020007790997624397	 
Topic: 0.009*"restrict" + 0.008*"zealand" + 0.008*"coronaviru" + 0.006*"octob" + 0.006*"fiji" + 0.006*"biden" + 0.006*"teacher" + 0.006*"australia" + 0.005*"asylum" + 0.005*"action"

Score: 0.02000746876001358	 
Topic: 0.008*"govern" + 0.007*"updat" + 0.007*"tasmania" + 0.007*"juli" + 0.006*"council" + 0.005*"plan" + 0.005*"farm" + 0.00

In [None]:
# Our test document has the highest probability to be part of the topic on the top

In [58]:
# Testing model on unseen document
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.49834156036376953	 Topic: 0.033*"china" + 0.019*"miss" + 0.017*"work" + 0.015*"australia" + 0.015*"countri"
Score: 0.20165833830833435	 Topic: 0.050*"australian" + 0.015*"coronaviru" + 0.013*"fight" + 0.013*"royal" + 0.012*"scott"
Score: 0.18330468237400055	 Topic: 0.023*"bushfir" + 0.023*"school" + 0.021*"canberra" + 0.017*"busi" + 0.014*"student"
Score: 0.016671162098646164	 Topic: 0.028*"govern" + 0.018*"crash" + 0.017*"chang" + 0.014*"restrict" + 0.014*"plan"
Score: 0.016670942306518555	 Topic: 0.053*"coronaviru" + 0.043*"australia" + 0.035*"queensland" + 0.028*"victoria" + 0.022*"covid"
Score: 0.016670912504196167	 Topic: 0.019*"women" + 0.018*"feder" + 0.016*"peopl" + 0.016*"farmer" + 0.015*"health"
Score: 0.016670826822519302	 Topic: 0.032*"sydney" + 0.032*"polic" + 0.029*"year" + 0.026*"news" + 0.022*"record"
Score: 0.016670534387230873	 Topic: 0.046*"trump" + 0.034*"elect" + 0.024*"live" + 0.015*"island" + 0.013*"say"
Score: 0.016670512035489082	 Topic: 0.027*"donald"

[link](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)