This notebook follows the tutorial from this [post](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) by Susan Li.

In [2]:
import pandas as pd
import numpy as np
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

np.random.seed(2018)

import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/brianesamson/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
data = pd.read_csv("../raw/abcnews-date-text.csv", error_bad_lines=False)
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

documents

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4
5,ambitious olsson wins triple jump,5
6,antic delighted with record breaking barca,6
7,aussie qualifier stosur wastes four memphis match,7
8,aust addresses un security council over iraq,8
9,australia is locked into war timetable opp,9


**Lemmatize**

In [4]:
WordNetLemmatizer().lemmatize('went', pos='v')

'go'

**Stemming**

In [5]:
stemmer = SnowballStemmer('english')
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]
stemmed = pd.DataFrame(data = {'original word': original_words, 'stemmed': singles})
stemmed

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [8]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

**Preprocessing**

- Tokenization
- Lemmatization
- Stemming
- Remove stopwords
- Remove words with <= 3 characters

In [12]:
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('Original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
   
print(words)
print('\n\nTokenized and lemmatized document: ')
print(preprocess(doc_sample))

Original document: 
['rain', 'helps', 'dampen', 'bushfires']


Tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


In [14]:
processed_docs = documents['headline_text'].map(preprocess)

In [15]:
processed_docs

0                      [decid, communiti, broadcast, licenc]
1                                         [wit, awar, defam]
2                     [call, infrastructur, protect, summit]
3                                [staff, aust, strike, rise]
4                       [strike, affect, australian, travel]
5                         [ambiti, olsson, win, tripl, jump]
6                     [antic, delight, record, break, barca]
7              [aussi, qualifi, stosur, wast, memphi, match]
8                      [aust, address, secur, council, iraq]
9                                   [australia, lock, timet]
10                     [australia, contribut, million, iraq]
11                 [barca, record, robson, celebr, birthday]
12                                   [bathhous, plan, ahead]
13                     [hop, launceston, cycl, championship]
14                       [plan, boost, paroo, water, suppli]
15                       [blizzard, buri, unit, state, bill]
16                 [brig

**Create Bag of Words on the dataset**

Get all the unique words and assign an ID to them.

In [16]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [17]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


**Filter words** | [Source](https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.filter_extremes.html)

Filter out tokens that appear in

- less than `no_below` documents (absolute number) or
- more than `no_above` documents (fraction of total corpus size, not absolute number).
- after (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if None).

In [18]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

**Convert corpus to used the BoW IDs and count the word frequency per document**

In [20]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(76, 1), (112, 1), (483, 1), (4014, 1)]

In [34]:
bow_doc_4310 = bow_corpus[32]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 34 ("council") appears 1 time.
Word 129 ("welcom") appears 1 time.
Word 130 ("breakthrough") appears 1 time.
Word 131 ("insur") appears 1 time.


TF-IDF
===

First, calculate the inverse document counts for all terms in the training corpus.

In [35]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

Then, transform the count representations into the Tfidf space.

In [36]:
corpus_tfidf = tfidf[bow_corpus]

In [40]:
from pprint import pprint
    
pprint(corpus_tfidf[0])

[(0, 0.58929086447099832),
 (1, 0.38929657403503015),
 (2, 0.4964985198530063),
 (3, 0.50465203286956617)]


LDA using Bag of Words
===

In [44]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [47]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.028*"queensland" + 0.021*"hous" + 0.020*"school" + 0.015*"chang" + 0.014*"child" + 0.012*"abus" + 0.011*"driver" + 0.010*"concern" + 0.009*"worker" + 0.009*"liber"
Topic: 1 
Words: 0.019*"nation" + 0.016*"health" + 0.013*"indigen" + 0.012*"servic" + 0.012*"gold" + 0.011*"communiti" + 0.011*"feder" + 0.011*"help" + 0.011*"busi" + 0.010*"coast"
Topic: 2 
Words: 0.016*"china" + 0.016*"rural" + 0.016*"elect" + 0.015*"govern" + 0.013*"fight" + 0.010*"royal" + 0.010*"prison" + 0.009*"say" + 0.009*"announc" + 0.009*"drum"
Topic: 3 
Words: 0.018*"women" + 0.016*"world" + 0.010*"tasmanian" + 0.010*"record" + 0.010*"life" + 0.010*"fear" + 0.010*"million" + 0.010*"street" + 0.010*"violenc" + 0.009*"year"
Topic: 4 
Words: 0.038*"trump" + 0.019*"attack" + 0.013*"say" + 0.013*"kill" + 0.013*"train" + 0.012*"near" + 0.012*"guilti" + 0.011*"polit" + 0.011*"farm" + 0.011*"time"
Topic: 5 
Words: 0.022*"call" + 0.020*"countri" + 0.016*"turnbul" + 0.016*"water" + 0.014*"test" + 0.014*"s

LDA using TF-IDF
===

In [45]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [46]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.009*"leagu" + 0.008*"final" + 0.008*"australia" + 0.007*"world" + 0.007*"christma" + 0.006*"sexual" + 0.005*"cricket" + 0.005*"test" + 0.005*"david" + 0.005*"jam"
Topic: 1 Word: 0.017*"countri" + 0.015*"hour" + 0.008*"health" + 0.007*"fund" + 0.006*"grandstand" + 0.005*"sport" + 0.005*"abbott" + 0.005*"plead" + 0.005*"plan" + 0.005*"council"
Topic: 2 Word: 0.022*"trump" + 0.011*"donald" + 0.007*"michael" + 0.007*"septemb" + 0.007*"friday" + 0.006*"andrew" + 0.006*"june" + 0.005*"capit" + 0.004*"presid" + 0.004*"bash"
Topic: 3 Word: 0.011*"drum" + 0.010*"weather" + 0.007*"search" + 0.007*"octob" + 0.006*"monday" + 0.006*"malcolm" + 0.006*"stori" + 0.006*"miss" + 0.006*"polic" + 0.005*"rate"
Topic: 4 Word: 0.013*"interview" + 0.010*"turnbul" + 0.008*"marriag" + 0.007*"ash" + 0.006*"asylum" + 0.006*"tuesday" + 0.006*"seeker" + 0.005*"outback" + 0.005*"detent" + 0.004*"footag"
Topic: 5 Word: 0.009*"royal" + 0.007*"commiss" + 0.007*"govern" + 0.005*"social" + 0.005*"univers

Sample classification of documents to topics
===

**Using the LDA Bag of Words model**

In [48]:
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [49]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6199929118156433	 
Topic: 0.019*"nation" + 0.016*"health" + 0.013*"indigen" + 0.012*"servic" + 0.012*"gold" + 0.011*"communiti" + 0.011*"feder" + 0.011*"help" + 0.011*"busi" + 0.010*"coast"

Score: 0.21999980509281158	 
Topic: 0.018*"women" + 0.016*"world" + 0.010*"tasmanian" + 0.010*"record" + 0.010*"life" + 0.010*"fear" + 0.010*"million" + 0.010*"street" + 0.010*"violenc" + 0.009*"year"

Score: 0.020005155354738235	 
Topic: 0.038*"polic" + 0.027*"sydney" + 0.020*"south" + 0.019*"adelaid" + 0.018*"melbourn" + 0.017*"crash" + 0.017*"perth" + 0.017*"death" + 0.015*"die" + 0.014*"woman"

Score: 0.020001305267214775	 
Topic: 0.022*"call" + 0.020*"countri" + 0.016*"turnbul" + 0.016*"water" + 0.014*"test" + 0.014*"student" + 0.011*"john" + 0.011*"parti" + 0.010*"meet" + 0.009*"energi"

Score: 0.020000819116830826	 
Topic: 0.028*"queensland" + 0.021*"hous" + 0.020*"school" + 0.015*"chang" + 0.014*"child" + 0.012*"abus" + 0.011*"driver" + 0.010*"concern" + 0.009*"worker" + 0.009*"li

**Using the LDA TFIDF model**

In [50]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.8199766278266907	 
Topic: 0.026*"year" + 0.019*"market" + 0.017*"famili" + 0.015*"rise" + 0.014*"high" + 0.014*"price" + 0.014*"share" + 0.013*"australian" + 0.012*"jail" + 0.011*"sentenc"

Score: 0.020014287903904915	 
Topic: 0.049*"australia" + 0.018*"home" + 0.015*"final" + 0.015*"interview" + 0.014*"open" + 0.014*"tasmania" + 0.013*"win" + 0.011*"island" + 0.011*"leagu" + 0.010*"return"

Score: 0.02000255137681961	 
Topic: 0.019*"nation" + 0.016*"health" + 0.013*"indigen" + 0.012*"servic" + 0.012*"gold" + 0.011*"communiti" + 0.011*"feder" + 0.011*"help" + 0.011*"busi" + 0.010*"coast"

Score: 0.020001647993922234	 
Topic: 0.018*"women" + 0.016*"world" + 0.010*"tasmanian" + 0.010*"record" + 0.010*"life" + 0.010*"fear" + 0.010*"million" + 0.010*"street" + 0.010*"violenc" + 0.009*"year"

Score: 0.02000107429921627	 
Topic: 0.022*"call" + 0.020*"countri" + 0.016*"turnbul" + 0.016*"water" + 0.014*"test" + 0.014*"student" + 0.011*"john" + 0.011*"parti" + 0.010*"meet" + 0.009*"en

**On an unseen document**

In [51]:
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.2895960509777069	 Topic: 0.038*"trump" + 0.019*"attack" + 0.013*"say" + 0.013*"kill" + 0.013*"train"
Score: 0.21366867423057556	 Topic: 0.026*"year" + 0.019*"market" + 0.017*"famili" + 0.015*"rise" + 0.014*"high"
Score: 0.21339593827724457	 Topic: 0.022*"call" + 0.020*"countri" + 0.016*"turnbul" + 0.016*"water" + 0.014*"test"
Score: 0.18333330750465393	 Topic: 0.038*"polic" + 0.027*"sydney" + 0.020*"south" + 0.019*"adelaid" + 0.018*"melbourn"
Score: 0.016670215874910355	 Topic: 0.016*"china" + 0.016*"rural" + 0.016*"elect" + 0.015*"govern" + 0.013*"fight"
Score: 0.016668127849698067	 Topic: 0.028*"queensland" + 0.021*"hous" + 0.020*"school" + 0.015*"chang" + 0.014*"child"
Score: 0.016667678952217102	 Topic: 0.049*"australia" + 0.018*"home" + 0.015*"final" + 0.015*"interview" + 0.014*"open"
Score: 0.01666666753590107	 Topic: 0.019*"nation" + 0.016*"health" + 0.013*"indigen" + 0.012*"servic" + 0.012*"gold"
Score: 0.01666666753590107	 Topic: 0.018*"women" + 0.016*"world" + 0.010*