<a href="https://colab.research.google.com/github/diem-ai/topic-modeling/blob/master/Topic_Modeling_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Introduction

#### Every document we read can be thought of as consisting of many topics all stacked upon one another. Today, we’re going can unpack these topics using of NLP techniques: 
- Latent Dirichlet Allocation (LDA) and Topic Modeling
- Data is collected on https://www.reuters.com/breakingviews by a scrapping script
- The goal is to break text documents down into topics by word. 
- What is laten feature ? Mathematically, we want to find “topics” that are collections of words that appear in similar documents. 
  More generally, it is a collection of features in a dataset.
- There are several libraries for LDA such as scikit-learn and gensim. I choose gensim for this project. 

#### Project tasks:
- Cleaning the dataset & Lemmatization
- Creat a dictionay from processed data
- Create Corpus and LDA Model with bag of words
- Create Coprpus and LDA with TF-IDF
- Caculate the Perplexity and Topic Cohenrence between two models
- Visualize topics with the help of pyLDAvis


#### Google Colab Setup

In [22]:
from google.colab import drive
# This will prompt for authorization.
# authorization code: 4/OwErfUj6QceGXhIGx_RWv0MKclb9rilw8UsJnZqFbSez-QS8zQ399JU
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [23]:
!pip install PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)





In [0]:
#import accessory_functions.py from Colab
#https://drive.google.com/open?id=1S7URZIBq4zMh5QWv0qXPHv4ixhgHWN_y

my_module = drive.CreateFile({'id':'1S7URZIBq4zMh5QWv0qXPHv4ixhgHWN_y'})
my_module.GetContentFile('accessory_functions.py')

In [25]:
!pip install unidecode



In [26]:
!pip install pyLDAvis



<p>Data Path & Model parameters</p>

In [0]:
datapath = '/content/drive/My Drive/data/'
n_topics = 50
iterations = 50

<p> Import Libraries </p>

In [0]:
import numpy as np
import string
import pandas as pd
import unidecode

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from accessory_functions import read_pickle_file


"""


import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
nltk.download('averaged_perceptron_tagger')
"""

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  
import matplotlib.pyplot as plt
%matplotlib inline
# Make all my plots 538 Style
plt.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore')


<p> Loading data</p>

In [0]:
processed_docs = read_pickle_file(datapath + 'processed_docs.pkl')
bow = read_pickle_file(datapath + 'bow.pkl')
tfidf = read_pickle_file(datapath + 'tfidf.pkl')
dictionary = read_pickle_file(datapath + 'dictionary.pkl')

In [30]:
print(processed_docs[1])



['trade', 'flow', 'traditional', 'economic', 'measure', 'reveal', 'true', 'cost', 'tariff', 'washington', 'friday', 'hike', 'duty', 'billion', 'chinese', 'good', 'side', 'intimate', 'survive', 'blow', 'may', 'slightly', 'dent', 'gdp', 'shift', 'supply', 'chain', 'though', 'show', 'extent', 'long', 'term', 'loss']


In [31]:
doc = bow[1]

print(doc)

[(dictionary[id], count) for id, count in doc]

[(4, 1), (10, 1), (33, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1)]


[('billion', 1),
 ('friday', 1),
 ('trade', 1),
 ('blow', 1),
 ('chain', 1),
 ('chinese', 1),
 ('cost', 1),
 ('dent', 1),
 ('duty', 1),
 ('economic', 1),
 ('extent', 1),
 ('flow', 1),
 ('gdp', 1),
 ('good', 1),
 ('hike', 1),
 ('intimate', 1),
 ('long', 1),
 ('loss', 1),
 ('may', 1),
 ('measure', 1),
 ('reveal', 1),
 ('shift', 1),
 ('show', 1),
 ('side', 1),
 ('slightly', 1),
 ('supply', 1),
 ('survive', 1),
 ('tariff', 1),
 ('term', 1),
 ('though', 1),
 ('traditional', 1),
 ('true', 1),
 ('washington', 1)]

In [32]:
doc = tfidf[1]

print(doc)

[(dictionary[id], freq) for id, freq in doc]

[(4, 0.05775843375562911), (10, 0.12310661992926017), (33, 0.10439212162929286), (38, 0.1961303786255056), (39, 0.17310433437273764), (40, 0.10148900465133513), (41, 0.1199009983425769), (42, 0.23725627027918395), (43, 0.21614153934185534), (44, 0.1071217734433688), (45, 0.28633219743654287), (46, 0.1961303786255056), (47, 0.1961303786255056), (48, 0.1091949636509986), (49, 0.1923966831387611), (50, 0.28633219743654287), (51, 0.11962217757715381), (52, 0.1524456522665324), (53, 0.06964385147961394), (54, 0.17420790459371643), (55, 0.19813974994500413), (56, 0.1765129312952131), (57, 0.12161692894792232), (58, 0.1829493640280444), (59, 0.21010281715361415), (60, 0.18157469132600734), (61, 0.21010281715361415), (62, 0.16023546606560787), (63, 0.13550592942070178), (64, 0.11798968502713748), (65, 0.19813974994500413), (66, 0.20487183805983358), (67, 0.14595088124716785)]


[('billion', 0.05775843375562911),
 ('friday', 0.12310661992926017),
 ('trade', 0.10439212162929286),
 ('blow', 0.1961303786255056),
 ('chain', 0.17310433437273764),
 ('chinese', 0.10148900465133513),
 ('cost', 0.1199009983425769),
 ('dent', 0.23725627027918395),
 ('duty', 0.21614153934185534),
 ('economic', 0.1071217734433688),
 ('extent', 0.28633219743654287),
 ('flow', 0.1961303786255056),
 ('gdp', 0.1961303786255056),
 ('good', 0.1091949636509986),
 ('hike', 0.1923966831387611),
 ('intimate', 0.28633219743654287),
 ('long', 0.11962217757715381),
 ('loss', 0.1524456522665324),
 ('may', 0.06964385147961394),
 ('measure', 0.17420790459371643),
 ('reveal', 0.19813974994500413),
 ('shift', 0.1765129312952131),
 ('show', 0.12161692894792232),
 ('side', 0.1829493640280444),
 ('slightly', 0.21010281715361415),
 ('supply', 0.18157469132600734),
 ('survive', 0.21010281715361415),
 ('tariff', 0.16023546606560787),
 ('term', 0.13550592942070178),
 ('though', 0.11798968502713748),
 ('traditiona

<p>Build LDA Model with Bag-of-Word and Calculate Perplexity</p>

In [0]:
lda_bow = gensim.models.LdaModel(bow      
                                 , num_topics=n_topics
                                       , id2word=dictionary
                                       , iterations=iterations)

<p>Print top 10 popular topics</p>

In [34]:
topics = lda_bow.print_topics(10)

for idx, topic in topics:
  print("topic: {}\n {}".format(idx, topic))
  
#[print("topic: {}\n {}".format(idx, topic)) for idx, topic in topics]

topic: 1
 0.009*"billion" + 0.008*"inc" + 0.007*"job" + 0.007*"trln" + 0.006*"election" + 0.006*"make" + 0.006*"prove" + 0.006*"news" + 0.005*"presidential" + 0.005*"interview"
topic: 26
 0.015*"billion" + 0.007*"investor" + 0.007*"tv" + 0.007*"company" + 0.006*"vote" + 0.006*"year" + 0.006*"though" + 0.005*"shareholder" + 0.005*"could" + 0.005*"new"
topic: 20
 0.010*"billion" + 0.009*"president" + 0.008*"state" + 0.007*"trump" + 0.007*"still" + 0.006*"like" + 0.006*"deal" + 0.006*"china" + 0.005*"investment" + 0.005*"trade"
topic: 29
 0.014*"bank" + 0.008*"bond" + 0.008*"could" + 0.008*"president" + 0.007*"largely" + 0.007*"euro" + 0.006*"european" + 0.006*"billion" + 0.006*"criminal" + 0.006*"price"
topic: 39
 0.010*"company" + 0.010*"billion" + 0.007*"apple" + 0.006*"year" + 0.006*"cash" + 0.006*"capitol" + 0.005*"one" + 0.005*"executive" + 0.005*"mid" + 0.005*"go"
topic: 3
 0.008*"market" + 0.007*"elect" + 0.006*"create" + 0.006*"new" + 0.006*"chamber" + 0.006*"chinese" + 0.006*"es

In [35]:
# Compute Perplexity
# a measure of how good the model is. the lower, the better.
print('\nPerplexity: ', lda_bow.log_perplexity(bow)) 

# Compute Coherence Score
coherence_lda_bow = CoherenceModel(model=lda_bow
                                     , corpus=bow
                                     , texts = processed_docs
                                     , dictionary=dictionary 
                                     ,coherence='c_v')

coherence_lda = coherence_lda_bow.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -9.695302707640202

Coherence Score:  0.3247636975491756


<p>Visualize the topics-keywords</p>

In [36]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_bow, bow, dictionary)
vis

#### LDA model with TF-IDF


In [0]:
lda_tfidf = gensim.models.LdaModel(tfidf
                                       , num_topics=n_topics
                                       , id2word=dictionary
                                       , iterations=iterations)

<p>Print top 10 topics</p>

In [38]:
#pprint(lda_model.print_topics())
for idx, topic in lda_tfidf.print_topics(5):
    print('\nTopic: {}\nWords: {}'.format(idx, topic))



Topic: 2
Words: 0.009*"supreme" + 0.008*"court" + 0.007*"sexual" + 0.006*"federal" + 0.005*"allegation" + 0.005*"dec" + 0.005*"oversee" + 0.004*"assault" + 0.004*"governor" + 0.004*"antonio"

Topic: 14
Words: 0.007*"lawsuit" + 0.006*"court" + 0.005*"jia" + 0.005*"supreme" + 0.004*"security" + 0.004*"authority" + 0.004*"violate" + 0.004*"cry" + 0.004*"female" + 0.003*"season"

Topic: 45
Words: 0.004*"jeff" + 0.004*"fraud" + 0.004*"proceed" + 0.004*"money" + 0.004*"approach" + 0.003*"online" + 0.003*"question" + 0.003*"package" + 0.003*"answer" + 0.003*"information"

Topic: 21
Words: 0.006*"ethic" + 0.005*"accept" + 0.004*"automaker" + 0.004*"accuse" + 0.004*"indian" + 0.004*"truth" + 0.004*"boy" + 0.003*"senator" + 0.003*"ellison" + 0.003*"deal"

Topic: 23
Words: 0.006*"presidential" + 0.005*"goldman" + 0.005*"russian" + 0.004*"bond" + 0.004*"lloyd" + 0.004*"sachs" + 0.004*"nbc" + 0.004*"election" + 0.004*"interference" + 0.004*"morgan"


#### Compute Perplexity & Coherence Score

In [39]:
# Compute Perplexity
# a measure of how good the model is. the lower, the better.
print('\nPerplexity: ', lda_tfidf.log_perplexity(tfidf)) 

# Compute Coherence Score
coherence_model_tfidf = CoherenceModel(model=lda_tfidf
                                     , corpus=tfidf
                                     , texts = processed_docs
                                     , dictionary=dictionary 
                                     ,coherence='c_v')

coherence_tfidf = coherence_model_tfidf.get_coherence()
print('\nCoherence Score: ', coherence_tfidf)



Perplexity:  -17.246296709852626

Coherence Score:  0.5328383044511587


#### Visualize the topics

In [40]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_tfidf, tfidf, dictionary)
vis

<p>Test model with unseen data</p>

In [41]:
text = 'Uber Technologies lackluster stock-market debut is a warning for other tech unicorns'

unseen_doc = dictionary.doc2bow(text.split())

vector = lda_bow[unseen_doc]

for idx, score in vector:
  print("topic: {} score: {}".format(idx, lda_bow.print_topic(idx)))



topic: 10 score: 0.015*"president" + 0.014*"fargo" + 0.011*"kavanaugh" + 0.011*"donald" + 0.010*"trump" + 0.010*"brett" + 0.008*"investment" + 0.008*"fake" + 0.008*"murder" + 0.007*"executive"
topic: 24 score: 0.017*"senate" + 0.009*"year" + 0.008*"leader" + 0.008*"could" + 0.007*"government" + 0.007*"bill" + 0.007*"party" + 0.006*"capitalist" + 0.006*"may" + 0.005*"european"
topic: 27 score: 0.009*"bank" + 0.008*"financial" + 0.007*"china" + 0.007*"could" + 0.007*"year" + 0.006*"may" + 0.006*"billion" + 0.005*"come" + 0.005*"well" + 0.005*"make"
topic: 44 score: 0.008*"president" + 0.008*"presidential" + 0.007*"counsel" + 0.007*"special" + 0.006*"china" + 0.006*"mueller" + 0.006*"market" + 0.006*"group" + 0.006*"make" + 0.006*"election"


In [42]:
text = 'Uber Technologies lackluster stock-market debut is a warning for other tech unicorns'

unseen_doc = dictionary.doc2bow(text.split())

vector = lda_tfidf[unseen_doc]

for idx, score in vector:
  print("topic: {} score: {}".format(idx, lda_tfidf.print_topic(idx)))


topic: 9 score: 0.006*"abortion" + 0.005*"agency" + 0.005*"northern" + 0.004*"ireland" + 0.004*"strict" + 0.004*"de" + 0.003*"tank" + 0.003*"facto" + 0.003*"necessary" + 0.003*"make"
topic: 15 score: 0.009*"tower" + 0.005*"kalanick" + 0.005*"travis" + 0.004*"toll" + 0.004*"mexican" + 0.004*"serve" + 0.004*"shift" + 0.003*"korea" + 0.003*"reduce" + 0.003*"possibly"
topic: 42 score: 0.007*"le" + 0.005*"pyongyang" + 0.005*"la" + 0.004*"hour" + 0.004*"operation" + 0.004*"store" + 0.004*"hinge" + 0.004*"propaganda" + 0.004*"alibaba" + 0.003*"queue"
topic: 46 score: 0.006*"merkel" + 0.006*"attorney" + 0.006*"department" + 0.005*"germany" + 0.004*"mueller" + 0.004*"robert" + 0.004*"counsel" + 0.004*"confused" + 0.004*"justice" + 0.004*"angela"


<p>With the same text, both models give the different results</p>