<a href="https://colab.research.google.com/github/diem-ai/topic-modeling/blob/master/Topic_Modeling_LSA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Introduction

<p>In another notebook Topic Modeling-LDA.ipynb, we examine the idea of latent spaces and how we
could use Latent Dirichlet Allocation to create a topic space. LDA is not the only method to create latent spaces. In this notebook, we are going to use Non Negative Matrix to accomplish the same task. Non-negative Matrix Factorization is another mathematical technique to decompose a matrix into sub-matrices</p>

#### Project tasks:
- Cleaning the dataset & Lemmatization (done in notebook model_preparation.ipynb )
- Creat a dictionay from processed data (done in notebook model_preparation.ipynb)
- Create Corpus and LDA Model with bag of words
- Create Coprpus and LDA with TF-IDF
- Caculate the Perplexity and Topic Cohenrence between two models
- Visualize topics with the help of pyLDAvis


#### Google Colab Setup

In [31]:
from google.colab import drive
# This will prompt for authorization.
# authorization code: 4/OwErfUj6QceGXhIGx_RWv0MKclb9rilw8UsJnZqFbSez-QS8zQ399JU
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [32]:
!pip install PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)





In [0]:
#import accessory_functions.py from Colab
#https://drive.google.com/open?id=1S7URZIBq4zMh5QWv0qXPHv4ixhgHWN_y

my_module = drive.CreateFile({'id':'1S7URZIBq4zMh5QWv0qXPHv4ixhgHWN_y'})
my_module.GetContentFile('accessory_functions.py')

In [0]:
import numpy as np
import string
import pandas as pd
import unidecode

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
import matplotlib.pyplot as plt

from accessory_functions import read_pickle_file


%matplotlib inline
# Make all my plots 538 Style
plt.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore')


<p>Data Path & Model parameters</p>

In [0]:
datapath = '/content/drive/My Drive/data/'
n_topics = 50
iterations = 50

<p> Loading data</p>

In [0]:
processed_docs = read_pickle_file(datapath + 'processed_docs.pkl')
bow = read_pickle_file(datapath + 'bow.pkl')
tfidf = read_pickle_file(datapath + 'tfidf.pkl')
dictionary = read_pickle_file(datapath + 'dictionary.pkl')

In [37]:
print(processed_docs[1])



['trade', 'flow', 'traditional', 'economic', 'measure', 'reveal', 'true', 'cost', 'tariff', 'washington', 'friday', 'hike', 'duty', 'billion', 'chinese', 'good', 'side', 'intimate', 'survive', 'blow', 'may', 'slightly', 'dent', 'gdp', 'shift', 'supply', 'chain', 'though', 'show', 'extent', 'long', 'term', 'loss']


In [38]:
doc = bow[1]

print(doc)

[(dictionary[id], count) for id, count in doc]

[(4, 1), (10, 1), (33, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1)]


[('billion', 1),
 ('friday', 1),
 ('trade', 1),
 ('blow', 1),
 ('chain', 1),
 ('chinese', 1),
 ('cost', 1),
 ('dent', 1),
 ('duty', 1),
 ('economic', 1),
 ('extent', 1),
 ('flow', 1),
 ('gdp', 1),
 ('good', 1),
 ('hike', 1),
 ('intimate', 1),
 ('long', 1),
 ('loss', 1),
 ('may', 1),
 ('measure', 1),
 ('reveal', 1),
 ('shift', 1),
 ('show', 1),
 ('side', 1),
 ('slightly', 1),
 ('supply', 1),
 ('survive', 1),
 ('tariff', 1),
 ('term', 1),
 ('though', 1),
 ('traditional', 1),
 ('true', 1),
 ('washington', 1)]

In [39]:
doc = tfidf[1]

print(doc)

[(dictionary[id], freq) for id, freq in doc]

[(4, 0.05775843375562911), (10, 0.12310661992926017), (33, 0.10439212162929286), (38, 0.1961303786255056), (39, 0.17310433437273764), (40, 0.10148900465133513), (41, 0.1199009983425769), (42, 0.23725627027918395), (43, 0.21614153934185534), (44, 0.1071217734433688), (45, 0.28633219743654287), (46, 0.1961303786255056), (47, 0.1961303786255056), (48, 0.1091949636509986), (49, 0.1923966831387611), (50, 0.28633219743654287), (51, 0.11962217757715381), (52, 0.1524456522665324), (53, 0.06964385147961394), (54, 0.17420790459371643), (55, 0.19813974994500413), (56, 0.1765129312952131), (57, 0.12161692894792232), (58, 0.1829493640280444), (59, 0.21010281715361415), (60, 0.18157469132600734), (61, 0.21010281715361415), (62, 0.16023546606560787), (63, 0.13550592942070178), (64, 0.11798968502713748), (65, 0.19813974994500413), (66, 0.20487183805983358), (67, 0.14595088124716785)]


[('billion', 0.05775843375562911),
 ('friday', 0.12310661992926017),
 ('trade', 0.10439212162929286),
 ('blow', 0.1961303786255056),
 ('chain', 0.17310433437273764),
 ('chinese', 0.10148900465133513),
 ('cost', 0.1199009983425769),
 ('dent', 0.23725627027918395),
 ('duty', 0.21614153934185534),
 ('economic', 0.1071217734433688),
 ('extent', 0.28633219743654287),
 ('flow', 0.1961303786255056),
 ('gdp', 0.1961303786255056),
 ('good', 0.1091949636509986),
 ('hike', 0.1923966831387611),
 ('intimate', 0.28633219743654287),
 ('long', 0.11962217757715381),
 ('loss', 0.1524456522665324),
 ('may', 0.06964385147961394),
 ('measure', 0.17420790459371643),
 ('reveal', 0.19813974994500413),
 ('shift', 0.1765129312952131),
 ('show', 0.12161692894792232),
 ('side', 0.1829493640280444),
 ('slightly', 0.21010281715361415),
 ('supply', 0.18157469132600734),
 ('survive', 0.21010281715361415),
 ('tariff', 0.16023546606560787),
 ('term', 0.13550592942070178),
 ('though', 0.11798968502713748),
 ('traditiona

<p>LSA with Bag-of-Word</p>

In [0]:
lsi_bow = gensim.models.LsiModel(bow      
                                 , num_topics=n_topics
                                       , id2word=dictionary)

<p>Print top 10 popular topics</p>

In [41]:
topics = lsi_bow.print_topics(num_topics=5, num_words=10)

for idx, topic in topics:
  print("topic: {}\n {}".format(idx, topic))
  
#[print("topic: {}\n {}".format(idx, topic)) for idx, topic in topics]

topic: 0
 0.297*"billion" + 0.213*"president" + 0.211*"trump" + 0.187*"may" + 0.174*"year" + 0.163*"donald" + 0.156*"new" + 0.155*"could" + 0.155*"company" + 0.146*"bank"
topic: 1
 -0.516*"trump" + -0.478*"president" + -0.408*"donald" + 0.280*"billion" + -0.149*"say" + 0.123*"company" + 0.102*"year" + 0.099*"investor" + -0.096*"house" + 0.082*"bank"
topic: 2
 0.516*"billion" + -0.401*"bank" + -0.264*"may" + -0.211*"european" + 0.173*"company" + -0.144*"union" + -0.144*"britain" + -0.142*"minister" + 0.135*"trump" + -0.122*"prime"
topic: 3
 0.539*"bank" + -0.435*"may" + 0.190*"year" + -0.171*"minister" + 0.166*"chief" + 0.155*"executive" + -0.152*"prime" + -0.151*"union" + -0.150*"european" + -0.143*"britain"
topic: 4
 0.516*"china" + 0.250*"market" + -0.229*"chief" + -0.216*"executive" + 0.210*"could" + -0.209*"billion" + 0.180*"chinese" + 0.165*"beijing" + -0.164*"may" + 0.162*"state"


#### LSA model with TF-IDF


In [0]:
lsi_tfidf = gensim.models.LsiModel(tfidf
                                       , num_topics=n_topics
                                       , id2word=dictionary)

<p>Print top 5 topics</p>

In [43]:
#pprint(lda_model.print_topics())
for idx, topic in lsi_tfidf.print_topics(num_topics=5, num_words=10):
    print('\nTopic: {}\nWords: {}'.format(idx, topic))



Topic: 0
Words: 0.148*"trump" + 0.143*"president" + 0.126*"donald" + 0.124*"billion" + 0.112*"may" + 0.111*"bank" + 0.108*"say" + 0.106*"could" + 0.102*"year" + 0.102*"china"

Topic: 1
Words: 0.347*"trump" + 0.311*"donald" + 0.308*"president" + 0.173*"say" + 0.149*"verified" + 0.145*"twitter" + 0.143*"house" + 0.141*"statement" + 0.121*"post" + 0.120*"account"

Topic: 2
Words: -0.377*"verified" + -0.342*"twitter" + -0.339*"statement" + -0.318*"account" + -0.295*"post" + -0.263*"follow" + -0.181*"personal" + 0.110*"election" + 0.088*"minister" + 0.081*"house"

Topic: 3
Words: -0.234*"european" + -0.219*"minister" + -0.217*"union" + -0.211*"britain" + -0.210*"brexit" + -0.198*"prime" + -0.191*"theresa" + -0.152*"verified" + -0.140*"may" + -0.135*"statement"

Topic: 4
Words: -0.246*"china" + 0.214*"fox" + 0.186*"murdoch" + -0.157*"trade" + 0.153*"disney" + 0.149*"twenty" + 0.144*"sky" + 0.143*"century" + 0.137*"rupert" + -0.135*"chinese"


<p>Test model with unseen data</p>

In [44]:
def sortbyvalue(item):
  """
  @item is a tuple (key,val)
  """
  return item[1]



text = 'Uber Technologies lackluster stock-market debut is a warning for other tech unicorns'

unseen_doc = dictionary.doc2bow(text.split())

vector = lsi_bow[unseen_doc]
# sort vector descending by score
vector.sort(key=sortbyvalue, reverse=True)

#print top 5 topics
for idx, score in vector[:5]:
  print("topic:{}  score:{} \n {}".format(idx, score, lsi_bow.print_topic(topicno=idx, topn=5)))



topic:7  score:0.08736280557799948 
 -0.402*"year" + 0.385*"market" + -0.287*"state" + -0.265*"china" + 0.260*"investor"
topic:23  score:0.07350410000420445 
 0.264*"would" + -0.210*"market" + 0.187*"trade" + 0.185*"economic" + 0.168*"percent"
topic:4  score:0.06564559022171641 
 0.516*"china" + 0.250*"market" + -0.229*"chief" + -0.216*"executive" + 0.210*"could"
topic:21  score:0.05115576717249284 
 0.430*"would" + -0.261*"percent" + 0.253*"investor" + 0.235*"like" + -0.167*"price"
topic:2  score:0.04746070033605729 
 0.516*"billion" + -0.401*"bank" + -0.264*"may" + -0.211*"european" + 0.173*"company"


In [45]:
text = 'Uber Technologies lackluster stock-market debut is a warning for other tech unicorns'

unseen_doc = dictionary.doc2bow(text.split())

vector = lsi_tfidf[unseen_doc]

vector.sort(key = sortbyvalue)

for idx, score in vector[:5]:
  print("topic:{}  score:{} \n {}".format(idx, score, lsi_tfidf.print_topic(topicno=idx, topn=5)))


topic:49  score:-0.10628477110527208 
 0.142*"tuesday" + 0.137*"would" + -0.136*"group" + -0.129*"financial" + -0.122*"democratic"
topic:36  score:-0.08223417129635605 
 -0.142*"financial" + 0.133*"white" + -0.129*"crisis" + 0.128*"market" + -0.125*"deal"
topic:29  score:-0.07213120942662607 
 0.171*"financial" + 0.143*"government" + -0.136*"tax" + -0.123*"presidential" + 0.108*"year"
topic:34  score:-0.06332052801079278 
 0.139*"financial" + 0.131*"state" + 0.131*"make" + -0.131*"federal" + 0.129*"tax"
topic:1  score:-0.06103160388384444 
 0.347*"trump" + 0.311*"donald" + 0.308*"president" + 0.173*"say" + 0.149*"verified"


<p>With the same text, both models give the different results</p>