<a href="https://colab.research.google.com/github/diem-ai/topic-modeling/blob/master/Breakingnews_Topic_Modeling_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

#### Every document we read can be thought of as consisting of many topics all stacked upon one another. Today, we’re going can unpack these topics using of NLP techniques: 
- Latent Dirichlet Allocation (LDA) and Topic Modeling
- Data is collected on https://www.reuters.com/breakingviews by a scrapping script
- The goal is to break text documents down into topics by word. 
- What is laten feature ? Mathematically, we want to find “topics” that are collections of words that appear in similar documents. 
  More generally, it is a collection of features in a dataset.
- There are several libraries for LDA such as scikit-learn and gensim. I choose gensim for this project. 

#### Project tasks:
- Cleaning the dataset & Lemmatization
- Creat a dictionay from processed data
- Create Corpus and LDA Model with bag of words
- Create Coprpus and LDA with TF-IDF
- Caculate the Perplexity and Topic Cohenrence between two models
- Visualize topics with the help of pyLDAvis


####  Import libraries

In [0]:
from google.colab import drive
# This will prompt for authorization.
# authorization code: 4/OwErfUj6QceGXhIGx_RWv0MKclb9rilw8UsJnZqFbSez-QS8zQ399JU
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
!pip install PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)





In [0]:
!pip install unidecode



In [0]:
!pip install pyLDAvis



In [0]:
import numpy as np
import string
import pandas as pd
import unidecode

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
nltk.download('averaged_perceptron_tagger')

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  
import matplotlib.pyplot as plt
%matplotlib inline
# Make all my plots 538 Style
plt.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [0]:

data = pd.read_csv('/content/drive/My Drive/data/breakingnews.csv')

data.head(2)

Unnamed: 0,title,headline_text
0,Uber’s losses are nothing like young Amazon’s,Uber Technologies is no Amazon.com. Some of th...
1,Hadas: CEOs would benefit from more humanities,"“The Defects of an University Education, and i..."


In [0]:
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

#####  Preprocessing Data & Lemmatization

In [0]:
stemmer = SnowballStemmer('english')

def get_wordnet_pos(treebank_tag):
    """Convert the part-of-speech naming scheme
       from the nltk default to that which is
       recognized by the WordNet lemmatizer"""

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

      
# remove alpha numerical words and make lowercase
alphanum_re = re.compile(r"""\w*\d\w*""")
alphanum_lambda = lambda x: alphanum_re.sub('', x)

re_alpha = re.compile('[^A-Za-z]', re.UNICODE)
alphaonly = lambda x : re_alpha.sub(' ', x)

# remove punctuation
punc_re = re.compile('[%s]' % re.escape(string.punctuation))
punc_lambda = lambda x: punc_re.sub(' ', x)

single_quote1 = re.compile("’")
nosinglequote1 = lambda x : re.sub(single_quote1 , '', x)

single_quote2 = re.compile('‘*')
nosinglequote2 = lambda x : re.sub(single_quote2 , '', x)


double_quote = re.compile('["]*')
nodoublequote = lambda x : re.sub(double_quote , '', x)

# remove stop words
sw = stopwords.words('english')
sw_lambda = lambda x: list(filter(lambda y: y not in sw, x))

pos_lambda = lambda x: [(y[0], get_wordnet_pos(y[1])) for y in x]

lemmatizer = WordNetLemmatizer()
lem_lambda = lambda x: [lemmatizer.lemmatize(*y) for y in x]



def preprocess_raw_data(data):
    """
    data: Pandas series
    """
     # remove email
    email_re =  re.compile('\S*@\S*\s?')
    noemail = lambda x : email_re.sub(' ', x)
    data = data.map(noemail)
 
    # remove new line character:
    newline_re = re.compile('\s+')
    nonewline = lambda x : newline_re.sub(' ', x)
    data = data.map(nonewline)
    # Remove distracting single quotes
    sg_quote_re = re.compile("\'")
    no_sg_quote = lambda x : sg_quote_re.sub(' ', x)
    data = data.map(no_sg_quote)
    
    data = data.map(simple_preprocess)
    
    # remove stop words
#    data = data.map(word_tokenize)
    sw = stopwords.words('english')
    sw_lambda = lambda x: list(filter(lambda y: y not in sw, x))
    # tokenize words before removing stopwords
    data = data.map(sw_lambda)

    # part of speech tagging--must convert to format used by lemmatizer
    data = data.map(nltk.pos_tag)
    data = data.map(pos_lambda)
    # lemmatization
    data = data.map(lem_lambda)
    
    return data
 
def get_score(lda_model, doc2vec):
    """
    lda_model: LDA model 
    
    """
    for index, score in sorted(lda_model[doc2vec], key=lambda tup: -1*tup[1]):
        print("\nScore: {}\nTopic: {} \nWord: {}".format(score, index, lda_model.print_topic(index, 10)))


In [0]:
processed_docs = preprocess_raw_data(documents['headline_text'])

#### Create the Dictionary and Corpus

In [0]:
# Create a corpus from a list of texts
dictionary = corpora.Dictionary(processed_docs)
# filter out the less common words
# Keep tokens which are contained in at least 15 documents
# Keep tokens which are contained in no more than 50% documents
# Keep only the first 10000 most frequent tokens
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=10000)
# Term Document Frequency, it is a list of (word_id, word_frequency) in the processed_docs.
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
# View the first document in corpus
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1)]]


In [0]:
#see what words from given ids in dictionary and their frequency
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('accord', 1),
  ('amazon', 1),
  ('cash', 1),
  ('commerce', 1),
  ('company', 1),
  ('firm', 1),
  ('future', 1),
  ('giant', 1),
  ('idea', 1),
  ('initial', 1),
  ('investor', 1),
  ('look', 1),
  ('loss', 1),
  ('news', 1),
  ('number', 1),
  ('offering', 1),
  ('profit', 1),
  ('public', 1),
  ('push', 1),
  ('report', 1),
  ('ride', 1),
  ('technology', 1),
  ('though', 1)]]

#### LDA Topic Modeling (Bag of words)
- Building LDA using Bag of Words with 5 topics
- LDA model is built with 5 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

In [0]:
lda_model = gensim.models.LdaMulticore(corpus
                                       , num_topics=20
                                       , id2word=dictionary
                                       , iterations=50
                                       , passes=2
                                       , workers=4)


##### View the topic

In [0]:
#pprint(lda_model.print_topics())
for idx, topic in lda_model.print_topics():
    print('\nTopic: {}\nWords: {}'.format(idx+1, topic))



Topic: 1
Words: 0.024*"may" + 0.019*"year" + 0.017*"make" + 0.016*"european" + 0.013*"look" + 0.012*"could" + 0.011*"brexit" + 0.010*"first" + 0.010*"britain" + 0.009*"bank"

Topic: 2
Words: 0.023*"china" + 0.020*"may" + 0.018*"make" + 0.016*"economic" + 0.016*"leader" + 0.014*"market" + 0.013*"investor" + 0.012*"financial" + 0.010*"minister" + 0.010*"people"

Topic: 3
Words: 0.020*"bank" + 0.014*"billion" + 0.014*"take" + 0.013*"lender" + 0.012*"executive" + 0.012*"chief" + 0.012*"face" + 0.011*"thursday" + 0.010*"yet" + 0.010*"year"

Topic: 4
Words: 0.020*"bank" + 0.017*"big" + 0.014*"may" + 0.013*"year" + 0.012*"european" + 0.012*"give" + 0.012*"leave" + 0.011*"could" + 0.010*"like" + 0.010*"new"

Topic: 5
Words: 0.016*"like" + 0.014*"year" + 0.012*"billion" + 0.012*"new" + 0.012*"rival" + 0.011*"uk" + 0.011*"may" + 0.010*"data" + 0.010*"could" + 0.010*"make"

Topic: 6
Words: 0.019*"china" + 0.017*"trump" + 0.016*"president" + 0.015*"billion" + 0.014*"donald" + 0.012*"may" + 0.012*

#### Compute Model Perplexity and Coherence Score

In [0]:
# Compute Perplexity
# a measure of how good the model is. the lower, the better.
print('\nPerplexity: ', lda_model.log_perplexity(corpus)) 

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model
                                     , corpus=corpus
                                     , texts = list(processed_docs)
                                     , dictionary=dictionary 
                                     ,coherence='c_v')

coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -6.766496981976656

Coherence Score:  0.2867435510539194


#### Visualize the topics-keywords

In [0]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis

#### LDA model with TF-IDF


In [0]:
tfidf = gensim.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

In [0]:
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus_tfidf[:1]]

[[('accord', 0.2484508994549492),
  ('amazon', 0.24541834943804625),
  ('cash', 0.20694200089786202),
  ('commerce', 0.23711871294667589),
  ('company', 0.1139896737199559),
  ('firm', 0.17147981605705243),
  ('future', 0.22753748577366648),
  ('giant', 0.15999767785734764),
  ('idea', 0.22121118791006092),
  ('initial', 0.21546972918998153),
  ('investor', 0.12077694980995311),
  ('look', 0.1697038396371476),
  ('loss', 0.2425269003806089),
  ('news', 0.21732523850103416),
  ('number', 0.24541834943804625),
  ('offering', 0.22121118791006092),
  ('profit', 0.21546972918998153),
  ('public', 0.18335484397173907),
  ('push', 0.21546972918998153),
  ('report', 0.1697038396371476),
  ('ride', 0.26633176311932893),
  ('technology', 0.21021406930731967),
  ('though', 0.1867153891875265)]]

In [0]:
lda_tfidf = gensim.models.LdaMulticore(corpus_tfidf
                                       , num_topics=20
                                       , id2word=dictionary
                                       , iterations=50)

#### View the topic

In [0]:
#pprint(lda_model.print_topics())
for idx, topic in lda_tfidf.print_topics():
    print('\nTopic: {}\nWords: {}'.format(idx+1, topic))



Topic: 1
Words: 0.009*"president" + 0.009*"leader" + 0.008*"though" + 0.008*"european" + 0.007*"trump" + 0.007*"way" + 0.007*"make" + 0.007*"house" + 0.007*"use" + 0.007*"say"

Topic: 2
Words: 0.009*"may" + 0.008*"investor" + 0.007*"former" + 0.007*"deal" + 0.007*"trump" + 0.007*"might" + 0.007*"business" + 0.007*"group" + 0.007*"take" + 0.006*"well"

Topic: 3
Words: 0.018*"mueller" + 0.014*"counsel" + 0.014*"investigation" + 0.014*"robert" + 0.014*"special" + 0.011*"leave" + 0.011*"general" + 0.011*"russia" + 0.009*"business" + 0.008*"one"

Topic: 4
Words: 0.008*"world" + 0.008*"president" + 0.008*"billion" + 0.008*"china" + 0.007*"less" + 0.007*"donald" + 0.007*"trump" + 0.006*"company" + 0.006*"trade" + 0.006*"say"

Topic: 5
Words: 0.010*"last" + 0.008*"make" + 0.008*"bank" + 0.008*"fund" + 0.008*"billion" + 0.008*"one" + 0.008*"investor" + 0.007*"big" + 0.007*"follow" + 0.007*"finance"

Topic: 6
Words: 0.011*"state" + 0.009*"trade" + 0.008*"company" + 0.008*"business" + 0.007*"bil

#### Compute Perplexity & Coherence Score

In [0]:
# Compute Perplexity
# a measure of how good the model is. the lower, the better.
print('\nPerplexity: ', lda_tfidf.log_perplexity(corpus_tfidf)) 

# Compute Coherence Score
coherence_model_tfidf = CoherenceModel(model=lda_tfidf
                                     , corpus=corpus_tfidf
                                     , texts = list(processed_docs)
                                     , dictionary=dictionary 
                                     ,coherence='c_v')

coherence_tfidf = coherence_model_tfidf.get_coherence()
print('\nCoherence Score: ', coherence_tfidf)



Perplexity:  -9.961465919389221

Coherence Score:  0.2510351150262894


#### Visualize the topics

In [0]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_tfidf, corpus_tfidf, dictionary)
vis

In [0]:
for idx, topic in lda_tfidf.print_topics(-1):
    print('\nTopic: {}\nWords: {}'.format(idx+1, topic))


Topic: 1
Words: 0.009*"president" + 0.009*"leader" + 0.008*"though" + 0.008*"european" + 0.007*"trump" + 0.007*"way" + 0.007*"make" + 0.007*"house" + 0.007*"use" + 0.007*"say"

Topic: 2
Words: 0.009*"may" + 0.008*"investor" + 0.007*"former" + 0.007*"deal" + 0.007*"trump" + 0.007*"might" + 0.007*"business" + 0.007*"group" + 0.007*"take" + 0.006*"well"

Topic: 3
Words: 0.018*"mueller" + 0.014*"counsel" + 0.014*"investigation" + 0.014*"robert" + 0.014*"special" + 0.011*"leave" + 0.011*"general" + 0.011*"russia" + 0.009*"business" + 0.008*"one"

Topic: 4
Words: 0.008*"world" + 0.008*"president" + 0.008*"billion" + 0.008*"china" + 0.007*"less" + 0.007*"donald" + 0.007*"trump" + 0.006*"company" + 0.006*"trade" + 0.006*"say"

Topic: 5
Words: 0.010*"last" + 0.008*"make" + 0.008*"bank" + 0.008*"fund" + 0.008*"billion" + 0.008*"one" + 0.008*"investor" + 0.007*"big" + 0.007*"follow" + 0.007*"finance"

Topic: 6
Words: 0.011*"state" + 0.009*"trade" + 0.008*"company" + 0.008*"business" + 0.007*"bil

In [0]:
#Make a test
print(corpus[10])
[print(dictionary[id], freq) for id, freq in corpus[10]]

get_score(lda_model, corpus[10])
#for index, score in sorted(lda_model[common_corpus[4310]], key=lambda tup: -1*tup[1]):
#    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

[(19, 1), (54, 1), (72, 1), (85, 1), (94, 1), (98, 1), (104, 1), (119, 1), (120, 1), (121, 1), (122, 1), (123, 1), (124, 1), (125, 1), (126, 1), (127, 1)]
report 1
even 1
measure 1
hand 1
donald 1
president 1
trump 1
democratic 1
election 1
executive 1
general 1
hold 1
house 1
lead 1
mueller 1
wednesday 1

Score: 0.48596107959747314
Topic: 8 
Word: 0.021*"president" + 0.019*"trump" + 0.016*"state" + 0.013*"donald" + 0.013*"china" + 0.012*"chinese" + 0.011*"bank" + 0.011*"could" + 0.010*"administration" + 0.010*"way"

Score: 0.4610978066921234
Topic: 7 
Word: 0.042*"president" + 0.042*"trump" + 0.033*"donald" + 0.024*"say" + 0.023*"house" + 0.014*"tax" + 0.014*"robert" + 0.013*"special" + 0.013*"mueller" + 0.013*"white"


In [0]:

print(corpus_tfidf[10])
[print(dictionary[id], freq) for id, freq in corpus_tfidf[10]]

get_score(lda_tfidf, corpus_tfidf[10])
    

[(19, 0.21754700548297537), (54, 0.20320728850840858), (72, 0.3414164208393571), (85, 0.30735863779709416), (94, 0.1669388834804864), (98, 0.14342461694658137), (104, 0.1574221684933286), (119, 0.3109004548139884), (120, 0.22964256904552222), (121, 0.18713312501072776), (122, 0.28888920365703014), (123, 0.2810457188350571), (124, 0.2487391659201765), (125, 0.273905622465155), (126, 0.29458289407567856), (127, 0.23789245875527396)]
report 0.21754700548297537
even 0.20320728850840858
measure 0.3414164208393571
hand 0.30735863779709416
donald 0.1669388834804864
president 0.14342461694658137
trump 0.1574221684933286
democratic 0.3109004548139884
election 0.22964256904552222
executive 0.18713312501072776
general 0.28888920365703014
hold 0.2810457188350571
house 0.2487391659201765
lead 0.273905622465155
mueller 0.29458289407567856
wednesday 0.23789245875527396

Score: 0.8057277798652649
Topic: 7 
Word: 0.010*"trump" + 0.009*"president" + 0.009*"monday" + 0.008*"donald" + 0.008*"hand" + 0.008