### Introduction

#### Every document we read can be thought of as consisting of many topics all stacked upon one another. Today, we’re going can unpack these topics using of NLP techniques: 
- Latent Dirichlet Allocation (LDA) and Topic Modeling
- Data is collected on https://www.reuters.com/breakingviews by a scrapping script
- The goal is to break text documents down into topics by word. 
- What is laten feature ? Mathematically, we want to find “topics” that are collections of words that appear in similar documents. 
  More generally, it is a collection of features in a dataset.
- There are several libraries for LDA such as scikit-learn and gensim. I choose gensim for this project. 

#### Project tasks:
- Cleaning the dataset & Lemmatization
- Creat a dictionay from processed data
- Create Corpus and LDA Model with bag of words
- Create Coprpus and LDA with TF-IDF
- Caculate the Perplexity and Topic Cohenrence between two models
- Visualize topics with the help of pyLDAvis


####  Import libraries

In [85]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
import nltk
nltk.download('wordnet')
import string
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
import pandas as pd
import unidecode

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\btdiem\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [86]:
import pandas as pd
data = pd.read_csv('../data/breakingnews.csv', error_bad_lines=False);

In [87]:
data.head(2)

Unnamed: 0.1,Unnamed: 0,headline
0,0,Joseph Stiglitz does what he does very well. T...
1,1,For all the successes of Japanese Prime Minist...


In [88]:
data_text = data[['headline']]
data_text['index'] = data_text.index
documents = data_text

#####  Preprocessing Data & Lemmatization

In [89]:
stemmer = SnowballStemmer('english')

def get_wordnet_pos(treebank_tag):
    """Convert the part-of-speech naming scheme
       from the nltk default to that which is
       recognized by the WordNet lemmatizer"""

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

      
# remove alpha numerical words and make lowercase
alphanum_re = re.compile(r"""\w*\d\w*""")
alphanum_lambda = lambda x: alphanum_re.sub('', x)

re_alpha = re.compile('[^A-Za-z]', re.UNICODE)
alphaonly = lambda x : re_alpha.sub(' ', x)

# remove punctuation
punc_re = re.compile('[%s]' % re.escape(string.punctuation))
punc_lambda = lambda x: punc_re.sub(' ', x)

single_quote1 = re.compile("’")
nosinglequote1 = lambda x : re.sub(single_quote1 , '', x)

single_quote2 = re.compile('‘*')
nosinglequote2 = lambda x : re.sub(single_quote2 , '', x)


double_quote = re.compile('["]*')
nodoublequote = lambda x : re.sub(double_quote , '', x)

# remove stop words
sw = stopwords.words('english')
sw_lambda = lambda x: list(filter(lambda y: y not in sw, x))

pos_lambda = lambda x: [(y[0], get_wordnet_pos(y[1])) for y in x]

lemmatizer = WordNetLemmatizer()
lem_lambda = lambda x: [lemmatizer.lemmatize(*y) for y in x]



def preprocess_raw_data(data):
    """
    data: Pandas series
    """
     # remove email
    email_re =  re.compile('\S*@\S*\s?')
    noemail = lambda x : email_re.sub(' ', x)
    data = data.map(noemail)
 
    # remove new line character:
    newline_re = re.compile('\s+')
    nonewline = lambda x : newline_re.sub(' ', x)
    data = data.map(nonewline)
    # Remove distracting single quotes
    sg_quote_re = re.compile("\'")
    no_sg_quote = lambda x : sg_quote_re.sub(' ', x)
    data = data.map(no_sg_quote)
    
    data = data.map(simple_preprocess)
    
    # remove stop words
#    data = data.map(word_tokenize)
    sw = stopwords.words('english')
    sw_lambda = lambda x: list(filter(lambda y: y not in sw, x))
    # tokenize words before removing stopwords
    data = data.map(sw_lambda)

    # part of speech tagging--must convert to format used by lemmatizer
    data = data.map(nltk.pos_tag)
    data = data.map(pos_lambda)
    # lemmatization
    data = data.map(lem_lambda)
    
    return data
 
def get_score(lda_model, doc2vec):
    """
    lda_model: LDA model 
    
    """
    for index, score in sorted(lda_model[doc2vec], key=lambda tup: -1*tup[1]):
        print("\nScore: {}\nTopic: {} \nWord: {}".format(score, index, lda_model.print_topic(index, 10)))


  email_re =  re.compile('\S*@\S*\s?')
  newline_re = re.compile('\s+')


In [90]:
processed_docs = preprocess_raw_data(documents['headline'])

#### Create the Dictionary and Corpus

In [91]:
# Create a corpus from a list of texts
dictionary = corpora.Dictionary(processed_docs)
# filter out the less common words
# Keep tokens which are contained in at least 15 documents
# Keep tokens which are contained in no more than 50% documents
# Keep only the first 10000 most frequent tokens
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=10000)
# Term Document Frequency, it is a list of (word_id, word_frequency) in the processed_docs.
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
# View the first document in corpus
print(corpus[:1])

[[(0, 1), (1, 1)]]


In [92]:
#see what words from given ids in dictionary and their frequency
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('may', 1), ('provide', 1)]]

#### LDA Topic Modeling (Bag of words)
- Building LDA using Bag of Words with 5 topics
- LDA model is built with 5 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

In [93]:
lda_model = gensim.models.LdaMulticore(corpus
                                       , num_topics=5
                                       , id2word=dictionary
                                       , iterations=50
                                       , passes=2
                                       , workers=4)


##### View the topic

In [94]:
#pprint(lda_model.print_topics())
for idx, topic in lda_model.print_topics():
    print('\nTopic: {}\nWords: {}'.format(idx+1, topic))



Topic: 0
Words: 0.647*"year" + 0.086*"may" + 0.052*"billion" + 0.051*"news" + 0.051*"provide" + 0.051*"big" + 0.021*"world" + 0.021*"com" + 0.020*"day"

Topic: 1
Words: 0.245*"world" + 0.196*"news" + 0.147*"day" + 0.147*"com" + 0.099*"provide" + 0.057*"billion" + 0.057*"big" + 0.050*"may" + 0.002*"year"

Topic: 2
Words: 0.404*"may" + 0.401*"provide" + 0.043*"year" + 0.042*"big" + 0.042*"news" + 0.018*"com" + 0.017*"world" + 0.017*"billion" + 0.017*"day"

Topic: 3
Words: 0.317*"billion" + 0.315*"may" + 0.170*"com" + 0.169*"year" + 0.006*"big" + 0.006*"news" + 0.006*"provide" + 0.006*"world" + 0.006*"day"

Topic: 4
Words: 0.349*"big" + 0.183*"news" + 0.182*"year" + 0.182*"billion" + 0.048*"may" + 0.036*"provide" + 0.007*"com" + 0.007*"world" + 0.007*"day"


#### Compute Model Perplexity and Coherence Score

In [95]:
# Compute Perplexity
# a measure of how good the model is. the lower, the better.
print('\nPerplexity: ', lda_model.log_perplexity(corpus)) 

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model
                                     , corpus=corpus
                                     , texts = list(processed_docs)
                                     , dictionary=dictionary 
                                     ,coherence='c_v')

coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -2.5590968742763436

Coherence Score:  0.4224021185538137


#### Visualize the topics-keywords

In [96]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


#### LDA model with TF-IDF


In [97]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

In [98]:
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus_tfidf[:1]]

[[('may', 0.6309418522375851), ('provide', 0.7758301225751714)]]

In [99]:
lda_tfidf = gensim.models.LdaMulticore(corpus_tfidf
                                       , num_topics=5
                                       , id2word=dictionary
                                       , iterations=50)

#### View the topic

In [100]:
#pprint(lda_model.print_topics())
for idx, topic in lda_tfidf.print_topics():
    print('\nTopic: {}\nWords: {}'.format(idx+1, topic))



Topic: 1
Words: 0.484*"may" + 0.200*"billion" + 0.170*"provide" + 0.067*"big" + 0.017*"year" + 0.015*"world" + 0.015*"com" + 0.015*"news" + 0.015*"day"

Topic: 2
Words: 0.521*"year" + 0.200*"news" + 0.200*"big" + 0.015*"may" + 0.014*"provide" + 0.013*"billion" + 0.012*"world" + 0.012*"com" + 0.012*"day"

Topic: 3
Words: 0.317*"may" + 0.247*"billion" + 0.177*"provide" + 0.177*"big" + 0.018*"year" + 0.016*"world" + 0.016*"com" + 0.016*"news" + 0.016*"day"

Topic: 4
Words: 0.193*"billion" + 0.190*"com" + 0.146*"year" + 0.142*"may" + 0.093*"world" + 0.089*"big" + 0.069*"day" + 0.047*"news" + 0.031*"provide"

Topic: 5
Words: 0.243*"world" + 0.214*"news" + 0.158*"day" + 0.130*"com" + 0.092*"provide" + 0.062*"may" + 0.046*"big" + 0.046*"billion" + 0.008*"year"


#### Compute Perplexity & Coherence Score

In [101]:
# Compute Perplexity
# a measure of how good the model is. the lower, the better.
print('\nPerplexity: ', lda_tfidf.log_perplexity(corpus_tfidf)) 

# Compute Coherence Score
coherence_model_tfidf = CoherenceModel(model=lda_tfidf
                                     , corpus=corpus_tfidf
                                     , texts = list(processed_docs)
                                     , dictionary=dictionary 
                                     ,coherence='c_v')

coherence_tfidf = coherence_model_tfidf.get_coherence()
print('\nCoherence Score: ', coherence_tfidf)



Perplexity:  -3.186376803651969

Coherence Score:  0.4224021185538137


'\nTopics = 15\n\nPerplexity:  -9.27488119057329\n\nCoherence Score:  0.24405747657214635\n\nTopics = 10\n\nPerplexity:  -9.00745369196948\n\nCoherence Score:  0.2058747841509882\n\nTopics = 20\n\nPerplexity:  -9.656468835436096\n\nCoherence Score:  0.3316541111751868\n\n'

#### Visualize the topics

In [102]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_tfidf, corpus_tfidf, dictionary)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [106]:
for idx, topic in lda_tfidf.print_topics(-1):
    print('\nTopic: {}\nWords: {}'.format(idx+1, topic))


Topic: 1
Words: 0.484*"may" + 0.200*"billion" + 0.170*"provide" + 0.067*"big" + 0.017*"year" + 0.015*"world" + 0.015*"com" + 0.015*"news" + 0.015*"day"

Topic: 2
Words: 0.521*"year" + 0.200*"news" + 0.200*"big" + 0.015*"may" + 0.014*"provide" + 0.013*"billion" + 0.012*"world" + 0.012*"com" + 0.012*"day"

Topic: 3
Words: 0.317*"may" + 0.247*"billion" + 0.177*"provide" + 0.177*"big" + 0.018*"year" + 0.016*"world" + 0.016*"com" + 0.016*"news" + 0.016*"day"

Topic: 4
Words: 0.193*"billion" + 0.190*"com" + 0.146*"year" + 0.142*"may" + 0.093*"world" + 0.089*"big" + 0.069*"day" + 0.047*"news" + 0.031*"provide"

Topic: 5
Words: 0.243*"world" + 0.214*"news" + 0.158*"day" + 0.130*"com" + 0.092*"provide" + 0.062*"may" + 0.046*"big" + 0.046*"billion" + 0.008*"year"


In [104]:
#Make a test
print(corpus[10])
[print(dictionary[id], freq) for id, freq in corpus[10]]

get_score(lda_model, corpus[10])
#for index, score in sorted(lda_model[common_corpus[4310]], key=lambda tup: -1*tup[1]):
#    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

[(1, 1), (2, 1), (4, 2), (6, 2), (8, 1)]
provide 1
day 1
world 2
com 2
big 1

Score: 0.89858078956604
Topic: 1 
Word: 0.245*"world" + 0.196*"news" + 0.147*"day" + 0.147*"com" + 0.099*"provide" + 0.057*"billion" + 0.057*"big" + 0.050*"may" + 0.002*"year"

Score: 0.025722915306687355
Topic: 4 
Word: 0.349*"big" + 0.183*"news" + 0.182*"year" + 0.182*"billion" + 0.048*"may" + 0.036*"provide" + 0.007*"com" + 0.007*"world" + 0.007*"day"

Score: 0.025433992967009544
Topic: 2 
Word: 0.404*"may" + 0.401*"provide" + 0.043*"year" + 0.042*"big" + 0.042*"news" + 0.018*"com" + 0.017*"world" + 0.017*"billion" + 0.017*"day"

Score: 0.025217458605766296
Topic: 3 
Word: 0.317*"billion" + 0.315*"may" + 0.170*"com" + 0.169*"year" + 0.006*"big" + 0.006*"news" + 0.006*"provide" + 0.006*"world" + 0.006*"day"

Score: 0.025044839829206467
Topic: 0 
Word: 0.647*"year" + 0.086*"may" + 0.052*"billion" + 0.051*"news" + 0.051*"provide" + 0.051*"big" + 0.021*"world" + 0.021*"com" + 0.020*"day"


In [105]:

print(corpus_tfidf[10])
[print(dictionary[id], freq) for id, freq in corpus_tfidf[10]]

get_score(lda_tfidf, corpus_tfidf[10])
    

[(1, 0.30151134457776363), (2, 0.30151134457776363), (4, 0.6030226891555273), (6, 0.6030226891555273), (8, 0.30151134457776363)]
provide 0.30151134457776363
day 0.30151134457776363
world 0.6030226891555273
com 0.6030226891555273
big 0.30151134457776363

Score: 0.7362167835235596
Topic: 4 
Word: 0.243*"world" + 0.214*"news" + 0.158*"day" + 0.130*"com" + 0.092*"provide" + 0.062*"may" + 0.046*"big" + 0.046*"billion" + 0.008*"year"

Score: 0.06635069847106934
Topic: 3 
Word: 0.193*"billion" + 0.190*"com" + 0.146*"year" + 0.142*"may" + 0.093*"world" + 0.089*"big" + 0.069*"day" + 0.047*"news" + 0.031*"provide"

Score: 0.06631165742874146
Topic: 2 
Word: 0.317*"may" + 0.247*"billion" + 0.177*"provide" + 0.177*"big" + 0.018*"year" + 0.016*"world" + 0.016*"com" + 0.016*"news" + 0.016*"day"

Score: 0.06595547497272491
Topic: 1 
Word: 0.521*"year" + 0.200*"news" + 0.200*"big" + 0.015*"may" + 0.014*"provide" + 0.013*"billion" + 0.012*"world" + 0.012*"com" + 0.012*"day"

Score: 0.06516535580158234
