# Topic Modeling and Latent Dirichlet Allocation (LDA) -
## ADA Project Milestone 2

Trying out the following tutorial on the quotebank2016 data set.
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24


In [35]:
import pandas as pd
import numpy as np
import gensim
import nltk
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from gensim import corpora, models
from pprint import pprint


## Part 1: Get & Parse the Data

In [36]:
# TESTING DATA

# Read the first 10000 quotations and store them in a datafram
with pd.read_json('Quotebank/quotes-2016.json.bz2', lines=True, compression='bz2', chunksize=10000) as df_reader:
    i = 0
    for chunk in df_reader:
        print(f"Chunk: {i}")
        test_df = chunk
        i += 1
        break
        
print(f'Processing chunk with {len(test_df)} rows')
print(test_df)
data_quotes = test_df[['quotation']]
print(data_quotes)



Chunk: 0
Processing chunk with 10000 rows
                quoteID                                          quotation  \
0     2016-12-26-000040  [ ] and Chris [ Jones ] were in there a lot an...   
1     2016-07-31-000006  [ And ] I don't know if we have enough time to...   
2     2016-09-06-000292  ... I feel like I was champion long before I l...   
3     2016-07-11-000226  [ I ] mmigration has been and continues to be ...   
4     2016-05-26-000371  [ It is ] the process of understanding what ki...   
...                 ...                                                ...   
9995  2016-08-06-031503  It's a great way to take students out into the...   
9996  2016-09-01-065324  It's a happy accident, to me, that nothing has...   
9997  2016-07-27-070749  It's a joyous occasion and I'd like to think a...   
9998  2016-08-15-052447  It's a little bit better than yesterday in som...   
9999  2016-05-04-057877  It's a little bit bittersweet for me because T...   

              speaker

In [37]:
# TRAINING DATA

data = pd.read_csv('NYTimes/NYTimes_topic_modeling_training.zip', compression='zip')
data_text = data[['quotation']]
data_text['index'] = data_text.index
documents = data_text
print(documents.head())

                                           quotation  index
0  The public is certainly siding with McGregor b...      0
1  I cannot fathom why the Administration would p...      1
2  We will continue developing ballistic missile-...      2
3  Hey Lee / Your short game's good / but your lo...      3
4  There are many, many people who invest and par...      4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_text['index'] = data_text.index


## Part 2: Tokenization & Lemmatization

In [38]:
np.random.seed(2018)
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/simonspangenberg/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Examples

In [39]:
# Lemmatize: Lemmatization technique is like stemming. The output we will get after lemmatization is called 
# ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will 
# be getting a valid word that means the same thing.
print(WordNetLemmatizer().lemmatize('went', pos='v'))

go


In [40]:
# Stemmatize: Stemming is a technique used to extract the base form of the words by removing affixes from them. 
# It is just like cutting down the branches of a tree to its stems. 
# For example, the stem of the words eating, eats, eaten is eat.
stemmer = SnowballStemmer('english')
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]
pd.DataFrame(data = {'original word': original_words, 'stemmed': singles})

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [57]:
# Functions we will use
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        my_stop_words = STOPWORDS.union(set(['know', 'think','want','go','get','need','thing','like','feel']))
        if token not in my_stop_words and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [58]:
sample = documents[documents.index == 4310].values[0][0]

print('original quotation: ')
words = []
for word in sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(sample))


original quotation: 
['Tomorrow,', 'whatever', 'the', 'conditions', 'are,', 'I', 'think', 'we', 'should', 'be', 'confident', 'that', 'we', 'can', 'score', 'points', 'and,', 'hopefully,', 'as', 'last', 'year,', 'we', 'can', 'have', 'both', 'cars', 'in', 'the', 'points', 'and', 'help', 'the', 'team', 'in', 'that', 'respect.']


 tokenized and lemmatized document: 
['tomorrow', 'condit', 'confid', 'score', 'point', 'hope', 'year', 'car', 'point', 'help', 'team', 'respect']


In [59]:
# Process all the quotations
processed_docs = documents['quotation'].map(preprocess)
print(processed_docs.head(30))

0     [public, certain, side, mcgregor, valu, wisegu...
1     [fathom, administr, pursu, cours, signal, chan...
2     [continu, develop, ballist, missil, defens, te...
3               [short, game, good, long, game, better]
4     [peopl, invest, particip, level, time, effort,...
5                                              [cyborg]
6     [transform, live, present, grow, challeng, hum...
7                  [presid, trump, hasn, decis, moment]
8     [insati, greed, check, wealth, aggreg, hand, r...
9     [talk, time, end, sexism, form, discrimin, bel...
10               [group, look, work, civil, right, len]
11    [constant, learn, connect, compani, come, see,...
12    [stakehold, talk, industri, size, biggest, sma...
13        [grow, strong, women, return, destroy, world]
14    [think, look, come, putter, laugh, go, famous,...
15    [role, cleaner, effici, fossil, fuel, nuclear,...
16         [shine, strong, spotlight, financ, campaign]
17    [year, inconsist, analyz, cost, benefit, r

## Part 3: Create a Dictionary of Words & Filter

In [60]:
# Create a dictionary from ‘processed_docs’ df containing the number of times a word appears in the training set
count = 0
dictionary = gensim.corpora.Dictionary(processed_docs)
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break
print(len(dictionary))

0 certain
1 fight
2 mcgregor
3 public
4 side
5 valu
6 wiseguy
7 administr
8 align
9 american
10 autocrat
9428


In [61]:
# Filter tokens that appear: >15 times, more than 0.5 documents. 
# Then keep only the most frequent 100000 tokens. 
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(113, 1),
 (170, 1),
 (199, 1),
 (247, 1),
 (309, 1),
 (374, 1),
 (377, 1),
 (481, 2),
 (656, 1),
 (800, 1),
 (1209, 1)]

In [62]:
bow_doc_100 = bow_corpus[100]
for i in range(len(bow_doc_100)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_100[i][0], 
                                                     dictionary[bow_doc_100[i][0]], 
                                                     bow_doc_100[i][1]))

Word 37 ("effort") appears 1 time.
Word 379 ("combat") appears 1 time.
Word 380 ("rais") appears 1 time.
Word 381 ("revenu") appears 1 time.
Word 382 ("way") appears 2 time.


## Part 4: Create a TF-IDF Model

TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set)

In [63]:
tfidf = models.TfidfModel(bow_corpus)

In [64]:
corpus_tfidf = tfidf[bow_corpus]

In [65]:
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.29397709448002163),
 (1, 0.3039120552343348),
 (2, 0.26511944366641427),
 (3, 0.803808648364201),
 (4, 0.32375649492983805)]


## Part 5: LDA

Train our lda model using gensim.models.LdaMulticore and save it to ‘lda_model’

Topic modeling is a way of abstract modeling to discover the abstract ‘topics’ that occur in the collections of documents. The idea is that we will perform unsupervised classification on different documents, which find some natural groups in topics. We can answer the following question using topic modeling.

    What is the topic/main idea of the document?
    Given a document, can we find another document with a similar topic?
    How do topics field change over time?

Latent Dirichlet allocation is one of the most popular methods for performing topic modeling. Each document consists of various words and each topic can be associated with some words. The aim behind the LDA to find topics that the document belongs to, on the basis of words contains in it. It assumes that documents with similar topics will use a similar group of words. This enables the documents to map the probability distribution over latent topics and topics are probability distribution.

### Using Bag of Words

In [66]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=4, id2word=dictionary, passes=2, workers=2)

In [67]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.014*"go" + 0.013*"right" + 0.012*"climat" + 0.012*"play" + 0.011*"game" + 0.010*"look" + 0.010*"chang" + 0.009*"state" + 0.009*"see" + 0.009*"year"
Topic: 1 
Words: 0.029*"go" + 0.027*"peopl" + 0.017*"work" + 0.013*"trump" + 0.013*"come" + 0.009*"happen" + 0.009*"thing" + 0.009*"talk" + 0.008*"presid" + 0.008*"team"
Topic: 2 
Words: 0.016*"time" + 0.014*"chang" + 0.009*"take" + 0.008*"climat" + 0.008*"littl" + 0.008*"world" + 0.008*"get" + 0.007*"busi" + 0.007*"kind" + 0.007*"work"
Topic: 3 
Words: 0.016*"peopl" + 0.015*"good" + 0.013*"year" + 0.010*"play" + 0.008*"compani" + 0.007*"great" + 0.007*"game" + 0.007*"start" + 0.007*"better" + 0.007*"make"


### Using TF-IDF

In [68]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=4, id2word=dictionary, passes=2, workers=4)

In [69]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.014*"go" + 0.010*"come" + 0.008*"peopl" + 0.007*"good" + 0.007*"work" + 0.006*"time" + 0.006*"care" + 0.006*"say" + 0.006*"chang" + 0.006*"trump"
Topic: 1 Word: 0.008*"year" + 0.008*"go" + 0.007*"peopl" + 0.007*"problem" + 0.007*"better" + 0.006*"player" + 0.006*"game" + 0.006*"team" + 0.006*"take" + 0.006*"focus"
Topic: 2 Word: 0.009*"look" + 0.008*"time" + 0.007*"peopl" + 0.007*"happen" + 0.007*"go" + 0.006*"get" + 0.006*"talk" + 0.005*"year" + 0.005*"chang" + 0.005*"point"
Topic: 3 Word: 0.009*"peopl" + 0.008*"go" + 0.008*"play" + 0.007*"right" + 0.005*"thing" + 0.005*"world" + 0.005*"deal" + 0.005*"import" + 0.005*"water" + 0.005*"work"


## Part 6: Performance Evaluation

In [70]:
# Should be classified in a sports topic
print(processed_docs[:3])

0    [public, certain, side, mcgregor, valu, wisegu...
1    [fathom, administr, pursu, cours, signal, chan...
2    [continu, develop, ballist, missil, defens, te...
Name: quotation, dtype: object


### On LDA BOW Model

In [71]:
for index, score in sorted(lda_model[bow_corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.8894662261009216	 
Topic: 0.016*"peopl" + 0.015*"good" + 0.013*"year" + 0.010*"play" + 0.008*"compani" + 0.007*"great" + 0.007*"game" + 0.007*"start" + 0.007*"better" + 0.007*"make"

Score: 0.0373750664293766	 
Topic: 0.016*"time" + 0.014*"chang" + 0.009*"take" + 0.008*"climat" + 0.008*"littl" + 0.008*"world" + 0.008*"get" + 0.007*"busi" + 0.007*"kind" + 0.007*"work"

Score: 0.036617159843444824	 
Topic: 0.014*"go" + 0.013*"right" + 0.012*"climat" + 0.012*"play" + 0.011*"game" + 0.010*"look" + 0.010*"chang" + 0.009*"state" + 0.009*"see" + 0.009*"year"

Score: 0.036541566252708435	 
Topic: 0.029*"go" + 0.027*"peopl" + 0.017*"work" + 0.013*"trump" + 0.013*"come" + 0.009*"happen" + 0.009*"thing" + 0.009*"talk" + 0.008*"presid" + 0.008*"team"


### On LDA TF-IDF Model

In [72]:
for index, score in sorted(lda_model_tfidf[bow_corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.8871537446975708	 
Topic: 0.009*"look" + 0.008*"time" + 0.007*"peopl" + 0.007*"happen" + 0.007*"go" + 0.006*"get" + 0.006*"talk" + 0.005*"year" + 0.005*"chang" + 0.005*"point"

Score: 0.03781189024448395	 
Topic: 0.008*"year" + 0.008*"go" + 0.007*"peopl" + 0.007*"problem" + 0.007*"better" + 0.006*"player" + 0.006*"game" + 0.006*"team" + 0.006*"take" + 0.006*"focus"

Score: 0.037610866129398346	 
Topic: 0.014*"go" + 0.010*"come" + 0.008*"peopl" + 0.007*"good" + 0.007*"work" + 0.006*"time" + 0.006*"care" + 0.006*"say" + 0.006*"chang" + 0.006*"trump"

Score: 0.03742348402738571	 
Topic: 0.009*"peopl" + 0.008*"go" + 0.008*"play" + 0.007*"right" + 0.005*"thing" + 0.005*"world" + 0.005*"deal" + 0.005*"import" + 0.005*"water" + 0.005*"work"


## Part 7: Testing with Quotebank Data Set

In [87]:
quotebank_example = data_quotes['quotation'][7000]
print(quotebank_example)

A White House National Space Council can be useful if the president wants one and is willing to back it up when other White House offices, like the Office of Management and Budget, balk at its recommendations. If not, then it is a waste of resources,


In [88]:
bow_vector = dictionary.doc2bow(preprocess(quotebank_example))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.5477136373519897	 Topic: 0.029*"go" + 0.027*"peopl" + 0.017*"work" + 0.013*"trump" + 0.013*"come"
Score: 0.42258450388908386	 Topic: 0.016*"peopl" + 0.015*"good" + 0.013*"year" + 0.010*"play" + 0.008*"compani"
Score: 0.014941971749067307	 Topic: 0.014*"go" + 0.013*"right" + 0.012*"climat" + 0.012*"play" + 0.011*"game"
Score: 0.014759933575987816	 Topic: 0.016*"time" + 0.014*"chang" + 0.009*"take" + 0.008*"climat" + 0.008*"littl"
