### Big thanks to Susan Li for the awesome tutorial on LDA
[https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24]

This Notebook was inspired by the Medium article above and is written by Advith Chegu for HackDown 2020. Some more sources:

[https://medium.com/@osas.usen/topic-extraction-from-tweets-using-lda-a997e4eb0985]

[https://www.kaggle.com/therohk/million-headlines/data]

In [2]:
import pandas as pd
DATASET_COLUMNS  = ["sentiment", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
data = pd.read_json('News_Category_Dataset_v2.json', lines=True)
data_text = data[['headline']]
data_text.set_index('headline', drop=True, append=False, inplace=False, verify_integrity=False)
data_text['index'] = data_text.index
documents = data_text

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [3]:
print(len(documents))
print(documents[:5])

200853
                                            headline  index
0  There Were 2 Mass Shootings In Texas Last Week...      0
1  Will Smith Joins Diplo And Nicky Jam For The 2...      1
2    Hugh Grant Marries For The First Time At Age 57      2
3  Jim Carrey Blasts 'Castrato' Adam Schiff And D...      3
4  Julianna Margulies Uses Donald Trump Poop Bags...      4


In [5]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/advithchegu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
stemmer = SnowballStemmer("english")
stop_words = ['http']

In [7]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 and token not in stop_words and '@' not in token:
            result.append(lemmatize_stemming(token))
    return result

In [8]:
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['5', 'Numbers', 'To', 'Have', 'Handy', 'When', 'Men', 'Ask', 'Why', 'There', 'Is', 'An', 'International', "Women's", 'Day']


 tokenized and lemmatized document: 
['number', 'handi', 'intern', 'women']


In [9]:
processed_docs = documents['headline'].map(preprocess)
processed_docs[:10]

0                            [mass, shoot, texa, week]
1     [smith, join, diplo, nicki, world, offici, song]
2                           [hugh, grant, marri, time]
3    [carrey, blast, castrato, adam, schiff, democr...
4    [julianna, marguli, use, donald, trump, poop, ...
5    [morgan, freeman, devast, sexual, harass, clai...
6     [donald, trump, lovin, mcdonald, jingl, tonight]
7                         [watch, amazon, prime, week]
8    [mike, myer, reveal, like, fourth, austin, pow...
9                                  [watch, hulu, week]
Name: headline, dtype: object

In [10]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 mass
1 shoot
2 texa
3 week
4 diplo
5 join
6 nicki
7 offici
8 smith
9 song
10 world


In [11]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [12]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(109, 1), (2324, 1), (2719, 1), (4380, 1)]

In [13]:
bow_doc_4310 = bow_corpus[4310]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))

Word 109 ("women") appears 1 time.
Word 2324 ("number") appears 1 time.
Word 2719 ("intern") appears 1 time.
Word 4380 ("handi") appears 1 time.


In [14]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.6086420674271222),
 (1, 0.44082971680772476),
 (2, 0.534274478977891),
 (3, 0.38700746200837344)]


In [18]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=40, id2word=dictionary, passes=2, workers=2)

In [26]:
print(lda_model)

LdaModel(num_terms=7260, num_topics=40, decay=0.5, chunksize=2000)


In [19]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.048*"network" + 0.045*"test" + 0.040*"green" + 0.039*"updat" + 0.031*"beach" + 0.025*"compani" + 0.025*"sale" + 0.024*"california" + 0.023*"pregnant" + 0.022*"photo"
Topic: 1 
Words: 0.067*"learn" + 0.057*"worst" + 0.039*"cover" + 0.037*"social" + 0.029*"soul" + 0.029*"media" + 0.027*"appl" + 0.023*"near" + 0.021*"biggest" + 0.020*"photo"
Topic: 2 
Words: 0.189*"week" + 0.116*"parent" + 0.076*"women" + 0.071*"lose" + 0.043*"risk" + 0.034*"death" + 0.026*"advic" + 0.024*"young" + 0.022*"best" + 0.021*"pictur"
Topic: 3 
Words: 0.080*"better" + 0.078*"weight" + 0.068*"plan" + 0.050*"loss" + 0.029*"parti" + 0.028*"husband" + 0.026*"restaur" + 0.025*"win" + 0.025*"import" + 0.024*"cool"
Topic: 4 
Words: 0.052*"photo" + 0.040*"inspir" + 0.038*"day" + 0.034*"home" + 0.032*"long" + 0.028*"winter" + 0.027*"onlin" + 0.026*"video" + 0.024*"buy" + 0.023*"expert"
Topic: 5 
Words: 0.072*"wear" + 0.060*"spring" + 0.037*"relationship" + 0.037*"huffpost" + 0.031*"cloth" + 0.030*"chal

In [20]:
processed_docs[4310]

['number', 'handi', 'intern', 'women']

In [21]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.20511244237422943	 
Topic: 0.048*"network" + 0.045*"test" + 0.040*"green" + 0.039*"updat" + 0.031*"beach" + 0.025*"compani" + 0.025*"sale" + 0.024*"california" + 0.023*"pregnant" + 0.022*"photo"

Score: 0.20506323873996735	 
Topic: 0.151*"live" + 0.059*"nation" + 0.046*"propos" + 0.044*"countri" + 0.028*"matter" + 0.026*"habit" + 0.023*"blood" + 0.022*"boost" + 0.022*"chicken" + 0.020*"number"

Score: 0.20502015948295593	 
Topic: 0.189*"week" + 0.116*"parent" + 0.076*"women" + 0.071*"lose" + 0.043*"risk" + 0.034*"death" + 0.026*"advic" + 0.024*"young" + 0.022*"best" + 0.021*"pictur"

Score: 0.20474053919315338	 
Topic: 0.079*"celebr" + 0.065*"summer" + 0.056*"mother" + 0.054*"photo" + 0.049*"cancer" + 0.046*"healthi" + 0.039*"friend" + 0.029*"father" + 0.022*"avoid" + 0.020*"simpl"


In [22]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.20511186122894287	 
Topic: 0.048*"network" + 0.045*"test" + 0.040*"green" + 0.039*"updat" + 0.031*"beach" + 0.025*"compani" + 0.025*"sale" + 0.024*"california" + 0.023*"pregnant" + 0.022*"photo"

Score: 0.20506268739700317	 
Topic: 0.151*"live" + 0.059*"nation" + 0.046*"propos" + 0.044*"countri" + 0.028*"matter" + 0.026*"habit" + 0.023*"blood" + 0.022*"boost" + 0.022*"chicken" + 0.020*"number"

Score: 0.20501965284347534	 
Topic: 0.189*"week" + 0.116*"parent" + 0.076*"women" + 0.071*"lose" + 0.043*"risk" + 0.034*"death" + 0.026*"advic" + 0.024*"young" + 0.022*"best" + 0.021*"pictur"

Score: 0.2047426849603653	 
Topic: 0.079*"celebr" + 0.065*"summer" + 0.056*"mother" + 0.054*"photo" + 0.049*"cancer" + 0.046*"healthi" + 0.039*"friend" + 0.029*"father" + 0.022*"avoid" + 0.020*"simpl"


In [31]:
unseen_document = 'That football game was interesting'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.2676692306995392	 Topic: 0.130*"way" + 0.050*"game" + 0.049*"kate" + 0.045*"night" + 0.031*"video"
Score: 0.25623199343681335	 Topic: 0.072*"guid" + 0.071*"design" + 0.051*"date" + 0.048*"daughter" + 0.038*"video"
Score: 0.24483320116996765	 Topic: 0.106*"love" + 0.065*"photo" + 0.065*"travel" + 0.062*"babi" + 0.033*"happi"
