This notebook follows the tutorial from this [post](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) by Susan Li.

In [1]:
import pandas as pd
import numpy as np
import gensim
import string
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

np.random.seed(2018)

import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/brianesamson/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
import json

json_file = open("../raw/raw_textonly_waze.json")
json_data = json.load(json_file)
documents = pd.DataFrame(json_data)
# data_text = data[['headline_text']]
# data_text['index'] = data_text.index
# documents = data_text

# df = pd.DataFrame([{"a":'a', "b":'b'}, {"a":'c', "b":'d'}])
# df
documents

Unnamed: 0,body,id
0,Seriously Waze whomever is in charge of UX nee...,7nraji
1,I agree. And what's with the blue 'Go' buttons...,ds4cv5q
2,"Yes! When selecting a destination from seach, ...",ds4mc2j
3,I totally agree with you and I will add: make ...,ds4mtd9
4,I agree and disagree with little bits and piec...,ds52syn
5,I got used to the timer after a while. nowaday...,ds56b9r
6,Is anyone interested in conducting a usability...,ds5htj3
7,THANK YOU! I think they should hire you to do...,dsda76f
8,Too bad I can't hear a dang thing he says with...,7o0v80
9,I have this problem with pretty much all the v...,ds68bj4


**Preprocessing**

- Tokenization
- Lemmatization
- Stemming
- Remove stopwords
- Remove words with <= 3 characters

In [3]:
stemmer = SnowballStemmer('english')

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [4]:
doc_sample = documents[documents['id'] == "7nraji"].values[0][0]
print('Original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
   
print(words)
print('\n\nTokenized and lemmatized document: ')
print(preprocess(doc_sample))

Original document: 
['Seriously', 'Waze', 'whomever', 'is', 'in', 'charge', 'of', 'UX', 'needs', 'to', 'be', 'fired.', '', "It's", 'that', 'bad.', '', 'If', "you've", 'changed', 'UX', 'lead', 'in', 'the', 'past', 'few', 'years', 'then', 'whomever', 'is', 'in', 'charge', 'of', 'hiring', 'them', 'should', 'also', 'be', 'fired.', '', 'Please', 'consider', 'hiring', 'a', 'UX', 'consulting', 'team', 'with', 'a', 'solid', 'reputation', 'to', 'come', 'fix', 'your', 'UX', 'and', 'then', 'have', 'them', 'help', 'you', 'hire', 'a', 'competent', 'person', 'on', 'your', 'behalf.', '', 'Credit', 'where', 'credit', 'is', 'due.', '', 'You', 'fixed', 'this', 'a', 'while', 'ago,', 'but', 'I', "don't", 'want', 'you', 'to', 'think', 'it', 'went', 'unnoticed.', '', 'Thank', 'you', 'for', 'getting', 'rid', 'of', 'the', 'invisible', 'button', 'to', 'control', 'the', 'sound.', '', 'Not', 'having', 'invisible', 'buttons', 'should', 'be', 'a', 'no', 'brainer.', '', 'Thankfully', 'that', 'was', 'just', 'a', 'ni

In [5]:
processed_docs = documents['body'].map(preprocess)

In [6]:
processed_docs

0       [serious, waze, whomev, charg, need, fire, cha...
1       [agre, blue, button, stay, screen, second, wan...
2       [select, destin, seach, present, option, later...
3       [total, agre, screen, mind, option, choos, def...
4       [agre, disagre, littl, bit, piec, item, spot, ...
5       [timer, nowaday, start, drive, usual, know, se...
6       [interest, conduct, usabl, feedback, session, ...
7                                    [thank, think, hire]
8                                [hear, dang, thing, say]
9                            [problem, pretti, voic, tri]
10      [phone, hard, wire, stereo, cours, mute, stere...
11                                                     []
12                        [tri, chang, voic, show, wrong]
13                                        [updat, recent]
14      [android, auto, phone, know, plan, make, waze,...
15      [radio, silenc, launch, waze, android, auto, u...
16      [launch, beta, disabl, decid, wasn, readi, bet...
17      [waze,

**Create Bag of Words on the dataset**

Get all the unique words and assign an ID to them.

In [7]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [8]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 accid
1 accur
2 adsthey
3 area
4 aspect
5 avoid
6 base
7 behalf
8 better
9 brainer
10 briefli


**Filter words** | [Source](https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.filter_extremes.html)

Filter out tokens that appear in

- less than `no_below` documents (absolute number) or
- more than `no_above` documents (fraction of total corpus size, not absolute number).
- after (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if None).

In [9]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

**Convert corpus to used the BoW IDs and count the word frequency per document**

In [24]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus

[[(0, 3),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 1),
  (6, 3),
  (7, 1),
  (8, 1),
  (9, 2),
  (10, 2),
  (11, 1),
  (12, 2),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 2),
  (21, 2),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 2),
  (27, 1),
  (28, 3),
  (29, 3),
  (30, 1),
  (31, 2),
  (32, 1),
  (33, 1),
  (34, 2),
  (35, 3),
  (36, 1),
  (37, 3),
  (38, 1),
  (39, 1),
  (40, 2),
  (41, 1),
  (42, 1),
  (43, 2),
  (44, 1),
  (45, 4),
  (46, 6),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 2),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 2),
  (60, 1),
  (61, 1),
  (62, 2),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 2),
  (70, 1),
  (71, 1),
  (72, 7),
  (73, 1),
  (74, 2),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 3),
  (81, 1),
  (82, 1)],
 [(6, 1),
  (59, 1),
  (60, 1),
  (77, 1),
  (83, 1),
  (84, 1),
  (85, 1),
  (86, 1),
  (87, 1)

In [11]:
bow_doc_4310 = bow_corpus[32]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 2 ("area") appears 1 time.
Word 11 ("concern") appears 1 time.
Word 51 ("person") appears 1 time.
Word 168 ("decid") appears 1 time.
Word 206 ("minut") appears 1 time.
Word 213 ("car") appears 1 time.
Word 214 ("comment") appears 2 time.
Word 215 ("commut") appears 1 time.
Word 216 ("exit") appears 3 time.
Word 217 ("hand") appears 1 time.
Word 218 ("life") appears 1 time.
Word 219 ("liter") appears 1 time.
Word 220 ("mile") appears 1 time.
Word 221 ("near") appears 1 time.
Word 222 ("number") appears 1 time.
Word 223 ("posit") appears 1 time.
Word 224 ("previous") appears 1 time.
Word 225 ("shoulder") appears 1 time.
Word 226 ("stand") appears 1 time.
Word 227 ("store") appears 1 time.
Word 228 ("time") appears 1 time.
Word 229 ("wait") appears 1 time.


TF-IDF
===

First, calculate the inverse document counts for all terms in the training corpus.

In [12]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

Then, transform the count representations into the Tfidf space.

In [13]:
corpus_tfidf = tfidf[bow_corpus]

In [14]:
from pprint import pprint
    
pprint(corpus_tfidf[0])

[(0, 0.22462224687181556),
 (1, 0.069323819729907574),
 (2, 0.04884387983019952),
 (3, 0.061742209422726929),
 (4, 0.12715537262900292),
 (5, 0.056366136014774977),
 (6, 0.16811453628230397),
 (7, 0.087427413294406064),
 (8, 0.050183026438512711),
 (9, 0.12860797330864307),
 (10, 0.11186081705474693),
 (11, 0.083689436863604569),
 (12, 0.14623868245464181),
 (13, 0.070119622435790063),
 (14, 0.086602919282682239),
 (15, 0.058370873243421946),
 (16, 0.050480585626894135),
 (17, 0.085820726059680899),
 (18, 0.083689436863604569),
 (19, 0.048510181075915294),
 (20, 0.070729405949789173),
 (21, 0.13431770988288486),
 (22, 0.070674208081860118),
 (23, 0.055506786160318816),
 (24, 0.069584500001751023),
 (25, 0.085076704132095102),
 (26, 0.14494066319620202),
 (27, 0.048979431163222738),
 (28, 0.12336512421790684),
 (29, 0.15982774776502137),
 (30, 0.062884411270872714),
 (31, 0.097285571206559479),
 (32, 0.068083345538079626),
 (33, 0.076032823646398573),
 (34, 0.1537146353162448),
 (35, 0.

LDA using Bag of Words
===

In [39]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=20, id2word=dictionary, passes=50, workers=2)

In [40]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.033*"report" + 0.027*"waze" + 0.022*"road" + 0.021*"speed" + 0.020*"rout" + 0.017*"time" + 0.015*"think" + 0.013*"work" + 0.012*"limit" + 0.012*"map"
Topic: 1 
Words: 0.070*"waze" + 0.025*"point" + 0.025*"googl" + 0.019*"map" + 0.017*"report" + 0.015*"featur" + 0.014*"want" + 0.013*"rout" + 0.012*"drive" + 0.009*"need"
Topic: 2 
Words: 0.048*"waze" + 0.025*"drive" + 0.022*"turn" + 0.017*"android" + 0.013*"like" + 0.013*"time" + 0.011*"know" + 0.011*"go" + 0.010*"road" + 0.010*"phone"
Topic: 3 
Words: 0.104*"waze" + 0.022*"https" + 0.018*"problem" + 0.015*"googl" + 0.015*"volum" + 0.012*"forum" + 0.011*"phone" + 0.011*"map" + 0.010*"rout" + 0.010*"user"
Topic: 4 
Words: 0.050*"work" + 0.037*"waze" + 0.024*"issu" + 0.020*"time" + 0.019*"rout" + 0.013*"server" + 0.010*"home" + 0.009*"phone" + 0.009*"problem" + 0.009*"month"
Topic: 5 
Words: 0.060*"waze" + 0.021*"thank" + 0.021*"rout" + 0.013*"like" + 0.012*"want" + 0.012*"play" + 0.011*"screen" + 0.011*"traffic" + 0.011

LDA using TF-IDF
===

In [15]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [16]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.014*"thank" + 0.013*"road" + 0.012*"rout" + 0.011*"work" + 0.011*"waze" + 0.009*"googl" + 0.008*"user" + 0.008*"https" + 0.008*"think" + 0.007*"speed"
Topic: 1 Word: 0.014*"speed" + 0.013*"phone" + 0.010*"imgur" + 0.010*"waze" + 0.010*"yeah" + 0.009*"drive" + 0.009*"limit" + 0.009*"https" + 0.008*"googl" + 0.008*"go"
Topic: 2 Word: 0.022*"waze" + 0.014*"editor" + 0.014*"https" + 0.013*"like" + 0.009*"say" + 0.009*"android" + 0.009*"need" + 0.009*"camera" + 0.008*"know" + 0.008*"auto"
Topic: 3 Word: 0.018*"issu" + 0.015*"waze" + 0.011*"screen" + 0.011*"iphon" + 0.009*"thing" + 0.009*"appl" + 0.008*"drive" + 0.008*"map" + 0.008*"navig" + 0.008*"phone"
Topic: 4 Word: 0.013*"waze" + 0.012*"time" + 0.012*"updat" + 0.012*"delet" + 0.011*"work" + 0.011*"know" + 0.011*"voic" + 0.010*"beta" + 0.008*"issu" + 0.007*"sound"
Topic: 5 Word: 0.013*"rout" + 0.013*"waze" + 0.009*"batteri" + 0.009*"better" + 0.009*"map" + 0.009*"go" + 0.008*"drive" + 0.008*"destin" + 0.008*"time" + 0.00

Sample classification of documents to topics
===

**Using the LDA Bag of Words model**

In [17]:
processed_docs[0]

['serious',
 'waze',
 'whomev',
 'charg',
 'need',
 'fire',
 'chang',
 'lead',
 'past',
 'year',
 'whomev',
 'charg',
 'hire',
 'fire',
 'consid',
 'hire',
 'consult',
 'team',
 'solid',
 'reput',
 'come',
 'help',
 'hire',
 'compet',
 'person',
 'behalf',
 'credit',
 'credit',
 'fix',
 'want',
 'think',
 'go',
 'unnot',
 'thank',
 'get',
 'invis',
 'button',
 'control',
 'sound',
 'have',
 'invis',
 'button',
 'brainer',
 'thank',
 'nitpick',
 'world',
 'unsaf',
 'sad',
 'current',
 'usabl',
 'unsaf',
 'float',
 'instruct',
 'horizont',
 'mode',
 'import',
 'rule',
 'design',
 'element',
 'base',
 'user',
 'input',
 'form',
 'button',
 'move',
 'user',
 'move',
 'mous',
 'real',
 'world',
 'exampl',
 'mous',
 'toolbar',
 'icon',
 'magnifi',
 'nitpick',
 'exampl',
 'world',
 'unfortun',
 'safeti',
 'concern',
 'float',
 'instruct',
 'move',
 'base',
 'vehicl',
 'data',
 'intent',
 'purpos',
 'move',
 'float',
 'instruct',
 'come',
 'stop',
 'light',
 'need',
 'turn',
 'leav',
 'need',


In [18]:
for index, score in sorted(lda_model[bow_corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

NameError: name 'lda_model' is not defined

**Using the LDA TFIDF model**

In [20]:
for index, score in sorted(lda_model_tfidf[bow_corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.25986236333847046	 
Topic: 0.022*"waze" + 0.014*"editor" + 0.014*"https" + 0.013*"like" + 0.009*"say" + 0.009*"android" + 0.009*"need" + 0.009*"camera" + 0.008*"know" + 0.008*"auto"

Score: 0.23338556289672852	 
Topic: 0.014*"waze" + 0.011*"report" + 0.011*"point" + 0.010*"carplay" + 0.010*"rout" + 0.010*"problem" + 0.009*"phone" + 0.009*"start" + 0.008*"know" + 0.007*"turn"

Score: 0.2006094604730606	 
Topic: 0.013*"rout" + 0.013*"waze" + 0.009*"batteri" + 0.009*"better" + 0.009*"map" + 0.009*"go" + 0.008*"drive" + 0.008*"destin" + 0.008*"time" + 0.007*"button"

Score: 0.13958364725112915	 
Topic: 0.018*"waze" + 0.017*"work" + 0.016*"report" + 0.012*"rout" + 0.011*"time" + 0.010*"sure" + 0.009*"drive" + 0.009*"stop" + 0.009*"googl" + 0.008*"map"

Score: 0.062335409224033356	 
Topic: 0.023*"thank" + 0.015*"android" + 0.014*"phone" + 0.011*"waze" + 0.011*"problem" + 0.010*"auto" + 0.009*"work" + 0.008*"live" + 0.008*"remov" + 0.008*"point"

Score: 0.05805535241961479	 
Topic: 

**On an unseen document**

In [23]:
unseen_document = 'Waze does not recommend my usual route. Why?'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model_tfidf.print_topic(index, 5)))

Score: 0.8199449181556702	 Topic: 0.023*"thank" + 0.015*"android" + 0.014*"phone" + 0.011*"waze" + 0.011*"problem"
Score: 0.020008079707622528	 Topic: 0.013*"waze" + 0.012*"time" + 0.012*"updat" + 0.012*"delet" + 0.011*"work"
Score: 0.020007383078336716	 Topic: 0.018*"waze" + 0.017*"work" + 0.016*"report" + 0.012*"rout" + 0.011*"time"
Score: 0.020007187500596046	 Topic: 0.014*"thank" + 0.013*"road" + 0.012*"rout" + 0.011*"work" + 0.011*"waze"
Score: 0.02000717632472515	 Topic: 0.013*"rout" + 0.013*"waze" + 0.009*"batteri" + 0.009*"better" + 0.009*"map"
Score: 0.020006539300084114	 Topic: 0.014*"waze" + 0.011*"report" + 0.011*"point" + 0.010*"carplay" + 0.010*"rout"
Score: 0.02000603638589382	 Topic: 0.022*"waze" + 0.014*"editor" + 0.014*"https" + 0.013*"like" + 0.009*"say"
Score: 0.020004646852612495	 Topic: 0.014*"speed" + 0.013*"phone" + 0.010*"imgur" + 0.010*"waze" + 0.010*"yeah"
Score: 0.020004315301775932	 Topic: 0.018*"issu" + 0.015*"waze" + 0.011*"screen" + 0.011*"iphon" + 0.009