This notebook follows the tutorial from this [post](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) by Susan Li.

In [1]:
import pandas as pd
import numpy as np
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

np.random.seed(2018)

import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/brianesamson/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [22]:
import json

json_file = open("../raw/raw_textonly_waze.json")
json_data = json.load(json_file)
documents = pd.DataFrame(json_data)
# data_text = data[['headline_text']]
# data_text['index'] = data_text.index
# documents = data_text

# df = pd.DataFrame([{"a":'a', "b":'b'}, {"a":'c', "b":'d'}])
# df
documents

Unnamed: 0,body,sub_id
0,Seriously Waze whomever is in charge of UX nee...,7nraji
1,I agree. And what's with the blue Go buttons t...,ds4cv5q
2,"Yes! When selecting a destination from seach, ...",ds4mc2j
3,I totally agree with you and I will add: make ...,ds4mtd9
4,I agree and disagree with little bits and piec...,ds52syn
5,I got used to the timer after a while. nowaday...,ds56b9r
6,Is anyone interested in conducting a usability...,ds5htj3
7,THANK YOU! I think they should hire you to do...,dsda76f
8,Too bad I can't hear a dang thing he says with...,7o0v80
9,I have this problem with pretty much all the v...,ds68bj4


**Preprocessing**

- Tokenization
- Lemmatization
- Stemming
- Remove stopwords
- Remove words with <= 3 characters

In [25]:
stemmer = SnowballStemmer('english')

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [26]:
doc_sample = documents[documents['sub_id'] == "7nraji"].values[0][0]
print('Original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
   
print(words)
print('\n\nTokenized and lemmatized document: ')
print(preprocess(doc_sample))

Original document: 
['Seriously', 'Waze', 'whomever', 'is', 'in', 'charge', 'of', 'UX', 'needs', 'to', 'be', 'fired.', '', "It's", 'that', 'bad.', '', 'If', "you've", 'changed', 'UX', 'lead', 'in', 'the', 'past', 'few', 'years', 'then', 'whomever', 'is', 'in', 'charge', 'of', 'hiring', 'them', 'should', 'also', 'be', 'fired.', '', 'Please', 'consider', 'hiring', 'a', 'UX', 'consulting', 'team', 'with', 'a', 'solid', 'reputation', 'to', 'come', 'fix', 'your', 'UX', 'and', 'then', 'have', 'them', 'help', 'you', 'hire', 'a', 'competent', 'person', 'on', 'your', 'behalf.', '', 'Credit', 'where', 'credit', 'is', 'due.', '', 'You', 'fixed', 'this', 'a', 'while', 'ago,', 'but', 'I', "don't", 'want', 'you', 'to', 'think', 'it', 'went', 'unnoticed.', '', 'Thank', 'you', 'for', 'getting', 'rid', 'of', 'the', 'invisible', 'button', 'to', 'control', 'the', 'sound.', '', 'Not', 'having', 'invisible', 'buttons', 'should', 'be', 'a', 'no', 'brainer.', '', 'Thankfully', 'that', 'was', 'just', 'a', 'ni

In [27]:
processed_docs = documents['body'].map(preprocess)

In [28]:
processed_docs

0       [serious, waze, whomev, charg, need, fire, cha...
1       [agre, blue, button, stay, screen, second, wan...
2       [select, destin, seach, present, option, later...
3       [total, agre, screen, mind, option, choos, def...
4       [agre, disagre, littl, bit, piec, item, spot, ...
5       [timer, nowaday, start, drive, usual, know, se...
6       [interest, conduct, usabl, feedback, session, ...
7                                    [thank, think, hire]
8                                [hear, dang, thing, say]
9                            [problem, pretti, voic, tri]
10      [phone, hard, wire, stereo, cours, mute, stere...
11                                                     []
12                        [tri, chang, voic, show, wrong]
13                                        [updat, recent]
14      [android, auto, phone, know, plan, make, waze,...
15      [radio, silenc, launch, waze, android, auto, u...
16      [launch, beta, disabl, decid, wasn, readi, bet...
17      [waze,

**Create Bag of Words on the dataset**

Get all the unique words and assign an ID to them.

In [29]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [30]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 accid
1 accur
2 adsthey
3 area
4 aspect
5 avoid
6 base
7 behalf
8 better
9 brainer
10 briefli


**Filter words** | [Source](https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.filter_extremes.html)

Filter out tokens that appear in

- less than `no_below` documents (absolute number) or
- more than `no_above` documents (fraction of total corpus size, not absolute number).
- after (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if None).

In [31]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

**Convert corpus to used the BoW IDs and count the word frequency per document**

In [34]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[1]

[(6, 1),
 (59, 1),
 (60, 1),
 (77, 1),
 (83, 1),
 (84, 1),
 (85, 1),
 (86, 1),
 (87, 1)]

In [35]:
bow_doc_4310 = bow_corpus[32]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 2 ("area") appears 1 time.
Word 11 ("concern") appears 1 time.
Word 51 ("person") appears 1 time.
Word 168 ("decid") appears 1 time.
Word 204 ("minut") appears 1 time.
Word 211 ("car") appears 1 time.
Word 212 ("comment") appears 2 time.
Word 213 ("commut") appears 1 time.
Word 214 ("exit") appears 3 time.
Word 215 ("hand") appears 1 time.
Word 216 ("life") appears 1 time.
Word 217 ("liter") appears 1 time.
Word 218 ("mile") appears 1 time.
Word 219 ("near") appears 1 time.
Word 220 ("number") appears 1 time.
Word 221 ("posit") appears 1 time.
Word 222 ("previous") appears 1 time.
Word 223 ("shoulder") appears 1 time.
Word 224 ("stand") appears 1 time.
Word 225 ("store") appears 1 time.
Word 226 ("time") appears 1 time.
Word 227 ("wait") appears 1 time.


TF-IDF
===

First, calculate the inverse document counts for all terms in the training corpus.

In [36]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

Then, transform the count representations into the Tfidf space.

In [37]:
corpus_tfidf = tfidf[bow_corpus]

In [38]:
from pprint import pprint
    
pprint(corpus_tfidf[0])

[(0, 0.22415795657886764),
 (1, 0.070042165222550662),
 (2, 0.048477793833333407),
 (3, 0.061799360543494372),
 (4, 0.12869334462209533),
 (5, 0.056127824314874397),
 (6, 0.16668270915926744),
 (7, 0.0873808230982311),
 (8, 0.050022327196284647),
 (9, 0.12714887024636062),
 (10, 0.11179704072384869),
 (11, 0.08415869212205758),
 (12, 0.14575986394475424),
 (13, 0.069757139624532311),
 (14, 0.086509247282035046),
 (15, 0.058769945251189872),
 (16, 0.051654732083255342),
 (17, 0.085684807285501111),
 (18, 0.084902665306432257),
 (19, 0.049267300422793432),
 (20, 0.070151807706285083),
 (21, 0.13733500586923783),
 (22, 0.070930674328125456),
 (23, 0.055450004924145452),
 (24, 0.069757139624532311),
 (25, 0.084902665306432257),
 (26, 0.14374835362629998),
 (27, 0.048903358910249224),
 (28, 0.12338539359054072),
 (29, 0.15929834929250397),
 (30, 0.063387335530188493),
 (31, 0.097663202962533779),
 (32, 0.068150557648929536),
 (33, 0.076369405645102792),
 (34, 0.15362283988416073),
 (35, 0.2

LDA using Bag of Words
===

In [39]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [40]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.033*"report" + 0.027*"waze" + 0.022*"road" + 0.021*"speed" + 0.020*"rout" + 0.017*"time" + 0.015*"think" + 0.013*"work" + 0.012*"limit" + 0.012*"map"
Topic: 1 
Words: 0.070*"waze" + 0.025*"point" + 0.025*"googl" + 0.019*"map" + 0.017*"report" + 0.015*"featur" + 0.014*"want" + 0.013*"rout" + 0.012*"drive" + 0.009*"need"
Topic: 2 
Words: 0.048*"waze" + 0.025*"drive" + 0.022*"turn" + 0.017*"android" + 0.013*"like" + 0.013*"time" + 0.011*"know" + 0.011*"go" + 0.010*"road" + 0.010*"phone"
Topic: 3 
Words: 0.104*"waze" + 0.022*"https" + 0.018*"problem" + 0.015*"googl" + 0.015*"volum" + 0.012*"forum" + 0.011*"phone" + 0.011*"map" + 0.010*"rout" + 0.010*"user"
Topic: 4 
Words: 0.050*"work" + 0.037*"waze" + 0.024*"issu" + 0.020*"time" + 0.019*"rout" + 0.013*"server" + 0.010*"home" + 0.009*"phone" + 0.009*"problem" + 0.009*"month"
Topic: 5 
Words: 0.060*"waze" + 0.021*"thank" + 0.021*"rout" + 0.013*"like" + 0.012*"want" + 0.012*"play" + 0.011*"screen" + 0.011*"traffic" + 0.011

LDA using TF-IDF
===

In [41]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [42]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.018*"thank" + 0.013*"waze" + 0.012*"rout" + 0.012*"android" + 0.011*"updat" + 0.010*"look" + 0.009*"minut" + 0.009*"turn" + 0.008*"https" + 0.007*"possibl"
Topic: 1 Word: 0.021*"waze" + 0.019*"phone" + 0.013*"issu" + 0.012*"work" + 0.012*"problem" + 0.012*"beta" + 0.011*"android" + 0.008*"carplay" + 0.008*"happen" + 0.008*"time"
Topic: 2 Word: 0.014*"waze" + 0.012*"voic" + 0.011*"need" + 0.010*"sure" + 0.008*"road" + 0.008*"chang" + 0.007*"problem" + 0.007*"know" + 0.007*"set" + 0.007*"alert"
Topic: 3 Word: 0.016*"report" + 0.012*"speed" + 0.012*"waze" + 0.011*"delet" + 0.010*"road" + 0.009*"point" + 0.009*"time" + 0.009*"rout" + 0.008*"camera" + 0.008*"limit"
Topic: 4 Word: 0.016*"updat" + 0.012*"waze" + 0.011*"fix" + 0.011*"rout" + 0.011*"phone" + 0.009*"thank" + 0.009*"version" + 0.009*"point" + 0.009*"better" + 0.008*"think"
Topic: 5 Word: 0.024*"thank" + 0.014*"remov" + 0.011*"waze" + 0.011*"imgur" + 0.010*"issu" + 0.010*"like" + 0.009*"https" + 0.009*"edit" + 0.0

Sample classification of documents to topics
===

**Using the LDA Bag of Words model**

In [43]:
processed_docs[0]

['serious',
 'waze',
 'whomev',
 'charg',
 'need',
 'fire',
 'chang',
 'lead',
 'past',
 'year',
 'whomev',
 'charg',
 'hire',
 'fire',
 'consid',
 'hire',
 'consult',
 'team',
 'solid',
 'reput',
 'come',
 'help',
 'hire',
 'compet',
 'person',
 'behalf',
 'credit',
 'credit',
 'fix',
 'want',
 'think',
 'go',
 'unnot',
 'thank',
 'get',
 'invis',
 'button',
 'control',
 'sound',
 'have',
 'invis',
 'button',
 'brainer',
 'thank',
 'nitpick',
 'world',
 'unsaf',
 'sad',
 'current',
 'usabl',
 'unsaf',
 'float',
 'instruct',
 'horizont',
 'mode',
 'import',
 'rule',
 'design',
 'element',
 'base',
 'user',
 'input',
 'form',
 'button',
 'move',
 'user',
 'move',
 'mous',
 'real',
 'world',
 'exampl',
 'mous',
 'toolbar',
 'icon',
 'magnifi',
 'nitpick',
 'exampl',
 'world',
 'unfortun',
 'safeti',
 'concern',
 'float',
 'instruct',
 'move',
 'base',
 'vehicl',
 'data',
 'intent',
 'purpos',
 'move',
 'float',
 'instruct',
 'come',
 'stop',
 'light',
 'need',
 'turn',
 'leav',
 'need',


In [45]:
for index, score in sorted(lda_model[bow_corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6332421898841858	 
Topic: 0.048*"waze" + 0.025*"drive" + 0.022*"turn" + 0.017*"android" + 0.013*"like" + 0.013*"time" + 0.011*"know" + 0.011*"go" + 0.010*"road" + 0.010*"phone"

Score: 0.21702523529529572	 
Topic: 0.048*"waze" + 0.038*"rout" + 0.030*"phone" + 0.024*"googl" + 0.022*"map" + 0.014*"like" + 0.014*"road" + 0.014*"traffic" + 0.012*"actual" + 0.012*"time"

Score: 0.10926144570112228	 
Topic: 0.028*"waze" + 0.021*"report" + 0.017*"screen" + 0.016*"googl" + 0.015*"like" + 0.015*"updat" + 0.014*"phone" + 0.014*"time" + 0.014*"appl" + 0.013*"direct"

Score: 0.03578241169452667	 
Topic: 0.067*"waze" + 0.025*"phone" + 0.024*"drive" + 0.023*"issu" + 0.020*"turn" + 0.015*"work" + 0.014*"updat" + 0.013*"editor" + 0.013*"need" + 0.012*"know"


**Using the LDA TFIDF model**

In [46]:
for index, score in sorted(lda_model_tfidf[bow_corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.4499287009239197	 
Topic: 0.048*"waze" + 0.025*"drive" + 0.022*"turn" + 0.017*"android" + 0.013*"like" + 0.013*"time" + 0.011*"know" + 0.011*"go" + 0.010*"road" + 0.010*"phone"

Score: 0.4364989697933197	 
Topic: 0.048*"waze" + 0.038*"rout" + 0.030*"phone" + 0.024*"googl" + 0.022*"map" + 0.014*"like" + 0.014*"road" + 0.014*"traffic" + 0.012*"actual" + 0.012*"time"

Score: 0.10810196399688721	 
Topic: 0.028*"waze" + 0.021*"report" + 0.017*"screen" + 0.016*"googl" + 0.015*"like" + 0.015*"updat" + 0.014*"phone" + 0.014*"time" + 0.014*"appl" + 0.013*"direct"


**On an unseen document**

In [47]:
unseen_document = 'Waze does not recommend my usual route. Why?'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.819949209690094	 Topic: 0.067*"waze" + 0.025*"phone" + 0.024*"drive" + 0.023*"issu" + 0.020*"turn"
Score: 0.020009329542517662	 Topic: 0.048*"waze" + 0.038*"rout" + 0.030*"phone" + 0.024*"googl" + 0.022*"map"
Score: 0.020007828250527382	 Topic: 0.104*"waze" + 0.022*"https" + 0.018*"problem" + 0.015*"googl" + 0.015*"volum"
Score: 0.020006118342280388	 Topic: 0.033*"report" + 0.027*"waze" + 0.022*"road" + 0.021*"speed" + 0.020*"rout"
Score: 0.020005986094474792	 Topic: 0.050*"work" + 0.037*"waze" + 0.024*"issu" + 0.020*"time" + 0.019*"rout"
Score: 0.02000591717660427	 Topic: 0.028*"waze" + 0.021*"report" + 0.017*"screen" + 0.016*"googl" + 0.015*"like"
Score: 0.020005403086543083	 Topic: 0.070*"waze" + 0.025*"point" + 0.025*"googl" + 0.019*"map" + 0.017*"report"
Score: 0.0200053583830595	 Topic: 0.060*"waze" + 0.021*"thank" + 0.021*"rout" + 0.013*"like" + 0.012*"want"
Score: 0.020002558827400208	 Topic: 0.048*"waze" + 0.025*"drive" + 0.022*"turn" + 0.017*"android" + 0.013*"like"
