In [0]:
## Import libs
import numpy as np
import pandas as pd
import os 
import random
from tqdm import tqdm
random.seed(1234)

![Texte alternatif…](https://github.com/Yachad/IASD_TD_1/blob/master/logo.png?raw=true)

## LDA from scratch

###### Text clustering is a widely used techniques to automatically draw out patterns from a set of documents, specially for documents organizing or indexing (tagging).
###### With the plethora amount of information available on the Internet, the topic of knowledge management has become ever so more important.
###### Everyone’s way of thinking about things may differ ever so slightly, a team of information architects may argue for years over which word is the right term to represent a document. Tagging, on the other hand, users can use whatever term works for them.
###### This is now a common way (e.g. on Twitter, StackOverflow) to sort relevant topics together so that they can be easily found by people of the same interested.


## Latent Dirichlet Allocation

**Latent Dirichlet Allocation** (LDA) is a probabilistic topic modeling method that gives us an approach to find out possible topics from documents that we do not know of beforehand. 

The key assumptions behind LDA is that each given documents is a mix of multiple topics. Given a set of documents, one can use the LDA framework to learn not only the topic mixture (distribution) that represents each document. But also word (distribution) that are associated with each topic to help understand what the topic might be referring to. 

The topic distribution for each document is distributed as 

$$ \theta \sim Dirichlet(\alpha) $$

Where $Dirichlet(\alpha)$ denotes the Dirichlet distribution for parameter $\alpha$.

The term (word) distribution on the other hand is also modeled by a Dirichlet distribution, just under a different parameter $\eta$ ( pronounced "eta", you'll see other places refer to it as $\beta$ ).

$$ \phi \sim Dirichlet(\eta) $$

The utmost goal of LDA is to estimate the $\theta$ and $\phi$ which is equivalent to estimate which words are important for which topic and which topics are important for a particular document, respectively.



![Texte alternatif…](https://raw.githubusercontent.com/Yachad/IASD_TD_1/master/lda.png)

The basic idea behind the parameters for the Dirichlet distribution is: $\alpha$ The higher the value the more likely each document is to contain a mixture of most of the topics instead of any single topic. The same goes for $\eta$, where higher value denotes that each topic is likely to contain a mixture of most of the words and not any word specifically.

There're different approaches to this algorithm, the one we'll be using is gibbs sampling.

Gibbs sampling is commonly used as a means of statistical inference, especially Bayesian inference. It is a randomized algorithm (i.e. an algorithm that makes use of random numbers), and is an alternative to deterministic algorithms for statistical inference such as the expectation-maximization algorithm (EM).

We'll use 8 short strings to represent our set of documents. 
The following section creates the set the documents and convert each document into word ids, where word ids is just the ids assigned to each unique word in the set of document. We're dropping the issue of stemming words as this is a fairly simple set of document.

In [0]:
rawdocs = ['eat turkey on turkey day holiday',
          'i like to eat cake on holiday',
          'turkey trot race on thanksgiving holiday',
          "snail race the turtle",
          'time travel space race',
          'movie on thanksgiving',
          'movie at air and space museum is cool movie',
          'aspiring movie star'
]



docs = [rawdoc.split(' ') for rawdoc in rawdocs]


# unique words
vocab = list(set(sum(docs, [])))
vocab.sort()
vocab = np.array(vocab)


# replace words in documents with wordIDs
docs_index = [[np.argwhere(vocab == word).item() for word in doc] for doc in docs] 

In [4]:
docs

[['eat', 'turkey', 'on', 'turkey', 'day', 'holiday'],
 ['i', 'like', 'to', 'eat', 'cake', 'on', 'holiday'],
 ['turkey', 'trot', 'race', 'on', 'thanksgiving', 'holiday'],
 ['snail', 'race', 'the', 'turtle'],
 ['time', 'travel', 'space', 'race'],
 ['movie', 'on', 'thanksgiving'],
 ['movie', 'at', 'air', 'and', 'space', 'museum', 'is', 'cool', 'movie'],
 ['aspiring', 'movie', 'star']]

In [5]:
docs_index

[[7, 25, 14, 25, 6, 8],
 [9, 11, 22, 7, 4, 14, 8],
 [25, 24, 15, 14, 19, 8],
 [16, 15, 20, 26],
 [21, 23, 17, 15],
 [12, 14, 19],
 [12, 3, 0, 1, 17, 13, 10, 5, 12],
 [2, 12, 18]]

In [6]:
vocab

array(['air', 'and', 'aspiring', 'at', 'cake', 'cool', 'day', 'eat',
       'holiday', 'i', 'is', 'like', 'movie', 'museum', 'on', 'race',
       'snail', 'space', 'star', 'thanksgiving', 'the', 'time', 'to',
       'travel', 'trot', 'turkey', 'turtle'], dtype='<U12')

A slight drawback of latent dirichlet allocation is that you have to specify the number of clusters first. In other words you have to specify the number of topics that you wish to group the set of documents into upfront ( denoted by K ). In our cases we'll use 2.

The first step of the algorithm is to go through each document and randomly assign each word in the document to one of the K topics. Apart from generating this **topic assignment list**, we'll also create a **word-topic matrix**, which is the count of each word being assigned to each topic. And a **document-topic matrix**, which is the number of words assigned to each topic for each document (distribution of the topic assignment list). We'll be using the later two matrices throughout the process of the algorithm.


In [0]:
# cluster number 
K = 2

# %%
# initialize count matrices 
# @wt = word-topic matrix 
wt = pd.DataFrame(data = np.zeros(shape = [K, len(vocab)]),
                  columns= vocab,
                  index= np.arange(1, K+1))


# @ta : topic assignment list
ta = [[np.random.randint(1, high=K+1, size=1).item() for word in doc] for doc in docs]

# %%
# @dt : counts correspond to the number of words assigned to each topic for each document
dt = np.zeros([len(docs), K])


for index_doc, doc in enumerate(docs):
    # randomly assign topic to word w
    for index_word,  word in enumerate(doc):
        ta[index_doc][index_word] = np.random.randint(1, high=K+1, size=1).item()
        
        # extract the topic index, word id and update the corresponding cell
        # in the word-topic matrix
        topic = ta[index_doc][index_word]
        wt.loc[topic, word] +=  1

    # count words in document d assigned to each topic t
    for t in np.arange(1, K+1) :
        dt[index_doc,t-1] =  np.sum((np.array(ta[index_doc]) == t) * 1)

In [8]:
wt

Unnamed: 0,air,and,aspiring,at,cake,cool,day,eat,holiday,i,is,like,movie,museum,on,race,snail,space,star,thanksgiving,the,time,to,travel,trot,turkey,turtle
1,0.0,0.0,1.0,1.0,1.0,1.0,0.0,2.0,3.0,1.0,0.0,1.0,1.0,0.0,2.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,2.0,0.0
2,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,3.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0


In [9]:
ta

[[1, 1, 2, 2, 2, 1],
 [1, 1, 2, 1, 1, 1, 1],
 [1, 1, 2, 1, 1, 1],
 [2, 1, 2, 2],
 [1, 2, 1, 2],
 [1, 2, 2],
 [2, 1, 2, 2, 2, 2, 2, 1, 2],
 [1, 2, 2]]

In [10]:
dt

array([[3., 3.],
       [6., 1.],
       [5., 1.],
       [1., 3.],
       [2., 2.],
       [1., 2.],
       [2., 7.],
       [1., 2.]])

Notice that this random assignment already gives you both the topic representations of all the documents and word distributions of all the topics, albeit not very good ones. 

So to improve them, we'll employ the gibbs sampling method that performs the following steps for a user-specified iteration: 

For each document d, go through each word w (a double for loop). Reassign a new topic to w, where we choose topic t with the probability of word w given topic t $\times$ probability of topic t given document d, denoted by the following mathematical notations:

$$ P( z_i = j \text{ }| \text{ } z_{-i}, w_i, d_i ) 
    \propto  \frac{ C^{WT}_{w_ij} + \eta }{ \sum^W_{ w = 1 }C^{WT}_{wj} + W\eta } \times
      \frac{ C^{DT}_{d_ij} + \alpha }{ \sum^T_{ t = 1 }C^{DT}_{d_it} + T\alpha }
$$


Let's try and break that down piece by piece..... 

Starting from the left side of the equal sign:

- **$P(z_i = j)$ :** The probability that token i is assigned to topic j.
- **$z_{-i}$ :** Represents topic assignments of all other tokens.
- **$w_i$ :** Word (index) of the $i_{th}$ token.
- **$d_i$ :** Document containing the $i_{th}$ token.

For the right side of the proportionality :

- **$C^{WT}$ :** Word-topic matrix, the `wt` matrix we generated.
- **$\sum^W_{ w = 1 }C^{WT}_{wj}$ :** Total number of tokens (words) in each topic.
- **$C^{DT}$ :** Document-topic matrix, the `dt` matrix we generated.
- **$\sum^T_{ t = 1 }C^{DT}_{d_it}$ :** Total number of tokens (words) in document i.
- **$\eta$ :** Parameter that sets the topic distribution for the words, the higher the more spread out the words will be across the specified number of topics (K). 
- **$\alpha$ :** Parameter that sets the topic distribution for the documents, the higher the more spread out the documents will be across the specified number of topics (K).
- **$W$ :** Total number of words in the set of documents. 
- **$T$ :** Number of topics, equivalent of the K we defined earlier. 

It may be still confusing with all of that notations, the following section goes through the computation for one iteration. The topic of the first word in the first document is resampled as follow: The output will not be printed during the process, since it'll probably make the documentation messier.

LDA Gibbs Algorithm 


#### for each iteration $i $ :
#### &nbsp;&nbsp;  for each document  $d_i  \in \mathcal D $ :
#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  for each word  $w_i  \in \mathcal d_i $ :
#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; $z_{old}$ : topic assigned to $w_i$ 
#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.&nbsp;&nbsp;&nbsp; Decrement &nbsp;&nbsp;   $n_{d_i, z_{old}}$, $wt_{z_{old}, w_{d, n}}$ 

#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.&nbsp;&nbsp;&nbsp; Sample $z_{new}$ &nbsp;&nbsp; &nbsp;&nbsp; from $ P( z_i = j \text{ }| \text{ } z_{-i}, w_i, d_i ) $ 

#### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.&nbsp;&nbsp;&nbsp; Increment &nbsp;&nbsp;   $n_{d_i, z_{new}}$, $wt_{z_{new}, w_{d, n}}$ 

Gibbs sampling one iteration 

In [0]:
## 1. Randomly assign topics to
alpha = 1
eta = 1


# initial topics assigned to the first word of the first document 
# and its corresponding word id 

t0 = ta[0][0]
word_id = docs_index[0][0]

# Z_-i means that we dot not include token w in our word-topic and document-topic
# count matrix when sampling for token w,
# only leave the topic assignments of all other tokens for document 1

## ! decrement dt for the first document and topic t0 
..
## ! decrement wt for the  topic t0  and word_id
..

# calculate left side and right side of propo

### ! calculate the left side of the probability
left = ..

### ! calculate the right side of the probability for the first document
right = 

# transform the proportionality to a probability
prob_topic = (left * right) /(left * right).sum()

# %%
# draw new topic for the first word in the first document 
### ! draw a new topic with the probability proba_topic
new_t0 = ..

# refresh the dt and wt with the newly assigned topic 
###! refresh the value of the topic in  ta for the first first doc and the first word 
...

## ! increment dt for the first document and topic new_t0 
..
## ! increment wt for the  topic new_t0   and word_id
..

After the first iteration, the topic for the first word in the first document is updated to `r new_t0`. Hopefully, that is clears out the confusing of all those mathematical notations. We can now apply the whole thing to a user-specified iteration. Just remember after drawing the new topic we also have to update the topic assignment list with newly sampled topic for token w; re-increment the word-topic and document-topic count matrices with the new sampled topic for token w.

To conserve space, we'll put all of it into a function [`LDA1`][LDA], which takes the paramters of:

- `docs` Document that have be converted to token (word) ids.
- `vocab` Unique tokens (words) for all the document collection.
- `K` Number of topic groups.
- `alpha` and `eta` Distribution parameters as explained earlier.
- `iterations` Number of iterations to run gibbs sampling to train our model.
- Returns a list containing the final weight-topic count matrix `wt` and document-topic matrix `dt`.

In [0]:
def lda_scratch(docs, vocab, K, alpha, eta, iterations):

    # replace words in documents with wordIDs
    docs_index = [[np.argwhere(vocab == word).item() for word in doc] for doc in docs] 

    #initialize count matrices
    # @wt : word-topic matrix
    wt = 

    # @ta : topic assignment list
    ta = 

    # @dt : counts correspond to the number of words assigned to each topic for each document
    dt = 

    for index_doc, doc in enumerate(docs):
    # randomly assign topic to word w
        for index_word,  word in enumerate(doc):
            
            # extract the topic index, word id and update the corresponding cell
            # in the word-topic matrix
            topic = 
            wt.loc[topic, word] =  

            # count words in document d assigned to each topic t
            for t in np.arange(1, K+1) :
                dt[index_doc,t-1] =  


    # for each pass through the corpus

    for i in tqdm(np.arange(iterations)):

        # for each document 
        for id_doc in np.arange(len(docs)):
            
            # for each id_word in id_doc
            for id_word in np.arange(len(docs[id_doc])):

                # initial topics assigned to the first word of the first document 
                # and its corresponding word id 

                ti = 
                word_id = 

                dt[id_doc, ti-1] = 
                wt.iloc[ti-1, word_id] = 

                left = 
                right =


                # transform the proportionality to a probability
                prob_topic = 

                # draw new topic for id_word in the id_doc document 
                new_ti = 

                # refresh the dt and wt with the newly assigned topic 
                ta[id_doc][id_word] = 
                dt[id_doc, new_ti-1] = 
                wt.iloc[new_ti-1, word_id] = 
    
    return wt, dt 

In [0]:
K=2
alpha = 1
eta = 0.001
iterations = 1000

# %%
wt , dt = lda_scratch(docs, vocab, K, alpha, eta, iterations)

100%|██████████| 1000/1000 [01:31<00:00, 10.88it/s]


After we're done with learning the topics for `r iterations` iterations, we can use the count matrices to obtain the word-topic distribution and document-topic distribution.

To compute the probability of word given topic:

$$\phi_{ij} = \frac{C^{WT}_{ij} + \eta}{\sum^W_{ k = 1 }C^{WT}_{kj} + W\eta}$$

Where $\phi_{ij}$ is the probability of word i for topic j.

In [0]:
# topic probability of every word

#! calculate the phi value
phi_values = 
phi = pd.DataFrame(data= phi_values,
                   columns= vocab,
                   index= np.arange(1, K+1))

In [0]:
phi

Unnamed: 0,air,and,aspiring,at,cake,cool,day,eat,holiday,i,is,like,movie,museum,on,race,snail,space,star,thanksgiving,the,time,to,travel,trot,turkey,turtle
1,4.8e-05,4.8e-05,4.8e-05,4.8e-05,0.047605,0.047605,0.047605,0.095163,0.142721,0.047605,4.8e-05,0.047605,4.8e-05,4.8e-05,0.190279,4.8e-05,4.8e-05,4.8e-05,4.8e-05,0.095163,4.8e-05,4.8e-05,4.8e-05,0.047605,0.047605,0.142721,4.8e-05
2,0.047605,0.047605,0.047605,0.047605,4.8e-05,4.8e-05,4.8e-05,4.8e-05,4.8e-05,4.8e-05,0.047605,4.8e-05,0.190279,0.047605,4.8e-05,0.142721,0.047605,0.095163,0.047605,4.8e-05,0.047605,0.047605,0.047605,4.8e-05,4.8e-05,4.8e-05,0.047605


$$\theta_{dj} = \frac{C^{DT}_{dj} + \alpha}{\sum^T_{ k = 1 }C^{DT}_{dk} + T\alpha}$$

Where $\theta_{dj}$ is the proportion of topic j in document d.


In [0]:
# topic probability of every document
## calculate the theta value
theta = 

In [0]:
theta

array([[0.875     , 0.125     ],
       [0.77777778, 0.22222222],
       [0.75      , 0.25      ],
       [0.16666667, 0.83333333],
       [0.33333333, 0.66666667],
       [0.6       , 0.4       ],
       [0.18181818, 0.81818182],
       [0.2       , 0.8       ]])

Recall that LDA assumes that each document is a mixture of all topics, thus after computing the probability that each document belongs to each topic ( same goes for word & topic ) we can use this information to see which topic does each document belongs to and the more possible words that are associated with each topic.

In [0]:
# topic assigned to each document, the one with the highest probability
topics = np.array([np.argmax(doc_prob) + 1 for doc_prob in theta])

In [0]:
# possible words under each topic
# sort the probability and obtain the user-specified number n
def find_top_term (phi, n):
    top_term = {}
    for topic in phi.index.values :
        top_term[str(topic)] = phi.loc[topic].sort_values(ascending=False)[:n]

    return top_term

top_terms = find_top_term(phi, 3)

We specified that we wanted to see the top 3 terms associated with each topic. The following section prints out the original raw document, which is grouped into `r K` groups that we specified and words that are likely to go along with each topic.

In [0]:
### Topic 
topic = 1
for i, rawdoc in enumerate(rawdocs):
    if topics[i] == topic:
        print(rawdoc)
        
top_terms[str(topic)]

eat turkey on turkey day holiday
i like to eat cake on holiday
turkey trot race on thanksgiving holiday
movie on thanksgiving


on         0.190279
turkey     0.142721
holiday    0.142721
Name: 1, dtype: float64

In [0]:
### Topic 
topic = 2
for i, rawdoc in enumerate(rawdocs):
    if topics[i] == topic:
        print(rawdoc)
        
top_terms[str(topic)]

snail race the turtle
time travel space race
movie at air and space museum is cool movie
aspiring movie star


movie    0.190279
race     0.142721
space    0.095163
Name: 2, dtype: float64

The output tells us that the first topic seems to be discussing something about movie and race , while the second is something about turkey and holiday. 

### Yelp review

In this tutorial, we will take a real example of the ’YelP reviews’ dataset and use LDA to extract the naturally discussed topics.

In [0]:
import nltk; nltk.download('stopwords')
#!pip3 install  spacy download en

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
!pip install pyldavis
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

Collecting pyldavis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |▏                               | 10kB 14.4MB/s eta 0:00:01[K     |▍                               | 20kB 3.3MB/s eta 0:00:01[K     |▋                               | 30kB 4.5MB/s eta 0:00:01[K     |▉                               | 40kB 3.0MB/s eta 0:00:01[K     |█                               | 51kB 3.7MB/s eta 0:00:01[K     |█▏                              | 61kB 4.4MB/s eta 0:00:01[K     |█▍                              | 71kB 5.0MB/s eta 0:00:01[K     |█▋                              | 81kB 5.6MB/s eta 0:00:01[K     |█▉                              | 92kB 6.3MB/s eta 0:00:01[K     |██                              | 102kB 5.0MB/s eta 0:00:01[K     |██▎                             | 112kB 5.0MB/s eta 0:00:01[K     |██▍                             | 122kB 5.0MB/s eta 0:00:01[

In [0]:
link = 'https://raw.githubusercontent.com/Yachad/IASD_TD_1/master/yelp.csv'

In [0]:
df = pd.read_csv(link)

In [0]:
# Convert to list
data = df.text.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

['My wife took me here on my birthday for breakfast and it was excellent. The '
 'weather was perfect which made sitting outside overlooking their grounds an '
 'absolute pleasure. Our waitress was excellent and our food arrived quickly '
 'on the semi-busy Saturday morning. It looked like the place fills up pretty '
 'quickly so the earlier you get here the better. Do yourself a favor and get '
 'their Bloody Mary. It was phenomenal and simply the best Ive ever had. Im '
 'pretty sure they only use ingredients from their garden and blend them fresh '
 'when you order it. It was amazing. While EVERYTHING on the menu looks '
 'excellent, I had the white truffle scrambled eggs vegetable skillet and it '
 'was tasty and delicious. It came with 2 pieces of their griddled bread with '
 'was amazing and it absolutely made the meal complete. It was the best '
 '"toast" Ive ever had. Anyway, I cant wait to go back!']


In [0]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [0]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['my', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'the', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi', 'busy', 'saturday', 'morning', 'it', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'do', 'yourself', 'favor', 'and', 'get', 'their', 'bloody', 'mary', 'it', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'ive', 'ever', 'had', 'im', 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'it', 'was', 'amazing', 'while', 'everything', 'on', 'the', 'menu', 'looks', 'excellent', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', '

In [0]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])



['my', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'the', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi', 'busy', 'saturday', 'morning', 'it', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'do_yourself_favor', 'and', 'get', 'their', 'bloody_mary', 'it', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'ive', 'ever', 'had', 'im', 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'it', 'was', 'amazing', 'while', 'everything', 'on', 'the', 'menu', 'looks', 'excellent', 'had', 'the', 'white', 'truffle', 'scrambled_eggs', 'vegetable', 'skillet', 'and', 'it', '

In [0]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [0]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['wife', 'take', 'birthday', 'breakfast', 'excellent', 'weather', 'perfect', 'make', 'sit', 'overlooking', 'ground', 'absolute', 'pleasure', 'waitress', 'excellent', 'food', 'arrive', 'quickly', 'semi', 'busy', 'saturday', 'morning', 'look', 'place', 'fill', 'pretty', 'quickly', 'earlier', 'get', 'well', 'favor', 'get', 'bloody_mary', 'phenomenal', 'simply', 'good', 'have', 'ever', 'be', 'pretty', 'sure', 'ingredient', 'garden', 'blend', 'fresh', 'order', 'amazing', 'everything', 'menu', 'look', 'excellent', 'white', 'truffle', 'scrambled_eggs', 'vegetable', 'skillet', 'tasty', 'delicious', 'come', 'piece', 'griddle', 'bread', 'amazing', 'absolutely', 'make', 'meal', 'complete', 'good', 'toast', 'have', 'ever', 'anyway', 'not', 'wait', 'go', 'back']]


In [0]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 3), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 2), (28, 1), (29, 1), (30, 2), (31, 1), (32, 2), (33, 2), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 2), (46, 2), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1)]]


In [0]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('absolute', 1),
  ('absolutely', 1),
  ('amazing', 2),
  ('anyway', 1),
  ('arrive', 1),
  ('back', 1),
  ('be', 1),
  ('birthday', 1),
  ('blend', 1),
  ('bloody_mary', 1),
  ('bread', 1),
  ('breakfast', 1),
  ('busy', 1),
  ('come', 1),
  ('complete', 1),
  ('delicious', 1),
  ('earlier', 1),
  ('ever', 2),
  ('everything', 1),
  ('excellent', 3),
  ('favor', 1),
  ('fill', 1),
  ('food', 1),
  ('fresh', 1),
  ('garden', 1),
  ('get', 2),
  ('go', 1),
  ('good', 2),
  ('griddle', 1),
  ('ground', 1),
  ('have', 2),
  ('ingredient', 1),
  ('look', 2),
  ('make', 2),
  ('meal', 1),
  ('menu', 1),
  ('morning', 1),
  ('not', 1),
  ('order', 1),
  ('overlooking', 1),
  ('perfect', 1),
  ('phenomenal', 1),
  ('piece', 1),
  ('place', 1),
  ('pleasure', 1),
  ('pretty', 2),
  ('quickly', 2),
  ('saturday', 1),
  ('scrambled_eggs', 1),
  ('semi', 1),
  ('simply', 1),
  ('sit', 1),
  ('skillet', 1),
  ('sure', 1),
  ('take', 1),
  ('tasty', 1),
  ('toast', 1),
  ('truffle', 1),
  ('veget

In [0]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=8, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [0]:
# Print the Keyword in the 8 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.032*"dog" + 0.029*"room" + 0.024*"stay" + 0.016*"hotel" + 0.015*"pool" + '
  '0.014*"cake" + 0.014*"park" + 0.013*"parking" + 0.012*"mall" + '
  '0.012*"cookie"'),
 (1,
  '0.029*"not" + 0.022*"go" + 0.021*"get" + 0.016*"be" + 0.016*"do" + '
  '0.016*"good" + 0.015*"time" + 0.011*"come" + 0.011*"make" + 0.011*"food"'),
 (2,
  '0.031*"location" + 0.016*"new" + 0.013*"local" + 0.012*"spend" + '
  '0.010*"trip" + 0.009*"year" + 0.009*"smell" + 0.008*"pho" + 0.008*"book" + '
  '0.008*"include"'),
 (3,
  '0.043*"great" + 0.039*"place" + 0.028*"food" + 0.027*"good" + 0.025*"love" '
  '+ 0.015*"always" + 0.014*"bar" + 0.014*"drink" + 0.013*"friendly" + '
  '0.013*"night"'),
 (4,
  '0.032*"course" + 0.016*"able" + 0.010*"office" + 0.010*"step" + '
  '0.010*"play" + 0.007*"card" + 0.007*"foot" + 0.007*"color" + 0.007*"road" + '
  '0.007*"cash"'),
 (5,
  '0.039*"store" + 0.012*"car" + 0.012*"class" + 0.009*"young" + '
  '0.007*"cocktail" + 0.007*"wear" + 0.006*"dress" + 0.006*"movie" + 

### Evaluate topic model 

Two metrics are used to evaluate the performance of a topic modeling of a set of documents :

1. **Perplexity** :
    - Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of 
k, you estimate the LDA model.
    - Then given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents. 
    - However, the statistic is somewhat meaningless on its own. The benefit of this statistic comes in comparing perplexity across different models with varying  k. The model with the lowest perplexity is generally considered the “best”.
    - ! Optimizing for perplexity may not yield human interpretable topics ! 



2. **Coherence** :
    - Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.
    - C_v measure is a variant to calculate a topic coherence, it's based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity

In [0]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.789951700286562

Coherence Score:  0.3424798013925847


In [0]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

NameError: ignored