# Topic Clustering Using LDA

The purpose of this tutorial is to give you a basic understanding of Latent Dirichlet Allocation (LDA) from http://ai.stanford.edu/~ang/papers/jair03-lda.pdf and use it to implement a simple downscaled topic clustering on the newsgroup dataset from scikit-learn.

Section 1 will give some background to the mechanics and theory behind LDA. Section 2 will then tackle the task of implementing LDA to infer topics in documents based on their content. This section will provide you with skeleton code already written in Python 3 using the numpy, scipy, and scikit-learn libraries. 

If you do not have jupyter notebook installed then you probably aren't reading this, but see http://jupyter.readthedocs.io/en/latest/install.html

If you do not have a python 3 kernel installed for jupyter notebook see https://ipython.readthedocs.io/en/latest/install/kernel_install.html or https://stackoverflow.com/questions/28831854/how-do-i-add-python3-kernel-to-jupyter-ipython

If you do not have some of the libraries installed for your python 3 kernel, use the "Kernel -> Conda packages" dropdown menu in Jupyter if you used anaconda for your python 3 kernel, if not use the normal pip install commands.

## 1: Theory

### Background and terminology

Since we will be working in the setting of text corpora, we should clarify some of the terminology used in this setting:
<ul>
<li>A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1, . . . ,V}. 
<li> A document is a sequence of <i>N</i> words denoted by <b>w</b>=(<i>w</i><sub>1</sub>,<i>w</i><sub>2</sub>, . . . ,<i>w</i><sub>N</sub>), where <i>w</i><sub>n</sub> is the <i>n</i>th word in the sequence.</li>
<li>A corpus is a collection of <i>M</i> documents denoted by $D$ ={<b>w</b><sub>1</sub>,<b>w</b><sub>2</sub>, . . . ,<b>w</b><sub>M</sub>}</li>
</ul>

It is important to note that LDA works in other domains besides text collections, but this is the setting in which we will use it.

LDA is a generative probabilistic model that is used for modeling collections of discrete data. In our application we will be using it to model text corpora, or more specifically news group e-mails. The purpose of the model is to give us compact representations of the data in these collections, allowing us to process large collections while still retaining sufficient information to be able to perform for example classification and relevance measures. 

There have been several solutions for this type of information retrieval problem, such as the tf-idf (term frequency - inverse document frequency) scheme by Salton and McGill, 1983. This approach produces a term-by-document matrix X whose columns contain the tf-idf values for each of the documents in the corpus. This representation however did not provide significantly shortened representation of the corpora, or represent the inter- or intra- document statistics in a intuitive way. A step forward from this was given by LSI (latent semantic indexing) where singular value decomposition was used on the matrix X to offer a more compact representation. The authors of the method also argued that since the LSI features are linnear combinations of the basic tf-idf features, they incorporate some linguistical notions such as synonomy and polysemy.
The first step to providing a generative model was the <i>probabilistic</i> LSI (pLSI), which uses mixture models to model each word in a document. The mixture components are the "topics" and represented as multinomial random variables, allowing different words in the document to be genereated by different topics. The compact representation for each document is then the list of numbers representing the mixing proportions for the fixed set of topics. The method however gives no generative probabilistic model for getting these numbers, causing the number of parameters in the model to grow linearly with the corpus size. Also, since there is no probabilistic model for the mixture components that represent a document, there is no clear way of assigning a probability to a document that is outside the training set.

Both LSI and pLSI use the "bag-of-words" approach which assumes exchangeability within the words of the document as well as the documents themselves, meaning their order is of no importance. A theorem due to de Finetti (1990) states that any collection of exchangeable random variables has a representation as a mixture distribution—in general an infinite mixture. This means we must consider mixture models that capture the exchangeability of both documents and words if we wish to achieve exchangeable representations for them. It is this line of thinking that leads to LDA.




### Theory Behind LDA

As mentioned earlier, LDA is a generative probabilistic model for a corpus. It can be seen as a hierarchical Bayesian model with three levels: each document in a corpus is modeled as a finite random mixture over a latent set of topics, and each of these topics are characterized by a distribution of words. A graphical model for LDA using plate notation can be seen below:
![title](imgs/LDAPlateGM.png)
From here we can see the three levels of the model. $\alpha$ and $\beta$ and corpus level parameters, $\theta$ is a document level parameter for the M documents in the corpus, and $z$ and $w$ are word level parameters for the N words in a document.

The generative process according to LDA for each document <b>w</b> is then:
<ol>
<li>Choose N ∼Poisson(ξ)</li>
<li>Choose $\theta$∼Dir($\alpha$)</li>
<li>For each of the N words w<sub>n</sub>:
<ol type="a">
    <li>Choose a topic z<sub>n</sub> ∼Multinomial($\theta$).</li>
    <li>Choose a word w<sub>n</sub> from p(w<sub>n</sub> |z<sub>n</sub>,$\beta$), a multinomial probability conditioned on the topic z<sub>n</sub>.</li>
    </ol></li>
</ol>

There are however some simplifications to these steps that we will utilize. First, we assume that the dimensionality of the Dirichlet distribution, and therefore the dimensionality for the topic variable $z$ is known and fixed, meaning we assume a fixed known number of topics, $k$. Furthermore, the probabilities for words ($w$) are parameterized by a $k \times V$ matrix $\beta$ which defines $p(w^j = 1| z^i = 1) = \beta_{i,j}$, that we will estimate later and keep fixed. We also note that $N$ is independant of the other data generating variables $\theta$ and <b>z</b> so we will ignore the Poisson assumption and set it to a known fixed value (the length of the document).  

#### Dirichlet Distribution in LDA

The probability distribution for a $k$-dimensional Dirichlet random variable $\theta$ is defined as follows: 

<b>Eq. 1:</b>
![Eq 1](imgs/LDAEq1.png "Eq 1")


where $\alpha$ is $k$-dimensional with all elements larger than 0 and $\Gamma(x)$ is the Gamma function. The Dirichlet distribution has some advantegous advantageous qualities; it is in the exponential family, has finite dimensional sufficient statistics, and is conjugate to the multinomial distribution. These properties help us in running variational inference for the parameters later.

We can now express the joint distribution of a topic mixture $\theta$, a set of $N$ topics <b>z</b>, and
a set of $N$ words <b>w</b> given the corpus level parameters $\alpha,\beta$ as:

<b>Eq. 2:</b>
![Eq. 2](imgs/LDAEq2.png)
where the probability $p(z_n |\theta)$ is simply $\theta_i$ for the unique $i$ such that $z^i_n=1$. We can then obtain the marginal distribution over a document by integrating over $\theta$ and summing over $z$:

<b>Eq. 3:</b>
![Eq. 3](imgs/LDAEq3.png)


#### Comparison to other Latent Variable Models
In order to get feeling for how LDA works and what highlights its strengths, it can be helpful to relate it to other related models:

  a) Unigram Model

  b) Mixture of Unigrams Model

  c) pLSI Model
  


We will begin by examing the absolute simplest model, the unigram model: 

![Eq. 3](imgs/UniGramMdl.png)

This method has no latent variables and instead states that each word in a document is independantly drawn from a single multinomial distribution as seen here:

![Eq. 3](imgs/UniGramEq.png)


A slighly more complex model is the mixture of unigrams:

![Eq. 3](imgs/MixUniGramMdl.png)

This model incorporates a discrete latent topic variable, $z$. Here, each document <b>w</b> is generated by first sampling the topic variable $z$, and then generating all words from a conditional probability on that choice:

![Eq. 3](imgs/MixUniGramEq.png)

This effectively limits the modeling of words in a document to only being representative of one topic. The LDA model on the other hand allows for documents to exhibit multiple topics with different mixtures.

Finally we have the pLSI model which we mentioned earlier. It was a relatively popular model around the time that LDA was proposed, and is the model with highest generative capabilities of these three mentioned. 

![Eq. 3](imgs/PISLMdl.png)

pLSI proposes that each word is conditionally independant a "document label", $d$, given an unobserved topic $z$:

![Eq. 3](imgs/PISLEq.png)

This proposal aims to soften the constraint of having each document modeled as being generated from only one topic, as it is in the mixture of unigrams approach. It does so by incorporating the probability, $p(z | d)$ for a certain document $d$ as the mixture of topics that document. A true generative model cannot be created for this mixture however; as d is only a dummy index to the documents pISL was trained with, meaning it is a multinomial random variable with the same amount of possible values as training documents. This leads to the method only learning the topic mixtures, $p(z | d)$, for documents it has already seen, so there is no natural way to assign probability to an unseen document with it.
Another problem is that to model $k$ topics with pLSI you need K multinomial distributions with vocabulary size $V$ and $M$ mixtures over the hidden topics $k$ for each training document, resulting in $kV + kM$ parameters. Not only does this not scale well but it is also prone to overfitting.

LDA however treats the topic mixture weights as a $k$-parameter hidden variable, meaning the amount of parameters does not scale linnearly with the number training documents, and the generative model can still be used even with unseen documents.

We can see these differences geometrically as well if we examine the distribution over words as a $V$-1 dimensional on a vocabulary of size $V$ with another $k$-1 dimensional simplex spanning $k$ topics. We can set $V$ and $k$ to 3 for simplicity (3 words gives a two-dimensional triangle):

![title](imgs/UnigramSampling.png)

How this distribution is spread out and how it uses the topics distribution differs among the methods. The mixture of unigrams method pics a random point on the word simplex that corresponds to one of the topic simplex vertices k, and draws all the words for a document from the distribution corresponding to that point. pLSI assumes that all words in training documents belong to a single randomly chosen topic. The topics are drawn from a document-specific distribution, meaning each document has a topic distribution that sits on the topic simplex. The training documents then give an empirical distribution (with the marked 'x's) over the topic simplex. LDA instead models that <b>each word</b> in a document is drawn from a randomly chosen topic that is sampled from a distribution governed by a random parameter. Since this parameter is sampled once per document, it gives a smooth probability distribution over the topic simplex (the circular topology markers). 

Now that we know how LDA compares with other methods, lets take a look at how to do inference in LDA:

### Inference and Parameter Estimation with LDA

The main inference problem we will be interested in solving is the posterior distribution of the latent variables given a document, which would allow us to infer the topics associated with the document. This is given by the following equation:

\begin{equation}
p( \theta, \mathbf{z} \mid \mathbf{w}, \alpha, \beta) = \frac{p( \theta, \mathbf{z}, \mathbf{w} \mid \alpha, \beta)}{p(\mathbf{w} \mid \alpha, \beta)} 
\end{equation}

However, to compute the normalizing denominator we would rewrite equation 3 using equation 1 and $p(z_n \mid\theta)=\theta_i$ and then integrate, resulting in:

\begin{equation}
p( \mathbf{w} \mid \alpha, \beta) = \frac{\Gamma(\sum_i\alpha_i)}{\prod_i\Gamma(\alpha_i)}\int\Bigg(\prod_{i=1}^k\theta_i^{\alpha_i-1}\Bigg)\Bigg(\prod_{n=1}^N\sum_{i=1}^k\prod_{j=1}^V(\theta_i\beta_{ij})^{w_n^j}\Bigg)d\theta
\end{equation}

Unfortunately, this expression is intractable due to the coupling of $\theta$ and $\beta$ in the summation over topics. We can however solve this problem approximately using variational inference methods. 

#### Variational Inference for LDA

It is possible to use several VI methods for LDA, including La Place approximation, variational approximation, and MCMC methods. In our case, we will be using the convexity-based variational approximation that was mentioned in  Olga's tutorial. From there we learned that in this VI we attempt to reformulate/simplify the original graphical model by removing some dependencies and introducing free variational parameters. This leads to a family of distributions dependant on these variational parameters which form a lower bound on the log likelyhood. We then aim to find the parameter values that give the tightest lower bound. 

In our case, the problematic dependancy is between $\theta$ and $\beta$ which is introduced by the edges between $\theta, \mathbf{z}$ and $\mathbf{w}$ (remember w is a 'collision' node and is observed). If we simplify our model by removing these edges along with the <b>w</b> node, and introduce two variational parameters $\gamma$ and $\phi$ which give a family of distributions over the remaining latent variables, we are left with the graphical model shown on the right in the figure below:

![LDA VI](imgs/LDAVIGM.png "GM for the VI used for our LDA")

This results in the following distribution over the latent variables:

\begin{equation}
p( \theta, \mathbf{z} \mid \gamma, \phi) = q(\theta\mid\gamma)\prod_{n=1}^Nq(z_n\mid\phi_n)
\end{equation}

where the new Dirichlet parameters $\gamma$ and the multinomial parameters $\phi$ are the free variational parameters. Now having simplified our graphical model, we need to find the optimal values for the variational parameters ($\gamma^*, \phi^*$). From Olga's tutorial we know that this is equivalent to finding the values which minimize the KL divergence between the simplified distribution and the true posterior distribution:

\begin{equation}
( \gamma^*, \phi^*) = \arg\!\min_{(\gamma,\phi)}D\big(q( \theta, \mathbf{z} \mid \gamma, \phi) \mid\mid p( \theta, \mathbf{z} \mid \mathbf{w},\alpha, \beta\big)
\end{equation}

We do this by setting the derivatives of the KL divergence to zero w.r.t $\gamma$ and $\phi$ we get the following update equations for the parameters:

\begin{equation}
\phi_{ni} \propto \beta_{iw_n}\mathrm{exp}\big\lbrace\mathrm{E}_q\big[log(\theta_i)\mid\gamma\big]\big\rbrace = \beta_{iw_n}\mathrm{exp}\big\lbrace\Psi(\gamma_i)\big\rbrace
\end{equation}

\begin{equation}
\gamma_i = \alpha_i + \sum_{n=1}^N\phi_{ni}
\end{equation}

where $\Psi$ is the digamma function (first derivative of log $\Gamma$). It is important to note that these update equations derived from the KL divergence are dependant on a certain choice of <b>w</b>. This means that the shown approximation for the variational parameters is only valid for one set of words, and must therefore be calculated for each document when we use them later on. 

We must also find a way of estimating the $\beta$ matrix, as it is used in the approximations for the variational parameters. The log likelihood of the data given $\beta$ and $\alpha$ is intractable as we saw at the end of the previous section. However, it is possible to implement a variational EM procedure that gives us an approximation of the best value for $\beta$ by first maximizing a lower bound for the optimal variational parameters $\gamma^*,\phi^*$, then maximizing the lower bound w.r.t $\beta$ with the previously acquired variational parameters. Essentially we will iterate the following steps until a sufficient level of convergence:
<ol>
<li>(E-step) Find the optimizing values of the variational parameters {$\gamma^∗
_d,\phi^∗_d : d\in D$},for each document as described earlier.</li>
<li>(M-step) Maximize the resulting lower bound on the log likelihood w.r.t $\beta$ using:
\begin{equation}
\beta_{ij}\propto\sum_{d=1}^M\sum_{n=1}^{N_d}\phi^*_{dni}w^j_{dn}
\end{equation}
as well as maximize the resulting lower bound on the log likelihood w.r.t $\alpha$ (this will be given to you).
</ol>

In laymans terms, what we are essentially doing in the E-step is finding out "How prevalent are topics in the document across its words?". In the M-step we then ask "How prevalent are specific words across topics?". By using the answer from one question as a starting point for the other, we iteratively gain the answer to both. 

<span style="color:red">For proof for the update equations, see appendix of http://ai.stanford.edu/~ang/papers/jair03-lda.pdf</span>. This appendix also includes the derivation of the Newton-Raphson based method for updating $\alpha$.  

We have now seen the basic intuition behind LDA, and gone through methods for running inference based on the LDA model. In the next section we will put this knowledge in to practice.


## 2. Implementation

The goal of this task is to use LDA to create topics newsgroup documents, and infer the topic that new documents would belong to. In this setting, our document corpus is the Newsgroup dataset from scikit-learn, and a document is a certain document/e-mail. 

### Dataset

First we will load the dataset we will use for training and testing. We will simplify the example from the original paper to only do clustering for 4 topics, and only use 250 documents with a vocabulary of 750 words. While this does have an effect on the accuracy and performance of the algorithm, it's more important for you to be able to run the code in a managable amount of time. The documents I have chosen come from 4 different categories; "Christian Religion", "Hockey", "Space" and "Cars". This means that we have slightly unrealistic prior knowledge by assuming the exact correct number of topics, but don't worry there are bonus assignments in the end where you can play around with the number of topics. I have already compiled and done some preprocessing on the documents, as well as built the vocabulary dataset as pickle files. Run the code in the cell below and double check that you get the output "found 200 training and 50 test documents, with a vocabulary of 750 words". Do not worry if you get a warning regarding the version of CountVectorizer which is used for handling the vocabulary of the dataset. 

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pickle as pickle

#vectorizer = pickle.load(open("newsVectorizer2.p", "rb"))
#trainDocuments = pickle.load(open("newsTrainDocs.p", "rb"))
#testDocuments = pickle.load(open("newsTestDocs.p", "rb"))

vectorizer = pickle.load(open("vectorizerNews.p", "rb"))
trainDocs = pickle.load(open("trainDocsNews.p", "rb"))
testDocs = pickle.load(open("testDocsNews.p", "rb"))

origTrainDocs = pickle.load(open("trainDocsNewsOrig.p", "rb"))
origTestDocs = pickle.load(open("testDocsNewsOrig.p", "rb"))

print("Found ", len(trainDocs), "training and ", len(testDocs), " test documents, with a vocabulary of ", len(vectorizer.get_feature_names()), " words.")

Found  200 training and  50  test documents, with a vocabulary of  750  words.


### Parameter estimation

We must now find the optimal values for the variational parameters, as well as the values for $\alpha$ and the $\beta$ matrix that were introduced in the variational inference section. In order to follow the instructions given in the VI section we will need to do some setup first, so run the following cells:

In [2]:
#Imports
import numpy as np
import scipy.special as special
import scipy.optimize 
import time

#diGamma func from scipy, use this in your code!
diGamma = special.digamma

#Function definitions for maximizing the VI parameters. This will later be completed by you.
def maxVIParam(phi, gamma, B, alpha, M, k, Wd, eta):
    
    for d in range(M):
        N = len(Wd[d])
        #Initialization of vars, as shown in E-step. 
        phi[d] = np.ones((N,k))*1.0/k
        gamma[d] = np.ones(k)*(N/k) + alpha
        converged = False
        j = 0 #you can use this to print the update error to check your code in the beginning with something like:
        '''if(j%10==0 and d==0):
                    print("u e: ", updateError)'''
        
        #-----Complete the method here by implementing the pseudo code for the E-Step----
        while(not converged):
            for n in range(N):
                for i in range(k):
                    #print("old,", phi[n,i])
                    phi[d][n,i] = B[i,Wd[d][n]]*np.exp(diGamma(gamma[d][i]))
                    #print("new:", phi[n,i])
                if(np.sum(phi[d][n])==0):
                    print("crap, phi sum is: ", np.sum(phi[d][n]), "for doc ",d)
                    phi[d][n] = np.zeros(k)
                else:
                    phi[d][n] = phi[d][n]/np.sum(phi[d][n])

            updateError = 0
            for i in range(k):
                gammaOld = gamma[d][i]
                #print("old:,", gammaOld)
                gamma[d][i] = alpha[i] + np.sum(phi[d][:,i])
                #print("new: ", gamma[d][i], "der: ",np.exp(diGamma(gammaOld)))
                updateError += abs(gammaOld - gamma[d][i]) 
            if(updateError < eta):
                converged = True
            
    
    return gamma, phi

#Function definitions for maximizing the B parameter. This will later be completed by you.
def MaxB(B, phi, k, V, M, Wd):
    
    #YOUR CODE FOR THE M-STEP HERE
    for i in range(k):
        for j in range(V):
            B[i,j] = 0
            for d in range(M):    

                if(j in Wd[d]):
                    Wj = 1
                else:
                    Wj = 0

                for n in range(len(phi[d])):
                    B[i,j] += phi[d][n,i]*Wj
    return B


In [3]:
'''Here are the functions needed for updating the alpha parameter, as shown in the start of appendix A.4.2.
These are all provided for you as it is just plugging in the definition for the gradient and hessian into the 
Newton-Raphson based method to find a stationary point using SciPy. Feel free to take a look at the appendix to
see where these values come from.'''

#value of Likelihood(gamma,phi,alpha,beta) function w.r.t. alpha terms (see start of appendix A.4.2) 
def L_alpha_val(a):
    val = 0
    M = len(gamma)
    k = len(a)
    for d in range(M):
        val += (np.log(scipy.special.gamma(np.sum(a))) - np.sum([np.log(scipy.special.gamma(a[i])) for i in range(k)]) + np.sum([((a[i] -1)*(diGamma(gamma[d][i]) - diGamma(np.sum(gamma[d])))) for i in range(k)]))

    return -val

#value of the derivative of above func w.r.t. alpha_i (2nd eq of appendix A.4.2) 
def L_alpha_der(a):
    M = len(gamma)
    k = len(a)
    der = np.array(
    [(M*(diGamma(np.sum(a)) - diGamma(a[i])) + np.sum([diGamma(gamma[d][i]) - diGamma(np.sum(gamma[d])) for d in range(M)])) for i in range(k)]
    )
    return -der

def L_alpha_hess(a):
    hess = np.zeros((len(a),len(a)))
    for i in range(len(a)):
        for j in range(len(a)):
            k_delta = 1 if i == j else 0
            hess[i,j] = k_delta*M*scipy.special.polygamma(1,a[i]) - scipy.special.polygamma(1,np.sum(a))
    return -hess

def MaxA(a):
    res = scipy.optimize.minimize(L_alpha_val, a, method='Newton-CG',
        jac=L_alpha_der, hess=L_alpha_hess,
        options={'xtol': 1e-8, 'disp': False})
    print(res.x)
    
    return res.x

With that in place, we can now initialize the required parameters and define the skeleton of our loop for the parameter estimation:

In [4]:
eta = 10e-5 #threshold for convergence

#hyperparamater init.
V = len(vectorizer.get_feature_names()) #vocab. cardinality
M = int(len(trainDocs)) #number of documents
k = 4 #amount of emotions

nIter=200 # number of iterations to run until parameter estimation is considered converged

#initialize B matrix as random valid distr (most common according to https://profs.sci.univr.it/~bicego/papers/2015_SIMBAD.pdf)
B = np.random.rand(k,V)

#normalize B
for i in range(k):
    B[i,:] = B[i]/np.sum(B[i])
    
alpha = np.ones(k)
#variational params (one for each doc)
phi = [None]*M
gamma = [None]*M

#word lists for all docs
Wd = [None]*M

'''Since scikit gives a matrix of counts of all words, and we want a list of words,
we do some quick transformations here. This gives us a representation of the documents 
as a list of numbers, where each number is the vocabulary index of a word. This way, to access
B_ij where i is the ith topic and j is the nth word in the document d, you can simply write B[i][Wd[d][n]]. If you want
replace this code a different representation for the words in a document, such as a one-hot vector for each word, you are
of course free to do so but make sure to keep track of your indexes'''

for d in range(M):
    Wmat = vectorizer.transform([trainDocs[d]]).toarray()[0] #get vocabulary matrix for document
    WVidxs = np.where(Wmat!=0)[0]
    WVcounts = Wmat[WVidxs]
    N = np.sum(WVcounts)
    W = np.zeros((N)).astype(int)

    i = 0
    for WVidx, WV in enumerate(WVidxs):
        for wordCount in range(WVcounts[WVidx]):
            W[i] = WV
            i+=1
    Wd[d] = W #We save the list of words for the document for analysis later

start = time.time()

#start of parameter estimation loop
for j in range(nIter):
    print("on iter: ", j)
    #Variational EM for gamma and phi (E-step from VI section)
    start = time.time()
    gamma, phi = maxVIParam(phi, gamma, B, alpha, M, k, Wd, eta)
    end = time.time()
    #Bold = np.copy(B)
    B = MaxB(B,phi,k,V,M,Wd) #first half of M-step from VI section 
    #renormalize B
    for i in range(k):
        B[i,:] = B[i]/np.sum(B[i])
    
    #print("B max diff: ", np.amax(abs(B-Bold)))
    end = time.time()
    
    alpha = MaxA(alpha) #second half of M-step from VI section 
    end = time.time()
    print("iter took: ", end-start)
    print("new alpha: ", alpha)
    
end = time.time()
print("took: ", end-start)

on iter:  0
B max diff:  3063.96729225
[ 1.63959055  2.15508332  1.85917669  1.37267296]
iter took:  91.53012180328369
new alpha:  [ 1.63959055  2.15508332  1.85917669  1.37267296]
on iter:  1
B max diff:  4381.47512277
[ 1.33817758  1.92420598  1.1402266   1.06614479]
iter took:  102.99530720710754
new alpha:  [ 1.33817758  1.92420598  1.1402266   1.06614479]
on iter:  2
B max diff:  4730.54074284
[ 0.68782902  0.98267654  0.52623431  0.58272761]
iter took:  61.22580361366272
new alpha:  [ 0.68782902  0.98267654  0.52623431  0.58272761]
on iter:  3
B max diff:  4782.60032849
[ 0.41393411  0.57621589  0.3022423   0.37951737]
iter took:  64.54079818725586
new alpha:  [ 0.41393411  0.57621589  0.3022423   0.37951737]
on iter:  4
B max diff:  4767.71837023
[ 0.26221524  0.38130595  0.18435596  0.2719242 ]
iter took:  62.026681423187256
new alpha:  [ 0.26221524  0.38130595  0.18435596  0.2719242 ]
on iter:  5
B max diff:  4668.59704455
[ 0.17298086  0.25695241  0.12308987  0.21966763]
iter

#### VI Parameter Estimation
We can now begin with implementing the "E step" in the previous section which updates the variational parameters. The pseudo code for this is the following (remember these have to be calculated separately for each document):
![VIPseudo](imgs/VIPseudo.png)

Since we are working with four topics, k will be set to 4 and N will be the amount of words in the current document. Regarding the "until convergence" condition, it is sufficient to check if the largest difference between the previous and new gamma is less than $10^{-5}$. Now, use the pseudo code to fill in the missing code in the "MaxVIParam" function defined earlier and remember to use the provided diGamma function. To see that your implementation seems to be working, set nIter to 1 and add some printouts of the difference between updates for gamma, then check that they are converging to something smaller.

#### Beta Matrix Estimation
After you have implemented the MaxVIParam function, it's time to update the Beta matrix. Recall that the update function for Beta was:
\begin{equation}
\beta_{ij}\propto\sum_{d=1}^M\sum_{n=1}^{N_d}\phi^*_{dni}w^j_{dn}
\end{equation}
Implement this in the definition for MaxB above. To verify your code, you may set nIter to something low such as 10 and uncomment the "oldB" and "B max diff" lines in the code. The diff might increase at first but should start decreasing before/around iteration 10. After you have verified this, set nIter to 100 (updates should be negligable by then) and let it run unattended as it might take a couple hours. You can use the code in the following cell to save/load the parameter values you calculated for later so you don't have to re-run everything.

In [5]:
pickle.dump(alpha, open("myAlphaNews100.p", "wb"))
pickle.dump(B, open("myBetaNews100.p", "wb"))

#alpha = pickle.load(open("myAlphaNews.p", "rb"))
#B = pickle.load(open("myBetaNews.p", "rb"))

#### Verification
Let's take a look at what we've done so far. We can get an idea of what our implementation has done up to this point by inspecting the B matrix. As you may remember, B$_{ij}$ holds the probability of a vocabulary word j being representative of a certain topic i. Using the code in the following cell we can see the most representative words for our 4 topics:

In [7]:
#representation of top words for each topic:
nTop = 20
for i in range(k):
    topVocabs = np.argsort(B[i])[-nTop:]
    topWords = np.array(vectorizer.get_feature_names())[topVocabs]
    print("top words for topic ",i,": ")
    print(topWords)

top words for topic  0 : 
['data' 'satellite' 'launch' 'spacecraft' 'development' 'national' 'world'
 'question' 'new' 'science' 'year' 'distribution' '1993' 'high' 'program'
 'nntp' 'nasa' 'post' 'use' 'space']
top words for topic  1 : 
['reason' 'bible' 'word' 'like' 'make' 'use' 'come' 'way' 'christ' 'thing'
 'question' 'good' 'think' 'time' 'believe' 'christian' 'know' 'say'
 'people' 'god']
top words for topic  2 : 
['blue' 'point' '92' '1992' '93' '90' 'star' '86' 'red' 'playoff' 'wing'
 'contact' 'nhl' 'season' 'game' 'hockey' 'player' 'year' 'play' 'team']
top words for topic  3 : 
['usa' 'work' 'make' 'look' 'really' 'thing' 'way' 'year' 'come' 'use'
 'people' 'good' 'nntp' 'like' 'say' 'time' 'car' 'know' 'post' 'think']


Since there are no guarantees regarding the order of the topics (LDA is unsupervised) or what your initial B matrix values were, it is difficult to say exactly what results you should be seeing. Hopefully, you can see the four original topics to some extent in your result. For example, one of my topics had top words like "christ", and "god", meaning it was most likely the topic for "Christian Religion" documents, while another literally had the words "Hockey" and "NHL" in it. Our vocabulary is as mentioned earlier quite limited, so it may be possible that one of your topics is a bit unclear or close to another. You can also load the $\alpha$ and $\beta$ values in the cell below which are pre-calculated for 200 iterations and compare your topic results to those available there.

In [9]:
alphaTest = pickle.load(open("CompareAlphaNews200.p", "rb"))
BTest = pickle.load(open("CompareBetaNews200.p", "rb"))
vecTest = pickle.load(open("vectorizerNews.p", "rb"), encoding='latin1')
nTop = 20
for i in range(k):
    topVocabs = np.argsort(BTest[i])[-nTop:]
    topWords = np.array(vecTest.get_feature_names())[topVocabs]
    print("top words for topic ",i,": ")
    print(topWords)

top words for topic  0 : 
['data' 'satellite' 'launch' 'spacecraft' 'development' 'national' 'world'
 'question' 'new' 'science' 'year' 'distribution' '1993' 'high' 'program'
 'nntp' 'nasa' 'post' 'use' 'space']
top words for topic  1 : 
['reason' 'bible' 'word' 'like' 'make' 'use' 'come' 'way' 'christ' 'thing'
 'question' 'good' 'think' 'time' 'believe' 'christian' 'know' 'say'
 'people' 'god']
top words for topic  2 : 
['blue' 'point' '92' '1992' '93' '90' 'star' '86' 'red' 'playoff' 'wing'
 'contact' 'nhl' 'season' 'game' 'hockey' 'player' 'year' 'play' 'team']
top words for topic  3 : 
['usa' 'work' 'make' 'look' 'really' 'thing' 'way' 'year' 'come' 'use'
 'people' 'good' 'nntp' 'like' 'say' 'time' 'car' 'know' 'post' 'think']


#### Inferring the topic of a test document

In this section we will be using our estimated parameter values to infer the topic of some test documents. In order to do this, we will have to calculate the phi and gamma values for each new document we would like to do inference on. This is rather straight forward, and you should be able to reuse your code from the previous sections together with the test documents as a corpus instead.

In [10]:

eta = 10e-5 #threshold for convergence

#we are not re-initializing beta and alpha, we calculated them using the training docs.

V = len(vectorizer.get_feature_names()) #vocab. cardinality
M = int(len(testDocs)) #number of documents
k = 4 #amount of emotions

#variational params (one for each doc)
phi = [None]*M
gamma = [None]*M
WdTest = [None]*M

'''Same magic from before to get the word matrix correct, replace this if you redid this earlier.'''

for d in range(M):
    Wmat = vectorizer.transform([testDocs[d]]).toarray()[0] #get vocabulary matrix for document
    WVidxs = np.where(Wmat!=0)[0]
    WVcounts = Wmat[WVidxs]
    N = np.sum(WVcounts)
    W = np.zeros((N)).astype(int)

    i = 0
    for WVidx, WV in enumerate(WVidxs):
        for wordCount in range(WVcounts[WVidx]):
            W[i] = WV
            i+=1
    WdTest[d] = W #We save the list of words for the document for analysis later

'''Now that you have your variables initialized for the test documents, you should be able to use your previous code for 
maximizing the VI parameters with those variables instead. Remember, we're just calculating the variational parameters
gamma and phi for each test document so there is no iteration between maximizing Beta and maximizing gamma and phi.'''

#Run the gamma/phi maximization here.
gamma, phi = maxVIParam(phi, gamma, B, alpha, M, k, WdTest, eta)


We have now calculated the variational parameters for our test documents, so let us see what information we can infer from them. If you take a look at the pseudo code we used for the MaxVIParam method, you can see that the posterior gamma parameter $\gamma_i $we are calculating is approximately the prior Dichlet parameter $\alpha_i$ added to the expected number of words that were generated by that $i^{th}$ topic for a certain document. Looking at the values for the different $\gamma_i$ over all words for a test document tells us what mixture of topics form such a document. Let us now take a look at the mixtures for some of our test documents by running the code in the cell below: 

In [11]:
#take a look at some example test documents (14-24 has a nice mix of topics, with a couple difficult ones)
dStart = 14
dEnd = 24
for d in range(dStart,dEnd):
    print("Estimated mixture for document ", d," is: ")
    print("_______________________")
    for i in range(len(gamma[d])):
        print("topic ", i,": ", gamma[d][i]/np.sum(gamma[d]))
    print("_______________________")
    print("Which has the following text:")
    print(" ")
    print(origTestDocs[d])
    print("__________________________________________")
    print("__________________________________________")

Estimated mixture for document  14  is: 
_______________________
topic  0 :  0.000592033243315
topic  1 :  0.450022968527
topic  2 :  0.297347701863
topic  3 :  0.252037296367
_______________________
Which has the following text:
 
From: colombo@bronco.fnal.gov (Rick 'Open VMS 4ever' Colombo)
Subject: Re: Do trains/busses have radar?
Nntp-Posting-Host: bronco.fnal.gov
Organization: Fermilab Computing Division
Lines: 27

In article <C5FqFy.Fpq@usenet.ucs.indiana.edu>, mliggett@silver.ucs.indiana.edu (matthew liggett) writes:
> In <1993Apr13.111652@usho72.hou281.chevron.com> hhtra@usho72.hou281.chevron.com (T.M.Haddock) writes:
> 
> 
>> While taking an extended Easter vacation, I was going north on I-45
>> somewhere between Centerville, TX and Dallas, TX and I came upon a 
>> train parked on a trestle with its locomotive sitting directly over
>> the northbound lanes.  There appeared to be movement within the cab 
>> and out of curiosity I slowed to 85 to get a better look.  Just as I
>> 

Revisit the cell that presented the top words for your topics. Do the presented mixtures above make sense if you look at the document content? Recall which original categories ("Religion", "Cars", "Hockey", "Space") you (to your best ability) assigned to which numbers. Do the texts seem to be discussing those topics?

<span style="color:blue">If you're re-doing this test with the MoodyLyrics dataset from the bonus section, you may be noticing some weird results. LDA can experience some issues in this setting, as for example many words that would be present in a happy song could also be present in a sad song ('love', 'hold', 'forever') but in different order or with certain "negating" words between them. It is possible to alleviate this problem by using a vocabulary of n-grams, however this increases the total size of the vocabulary (and therefore the run time as well) substantially. </span>

It is also possible to gain some more insight by examining the $\phi$ values for the documents. Recall that the $\phi$ values for the document approximate $p(z_n | \mathbf{w})$, showing how the topics are mixed for the words in the document. The cell below provides a method for printing the phi values for each word in a document. Apart from just examining how the topics are mixed for specific words, take a look at the topic mixtures for the same word that appears in several different documents. As stated in the theory section, in LDA the distribution of topic mixtures that are assigned to each word is sampled differently for each document. This means that hopefully it should be apparent from your results how the topic mixture for the same word can be differ in different test documents.

In [12]:
#14-24 gives a good mix, but try whatever you like
dStart = 14 
dEnd = 24 


def getWordsFromMatrix(WdTest):
    originalWords  = np.array(vectorizer.get_feature_names())[WdTest] 
    return originalWords

for dk in range(dStart,dEnd):
    
    origWords = getWordsFromMatrix(WdTest[dk])
    wordMixtures = [origWords[n] + "\t: " + str(phi[dk][n]) for n in range(len(phi[dk]))]
    for wm in set(wordMixtures):
        print(wm)
    print("________________________________")


ucs	: [  2.59726755e-19   5.02561244e-01   5.62327252e-19   4.97438756e-01]
27	: [  5.86807521e-18   1.80996807e-01   5.83163425e-01   2.35839768e-01]
know	: [  1.90743410e-18   5.60808416e-01   1.00444381e-01   3.38747203e-01]
explain	: [  9.28633566e-19   6.31275223e-01   1.07737769e-02   3.57951000e-01]
open	: [  5.59818834e-18   6.35751703e-01   5.40271583e-02   3.10221139e-01]
sign	: [  2.17991145e-35   7.24872425e-02   5.82853519e-01   3.44659238e-01]
strong	: [  5.16714643e-20   5.77298820e-01   4.13754990e-01   8.94619003e-03]
usa	: [  5.03755456e-19   3.47041142e-02   2.24883836e-01   7.40412050e-01]
indiana	: [  4.01707639e-19   7.77288771e-01   5.10691350e-19   2.22711229e-01]
dallas	: [  7.13459885e-36   4.78322208e-14   7.55501129e-01   2.44498871e-01]
nntp	: [  5.27721318e-18   9.67768185e-02   3.72268236e-01   5.30954946e-01]
gov	: [  1.51998664e-17   7.64540936e-01   3.06874785e-02   2.04771586e-01]
radio	: [  9.83470753e-18   6.97552131e-02   6.35577975e-01   2.9466681

#### Bonus Objectives

Well done! You have now implemented LDA, approximated the necessary variational parameters, and examined the results to infer information about topics in documents. If you feel like you would like to experiment some more, there some variants that you could try:

1. Load the provided dataset from the Associated Press docs dataset. This has random news articles from an undisclosed number of topics. Replace the dataset code in the beginning with what is provided in the next cell and redo your tests. What kind of topics does your result have? How many topics did you assume there were? (Some interesting cases I got were general topics like Crime and Economics and then one focusing solely on foreign affairs with President Bush)

2. Run the tests using the MoodyLyrics dataset instead. This dataset includes the lyrics from songs in many different genres (I've included has slightly less than 200 / 50 documents and V=500). Run the tests again and see what kind of sense LDA tries to make out of these song lyrics. The dataset also provides an annotation as to what emotion ("Angry", "Sad", "Happy", "Relaxed") the song exhibits. Can you find a resemblence in your topics to these emotions? (<i>Disclaimer: The lyrics provided are not censored and some are not exactly "PG-13")</i>

<b>It is possible to start to see results after ~50-60 iterations so if you would like to try out these bonus exercises you need not wait overnight</b>

In [None]:
#loading the AP docs dataset instead:
#(everything else should work like before)
vectorizer = pickle.load(open("vectorizerAP.p", "rb"), encoding='latin1')
trainDocs = pickle.load(open("trainDocsAP.p", "rb"), encoding='latin1')
testDocs = pickle.load(open("testDocsAP.p", "rb"), encoding='latin1')


#loading the moodyLyrics dataset instead:
vectorizer = pickle.load(open("vectorizerMoodyLyrics.p", "rb"), encoding='latin1')
trainLyricsFile = pickle.load(open("trainDocsMoodyLyrics.p", "rb"), encoding='latin1')
testLyricsFile = pickle.load(open("testDocsMoodyLyrics.p", "rb"), encoding='latin1')

trainDocs = trainLyricsFile['lyrics']
testDocs = testLyricsFile['lyrics']
#original moods can be seen with: trainGT = trainLyricsFile['groundTruth'] but the labeling is not perfect. 

In [120]:
#loading a larger vocabulary: 

for dk in range(10,20):
    print(testLyricsFile['groundTruth'][dk])

#for dk in range(0,len(trainLyrics['lyrics'])):
#    print(trainLyrics['groundTruth'][dk])

happy
angry
relaxed
happy
relaxed
happy
angry
sad
angry
angry


In [21]:
from sklearn.datasets import fetch_20newsgroups
categories=['rec.autos', 'soc.religion.christian', 'sci.space', 'rec.sport.hockey']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
print("\n".join(twenty_train.data[4].split("\n\n")[1:]))

As a matter of interest does anyone know why autos are so popular in the US while 
here in Europe they are rare??? Just wondering.....
-- 
___________________________________________________________________ ____/|
John Kissane                           | Motorola Ireland Ltd.,   | \'o.O'
UUCP    : ..uunet!motcid!glas!kissanej | Mahon Industrial Estate, | =() ()=
Internet: kissanej@glas.rtsg.mot.com   | Blackrock, Cork, Ireland |    U



In [55]:
#print("\n".join(twenty_train.data[113].split("\n")))
print(twenty_train.data[68])

From: revdak@netcom.com (D. Andrew Kille)
Subject: Re: Easter: what's in a name? (was Re: New Testament Double Stan
Organization: NETCOM On-line Communication Services (408 241-9760 guest)
Lines: 40

Daniel Segard (dsegard@nyx.cs.du.edu) wrote:

[a lot of stuff deleted]

:      For that matter, stay Biblical and call it Omar Rasheet (The Feast of
: First Fruits).  Torah commands that this be observed on the day following
: the Sabbath of Passover week.  (Sunday by any other name in modern
: parlance.)  Why is there so much objection to observing the Resurrection
: on the 1st day of the week on which it actually occured?  Why jump it all
: over the calendar the way Easter does?  Why not just go with the Sunday
: following Passover the way the Bible has it?  Why seek after unbiblical
: methods?
:  
In fact, that is the reason Easter "jumps all over the calendar"- Passsover
itself is a lunar holiday, not a solar one, and thus falls over a wide
possible span of times.  The few times that E

In [2]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

NameError: name 'CountVectorizer' is not defined

In [117]:
#subTrainTops = twenty_train.ca
for t in twenty_train.target[250:300]:
    print(twenty_train.target_names[t])

soc.religion.christian
rec.sport.hockey
rec.autos
rec.autos
soc.religion.christian
rec.autos
rec.sport.hockey
rec.autos
rec.sport.hockey
sci.space
rec.autos
sci.space
rec.sport.hockey
rec.autos
rec.autos
sci.space
soc.religion.christian
rec.sport.hockey
rec.autos
rec.sport.hockey
soc.religion.christian
sci.space
rec.autos
sci.space
sci.space
rec.sport.hockey
soc.religion.christian
sci.space
sci.space
rec.autos
soc.religion.christian
rec.sport.hockey
soc.religion.christian
rec.sport.hockey
rec.sport.hockey
rec.autos
sci.space
rec.autos
sci.space
rec.autos
soc.religion.christian
soc.religion.christian
sci.space
rec.sport.hockey
soc.religion.christian
rec.sport.hockey
soc.religion.christian
soc.religion.christian
rec.autos
rec.sport.hockey


In [118]:
newsTrain = []
for newsD in range(200):
    newsTrain.append(''.join([i for i in twenty_train.data[newsD] if not i.isdigit()]))
  
newsTest = []
for newsD in range(250,300):
    newsTest.append(''.join([i for i in twenty_train.data[newsD] if not i.isdigit()]))
#newsTrain = twenty_train.data[:200]
#newsTest = twenty_train.data[200:250]
count_vect = CountVectorizer(max_df=0.5, min_df=0.01, stop_words='english', max_features=750)
X_train_counts = count_vect.fit_transform(newsTrain + newsTest)
X_train_counts.shape

(250, 750)

In [119]:
#print(count_vect.get_feature_names())
pickle.dump(newsTrain, open("newsTrainDocs.p", "wb"))
pickle.dump(newsTest, open("newsTestDocs.p", "wb"))
pickle.dump(count_vect, open("newsVectorizer2.p", "wb"))

In [121]:
print((newsTrain[0]))

From: darling@cellar.org (Thomas Darling)
Subject: Re: WHERE ARE THE DOUBTERS NOW?  HMM?
Organization: The Cellar BBS and public access system
Lines: 

jason@studsys.mscs.mu.edu (Jason Hanson) writes:

> In article <Apr..@ramsey.cs.laurentian.ca> maynard@ramsey.cs.
> >
> >And after the Leafs make cream cheese of the Philadelphia side tomorrow
> >night the Leafs will be without equal.
> 
> Then again, maybe not.

To put it mildly.  As I watched the Flyers demolish Toronto last night, -,
I realized that no matter how good the Leafs' # line may be, they'll need
one or two more decent lines to go far in the playoffs.  And, of course, a
healthy Felix Potvin.

^~^~^~^~^~^~^~^~^\\\^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^
Thomas A. Darling \\\ The Cellar BBS & Public Access System: ..
darling@cellar.org \\\ GEnie: T.DARLING \\ FactHQ "Truth Thru Technology"
v~v~v~v~v~v~v~v~v~v~\\\~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v~v



In [124]:
print(count_vect.get_feature_names())



In [126]:
np.ones((4))*3 - np.ones((4))*1.5

array([ 1.5,  1.5,  1.5,  1.5])

In [22]:
pp = NLTKPreprocessor()
trainDocsFiles = twenty_train.data[:200]
ppDoc = []
trainDocs = []
for dd in range(len(trainDocsFiles)):
    derp = pp.tokenize(trainDocsFiles[dd])
    ppDoc = []
    for wor in derp:
        ppDoc.append(wor)
    trainDocs.append(" ".join(ppDoc))
print(trainDocs[0])


darling cellar thomas darling doubter hmm organization cellar bb public access system 18 jason studsys msc mu jason hanson write article 1993apr4 051942 27095 ramsey c laurentian ca maynard ramsey c leaf make cream cheese philadelphia side tomorrow night leaf without equal maybe put mildly watch flyer demolish toronto last night realize matter good leaf line may need one two decent go far playoff course healthy felix potvin thomas darling cellar bb public access system 215 539 3043 darling cellar genie darling facthq truth thru technology


In [25]:
pickle.dump(trainDocsFiles, open("trainDocsNews2Orig.p", "wb"))
pickle.dump(testDocsFiles, open("testDocsNews2Orig.p", "wb"))

In [23]:
testDocsFiles = twenty_train.data[250:300]
ppDoc = []
testDocs = []
for dd in range(len(testDocsFiles)):
    derp = pp.tokenize(testDocsFiles[dd])
    ppDoc = []
    for wor in derp:
        ppDoc.append(wor)
    testDocs.append(" ".join(ppDoc))
print(testDocs[0])

rick_granberry pt mot rick granberry help reply rick_granberry pt mot rick granberry organization motorola paging telepoint system group 46 article apr 21 03 26 51 1993 1379 geneva rutgers lmvec westminster ac uk william hargreaves write hi everyone commited christian battle problem know romans talk save faith deed yet hebrew james say faith without deed useless say fool still think believing enough someone fully believe life totally lead god accord roman person still save faith 02 yes believe scenario possible either believe living least part lead god else believing intellectually wait enough especially important remember one judge whether committed judge someone else guess close come know someone situation listen statement fallible sense communion one another bit say god prefer someone cold know condemn lukewarm christian someone know believe god make attempt live bible regard passage need remember letter church laodicea people body christ rev 14 16 talk work translation could say sa

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
import pickle as pickle
count_vect = CountVectorizer(max_df=0.5, min_df=0.01, stop_words='english', max_features=750)
X_train_counts = count_vect.fit_transform(trainDocs + testDocs)
X_train_counts.shape


pickle.dump(trainDocs, open("trainDocsNews2.p", "wb"))
pickle.dump(testDocs, open("testDocsNews2.p", "wb"))
pickle.dump(count_vect, open("vecNews2.p", "wb"))

In [20]:
import string

from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk import wordpunct_tokenize
from nltk import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import pos_tag

from sklearn.base import BaseEstimator, TransformerMixin


class NLTKPreprocessor(BaseEstimator, TransformerMixin):

    def __init__(self, stopwords=None, punct=None,
                 lower=True, strip=True):
        self.lower      = lower
        self.strip      = strip
        self.stopwords  = stopwords or set(sw.words('english'))
        self.punct      = punct or set(string.punctuation)
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def inverse_transform(self, X):
        return [" ".join(doc) for doc in X]

    def transform(self, X):
        return [
            list(self.tokenize(doc)) for doc in X
        ]

    def tokenize(self, document):
        # Break the document into sentences
        for sent in sent_tokenize(document):
            # Break the sentence into part of speech tagged tokens
            for token, tag in pos_tag(wordpunct_tokenize(sent)):
                # Apply preprocessing to the token
                token = token.lower() if self.lower else token
                token = token.strip() if self.strip else token
                token = token.strip('_') if self.strip else token
                token = token.strip('*') if self.strip else token

                # If stopword, ignore token and continue
                if token in self.stopwords:
                    continue

                #no 1 letter words
                if len(token)<2:
                    continue
                    
                if token=='org' or token=='edu' or token=='lines' or token=='university' or token=='subject' or token=='posted' or token=='hosted' or token=='host' or token=='com':
                    continue
                # If punctuation, ignore token and continue
                if all(char in self.punct for char in token):
                    continue

                # Lemmatize the token and yield
                lemma = self.lemmatize(token, tag)
                yield lemma

    def lemmatize(self, token, tag):
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

