## Topic modelling using Gensim

Topic Modelling is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. 

It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. 

A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare,
and “farm”, “crops”, “wheat” for a topic – “Farming”.

![Modeling1.png](attachment:Modeling1.png)

We would be using latent Dirichlet allocation (LDA) to find the topics from documents.

* Latent Dirichlet allocation (LDA) is a generative statistical model.
* It allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. 

If observations are words collected into documents:

* It posits that each document is a mixture of a small number of topics
* And that each word's creation is attributable to one of the document's topics.




In [1]:
import numpy as np
import logging
import pyLDAvis.gensim
import json
import warnings
import os
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import EnglishStemmer

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary
from gensim import corpora, models
from gensim.corpora.dictionary import Dictionary
from numpy import array

### Set up logging

In [2]:
logger = logging.getLogger()
logger.setLevel(logging.ERROR)
logging.debug("test")

### Set up corpus

As stated in table 2 from [this](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) paper, this corpus essentially has two classes of documents. 

* First set of our documents are from enron emails.
* Second set of documents are scientific papers.

### Experimental setup 
We will be setting up two LDA models. 
* One with 50 iterations of training 
* the other with just 1. 

Hence the one with 50 iterations ("better" model) should be able to capture this underlying pattern of the corpus better than the "bad" LDA model.

Therefore, in theory, topic coherence for the good LDA model should be greater than the one for the bad LDA model.

In [3]:

class MyCorpus(corpora.TextCorpus):
    def __init__(self, input=None, path = 'assets/topicmodelling/'):
        self.path = path
        super(corpora.TextCorpus, self).__init__()
        self.input = input
        self.dictionary = Dictionary()
        self.metadata = False
        self.dictionary.add_documents(self.get_texts())

    def __len__(self):
        files = os.listdir(self.path)
        return len(files)

    def get_texts(self):
        files = os.listdir(self.path)
        counteR = 0
        json_data={}
        for fl in files:
            #print(counteR, ': ', fl)
            if(counteR%1000 == 0):
                print(counteR)
            counteR += 1
            text = ''
            input_file = open(self.path + '/'+fl, errors='ignore')
            dat = input_file.read()
            yield ie_preprocess(dat)

def ie_preprocess(document):
    document = re.sub('[^A-Za-z ]+', '', document)
    document = [stemmer.stem(w) for w in nltk.word_tokenize(document) if w.lower() not in stop]
    return document



### Initialize stopwords and stemmer to pre process the documents.

In [4]:
    stop = stopwords.words('english')
    add_stopwords = ['said', 'mln', 'billion', 'million', 'pct', 'would', 'inc', 'company', 'corp']
    stop += add_stopwords
    stemmer = EnglishStemmer()


## This would all our corpus above and create dictionary

In [5]:
corpus = MyCorpus()

0


### Set up two topic models

We'll be setting up two different LDA Topic models. A good one and bad one. To build a "good" topic model, we'll simply train it using more iterations than the bad one. Therefore the `u_mass` coherence should in theory be better for the good model than the bad one since it would be producing more "human-interpretable" topics.

In [6]:
goodLdaModel = LdaModel(corpus=corpus, id2word=corpus.dictionary, iterations=50, num_topics=2, minimum_probability=0)

0


In [7]:
badLdaModel = LdaModel(corpus=corpus, id2word=corpus.dictionary, iterations=1, num_topics=2, minimum_probability=0)

0


### Using U_Mass Coherence to check how well is our model trained

In [8]:
goodcm = CoherenceModel(model=goodLdaModel, corpus=corpus, dictionary=corpus.dictionary, coherence='u_mass')

In [9]:
badcm = CoherenceModel(model=badLdaModel, corpus=corpus, dictionary=corpus.dictionary, coherence='u_mass')

### Interpreting the topics

As we will see below using LDA visualization, the better model comes up with two topics composed of the following words:
1. goodLdaModel:
    - __Topic 1__: More weightage assigned to words such as "enron", "kevin", "color", "rate" etc which captures the first set of documents.
    - __Topic 2__: More weightage assigned to words such as "learn", "al", "model" which captures the topic in the second set of documents.
2. badLdaModel:
    - __Topic 1__: More weightage assigned to words such as "et", "feature", "event", "model" which doesn't make the topic clear enough.
    - __Topic 2__: More weightage assigned to words such as "et", "feature", "event", "model" which is similar to the first topic. Hence both topics are not human-interpretable.

Therefore, the topic coherence for the goodLdaModel should be greater for this than the badLdaModel since the topics it comes up with are more human-interpretable. 

### Visualize topic models

In [10]:
pyLDAvis.enable_notebook()

In [11]:
pyLDAvis.gensim.prepare(goodLdaModel, corpus, corpus.dictionary)

0
0


In [12]:
pyLDAvis.gensim.prepare(badLdaModel, corpus, corpus.dictionary)

0
0


In [13]:
print(goodcm.get_coherence())

0
-2.301717143335696


In [14]:
print(badcm.get_coherence())

0
-1.4188193109539793


## Predicting unseen documents using LDAModel 

In the end topic modelling tries to predict what is the probability of this document being from the found topic.

Topics in topic modelling is an abstract concept and is very different from what we image as topics.


### Type of documents:

1. Enron Emails : Files starting with P17
2. ACL anthology : Files starting with 00112

Here as we saw above one topic represents emails and one topic represents scientific papers.

When we try to predict a unknown document, We will get topic for document as probability over all found topic

Document Topic =  (probability of document being from topic 1)X Topic1 + (Probability of document being from topic 2)X Topic2

### Expectation 

 * Good LDA model i.e well trained model for 50 iteration should have correctly predict the topic of document 
 * Bad LDA model under trained mode should give random or equal probability for both the topics.

We are testing our model on unseen documents present in topicmodellingtest



In [15]:
files = os.listdir('assets/topicmodellingtest')
for fl in files:
    input_file = open('assets/topicmodellingtest' + '/'+fl, errors='ignore')
    dat = input_file.read()
    idocdow = corpus.dictionary.doc2bow(ie_preprocess(dat))
    prediction = goodLdaModel[idocdow]
    print("File : ", fl, " Prediction ", str(prediction))


File :  .DS_Store  Prediction  [(0, 0.49999875), (1, 0.5000012)]
File :  0109.2000-01-07.kaminski.ham.txt  Prediction  [(0, 0.95304316), (1, 0.04695687)]
File :  0112.1999-12-31.farmer.ham.txt  Prediction  [(0, 0.949702), (1, 0.05029799)]
File :  0112.2001-03-06.kitchen.ham.txt  Prediction  [(0, 0.9594932), (1, 0.040506788)]
File :  0113.2001-03-06.kitchen.ham.txt  Prediction  [(0, 0.9951991), (1, 0.0048009525)]
File :  0116.2001-03-07.kitchen.ham.txt  Prediction  [(0, 0.5343351), (1, 0.46566492)]
File :  0119.2000-01-04.farmer.ham.txt  Prediction  [(0, 0.98242134), (1, 0.017578613)]
File :  P17-1176.pdf.json  Prediction  [(0, 0.004903869), (1, 0.9950961)]
File :  P17-1177.pdf.json  Prediction  [(0, 0.057066027), (1, 0.942934)]
File :  P17-1178.pdf.json  Prediction  [(0, 0.16296355), (1, 0.83703643)]
File :  P17-1179.pdf.json  Prediction  [(0, 0.0009285536), (1, 0.9990715)]


In [16]:
files = os.listdir('assets/topicmodellingtest')
for fl in files:
    input_file = open('assets/topicmodellingtest' + '/'+fl, errors='ignore')
    dat = input_file.read()
    idocdow = corpus.dictionary.doc2bow(ie_preprocess(dat))
    prediction = badLdaModel[idocdow]
    print("File : ", fl, " Prediction ", str(prediction))


File :  .DS_Store  Prediction  [(0, 0.50000024), (1, 0.4999998)]
File :  0109.2000-01-07.kaminski.ham.txt  Prediction  [(0, 0.43157545), (1, 0.5684245)]
File :  0112.1999-12-31.farmer.ham.txt  Prediction  [(0, 0.49547005), (1, 0.50452995)]
File :  0112.2001-03-06.kitchen.ham.txt  Prediction  [(0, 0.42775166), (1, 0.5722484)]
File :  0113.2001-03-06.kitchen.ham.txt  Prediction  [(0, 0.4546653), (1, 0.54533476)]
File :  0116.2001-03-07.kitchen.ham.txt  Prediction  [(0, 0.4574061), (1, 0.5425939)]
File :  0119.2000-01-04.farmer.ham.txt  Prediction  [(0, 0.4996802), (1, 0.5003198)]
File :  P17-1176.pdf.json  Prediction  [(0, 0.462134), (1, 0.537866)]
File :  P17-1177.pdf.json  Prediction  [(0, 0.5255268), (1, 0.4744732)]
File :  P17-1178.pdf.json  Prediction  [(0, 0.44755605), (1, 0.552444)]
File :  P17-1179.pdf.json  Prediction  [(0, 0.48721746), (1, 0.51278245)]
