# Topic Model Analysis

The code is written to extract topics from text data. The code supports performing the following analysis:-
1. Latent Dirichlet Allocation (LDA)
2. Supervised Latent Dirichlet Allocation (sLDA)

Latent Dirichlet allocation can be performed using Gensim or Tomotopy. Supervised LDA can be performed using Tomotopy. The dependent variable can be linear or binary in nature.

In addition to this, the code also allows users to evaluate models using measures such as Coherence and Perplexity. Various visualisations can also be used to evaluate the results from the topic models. These include:-
1. pyLDAvis to understand topics and the inter-topic distance
2. Word clouds for topics

## Importing the Libraries

In [None]:
### ************************** Importing Packages ************************ ###
from __future__ import division
import re                     # regular expressions
import numpy as np            # scientific computing
import pandas as pd           # datastructures and computing
import pprint as pprint       # better printing
import os
import os.path

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Lemmatization
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

# Plotting tools
#import graphlab as gl
import pyLDAvis               # interactive topic model visualisation
#import pyLDAvis.graphlab
import pyLDAvis.gensim
import matplotlib.pyplot as plt

# %matplotlib inline          # to ensure that the matplotlib plots are printed in the Jupyter notebooks

# Libraries for Topic Models
import sys
import tomotopy as tp

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Creating the list of Stop Words

In [None]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['app', 'apps', 'also', 'android', 'atm', 'atms', 'call', 'calls', 'calling', 'browsing', 'browse', 
                   'contact','clock', 'communication', 'dk', 'edu', 'e-mails', 'email', 'emails', 'etc', 'etfc', 
                   'entertainment', 'fb', 'files', 'from', 'food', 'images', 'info', 'internet', 'jpg', 'online', 
                   'mail', 'make', 'much', 'mean', 'music', 'messaging', 'messenger', 'mobilepay', 'nd', 'networks', 
                   'news', 'parents', 'friends', 'pdf', 'player', 'photos', 'photographs', 'photography', 'receiving', 
                   're', 'related', 'reviews', 'rooms', 'social', 'subject', 'sms', 'storage', 'skype', 'twitter', 
                   'text', 'texting', 'use', 'using', 'www', 'wifi'])
#stop_words

## Importing the Dataset

The dataset preparation is quite important here. If the model used is simple LDA, only the text data is mandatory.

However, for the supervised LDA, it requires the response variable (dependent variable) along with the text data.

In [None]:
### ************************** Importing Datasets ************************ ###
directo = "C:\\Users\\vibabu\\Dropbox\\Doctoral_Research\\STSM\\Analysis\\Combined_Dataset\\SEM_October_19\\India_OE\\Results_Feb_10\\Open_Ended_Q1"

df = pd.read_csv(directo + "\\Open_Ended_Q1_90_pct.csv", encoding='latin-1')
print(df.Response.unique())

In [None]:
df.head()

In [None]:
# convert the content field in dataset into a list
data = df.Response.values.tolist()
resp = df.Rating.values.tolist()
data[:5]

In [None]:
resp[:5]

## Cleaning the Dataset

In this section of the code, the following data cleaning techniques are used:-

1. E-mail id  and New line characters
2. Remove "StopWords" from the dataset
3. Forming bigrams and trigrams
4. Stemming


### 1. Removing emails and new line characters

In [None]:
### ************************** Datasets Cleaning ************************* ###

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\n', ' ', sent) for sent in data]

pprint.pprint(data[:5])

### 2. Removing StopWords

In [None]:
# Function to remove StopWords

def remove_stopwords(texts):
        """
        objective:
            function to remove stopwords from the paragraph/sentence
            uses the preprocess
        input:
            paragraph/sentences
        output:
            wordlist after the stopwords are removed
        """
        return [[word for word in simple_preprocess(str(doc)) if word not in stop_words]
             for doc in texts]

In [None]:
data_words_nostops = remove_stopwords(data)
data_words_nostops[:5]

### 3. Forming Bigrams and Trigrams

In [None]:
def make_bigrams(texts):
    """
    objective:
        takes the processed text- after preprocessing and stop word removal
    input:
        preprocessed text
    output:
        text with bigrams
    """
    return [bigram_mod[text] for text in texts]

def make_trigrams(texts):
    """
    objective:
        generate trigrams for the text
    input:
        text with bigrams
    output:
        text with trigrams    
    """
    return [trigram_mod[bigram_mod[text]] for text in texts]

In [None]:
# Calibration Dataset

# Build functions to remove stopwords, bigram and trigram models- calibration dataset
bigram = gensim.models.phrases.Phrases(data, min_count=5, threshold=100)
trigram = gensim.models.phrases.Phrases(bigram[data], threshold=100)

# Passing the parameters to the bigram/trigram- calibration dataset
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

data_words_bigrams = make_bigrams(data_words_nostops)
data_words_bigrams[:5]

### 4. Lemmatization

In [None]:
ps = PorterStemmer()

data_lemmatized = []
for texts in data_words_bigrams:
    data_lemmatized.append([ps.stem(doc) for doc in texts])
    
data_lemmatized[:5]

## Writing the Files to the dataset

In [None]:
df['Cleaned_Data'] = data_lemmatized
df.head()
df.to_csv(directo + "\\Output_Q1_Words_Calibration.csv")

## LDA Model

In [None]:
# Defining the LDA Function
    
def lda_model(input_list, save_path):
    """
    desc:
        the function estimates the LDA model and outputs the estimated topics
    input:
        list with documents as responses
    output:
        prints the topics
        words and their corresponding proportions
    """
    mdl = tp.LDAModel(tw=tp.TermWeight.ONE,             # Term weighting
                      min_cf=3,                         # Minimum frequency of words
                      rm_top=0,                         # Number of top frequency words to be removed
                      k=3)                              # Number of topics
    for n, line in enumerate(input_list):
        ch = " ".join(line)
        docu = ch.strip().split()
        mdl.add_doc(docu)
    mdl.burn_in = 100
    mdl.train(0)
    print('Num docs: ', len(mdl.docs), 'Vocab size: ', mdl.num_vocabs, 'Num words: ', mdl.num_words)
    print('Removed words: ', mdl.removed_top_words)
    print('Training...', file=sys.stderr, flush=True)
    for i in range(0, 1000, 10):
        mdl.train(1000)
        print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))
        
    print('Saving...', file=sys.stderr, flush=True)
    mdl.save(save_path, True)
    
    for k in range(mdl.k):
        print('Topic #{}'.format(k))
        for word, prob in mdl.get_topic_words(k):
            print('\t', word, prob, sep='\t')
    return mdl

### Estimating the Topic Model

In [None]:
print('Running LDA')
lda_model = lda_model(data_lemmatized, 'test.lda.bin')

## Supervised LDA

In [None]:
def slda_model(documents, dep_var, save_path):
    """
    desc:
        the function estimates the sLDA model and outputs the estimated topics
    input:
        list with documents as responses
        dependent variable
    output:
        prints the topics
        words and their corresponding proportions
    """
    smdl = tp.SLDAModel(tw=tp.TermWeight.ONE,             # Term weighting
                        min_cf=3,                         # Minimum frequency of words
                        rm_top=0,                         # Number of top frequency words to be removed
                        k=3,                              # Number of topics
                        vars=['l'])                       # Number of dependent variables
    for row, pred in zip(documents, dep_var):
        pred_1 = []
        pred_1.append(pred)
        ch = " ".join(row)
        docu = ch.strip().split()
        smdl.add_doc(words=docu, y=pred_1)
        
    smdl.burn_in = 100
    smdl.train(0)
    
    # Printing the output statistics
    print('Num docs: ', len(smdl.docs), 'Vocab size: ', smdl.num_vocabs, 'Num words: ', smdl.num_words)
    print('Removed top words: ', smdl.removed_top_words)
    print('Training...', file=sys.stderr, flush=True)
    for i in range(0, 1000, 10):
        smdl.train(1000)
        print('Iteration: {}\tLog-likelihood: {}'.format(i, smdl.ll_per_word))
        
    print('Saving...', file=sys.stderr, flush=True)
    smdl.save(save_path, True)
    
    for k in range(smdl.k):
        print('Topic #{}'.format(k))
        for word, prob in smdl.get_topic_words(k):
            print('\t', word, prob, sep='\t')
    return smdl

In [None]:
print('Running Supervised LDA')
slda_model = slda_model(data_lemmatized, resp, 'test.slda.bin')

## Visualising the Results of LDA

pyLDAvis does not have a module that allows topic models estimated using Tomotopy to be used directly for plotting the graphs. It however allows plotting after the following parameters are computed for each of the topic models:-

1. phi

    a. probabilities of each word(W) for a given topic(K) under consideration
    
    b. is a K x W vector
    
    
2. theta

    a. probability mass function over "K" topics for all the documents in the corpus (D)
    
    b. is a D x K matrix
    
    
3. n(d)

    a. number of tokens for each document


4. vocab

    a. vector of terms in the vocabulary
    
    b. presented in the same order as in "phi"
    
    
5. M(w)

    a. frequency of term "w" across the entire corpus

### Computing the value of "Phi" for the Model

In [None]:
def compute_phi(model):
    """
    desc:
        this function computes the value of phi for visualising the results of topic model
        probabilities of each word for a given topic
    input:
        the topic model
    output:
        K x W vector
        K = number of topics
        W = number of words
    """
    mat_phi1 = []
    for i in range(model.k):
        #print(model.get_topic_words(i,model.num_vocabs))
        mat_phi1.append(model.get_topic_words(i,model.num_vocabs))
    
    list_words = []
    for text in mat_phi1[0]:
            list_words.append(text[0])
    
    #print(list_words)
    list_words.sort()
    #print(list_words)
    
    mat_phi2 = [[i * j for j in range(model.num_vocabs)] for i in range(model.k+1)]
    for i in range(model.num_vocabs):
        mat_phi2[0][i] = list_words[i]

    
    j1 = []
    k1 = []
    m = 0
    while m < model.k:
        j1.append(m)
        m += 1
        
    n = 1
    while n <= model.k:
        k1.append(n)
        n += 1
        
    for j, k in zip(j1, k1):
        for index, word in enumerate(mat_phi2[0]):
            #print(word)
            for item in mat_phi1[j]:
                #print(item)
                if word == item[0]:
                    mat_phi2[k][index] = item[1]
    
    if os.path.isfile(directo + '\\topic_word_prob_lda.csv'):
        with open(directo + '\\topic_word_prob_slda.csv', 'w') as f:
            for item in mat_phi2:
                for items in item:
                    f.writelines("%s, " % items)
                f.writelines("\n")
            f.close()
    else:
        with open(directo + '\\topic_word_prob_lda.csv', 'w') as f:
            for item in mat_phi2:
                for items in item:
                    f.writelines("%s, " % items)
                f.writelines("\n")
            f.close()
        
    return mat_phi2[0], mat_phi2[1:]

### Computing the value of "Theta" for the Model

#### For LDA Model

In [None]:
def compute_theta_lda(model, data):
    """
    desc:
        this function computes the value of theta for visualising the results of topic model
        probabilities mass function over "K" topics for all documents (D) in the corpus
    input:
        the topic model
        dataset
    output:
        D x K vector
        D = number of documents
        K = number of topics
    """
    mat_theta = []
    for n, line in enumerate(data):
        ch = " ".join(line)
        docu = ch.strip().split()
        theta_val = model.infer(doc=model.make_doc(docu),
                                     iter=100,
                                     workers=0,
                                     together=False)
        mat_theta.append(theta_val[0])
    
    with open(directo + '\\topic_probabilities_lda.csv', 'w') as f:
        for item in mat_theta:
            for items in item:
                f.writelines("%s, " %items)
            f.writelines("\n")
        f.close()
    
    return mat_theta

#### For sLDA Model

In [None]:
def compute_theta_slda(model, data, dep_var):
    """
    desc:
        this function computes the value of theta for visualising the results of topic model
        probabilities mass function over "K" topics for all documents (D) in the corpus
    input:
        the topic model
        dataset
        dependent variable
    output:
        D x K vector
        D = number of documents
        K = number of topics
    """
    mat_theta = []
    for line, dep in zip(data, dep_var):
        pred_1 = []
        pred_1.append(dep)
        ch = " ".join(line)
        docu = ch.strip().split()
        theta_val = model.infer(doc=model.make_doc(words=docu, y=pred_1),
                                     iter=100,
                                     workers=0,
                                     together=False)
        mat_theta.append(theta_val[0])
        
    with open(directo + '\\topic_probabilities_slda.csv', 'w') as f:
        for item in mat_theta:
            for items in item:
                f.writelines("%s, " %items)
            f.writelines("\n")
        f.close()
    
    return mat_theta

### Number of Tokens per document

In [None]:
def num_token(data):
    """
    desc:
        this function computes number of tokens per document for the entire corpus
    input:
        dataset
    output:
        N x 1 vector
        N = number of tokens in the document
    """
    numb_tok = []
    for text in data:
        numb_tok.append(len(text))
    return numb_tok

### Frequency of Words in the Corpus

In [None]:
def freq_words(vocabs, data):
    """
    desc:
        this function computes the frequency of words in the entire corpus
    input:
        list of words
        dataset
    output:
        N x 1 vector
        N = frequency of words in the document
    """
    fre_words = []
    for words in vocabs:
        words_freq = 0
        for line in data:
            for ind_words in line:
                if words == ind_words:
                    words_freq += 1
        fre_words.append(words_freq)
    return fre_words

## Visualising the Results of LDA Model

### Computing the Parameters for Visualising LDA Model

In [None]:
# Loading the LDA model
lda_model = tp.LDAModel.load('test.lda.bin')
#lda_model.get_topic_word_dist(2)
lvocab, lphi_val = compute_phi(lda_model)
ltheta_val = compute_theta_lda(lda_model, data_lemmatized)
lnum_token = num_token(data_lemmatized)
lfreq_terms = freq_words(lvocab, data_lemmatized)

### Plotting in pyLDAvis (LDA)

In [None]:
# Visualising the Results
pyLDAvis.enable_notebook()
data_lda = {'topic_term_dists': lphi_val,
            'doc_topic_dists' : ltheta_val,
            'doc_lengths'     : lnum_token,
            'vocab'           : lvocab,
            'term_frequency'  : lfreq_terms}
print('Topic-Term shape: %s' % str(np.array(data_lda['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(data_lda['doc_topic_dists']).shape))

In [None]:
vis_lda = pyLDAvis.prepare(**data_lda)
pyLDAvis.display(vis_lda)

## Visualising the Results of Supervised LDA Model

### Computing the Parameters for Visualising Supervised LDA Model

In [None]:
# Loading the sLDA model
slda_model = tp.SLDAModel.load('test.slda.bin')
#slda_model.get_topic_word_dist(2)
svocab, sphi_val = compute_phi(slda_model)
stheta_val = compute_theta_slda(slda_model, data_lemmatized, resp)
snum_token = num_token(data_lemmatized)
sfreq_terms = freq_words(svocab, data_lemmatized)

### Plotting in pyLDAvis (sLDA)

In [None]:
# Visualising the Results
pyLDAvis.enable_notebook()
data_slda = {'topic_term_dists': sphi_val,
             'doc_topic_dists' : stheta_val,
             'doc_lengths'     : snum_token,
             'vocab'           : svocab,
             'term_frequency'  : sfreq_terms}
print('Topic-Term shape: %s' % str(np.array(data_slda['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(data_slda['doc_topic_dists']).shape))

In [None]:
vis_slda = pyLDAvis.prepare(**data_slda)
pyLDAvis.display(vis_slda)

## Computing Scores for use in Estimation

In this portion of the code, values are computed for each document in the corpus. The values are computed based on the words used in each of the documents in the corpus. Scores will be computed for each topic. This will be based on the probability values in each of the topics.

In [None]:
def compute_scores(list_dataset, list_word_prob):
    """
    desc:
        this function will take the cleaned dataset and list of word probabilities per topic and compute the scores
    input:
        cleaned dataset as a list
        word probabilities as a dataframe
    output:
        scores for each document in the corpus
    """
    n = len(list_dataset)
    prob_list = [[0 for i in range(4)] for i in range(n)]
    for index, document in enumerate(list_dataset):
        # remember to change the number of variables based on the number of topics
        probab_1 = 0
        probab_2 = 0
        probab_3 = 0
        for word in document:
            for index1, row in list_word_prob.iterrows():
                item  = row['Word']
                prob1 = row['Prob_1']
                prob2 = row['Prob_2']
                prob3 = row['Prob_3']
                if word == item:
                    probab_1 += prob1
                    probab_2 += prob2
                    probab_3 += prob3
            
            prob_list[index][0] = probab_1
            prob_list[index][1] = probab_2
            prob_list[index][2] = probab_3
            prob_list[index][3] = probab_1 + probab_2 + probab_3
    
    return prob_list

### Computing the Scores for LDA

In [None]:
lda_dist = pd.read_csv(directo + "\\topic_word_prob_lda.csv", header=None)
lda_distT = lda_dist.T
lda_distT.columns = ['Word', 'Prob_1', 'Prob_2', 'Prob_3']
lda_distT['Word']   = lda_distT['Word'].str.strip()
lda_distT['Prob_1'] = pd.to_numeric(lda_distT.Prob_1, errors='coerce')
lda_distT['Prob_2'] = pd.to_numeric(lda_distT.Prob_2, errors='coerce')
lda_distT['Prob_3'] = pd.to_numeric(lda_distT.Prob_3, errors='coerce')
probab_lda = compute_scores(data_lemmatized, lda_distT)
df['probab_lda'] = probab_lda

### Computing the Scores for sLDA

In [None]:
slda_dist = pd.read_csv(directo + "\\topic_word_prob_slda.csv", header=None)
slda_distT = slda_dist.T
slda_distT.columns = ['Word', 'Prob_1', 'Prob_2', 'Prob_3']
slda_distT['Word']   = slda_distT['Word'].str.strip()
slda_distT['Prob_1'] = pd.to_numeric(slda_distT.Prob_1, errors='coerce')
slda_distT['Prob_2'] = pd.to_numeric(slda_distT.Prob_2, errors='coerce')
slda_distT['Prob_3'] = pd.to_numeric(slda_distT.Prob_3, errors='coerce')
probab_slda = compute_scores(data_lemmatized, slda_distT)
df['probab_slda'] = probab_slda

df.to_csv(directo + "\\Open_Ended_Q1_Scores_Calibration.csv")