# Topic Model Analysis

The code is written to extract topics from text data. The code supports performing the following analysis:-
1. Latent Dirichlet Allocation (LDA)
2. Supervised Latent Dirichlet Allocation (sLDA)

Latent Dirichlet allocation can be performed using Gensim or Tomotopy. Supervised LDA can be performed using Tomotopy. The dependent variable can be linear or binary in nature.

In addition to this, the code also allows users to evaluate models using measures such as Coherence and Perplexity. Various visualisations can also be used to evaluate the results from the topic models. These include:-
1. pyLDAvis to understand topics and the inter-topic distance
2. Word clouds for topics

## Importing the Libraries

In [1]:
### ************************** Importing Packages ************************ ###
from __future__ import division
import re                     # regular expressions
import numpy as np            # scientific computing
import pandas as pd           # datastructures and computing
import pprint as pprint       # better printing
import os
import os.path

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Lemmatization
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

# Plotting tools
#import graphlab as gl
import pyLDAvis               # interactive topic model visualisation
#import pyLDAvis.graphlab
import pyLDAvis.gensim
import matplotlib.pyplot as plt

# %matplotlib inline          # to ensure that the matplotlib plots are printed in the Jupyter notebooks

# Libraries for Topic Models
import sys
import tomotopy as tp

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Creating the list of Stop Words

In [2]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['app', 'apps', 'also', 'android', 'atm', 'atms', 'call', 'calls', 'calling', 'browsing', 'browse', 
                   'contact','clock', 'communication', 'dk', 'edu', 'e-mails', 'email', 'emails', 'etc', 'etfc', 
                   'entertainment', 'fb', 'files', 'from', 'food', 'images', 'info', 'internet', 'jpg', 'online', 
                   'mail', 'make', 'much', 'mean', 'music', 'messaging', 'messenger', 'mobilepay', 'nd', 'networks', 
                   'news', 'parents', 'friends', 'pdf', 'player', 'photos', 'photographs', 'photography', 'receiving', 
                   're', 'related', 'reviews', 'rooms', 'social', 'subject', 'sms', 'storage', 'skype', 'twitter', 
                   'text', 'texting', 'use', 'using', 'www', 'wifi'])
#stop_words

## Importing the Dataset

The dataset preparation is quite important here. If the model used is simple LDA, only the text data is mandatory.

However, for the supervised LDA, it requires the response variable (dependent variable) along with the text data.

In [3]:
### ************************** Importing Datasets ************************ ###
directo = "C:\\Users\\vibabu\\Dropbox\\Doctoral_Research\\STSM\\Analysis\\Combined_Dataset\\SEM_October_19\\India_OE\\Results_Feb_10\\Open_Ended_Q1"

df = pd.read_csv(directo + "\\Open_Ended_Q1_90_pct.csv", encoding='latin-1')
print(df.Response.unique())

['Google Maps,Redbus,IRCTC,Make My Trip,Goibibo'
 'Booking tickets online, Tracking where am I (if i am traveling in a place which not so familiar) in google maps, check the time to reach my destination'
 'Google maps, bus timing, train booking' 'Maps, ticket booking'
 'Locating destination  Taxi booking' 'Maps,hotel booking'
 'Book travel (Ola, Uber, Train, Flights and Bus)  Book food while on travel  Book hotels'
 'shortest route to destination and time taken  booking of travel tickets'
 'knowing the status of train    '
 'Navigation, finding eateries, hotels,   Banking while travels  News  Entertainment  '
 '1 . GPS  2 . Online reservations ' 'Booking cabs'
 'Find routes. Book cab. Find train times'
 'GPS  GOOGLE MAP & NAVIGATOR   TO FIND TOURIST SPOTS'
 'For navigation and photography ' 'Maps, Booking tickets'
 'Reviews about the place,route map.' 'Navigation, bookings'
 'Taxi booking, train timings, airline check in '
 'Ticket Booking   Online check in'
 'For checking schedules, a

In [4]:
df.head()

Unnamed: 0,ID,Rating,Response
0,6664092292,1,"Google Maps,Redbus,IRCTC,Make My Trip,Goibibo"
1,6624275913,4,"Booking tickets online, Tracking where am I (i..."
2,6623765999,3,"Google maps, bus timing, train booking"
3,6623636399,2,"Maps, ticket booking"
4,6613595569,2,Locating destination Taxi booking


In [5]:
# convert the content field in dataset into a list
data = df.Response.values.tolist()
resp = df.Rating.values.tolist()
data[:5]

['Google Maps,Redbus,IRCTC,Make My Trip,Goibibo',
 'Booking tickets online, Tracking where am I (if i am traveling in a place which not so familiar) in google maps, check the time to reach my destination',
 'Google maps, bus timing, train booking',
 'Maps, ticket booking',
 'Locating destination  Taxi booking']

In [6]:
resp[:5]

[1, 4, 3, 2, 2]

## Cleaning the Dataset

In this section of the code, the following data cleaning techniques are used:-

1. E-mail id  and New line characters
2. Remove "StopWords" from the dataset
3. Forming bigrams and trigrams
4. Stemming


### 1. Removing emails and new line characters

In [7]:
### ************************** Datasets Cleaning ************************* ###

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\n', ' ', sent) for sent in data]

pprint.pprint(data[:5])

['Google Maps,Redbus,IRCTC,Make My Trip,Goibibo',
 'Booking tickets online, Tracking where am I (if i am traveling in a place '
 'which not so familiar) in google maps, check the time to reach my '
 'destination',
 'Google maps, bus timing, train booking',
 'Maps, ticket booking',
 'Locating destination  Taxi booking']


### 2. Removing StopWords

In [8]:
# Function to remove StopWords

def remove_stopwords(texts):
        """
        objective:
            function to remove stopwords from the paragraph/sentence
            uses the preprocess
        input:
            paragraph/sentences
        output:
            wordlist after the stopwords are removed
        """
        return [[word for word in simple_preprocess(str(doc)) if word not in stop_words]
             for doc in texts]

In [9]:
data_words_nostops = remove_stopwords(data)
data_words_nostops[:5]

[['google', 'maps', 'redbus', 'irctc', 'trip', 'goibibo'],
 ['booking',
  'tickets',
  'tracking',
  'traveling',
  'place',
  'familiar',
  'google',
  'maps',
  'check',
  'time',
  'reach',
  'destination'],
 ['google', 'maps', 'bus', 'timing', 'train', 'booking'],
 ['maps', 'ticket', 'booking'],
 ['locating', 'destination', 'taxi', 'booking']]

### 3. Forming Bigrams and Trigrams

In [10]:
def make_bigrams(texts):
    """
    objective:
        takes the processed text- after preprocessing and stop word removal
    input:
        preprocessed text
    output:
        text with bigrams
    """
    return [bigram_mod[text] for text in texts]

def make_trigrams(texts):
    """
    objective:
        generate trigrams for the text
    input:
        text with bigrams
    output:
        text with trigrams    
    """
    return [trigram_mod[bigram_mod[text]] for text in texts]

In [11]:
# Calibration Dataset

# Build functions to remove stopwords, bigram and trigram models- calibration dataset
bigram = gensim.models.phrases.Phrases(data, min_count=5, threshold=100)
trigram = gensim.models.phrases.Phrases(bigram[data], threshold=100)

# Passing the parameters to the bigram/trigram- calibration dataset
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

data_words_bigrams = make_bigrams(data_words_nostops)
data_words_bigrams[:5]

[['google', 'maps', 'redbus', 'irctc', 'trip', 'goibibo'],
 ['booking',
  'tickets',
  'tracking',
  'traveling',
  'place',
  'familiar',
  'google',
  'maps',
  'check',
  'time',
  'reach',
  'destination'],
 ['google', 'maps', 'bus', 'timing', 'train', 'booking'],
 ['maps', 'ticket', 'booking'],
 ['locating', 'destination', 'taxi', 'booking']]

### 4. Lemmatization

In [12]:
ps = PorterStemmer()

data_lemmatized = []
for texts in data_words_bigrams:
    data_lemmatized.append([ps.stem(doc) for doc in texts])
    
data_lemmatized[:5]

[['googl', 'map', 'redbu', 'irctc', 'trip', 'goibibo'],
 ['book',
  'ticket',
  'track',
  'travel',
  'place',
  'familiar',
  'googl',
  'map',
  'check',
  'time',
  'reach',
  'destin'],
 ['googl', 'map', 'bu', 'time', 'train', 'book'],
 ['map', 'ticket', 'book'],
 ['locat', 'destin', 'taxi', 'book']]

## Writing the Files to the dataset

In [13]:
df['Cleaned_Data'] = data_lemmatized
df.head()
df.to_csv(directo + "\\Output_Q1_Words_Calibration.csv")

## LDA Model

In [14]:
# Defining the LDA Function
    
def lda_model(input_list, save_path):
    """
    desc:
        the function estimates the LDA model and outputs the estimated topics
    input:
        list with documents as responses
    output:
        prints the topics
        words and their corresponding proportions
    """
    mdl = tp.LDAModel(tw=tp.TermWeight.ONE,             # Term weighting
                      min_cf=3,                         # Minimum frequency of words
                      rm_top=0,                         # Number of top frequency words to be removed
                      k=3)                              # Number of topics
    for n, line in enumerate(input_list):
        ch = " ".join(line)
        docu = ch.strip().split()
        mdl.add_doc(docu)
    mdl.burn_in = 100
    mdl.train(0)
    print('Num docs: ', len(mdl.docs), 'Vocab size: ', mdl.num_vocabs, 'Num words: ', mdl.num_words)
    print('Removed words: ', mdl.removed_top_words)
    print('Training...', file=sys.stderr, flush=True)
    for i in range(0, 1000, 10):
        mdl.train(1000)
        print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))
        
    print('Saving...', file=sys.stderr, flush=True)
    mdl.save(save_path, True)
    
    for k in range(mdl.k):
        print('Topic #{}'.format(k))
        for word, prob in mdl.get_topic_words(k):
            print('\t', word, prob, sep='\t')
    return mdl

### Estimating the Topic Model

In [15]:
print('Running LDA')
lda_model = lda_model(data_lemmatized, 'test.lda.bin')

Training...


Running LDA
Num docs:  143 Vocab size:  42 Num words:  577
Removed words:  []
Iteration: 0	Log-likelihood: -3.6185059650208755
Iteration: 10	Log-likelihood: -3.6328995266293322
Iteration: 20	Log-likelihood: -3.6750923707383962
Iteration: 30	Log-likelihood: -3.643572142854514
Iteration: 40	Log-likelihood: -3.6101904176240156
Iteration: 50	Log-likelihood: -3.549582522084849
Iteration: 60	Log-likelihood: -3.5713083892027675
Iteration: 70	Log-likelihood: -3.661895139972507
Iteration: 80	Log-likelihood: -3.5634897314722243
Iteration: 90	Log-likelihood: -3.590429211813356
Iteration: 100	Log-likelihood: -3.684176232904661
Iteration: 110	Log-likelihood: -3.660799719832282
Iteration: 120	Log-likelihood: -3.6109065015249895
Iteration: 130	Log-likelihood: -3.7042781510672804
Iteration: 140	Log-likelihood: -3.654083575026828
Iteration: 150	Log-likelihood: -3.5632668006616406
Iteration: 160	Log-likelihood: -3.6729990164370014
Iteration: 170	Log-likelihood: -3.6446837721912244
Iteration: 180	Log-lik

Saving...


Iteration: 930	Log-likelihood: -3.659135707174678
Iteration: 940	Log-likelihood: -3.7194633921882723
Iteration: 950	Log-likelihood: -3.675747972737152
Iteration: 960	Log-likelihood: -3.6495084011317127
Iteration: 970	Log-likelihood: -3.6223882290064977
Iteration: 980	Log-likelihood: -3.6350645303364644
Iteration: 990	Log-likelihood: -3.640377993961761
Topic #0
		map	0.17195110023021698
		googl	0.14330054819583893
		find	0.12420017272233963
		rout	0.10032470524311066
		locat	0.08122433722019196
		place	0.06689905375242233
		reserv	0.05734886974096298
		navig	0.05257377773523331
		gp	0.04302358999848366
		search	0.03824849799275398
Topic #1
		travel	0.19430840015411377
		train	0.12675224244594574
		check	0.10986319929361343
		time	0.06764060258865356
		statu	0.06764060258865356
		know	0.05919608473777771
		transport	0.05075156316161156
		track	0.0423070453107357
		locat	0.03386252745985985
		destin	0.03386252745985985
Topic #2
		book	0.363429456949234
		ticket	0.2156776636838913
		map	0.

## Supervised LDA

In [16]:
def slda_model(documents, dep_var, save_path):
    """
    desc:
        the function estimates the sLDA model and outputs the estimated topics
    input:
        list with documents as responses
        dependent variable
    output:
        prints the topics
        words and their corresponding proportions
    """
    smdl = tp.SLDAModel(tw=tp.TermWeight.ONE,             # Term weighting
                        min_cf=3,                         # Minimum frequency of words
                        rm_top=0,                         # Number of top frequency words to be removed
                        k=3,                              # Number of topics
                        vars=['l'])                       # Number of dependent variables
    for row, pred in zip(documents, dep_var):
        pred_1 = []
        pred_1.append(pred)
        ch = " ".join(row)
        docu = ch.strip().split()
        smdl.add_doc(words=docu, y=pred_1)
        
    smdl.burn_in = 100
    smdl.train(0)
    
    # Printing the output statistics
    print('Num docs: ', len(smdl.docs), 'Vocab size: ', smdl.num_vocabs, 'Num words: ', smdl.num_words)
    print('Removed top words: ', smdl.removed_top_words)
    print('Training...', file=sys.stderr, flush=True)
    for i in range(0, 1000, 10):
        smdl.train(1000)
        print('Iteration: {}\tLog-likelihood: {}'.format(i, smdl.ll_per_word))
        
    print('Saving...', file=sys.stderr, flush=True)
    smdl.save(save_path, True)
    
    for k in range(smdl.k):
        print('Topic #{}'.format(k))
        for word, prob in smdl.get_topic_words(k):
            print('\t', word, prob, sep='\t')
    return smdl

In [17]:
print('Running Supervised LDA')
slda_model = slda_model(data_lemmatized, resp, 'test.slda.bin')

Training...


Running Supervised LDA
Num docs:  143 Vocab size:  42 Num words:  577
Removed top words:  []
Iteration: 0	Log-likelihood: -3.764507998975052
Iteration: 10	Log-likelihood: -3.7388537478386272
Iteration: 20	Log-likelihood: -3.718330618570296
Iteration: 30	Log-likelihood: -3.7325866818365494
Iteration: 40	Log-likelihood: -3.8479303536626737
Iteration: 50	Log-likelihood: -3.71068759553015
Iteration: 60	Log-likelihood: -3.7813160386335176
Iteration: 70	Log-likelihood: -3.7319791300691976
Iteration: 80	Log-likelihood: -3.7666003134382477
Iteration: 90	Log-likelihood: -3.7131842947121148
Iteration: 100	Log-likelihood: -3.7982367141078672
Iteration: 110	Log-likelihood: -3.8403937091043727
Iteration: 120	Log-likelihood: -3.753743842678335
Iteration: 130	Log-likelihood: -3.8167568571103714
Iteration: 140	Log-likelihood: -3.772952949969611
Iteration: 150	Log-likelihood: -3.7133875892479846
Iteration: 160	Log-likelihood: -3.798783546865658
Iteration: 170	Log-likelihood: -3.7401982605847914
Iterati

Saving...


Iteration: 970	Log-likelihood: -3.753612796607326
Iteration: 980	Log-likelihood: -3.8139419800313785
Iteration: 990	Log-likelihood: -3.7782158874850107
Topic #0
		map	0.3136875033378601
		googl	0.2767939567565918
		find	0.14766648411750793
		place	0.12921971082687378
		restaur	0.0462091900408268
		direct	0.0462091900408268
		reach	0.027762405574321747
		destin	0.009315624833106995
		travel	9.223390225088224e-05
		rout	9.223390225088224e-05
Topic #1
		travel	0.12969225645065308
		locat	0.1184195727109909
		rout	0.09587419778108597
		train	0.0846015140414238
		check	0.07332882285118103
		find	0.056419797241687775
		time	0.045147109776735306
		statu	0.045147109776735306
		know	0.045147109776735306
		traffic	0.03951076790690422
Topic #2
		book	0.3112304210662842
		ticket	0.18470007181167603
		navig	0.08210792392492294
		map	0.07868818193674088
		hotel	0.06158949434757233
		taxi	0.04107106104493141
		reserv	0.04107106104493141
		cab	0.04107106104493141
		flight	0.030811846256256104
		gp	0.0

## Visualising the Results of LDA

pyLDAvis does not have a module that allows topic models estimated using Tomotopy to be used directly for plotting the graphs. It however allows plotting after the following parameters are computed for each of the topic models:-

1. phi

    a. probabilities of each word(W) for a given topic(K) under consideration
    
    b. is a K x W vector
    
    
2. theta

    a. probability mass function over "K" topics for all the documents in the corpus (D)
    
    b. is a D x K matrix
    
    
3. n(d)

    a. number of tokens for each document


4. vocab

    a. vector of terms in the vocabulary
    
    b. presented in the same order as in "phi"
    
    
5. M(w)

    a. frequency of term "w" across the entire corpus

### Computing the value of "Phi" for the Model

In [18]:
def compute_phi(model):
    """
    desc:
        this function computes the value of phi for visualising the results of topic model
        probabilities of each word for a given topic
    input:
        the topic model
    output:
        K x W vector
        K = number of topics
        W = number of words
    """
    mat_phi1 = []
    for i in range(model.k):
        #print(model.get_topic_words(i,model.num_vocabs))
        mat_phi1.append(model.get_topic_words(i,model.num_vocabs))
    
    list_words = []
    for text in mat_phi1[0]:
            list_words.append(text[0])
    
    #print(list_words)
    list_words.sort()
    #print(list_words)
    
    mat_phi2 = [[i * j for j in range(model.num_vocabs)] for i in range(model.k+1)]
    for i in range(model.num_vocabs):
        mat_phi2[0][i] = list_words[i]

    
    j1 = []
    k1 = []
    m = 0
    while m < model.k:
        j1.append(m)
        m += 1
        
    n = 1
    while n <= model.k:
        k1.append(n)
        n += 1
        
    for j, k in zip(j1, k1):
        for index, word in enumerate(mat_phi2[0]):
            #print(word)
            for item in mat_phi1[j]:
                #print(item)
                if word == item[0]:
                    mat_phi2[k][index] = item[1]
    
    if os.path.isfile(directo + '\\topic_word_prob_lda.csv'):
        with open(directo + '\\topic_word_prob_slda.csv', 'w') as f:
            for item in mat_phi2:
                for items in item:
                    f.writelines("%s, " % items)
                f.writelines("\n")
            f.close()
    else:
        with open(directo + '\\topic_word_prob_lda.csv', 'w') as f:
            for item in mat_phi2:
                for items in item:
                    f.writelines("%s, " % items)
                f.writelines("\n")
            f.close()
        
    return mat_phi2[0], mat_phi2[1:]

### Computing the value of "Theta" for the Model

#### For LDA Model

In [19]:
def compute_theta_lda(model, data):
    """
    desc:
        this function computes the value of theta for visualising the results of topic model
        probabilities mass function over "K" topics for all documents (D) in the corpus
    input:
        the topic model
        dataset
    output:
        D x K vector
        D = number of documents
        K = number of topics
    """
    mat_theta = []
    for n, line in enumerate(data):
        ch = " ".join(line)
        docu = ch.strip().split()
        theta_val = model.infer(doc=model.make_doc(docu),
                                     iter=100,
                                     workers=0,
                                     together=False)
        mat_theta.append(theta_val[0])
    
    with open(directo + '\\topic_probabilities_lda.csv', 'w') as f:
        for item in mat_theta:
            for items in item:
                f.writelines("%s, " %items)
            f.writelines("\n")
        f.close()
    
    return mat_theta

#### For sLDA Model

In [20]:
def compute_theta_slda(model, data, dep_var):
    """
    desc:
        this function computes the value of theta for visualising the results of topic model
        probabilities mass function over "K" topics for all documents (D) in the corpus
    input:
        the topic model
        dataset
        dependent variable
    output:
        D x K vector
        D = number of documents
        K = number of topics
    """
    mat_theta = []
    for line, dep in zip(data, dep_var):
        pred_1 = []
        pred_1.append(dep)
        ch = " ".join(line)
        docu = ch.strip().split()
        theta_val = model.infer(doc=model.make_doc(words=docu, y=pred_1),
                                     iter=100,
                                     workers=0,
                                     together=False)
        mat_theta.append(theta_val[0])
        
    with open(directo + '\\topic_probabilities_slda.csv', 'w') as f:
        for item in mat_theta:
            for items in item:
                f.writelines("%s, " %items)
            f.writelines("\n")
        f.close()
    
    return mat_theta

### Number of Tokens per document

In [21]:
def num_token(data):
    """
    desc:
        this function computes number of tokens per document for the entire corpus
    input:
        dataset
    output:
        N x 1 vector
        N = number of tokens in the document
    """
    numb_tok = []
    for text in data:
        numb_tok.append(len(text))
    return numb_tok

### Frequency of Words in the Corpus

In [22]:
def freq_words(vocabs, data):
    """
    desc:
        this function computes the frequency of words in the entire corpus
    input:
        list of words
        dataset
    output:
        N x 1 vector
        N = frequency of words in the document
    """
    fre_words = []
    for words in vocabs:
        words_freq = 0
        for line in data:
            for ind_words in line:
                if words == ind_words:
                    words_freq += 1
        fre_words.append(words_freq)
    return fre_words

## Visualising the Results of LDA Model

### Computing the Parameters for Visualising LDA Model

In [23]:
# Loading the LDA model
lda_model = tp.LDAModel.load('test.lda.bin')
#lda_model.get_topic_word_dist(2)
lvocab, lphi_val = compute_phi(lda_model)
ltheta_val = compute_theta_lda(lda_model, data_lemmatized)
lnum_token = num_token(data_lemmatized)
lfreq_terms = freq_words(lvocab, data_lemmatized)

### Plotting in pyLDAvis (LDA)

In [24]:
# Visualising the Results
pyLDAvis.enable_notebook()
data_lda = {'topic_term_dists': lphi_val,
            'doc_topic_dists' : ltheta_val,
            'doc_lengths'     : lnum_token,
            'vocab'           : lvocab,
            'term_frequency'  : lfreq_terms}
print('Topic-Term shape: %s' % str(np.array(data_lda['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(data_lda['doc_topic_dists']).shape))

Topic-Term shape: (3, 42)
Doc-Topic shape: (145, 3)


In [25]:
vis_lda = pyLDAvis.prepare(**data_lda)
pyLDAvis.display(vis_lda)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


## Visualising the Results of Supervised LDA Model

### Computing the Parameters for Visualising Supervised LDA Model

In [26]:
# Loading the sLDA model
slda_model = tp.SLDAModel.load('test.slda.bin')
#slda_model.get_topic_word_dist(2)
svocab, sphi_val = compute_phi(slda_model)
stheta_val = compute_theta_slda(slda_model, data_lemmatized, resp)
snum_token = num_token(data_lemmatized)
sfreq_terms = freq_words(svocab, data_lemmatized)

### Plotting in pyLDAvis (sLDA)

In [27]:
# Visualising the Results
pyLDAvis.enable_notebook()
data_slda = {'topic_term_dists': sphi_val,
             'doc_topic_dists' : stheta_val,
             'doc_lengths'     : snum_token,
             'vocab'           : svocab,
             'term_frequency'  : sfreq_terms}
print('Topic-Term shape: %s' % str(np.array(data_slda['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(data_slda['doc_topic_dists']).shape))

Topic-Term shape: (3, 42)
Doc-Topic shape: (145, 3)


In [28]:
vis_slda = pyLDAvis.prepare(**data_slda)
pyLDAvis.display(vis_slda)

## Computing Scores for use in Estimation

In this portion of the code, values are computed for each document in the corpus. The values are computed based on the words used in each of the documents in the corpus. Scores will be computed for each topic. This will be based on the probability values in each of the topics.

In [29]:
def compute_scores(list_dataset, list_word_prob):
    """
    desc:
        this function will take the cleaned dataset and list of word probabilities per topic and compute the scores
    input:
        cleaned dataset as a list
        word probabilities as a dataframe
    output:
        scores for each document in the corpus
    """
    n = len(list_dataset)
    prob_list = [[0 for i in range(4)] for i in range(n)]
    for index, document in enumerate(list_dataset):
        # remember to change the number of variables based on the number of topics
        probab_1 = 0
        probab_2 = 0
        probab_3 = 0
        for word in document:
            for index1, row in list_word_prob.iterrows():
                item  = row['Word']
                prob1 = row['Prob_1']
                prob2 = row['Prob_2']
                prob3 = row['Prob_3']
                if word == item:
                    probab_1 += prob1
                    probab_2 += prob2
                    probab_3 += prob3
            
            prob_list[index][0] = probab_1
            prob_list[index][1] = probab_2
            prob_list[index][2] = probab_3
            prob_list[index][3] = probab_1 + probab_2 + probab_3
    
    return prob_list

### Computing the Scores for LDA

In [30]:
lda_dist = pd.read_csv(directo + "\\topic_word_prob_lda.csv", header=None)
lda_distT = lda_dist.T
lda_distT.columns = ['Word', 'Prob_1', 'Prob_2', 'Prob_3']
lda_distT['Word']   = lda_distT['Word'].str.strip()
lda_distT['Prob_1'] = pd.to_numeric(lda_distT.Prob_1, errors='coerce')
lda_distT['Prob_2'] = pd.to_numeric(lda_distT.Prob_2, errors='coerce')
lda_distT['Prob_3'] = pd.to_numeric(lda_distT.Prob_3, errors='coerce')
probab_lda = compute_scores(data_lemmatized, lda_distT)
df['probab_lda'] = probab_lda

### Computing the Scores for sLDA

In [31]:
slda_dist = pd.read_csv(directo + "\\topic_word_prob_slda.csv", header=None)
slda_distT = slda_dist.T
slda_distT.columns = ['Word', 'Prob_1', 'Prob_2', 'Prob_3']
slda_distT['Word']   = slda_distT['Word'].str.strip()
slda_distT['Prob_1'] = pd.to_numeric(slda_distT.Prob_1, errors='coerce')
slda_distT['Prob_2'] = pd.to_numeric(slda_distT.Prob_2, errors='coerce')
slda_distT['Prob_3'] = pd.to_numeric(slda_distT.Prob_3, errors='coerce')
probab_slda = compute_scores(data_lemmatized, slda_distT)
df['probab_slda'] = probab_slda

df.to_csv(directo + "\\Open_Ended_Q1_Scores_Calibration.csv")