<h1 style='color:#00868b'>LDA with 15 features; 4 runs: [20, 30, 40, 50] topics<span class="tocSkip"></span></h1>

## Reading the data

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df_complaints = pd.read_csv("../../corpus_sprint2_pc_15_LDA.csv", encoding="utf-8")

In [5]:
features = 15

## LDA

Parameters ([source](https://radimrehurek.com/gensim/models/ldamulticore.html)):
* α: Topic smoothing parameter; can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability
* eta: Word/term smoothing parameter; a scalar for a symmetric prior over topic/word probability

Most topic modeling analyses in the literature ([Blei et al, 2003](https://www.researchgate.net/publication/326505884_Latent_Dirichlet_Allocation_LDA_for_Topic_Modeling_of_the_CFPB_Consumer_Complaints); [Blei and Lafferty, 2009](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.463.1205&rep=rep1&type=pdf#page=96); [Kaplan and Vakili, 2015](https://onlinelibrary.wiley.com/doi/abs/10.1002/smj.2294); [Blei, 2012](https://dl.acm.org/doi/pdf/10.1145/2133806.2133826)) suggest a value of 0.1 for both of these hyperparameters. This results in semantically meaningful topics. 

* number of topics: The number of topics LDA has to attempt to identify; through trial and error
* iterations: Maximum number of iterations through the corpus when inferring the topic distribution of a corpus
* passes: Number of passes through the corpus during training

We used **parallelized Latent Dirichlet Allocation** which uses multiprocessing to speed up learning ([source](https://radimrehurek.com/gensim/models/ldamulticore.html)).

In [6]:
from gensim import corpora, models
from sklearn import metrics
from gensim.test.utils import common_corpus, common_dictionary

alpha = 0.1
eta = 0.1
num_topics = [20,30,40,50] # sizes of the topics
chunksize = 2000
passes = 20
iterations = 400

for topics in num_topics:
    print("start with number of topics:", topics)
    lda_model = models.LdaMulticore(
                    corpus = df_complaints,
                    id2word=common_dictionary,
                    num_topics = topics, 
                    alpha = alpha, 
                    eta = eta,
                    random_state = 42
                   )
    # Compute Perplexity
    print("Perplexity: ", lda_model.log_perplexity(df_complaints))  # a measure of how good the model is. lower = better.
    # save the trained model
    print("Saving the model...")
    lda_model.save("/runs_Sprint2/LDA/lda_model_" + str(features) + "fea_" + str(topics) + "topics" + "no_p_and_i")

start with number of topics: 20


KeyboardInterrupt: 

In [None]:
# load the saved model
lda_model20 = LdaModel.load("/runs_Sprint2/LDA/lda_model_" + str(features) + "fea_" + 20 +"topics" + "no_p_and_i")
lda_model30 = LdaModel.load("/runs_Sprint2/LDA/lda_model_" + str(features) + "fea_" + 30 +"topics" + "no_p_and_i")
lda_model40 = LdaModel.load("/runs_Sprint2/LDA/lda_model_" + str(features) + "fea_" + 40 +"topics" + "no_p_and_i")
lda_model50 = LdaModel.load("/runs_Sprint2/LDA/lda_model_" + str(features) + "fea_" + 50 +"topics" + "no_p_and_i")

Show the topics of the models

In [None]:
print("Model with 20 topics:\n ", lda_model40.show_topics())
print("Model with 30 topics:\n ", lda_model40.show_topics())
print("Model with 40 topics:\n ", lda_model40.show_topics())
print("Model with 50 topics:\n ", lda_model40.show_topics())