## Topic Modeling 
We will use the algorithm Latent Dirichlet Allocation (LDA) to find automatically the top 10 most relevant topics in the publications.
This is an unsupervised algorithm. It will find relations between words , tweets and topics and 
it will select the most relevant words in all the tweets. 

In [None]:
# Import libraries.
import pandas as pd

# Gensim
import gensim
import gensim.corpora as corpora
from pprint import pprint
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim

In [None]:
## Import csv twitts in Spanish
mx_twitts=pd.read_csv('data/clean/mx_twitts.csv')

 ## Create the Dictionary and Corpus needed for Topic Modeling

In [None]:
token_tweets_list=[]

for token in mx_twitts['clean_text'].str.split():   
    token_tweets_list.append(token)

In [None]:
len(token_tweets_list)

Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency). This is used as the input by the LDA model.

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(token_tweets_list)

# Create Corpus
texts = token_tweets_list

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:5])

## Building the Topic Model

In [None]:
# Build LDA model , setup parameter like quantity of topics and passses
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

## View the topics in LDA model

In [None]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

## Compute Model Perplexity and Coherence Score

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=token_tweets_list, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

## Visualize the topics-keywords

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds='mmds')
vis

#topicData = pyLDAvis.gensim.prepare(ldamodel, docTermMatrix, dictionary, mds='mmds')   
    