### Topic Modeling for Twitter

In this notebook I will apply topic modeling on the twitter data I cleaned in a separate notebook.
I will use LDA algorithm to model topic, 
Here are some usefuls ressources I used to learn the intuition behind topic modeling :
    [AV introduction to topic Modeling](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/) 
    [LDA Video](https://www.youtube.com/watch?v=3mHy4OSyRf0) 
    [Topic Modeling With Gesim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)
For the first start I will use pyhon sklearn library and gensim library

In [23]:
# uncoment the next line to install gensim
# !pip install pyLDAvis

In [26]:
import pandas as pd
import gensim as gs
import numpy as np
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from pprint import pprint
from matplotlib import pyplot as plt
import pyLDAvis

In [2]:
cleanned_tweets = pd.read_csv('../data/cleanned_tweets_04-06-2019-23-07.csv', index_col='Unnamed: 0')

In [3]:
cleanned_tweets.dropna(inplace=True)

In [4]:
tweets_array = cleanned_tweets.get('cleanned_tweet').values

In [5]:
tweets_array = [tweet.split(' ') for tweet in tweets_array]

In [6]:
# Create Dictionary

id2word = corpora.Dictionary(tweets_array)

In [7]:
# Create Corpus
texts = tweets_array

In [8]:
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]]


In [11]:
lda_model = gs.models.ldamodel.LdaModel(corpus=corpus,
                                        id2word=id2word,
                                        num_topics=5, 
                                        random_state=100,
                                        update_every=1,
                                        chunksize=100,
                                        passes=100,
                                        alpha='auto',
                                        per_word_topics=True)

In [27]:
topic_term_matrix = lda_model.get_topics()

In [32]:
topic_term_matrix[0]

array([5.6825645e-02, 5.4356784e-02, 4.9937952e-02, ..., 4.8323946e-05,
       1.0451641e-03, 1.0451641e-03], dtype=float32)

In [12]:
for label, words in lda_model.print_topics():
    print('==================')
    print("topic label ", label, ' =>word ',  words)

topic label  0  =>word  0.066*"kinshasa" + 0.065*"juin" + 0.062*"retour" + 0.057*"annonce" + 0.054*"cher" + 0.053*"dimanche" + 0.050*"joie" + 0.050*"compatriote" + 0.050*"enthousiasme" + 0.018*"bemba"
topic label  1  =>word  0.012*"nouveau" + 0.009*"ebola" + 0.008*"má" + 0.008*"jeanpierre" + 0.008*"passer" + 0.008*"argent" + 0.007*"créer" + 0.007*"affaire" + 0.007*"an" + 0.007*"journaliste"
topic label  2  =>word  0.016*"président" + 0.011*"vouloir" + 0.011*"semaine" + 0.010*"mwilambwe" + 0.009*"national" + 0.008*"gouvernement" + 0.008*"Monsieur" + 0.008*"paul" + 0.007*"politique" + 0.007*"leader"
topic label  3  =>word  0.028*"ebola" + 0.021*"outbreak" + 0.016*"kabila" + 0.015*"bureau" + 0.015*"demande" + 0.014*"assemblée" + 0.014*"provincial" + 0.012*"kinshasa" + 0.010*"ngobila" + 0.010*"révoquer"
topic label  4  =>word  0.037*"tshisekedi" + 0.013*"félix" + 0.012*"kinshaser" + 0.010*"fidèle" + 0.009*"hommage" + 0.009*"martyr" + 0.008*"président" + 0.008*"chebeya" + 0.008*"assassinat"

From the first impression the model is not that accurate , there are some with word which has no sense, the only topic I can identify from those tweet are the holy day on thurday due to Tshisekedi death .

From now , if we remove mentions and english word and article we can have better accuracy, we need to improve the model
on that side

The step to improve the model will be to remove mention, remove english word, and some useless word.

In [13]:
print('\nPerplexity: ', lda_model.log_perplexity(corpus))


Perplexity:  -7.109698002056578


In [14]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.4456736448363941


In [36]:
import pyLDAvis.gensim

In [47]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [50]:
lda_model.get_topic_terms(1,  topn=10)

[(556, 0.017516058),
 (177, 0.017165974),
 (628, 0.016905999),
 (424, 0.016500577),
 (485, 0.01478237),
 (663, 0.014174746),
 (427, 0.013618593),
 (1675, 0.012578938),
 (432, 0.012284329),
 (426, 0.012159651)]

### II. trying Mallet Model

Let first download the model!!

In [45]:
#Uncomment to download the model
# !wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip -O ../.venv3/models/mallet.zip

In [46]:
mallet_path = '../.venv3/models/mallet.zip' # update this path
ldamallet = gs.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

CalledProcessError: Command '../.venv3/models/mallet.zip import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input /var/folders/nl/l4lf_0kn47d8xzr0ktf_k4xc0000gn/T/da0ad0_corpus.txt --output /var/folders/nl/l4lf_0kn47d8xzr0ktf_k4xc0000gn/T/da0ad0_corpus.mallet' returned non-zero exit status 2.