### Topic Modeling for Twitter

In this notebook I will apply topic modeling on the twitter data I cleaned in a separate notebook.
I will use LDA algorithm to model topic, 
Here are some usefuls ressources I used to learn the intuition behind topic modeling :
    [AV introduction to topic Modeling](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/) 
    [LDA Video](https://www.youtube.com/watch?v=3mHy4OSyRf0) 
    [Topic Modeling With Gesim](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)
For the first start I will use pyhon sklearn library and gensim library

In [91]:
# uncoment the next line to install gensim
!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 359kB/s eta 0:00:01
Collecting numexpr (from pyLDAvis)
[?25l  Downloading https://files.pythonhosted.org/packages/a4/2c/71676625624fe67b8ea2236455ceaed634bcef995bbe250f014c5d9508fd/numexpr-2.6.9-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (182kB)
[K     |████████████████████████████████| 184kB 5.4MB/s eta 0:00:01
[?25hCollecting pytest (from pyLDAvis)
[?25l  Downloading https://files.pythonhosted.org/packages/56/53/0ae37ab12c457945ae0152c6571d6d40eecccddf25f71fe328f9aefe90ca/pytest-4.5.0-py2.py3-none-any.whl (227kB)
[K     |████████████████████████████████| 235kB 15.4MB/s eta 0:00:01
[?25hCollecting future (from pyLDAvis)
[?25l  Downloading https://files.pythonhosted.org/packages/90/52/e20466b

In [92]:
import pandas as pd
import gensim as gs
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from pprint import pprint
import pyLDAvis

In [45]:
cleanned_tweets = pd.read_csv('../data/cleanned_tweets.csv', index_col='Unnamed: 0')

In [46]:
cleanned_tweets.dropna(inplace=True)

In [47]:
tweets_array = cleanned_tweets.get('cleanned_tweet').values

In [48]:
tweets_array = [tweet.split(' ') for tweet in tweets_array]

In [49]:
# Create Dictionary

id2word = corpora.Dictionary(tweets_array)

In [50]:
# Create Corpus
texts = tweets_array

In [51]:
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]]


In [99]:
lda_model = gs.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=100,
                                           alpha='auto',
                                           per_word_topics=True)

In [101]:
for label, words in lda_model.print_topics():
    print('==================')
    print("topic label ", label, ' =>word ',  words)

topic label  0  =>word  0.017*"come" + 0.017*"mort" + 0.013*"@fcc_rdc" + 0.013*"@sindika_dokolo" + 0.011*"tshimboj" + 0.011*"devoir" + 0.011*"discussion" + 0.011*"outbreak" + 0.010*"@elogemwandwe" + 0.010*"news"
topic label  1  =>word  0.013*"@actualitecd" + 0.010*"around" + 0.010*"@martinfayulu" + 0.010*"décider" + 0.009*"population" + 0.009*"drcongo" + 0.009*"petit" + 0.009*"enfant" + 0.009*"kalemie" + 0.008*"cependant"
topic label  2  =>word  0.040*"@presidence_rdc" + 0.035*"tshisekedi" + 0.026*"de" + 0.021*"etienne" + 0.019*"national" + 0.017*"@topcongo" + 0.017*"@fatshi13" + 0.016*"jeudi" + 0.016*"mai" + 0.015*"rdcongo"
topic label  3  =>word  0.040*"le" + 0.033*"de" + 0.030*"-" + 0.020*"président" + 0.019*"non" + 0.018*"tshisekedi" + 0.015*"mai" + 0.015*"kabila" + 0.014*"étienne" + 0.013*"dépouill"
topic label  4  =>word  0.030*"être" + 0.028*"@moise_katumbi" + 0.022*"@fatshi13" + 0.021*"@vitalkamerhe1" + 0.021*"ce" + 0.017*"@abelamundala" + 0.013*"pays" + 0.013*"@mwema_y" + 0.01

From the first impression the model is not that accurate , there are some with word which has no sense, the only topic I can identify from those tweet are the holy day on thurday due to Tshisekedi death .

From now , if we remove mentions and english word and article we can have better accuracy, we need to improve the model
on that side

The step to improve the model will be to remove mention, remove english word, and some useless word.

In [102]:
print('\nPerplexity: ', lda_model.log_perplexity(corpus))


Perplexity:  -7.954524536656986


In [103]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.4338861927091099


In [104]:
import pyLDAvis.gensim

In [105]:

vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### II. trying Mallet Model

Let first download the model!!

In [117]:
#Uncomment to download the model
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip -O ../.venv3/models/mallet.zip

--2019-05-30 20:17:45--  http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
Resolving mallet.cs.umass.edu (mallet.cs.umass.edu)... 128.119.246.70
Connecting to mallet.cs.umass.edu (mallet.cs.umass.edu)|128.119.246.70|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16184794 (15M) [application/zip]
Saving to: ‘../.venv3/models/mallet.zip’


2019-05-30 20:17:58 (1.24 MB/s) - ‘../.venv3/models/mallet.zip’ saved [16184794/16184794]



In [120]:
mallet_path = '../.venv3/models/mallet.zip' # update this path
#ldamallet = gs.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)