# Topic modeling and visualization of tweets

In [1]:
%matplotlib inline
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
from collections import Counter
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import pyLDAvis.gensim_models
import seaborn as sns
import os
import sys

  from imp import reload


In [2]:
sys.path.append(".")
sys.path.append("..")

In [3]:
#local import
from text_cleaner import TextCleaner
from build_model import BuildModel
from visualize import Visualize

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


## Load The Data 

In [4]:
tweets_df = pd.read_csv("../data/processed_tweet_data.csv")

In [5]:
tweets_df.head()

Unnamed: 0,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,original_author,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place
0,Sun Aug 07 22:31:20 +0000 2022,Twitter for Android,RT @i_ameztoy: Extra random image (I):\n\nLets...,-0.125,0.190625,en,15760,2,i_ameztoy,20497,2621,,"[{'text': 'City', 'indices': [132, 137]}]","[{'screen_name': 'i_ameztoy', 'name': 'Iban Am...",
1,Sun Aug 07 22:31:16 +0000 2022,Twitter for Android,RT @IndoPac_Info: #China's media explains the ...,-0.1,0.1,en,6967,201,ZIisq,65,272,,"[{'text': 'China', 'indices': [18, 24]}, {'tex...","[{'screen_name': 'IndoPac_Info', 'name': 'Indo...",
2,Sun Aug 07 22:31:07 +0000 2022,Twitter for Android,"China even cut off communication, they don't a...",0.0,0.0,en,2166,0,Fin21Free,85,392,,"[{'text': 'XiJinping', 'indices': [127, 137]}]","[{'screen_name': 'ZelenskyyUa', 'name': 'Волод...",Netherlands
3,Sun Aug 07 22:31:06 +0000 2022,Twitter for Android,"Putin to #XiJinping : I told you my friend, Ta...",0.1,0.35,en,2166,0,Fin21Free,85,392,,"[{'text': 'XiJinping', 'indices': [9, 19]}]",[],Netherlands
4,Sun Aug 07 22:31:04 +0000 2022,Twitter for iPhone,"RT @ChinaUncensored: I’m sorry, I thought Taiw...",-6.938894e-18,0.55625,en,17247,381,VizziniDolores,910,2608,,[],"[{'screen_name': 'ChinaUncensored', 'name': 'C...","Ayent, Schweiz"


In [6]:
# Let's look at a sample tweet
tweets_df['original_text'][100]

'RT @anku5hdilraaj_: I guess #WWIII on its way for #Taiwan https://t.co/oomVltBmKF'

# Standardize tweets
To utilize the power of machine learning algorithms, we need to provide the data translated into a meaningful features. 
Here we're going to do:
1. Word capitalization (string class functions)
2. Punctuation (regular expressions)
3. Singular-plural versions of same word (lemmatization)
4. Common words like 'and' (stopwords)

We will use the `NLTK` and `re` packages to clean the text and `gensim` to implement various learning algorithms.

In [6]:
text_cleaner = TextCleaner(tweets_df['original_text'].tolist())
text_cleaner.filterTweetList()

In [7]:
tweets_df['original_text'][100]

'RT @anku5hdilraaj_: I guess #WWIII on its way for #Taiwan https://t.co/oomVltBmKF'

Now let's see the standardized version of the above tweet

In [8]:
text_cleaner.tweetList[100]

['guess', '#wwiii', 'way', '#taiwan']

In [9]:
clean_tweets = text_cleaner.tweetList

### Now the data are ready for processing.
Many algorithms use a similar initial format, which is to

1. Build a dictionary with all words in the dataset
2. Store the word counts (using above dictionary) of each tweet in a corpus
Note that in step 2 only the word frequency is used. This is a so-called "bag-of-words" approach, which does not account for ordering of words next to each other. Other analyses like bigrams or trigrams could be used if word ordering was highly conserved.

Because I am looking at unlabeled data, and I want to get intuition for the data, I chose to use Latent Dirichlet Allocation (LDA), a topic modeling approach that probablistically learns the latent (unobserved) topics of a group of documents. There are a couple other algorithms, like LSA, LSI, or TF-IDF, which were either less accurate at predicting similarity or more suited to supervised learning.

In [10]:
buildModel = BuildModel()

Create a model objects

In [11]:
twtDict = buildModel.makeDict(clean_tweets)
twtCorpus = buildModel.makeCorpus(clean_tweets, twtDict)
twtLda = buildModel.createLDA(twtCorpus, twtDict)

Save model objects

In [12]:
#Save model object
buildModel.saveModelObjects(twtLda, 'twtLDAmodel', twtCorpus, 'twtCorpus.mm', twtDict, 'twtDictionary.dict' )

Model Objects Successfully SAVED!


## Topic Visualization of tweets on US-China Conflict using LDA

we can visualize our results using the `pyLDAvis` package. The plot is interactive. Each circle is a topic and the size represents the abundance of that topic in the corpus. Along with each topic are the associated words that go with it.

In [18]:
pyLDAvis.enable_notebook()
ldaViz = pyLDAvis.gensim_models.prepare(twtLda, twtCorpus, twtDict)

  default_term_info = default_term_info.sort_values(


In [19]:
ldaViz

## Visualization-guided analysis
After finding a topic of interest, we can find sort the data based on this topic for further information on which country got the most tweets in a given topic. To do this, we first need to match the indices between teh visualization and our LDA model.