# Topic_Modeling_On_Tweets_Around_COVID19
In this notebook, we explore the various topics people are talking about Corona Virus Disease 2019(COVID-19) in Twitter.

In [1]:
# Import all required libraries
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tweepy as tw
import json
import requests
import re

'''
Loading Gensim and nltk libraries
'''
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

import nltk
nltk.download('wordnet')

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come'])
# Other stop_words: gensim.parsing.preprocessing.STOPWORDS


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/deepakawari/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Configure notebook display to show data from pandas dataframe more clearly.
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',150)
pd.set_option('display.max_colwidth',1000)

# 1. Loading data
If the tweets data file is available, then load the data from the file else read the tweets from the Tweepy API. 

## Loading data from Twitter API:
To load the data from Twitter using Tweepy API, you'll have to create Developer account with Twitter. Then download the credentials to authenticate using Tweepy API. Please do not share these credentials with anybody else. 
* Here is the link to [apply for twitter developer access](https://developer.twitter.com/en/apply-for-access)
* You can follow the below code to use Tweepy API to authenticate and load the data. Here is the [Tweepy Documentation for reference](http://docs.tweepy.org/en/latest/) 

In [3]:
'''
Load the data. Set the LoadFromTwitter to True if you want to override loading the data afresh from twitter.
'''
LoadFromTwitter = False

fileName = '../data/tweets.csv'
tweetsDF = None

# Load the data
if os.path.exists(fileName) and not LoadFromTwitter:
    tweetsDF = pd.read_csv(fileName)
else:
    from TwitterDevSecrets import getTwitterDevCreds
    consumer_key, consumer_secret, access_token, access_secret = getTwitterDevCreds()

    auth = tw.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    # Set the wait_on_rate_limit and wait_on_rate_limit_notify to True
    # wait_on_rate_limit – 
    #    Whether or not to automatically wait for rate limits to replenish
    # wait_on_rate_limit_notify – 
    #    Whether or not to print a notification when Tweepy is waiting 
    #    for rate limits to replenish
    api = tw.API(
        auth, 
        wait_on_rate_limit=True, 
        wait_on_rate_limit_notify=True)

    # Define the search term and the date_since date as variables
    search_words = "#covid OR #covid19 OR #COVID OR #COVID19 OR #ncov OR #corona OR #coronaviru"
    date_since = "2020-03-16"
    
    # Read the tweets
    tweets = tw.Cursor(api.search, 
                   q=search_words,
                   lang="en",
                   since=date_since)

    # extract the data in pandas dataframe
    # Other parameters: tweet.user.screen_name, retweet_counts, favorite_counts
    tweetsDF = pd.DataFrame()
    for tweet in tweets.items(1000):
        id = tweet.id
        text = tweet.text
        loc = tweet.user.location
        tweetsDF = tweetsDF.append({'Id':id, 'Text':text, 'Location':loc},ignore_index=True)
    
    tweetsDF['index'] = tweetsDF.index
    
    # Save the new set of tweets in the file.
    tweetsDF.to_csv(fileName,index=False)

# Tweets loaded
tweetsDF.head()

Unnamed: 0,Id,Location,Text,index
0,1.246511e+18,,RT @ALPublicHealth: State Health Officer Dr. Scott Harris has issued a stay at home order and strict quarantine requirements. Read our full…,0
1,1.246511e+18,"Portland, oregon",RT @Carol_D_Johnson: Thank you nurses for helping to keep us healthy ❤ #COVID19 \n#StayHomeSaveLives \n#coronavirus https://t.co/HGv0HfuTgt,1
2,1.246511e+18,,RT @Surgeon_General: #TogetherApart we can slow the spread of #COVID19. https://t.co/8JIBxQFpjv,2
3,1.246511e+18,,RT @SkyNews: Are smokers at greater risk of contracting #coronavirus?\n\nDr Ellie Cannon says while we are all at equal risk of contracting #…,3
4,1.246511e+18,,RT @evankirstel: 😱 The video of a 3D model from a CT scan shows the extent to which the #COVID19 has damaged the lung tissue #StayHome #St…,4


In [4]:
tweetsDF.Text[1]

'RT @Carol_D_Johnson: Thank you nurses for helping to keep us healthy  ❤ #COVID19 \n#StayHomeSaveLives \n#coronavirus https://t.co/HGv0HfuTgt'

# 2: Data preprocessing
We will perform the following data processing steps:

* Tweet Preprocessing:
> * Remove the leading **RT** - RT indicates that the user is re-posting someone else's tweet. We can remove this token.
> * Remove the references to other accounts. The other accounts are usually referenced with '@' symbol.
> * Remove urls mentioned in the tweets.

* Generic text preprocessing:
> * **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
> * Remove words that have fewer than 3 characters.
> * Remove all **stopwords**.
> * **Lemmatize** the words: words in third person are changed to first person and verbs in past and future tenses are changed into present.  
> Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. 
>> WordnetLemmatizer: uses lookup table from nltk wordnet corpus to lookup the lemma to return a valid language lemma.
> * **Stem** the Words: words are reduced to their root form.  
> Stemming is the process of reducing inflection in words to their root forms such as 
mapping a group of words to the same stem even if the stem itself is not a valid word 
in the Language.
>> PorterStemmer: is known for simplicity and ease. The algorithm does not follow linguistics rather a set of 05 rules for different cases that are applied in phases (step by step) to generate stems. This is the reason why PorterStemmer does not often generate stems that are actual English words.
>> SnowballStemmer: One can generate its own set of rules for any language that is why Python nltk introduced SnowballStemmers that are used to create non-English Stemmers!
>> LancasterStemmer: is simple, but heavy stemming due to iterations and over-stemming may occur. Over-stemming causes the stems to be not linguistic, or they may have no meaning.
    

In [5]:
# Perform data preprocessing for all tweets.

stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def tweet_cleanup(text):
    # Remove the leading RT from the tweet
    text = text.replace('RT','')
    # Remove the references to the account names starting with '@'
    text = re.sub(r'(@[a-zA-Z]*)','',text)
    # Remove the urls in the tweet.
    text = re.sub(r'((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)','',text)
    
    return text
  
# Tokenize and lemmatize
def preprocess(text, stop_words=stop_words):
    result=[]
    text = tweet_cleanup(text)
    for token in gensim.utils.simple_preprocess(text) :
        if token not in stop_words and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result



In [6]:
# Test the preprocessing and on a sample tweet
tweet_num = 0
sampleTweet = tweetsDF[tweetsDF['index'] == tweet_num].Text.iloc[0]

print("Original tweet: ")
words = []
for word in sampleTweet.split():
    words.append(word)
print(words)
print("\n\nPreprocessed tweet: ")
print(preprocess(sampleTweet))

Original tweet: 
['RT', '@ALPublicHealth:', 'State', 'Health', 'Officer', 'Dr.', 'Scott', 'Harris', 'has', 'issued', 'a', 'stay', 'at', 'home', 'order', 'and', 'strict', 'quarantine', 'requirements.', 'Read', 'our', 'full…']


Preprocessed tweet: 
['state', 'health', 'offic', 'scott', 'harri', 'issu', 'stay', 'home', 'order', 'strict', 'quarantin', 'requir', 'read', 'full']


In [7]:
# Preprocess all tweets and generate a new processed tweet text dataset.

processed_tweets = tweetsDF['Text'].map(preprocess)
processed_tweets[:10]

0        [state, health, offic, scott, harri, issu, stay, home, order, strict, quarantin, requir, read, full]
1                                                             [nurs, help, keep, healthi, covid, coronavirus]
2                                                                        [togetherapart, slow, spread, covid]
3                    [smoker, greater, risk, contract, coronavirus, elli, cannon, say, equal, risk, contract]
4                                      [video, model, scan, show, extent, covid, damag, lung, tissu, stayhom]
5                      [leader, hous, parti, caucus, arizona, andi, bigg, think, spread, covid, much, possib]
6                                                                                     [covid, test, administ]
7           [keep, think, master, public, health, write, doctor, dissert, global, effort, tackl, aid, pandem]
8    [ceylonblacktea, rich, theaflavin, help, increas, human, immun, covid, srilankatea, industri, successfu]
9         

# 3.1: Bag of words on the dataset
Create a dictionary of words present in the preprocessed_tweets dataset. Gensim offers a great api for the same. This dictionary assigns a numerical id to each word so that you can work on the number representations of the word. This makes the data processing very easy than working on strings. 

Then create a corpus of Bag of words where words are represented by their numerical ids along with the frequency of occurence of that word in the tweet for further processing.

In [8]:
dictionary = gensim.corpora.Dictionary(processed_tweets)

# Create Corpus: Term Document Frequency
corpus = [dictionary.doc2bow(text) for text in processed_tweets]

In [9]:
# Check the id to word mapping from the dictionary created above
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 5:
        break

0 full
1 harri
2 health
3 home
4 issu
5 offic


Since the text corpus is very huge and sparse, we should try to minimize the amount of text being used for modeling. For this reason, let us remove very rare and very common words. Gensim dictionary object provides a good api to perform this operation.
- words appearing less than 15 times
- words appearing in more than 10% of all documents

Then convert it into bag of word corpus with very rare and very common wordsd filtered out.

In [10]:
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_tweets]

# Test the Bag of Words representation of the tweet --> (token_id, token_count)
bow_corpus[tweet_num]

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]

In [11]:
# Preview BOW for our sample preprocessed tweet
bow_tweet_0 = bow_corpus[tweet_num]

for i in range(len(bow_tweet_0)):
    print("Word {} (\"{}\") appears {} time.".format(bow_tweet_0[i][0], 
                                                     dictionary[bow_tweet_0[i][0]], 
                                                     bow_tweet_0[i][1]))

Word 0 ("health") appears 1 time.
Word 1 ("home") appears 1 time.
Word 2 ("order") appears 1 time.
Word 3 ("state") appears 1 time.
Word 4 ("stay") appears 1 time.


# 3.2: TF-IDF on the data set
TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Summing the Tf-idf of all possible terms and documents recovers the mutual information between documents and term taking into account all the specificities of their joint distribution.

TF (Term Frequency) - number of times a term occurs in a document.
IDF (Inverse Document Frequency) diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

In [12]:
# Create tf-idf model object using models.TfidfModel
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)

# Apply transformation to the entire corpus
corpus_tfidf = tfidf[bow_corpus]

In [13]:
# Preview TF-IDF for our sample preprocessed tweet
tfidf_tweet_0 = corpus_tfidf[tweet_num]

for i in range(len(tfidf_tweet_0)):
    print("Word {} (\"{}\") TF-IDF score: {}.".format(tfidf_tweet_0[i][0], 
                                                     dictionary[tfidf_tweet_0[i][0]], 
                                                     tfidf_tweet_0[i][1]))

Word 0 ("health") TF-IDF score: 0.40009295170061265.
Word 1 ("home") TF-IDF score: 0.44472924895798494.
Word 2 ("order") TF-IDF score: 0.45253501051552114.
Word 3 ("state") TF-IDF score: 0.47433139975516847.
Word 4 ("stay") TF-IDF score: 0.4608289406979316.


# Step 4: Topic modeling, Visualizations and evaluations
In this section we'll be building the topic models, visualize them and then evaluate the topic modeling. 

## Step 4.1: Modeling using Bag of Words
In the topic modeling task, we'll have to provide the number of topics we want the model to cluster the tweets into. But how do we identify the number of topics? The best way to identify that is by visualizing the clusters itself. Start with a high number of topics like 10 or 20. Then map the clusters into a vector space and see if the clusters have clear boundaries. If the clusters overlap, reduce the number of clusters and visualize again. Repeat the process until you are satisfied with the segregation of the clusters.

### Step 4.1.1: Running LDA using bag of words

In [14]:
# Train the lda model using gensim.models.LdaMulticore on Bag of word corpus
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=3, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)

In [15]:
# Explore the words occuring in that topic and its relative weight
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.062*"peopl" + 0.033*"death" + 0.031*"coronavirus" + 0.030*"keep" + 0.027*"like" + 0.024*"care" + 0.023*"spread" + 0.023*"today" + 0.021*"face" + 0.021*"human"


Topic: 1 
Words: 0.054*"case" + 0.048*"coronavirus" + 0.045*"work" + 0.041*"natur" + 0.038*"speak" + 0.037*"caus" + 0.037*"father" + 0.036*"murder" + 0.036*"poverti" + 0.036*"racism"


Topic: 2 
Words: 0.052*"pandem" + 0.044*"health" + 0.044*"mask" + 0.041*"home" + 0.039*"stay" + 0.035*"test" + 0.033*"order" + 0.028*"presid" + 0.024*"help" + 0.023*"public"




From the top words in each topic, we can identify the generic topic in that cluster. In the above clustering, the topics could be around  
Topic 0: Take precautions    
Topic 1: Impact of COVID-19 on work, racism, and poverty.   
Topic 2: Quarantine and fight Corona virus  

### 4.1.2. Visualization using pyLDAVis for LDA with BOW

In [16]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary=lda_model.id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Based on the visualizations, it'd be best to create 3 clusters instead of 10.

### 4.1.3: Model evaluation
Classify a sample tweet into the topics and then evaluate if the general topic matches with the tweet better than other topics.

In [27]:
# Our test tweet is 
print('Our test tweet is: {}: {}'.format(tweet_num, [dictionary[word[0]] for word in bow_corpus[tweet_num]]))

for index, score in sorted(lda_model[bow_corpus[tweet_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {} : {}".format(score, index, lda_model.print_topic(index, 10))) 

Our test tweet is: 0: ['health', 'home', 'order', 'state', 'stay']

Score: 0.8827516436576843	 
Topic: 2 : 0.052*"pandem" + 0.044*"health" + 0.044*"mask" + 0.041*"home" + 0.039*"stay" + 0.035*"test" + 0.033*"order" + 0.028*"presid" + 0.024*"help" + 0.023*"public"

Score: 0.06072410196065903	 
Topic: 0 : 0.062*"peopl" + 0.033*"death" + 0.031*"coronavirus" + 0.030*"keep" + 0.027*"like" + 0.024*"care" + 0.023*"spread" + 0.023*"today" + 0.021*"face" + 0.021*"human"

Score: 0.05652424693107605	 
Topic: 1 : 0.054*"case" + 0.048*"coronavirus" + 0.045*"work" + 0.041*"natur" + 0.038*"speak" + 0.037*"caus" + 0.037*"father" + 0.036*"murder" + 0.036*"poverti" + 0.036*"racism"


The sample tweet is classified to Topic 2 with 88% probability. Topic 2 was centered around Quarantining and the sample tweet is classified correctly. Seems like the BOW based LDA model worked well.    

## 4.2: Modeling using TF-IDF
TF-IDF intends to reflect on the importance of each word in the tweet amongst other tweets. Thus it tries to create a better model instead of using mere Term Frequency as in Bag of words model. However, for TF-IDF to work it needs to have a good size of text in each document. However, tweet is usually very small in size. Thus, most of the times each word ends up being mentioned only once. Thus, TF-IDF doesn't work better for short texts. However, let's train the model and evaluate the performance and see how does it perform.

### 4.2.1. Running LDA using TF-IDF

In [17]:
# Train lda model using corpus_tfidf
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=3, 
                                             id2word = dictionary, 
                                             passes = 2, 
                                             workers=4)

In [18]:
# Explore the words occuring in that topic and its relative weight
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.044*"test" + 0.034*"help" + 0.031*"time" + 0.030*"coronavirus" + 0.030*"spread" + 0.027*"like" + 0.027*"govern" + 0.023*"case" + 0.022*"health" + 0.022*"world"


Topic: 1 Word: 0.052*"peopl" + 0.035*"work" + 0.032*"say" + 0.031*"home" + 0.031*"speak" + 0.030*"natur" + 0.029*"make" + 0.029*"caus" + 0.029*"father" + 0.028*"pandem"


Topic: 2 Word: 0.051*"coronavirus" + 0.043*"mask" + 0.043*"case" + 0.028*"public" + 0.027*"keep" + 0.027*"protect" + 0.026*"first" + 0.025*"death" + 0.023*"stay" + 0.023*"patient"




Topics from the above words:  
Topic 0: President Trump's announcements   
Topic 1: Impact of COVID in terms of patients, deaths and lockdowns.   
Topic 2: Quarantine and fight the spread of COVID-19

# 4.2.2. Visualization using pyLDAVis

In [19]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model_tfidf, corpus_tfidf, dictionary=lda_model_tfidf.id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


May be 3 topics from the visualizations. 

### 5.1.3 Model evaluation

In [26]:
# Our test tweet is 
print('Our test tweet is: {}: {}'.format(tweet_num, [dictionary[word[0]] for word in corpus_tfidf[tweet_num]]))

for index, score in sorted(lda_model_tfidf[corpus_tfidf[tweet_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {} : {}".format(score, index, lda_model.print_topic(index, 10)))

Our test tweet is: 0: ['health', 'home', 'order', 'state', 'stay']

Score: 0.4609680473804474	 
Topic: 1 : 0.054*"case" + 0.048*"coronavirus" + 0.045*"work" + 0.041*"natur" + 0.038*"speak" + 0.037*"caus" + 0.037*"father" + 0.036*"murder" + 0.036*"poverti" + 0.036*"racism"

Score: 0.41478097438812256	 
Topic: 2 : 0.052*"pandem" + 0.044*"health" + 0.044*"mask" + 0.041*"home" + 0.039*"stay" + 0.035*"test" + 0.033*"order" + 0.028*"presid" + 0.024*"help" + 0.023*"public"

Score: 0.12425100058317184	 
Topic: 0 : 0.062*"peopl" + 0.033*"death" + 0.031*"coronavirus" + 0.030*"keep" + 0.027*"like" + 0.024*"care" + 0.023*"spread" + 0.023*"today" + 0.021*"face" + 0.021*"human"


As can be seen above, the sample tweet is split between topics 1 and 2. As we saw in the section 4.2.1, Topic 1 was centere around Impact of COVID in terms of patients, deaths and lockdowns and Topic 2 around Quarantine and fight the spread of COVID-19. The sample tweet matches better with Topic 2. But, the modeling didn't classify it correctly or rather didn't have good confidence in the classification. This was expected as TF-IDF doesn't work good for short text documents. 