# Topic Modeling for COVID19
COVID 19 has been the biggest pandemic people have seen in the recent times. It almost feels like one of those apocalyptic movies in real life. People hiding in their homes trying to save themselves from the infection. Some brave souls trying to find a better destination to survive this pandemic. With so much happening around the world, I have one question. **What are people around the world thinking about COVID 19?**

In this notebook, we will try to answer the above question using Topic Modeling. Let's explore the various topics people are talking about Corona Virus Disease 2019(COVID-19) in Twitter.

In [2]:
# Import all required libraries
import os
import json
import requests
import re

# Twitter data collection library
import tweepy as tw

# Data processing libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Loading Gensim and nltk libraries
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
#stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come'])
#Other stop_words: gensim.parsing.preprocessing.STOPWORDS


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/deepakawari/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Configure notebook display to show data from pandas dataframe more clearly.
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',100)
pd.set_option('display.max_colwidth',800)

# Step 1. Gather the textual data for Topic Modeling
To begin with the topic modeling we need the textual data. The textual data for what people are talking about COVID 19 can be pulled from many places such as social media, news articles, web scraping etc. In this notebook, we'll download the data from Twitter. Tweepy is an amazing library to pull data from twitter using your Twitter Developer Account. 

## Extracting tweets from Twitter API:
To load the data from Twitter using Tweepy API, you'll have to create Developer account with Twitter. Then download the credentials to authenticate using Tweepy API. **Please do not share these credentials with anybody else.**
* Here is the link to [apply for twitter developer access](https://developer.twitter.com/en/apply-for-access)
* You can follow the below code to use Tweepy API to authenticate and load the data. Here is the [Tweepy Documentation for reference](http://docs.tweepy.org/en/latest/) 

In [75]:
'''
LoadFromTwitter - 
    If true, pull the latest set of tweets from Twitter using the Tweepy library.
    If false, load the data from the datafile '../data/tweets.csv' if it exists, 
    otherwise load the tweets from Twitter using the Tweepy library.
    Set the LoadFromTwitter to True if you want to override loading the tweets afresh from twitter.
'''
LoadFromTwitter = False

fileName = '../data/tweets.csv'
tweetsDF = None

# Load the data
if os.path.exists(fileName) and not LoadFromTwitter:
    tweetsDF = pd.read_csv(fileName)
else:
    from TwitterDevSecrets import getTwitterDevCreds
    consumer_key, consumer_secret, access_token, access_secret = getTwitterDevCreds()

    auth = tw.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    # Set the wait_on_rate_limit and wait_on_rate_limit_notify to True
    # wait_on_rate_limit – 
    #    Whether or not to automatically wait for rate limits to replenish
    # wait_on_rate_limit_notify – 
    #    Whether or not to print a notification when Tweepy is waiting 
    #    for rate limits to replenish
    api = tw.API(
        auth, 
        wait_on_rate_limit=True, 
        wait_on_rate_limit_notify=True)

    # Define the search term and the date_since date as variables
    search_words = "#covid OR #covid19 OR #COVID OR #COVID19 OR #ncov OR #corona OR #coronaviru"
    date_since = "2020-05-16"
    
    # Read the tweets
    tweets = tw.Cursor(api.search, 
                   q=search_words,
                   lang="en",
                   since=date_since)

    # extract the data in pandas dataframe
    # Other parameters: tweet.user.screen_name, retweet_counts, favorite_counts
    tweetsDF = pd.DataFrame()
    for tweet in tweets.items(10000):
        id = tweet.id
        text = tweet.text
        loc = tweet.user.location
        tweetsDF = tweetsDF.append({'Id':id, 'Text':text, 'Location':loc},ignore_index=True)
    
    tweetsDF['index'] = tweetsDF.index
    
    # Save the new set of tweets in the file.
    tweetsDF.to_csv(fileName,index=False)



Let's see how doest he textual data look like.

In [76]:
tweetsDF.Text[1]

'RT @Carol_D_Johnson: Thank you nurses for helping to keep us healthy  ❤ #COVID19 \n#StayHomeSaveLives \n#coronavirus https://t.co/HGv0HfuTgt'

# Step 2. Data preprocessing
As you can see from the above text, a tweet contains a lot of textual data which probably doesn't contain any useful informaiton for Topic Modeling. So, these tweets needs to be processed to extract only useful textual data for further analysis. We will perform the following data processing steps:

* Tweet Preprocessing:
> * Remove the leading **RT** - RT indicates that the user is re-posting someone else's tweet. We can remove this token.
> * Remove the references to other accounts. The other accounts are usually referenced with '@' symbol.
> * Remove urls mentioned in the tweets.

* Generic text preprocessing:
> * **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
> * Remove words that have fewer than 3 characters.
> * Remove all **stopwords**. [Stop words](https://en.wikipedia.org/wiki/Stop_words) usually do not contain any usual information. As such these words are generally removed from the text in the preprocessing stage. 
> * **Lemmatize** the words: words in third person are changed to first person and verbs in past and future tenses are changed into present.  
> Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. 
>> **WordnetLemmatizer**: uses lookup table from nltk wordnet corpus to lookup the lemma to return a valid language lemma.
> * **Stem** the Words: words are reduced to their root form.  
> Stemming is the process of reducing inflection in words to their root forms such as 
mapping a group of words to the same stem even if the stem itself is not a valid word 
in the Language.
>> **PorterStemmer**: is known for simplicity and ease. The algorithm does not follow linguistics rather a set of rules that are applied in phases (step by step) to generate stems. This is the reason why PorterStemmer does not often generate stems that are actual English words.  
>> **SnowballStemmer**: One can generate their own set of rules for any language. Python nltk introduced SnowballStemmers that are used to create non-English Stemmers!  
>> **LancasterStemmer**: is simple, but heavy stemming due to iterations and over-stemming may occur. Over-stemming causes the stems to be not linguistic, or they may have no meaning.
    

In [6]:
# Perform data preprocessing for all tweets.
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def tweet_cleanup(text):
    # Remove the leading RT from the tweet
    text = text.replace('RT','')
    # Remove the references to the account names starting with '@'
    text = re.sub(r'(@[a-zA-Z]*)','',text)
    # Remove the urls in the tweet.
    text = re.sub(r'((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)','',text)
    
    return text
  
# Clean up the tweets and then Tokenize and lemmatize
def preprocess(text, stop_words=stop_words):
    result=[]
    text = tweet_cleanup(text)
    for token in gensim.utils.simple_preprocess(text) :
        if token not in stop_words and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result



In [7]:
# Test the preprocessing step on a sample tweet
tweet_num = 0
sampleTweet = tweetsDF[tweetsDF['index'] == tweet_num].Text.iloc[0]

print("Original tweet: ")
words = []
for word in sampleTweet.split():
    words.append(word)
print(words)
print("\n\nPreprocessed tweet: ")
print(preprocess(sampleTweet))

Original tweet: 
['RT', '@ALPublicHealth:', 'State', 'Health', 'Officer', 'Dr.', 'Scott', 'Harris', 'has', 'issued', 'a', 'stay', 'at', 'home', 'order', 'and', 'strict', 'quarantine', 'requirements.', 'Read', 'our', 'full…']


Preprocessed tweet: 
['state', 'health', 'offic', 'scott', 'harri', 'issu', 'stay', 'home', 'order', 'strict', 'quarantin', 'requir', 'read', 'full']


In [8]:
# Preprocess all tweets and generate a new processed tweet text dataset.
processed_tweets = tweetsDF['Text'].map(preprocess)
processed_tweets[:10]

0        [state, health, offic, scott, harri, issu, stay, home, order, strict, quarantin, requir, read, full]
1                                                      [thank, nurs, help, keep, healthi, covid, coronavirus]
2                                                                        [togetherapart, slow, spread, covid]
3                    [smoker, greater, risk, contract, coronavirus, elli, cannon, say, equal, risk, contract]
4                                      [video, model, scan, show, extent, covid, damag, lung, tissu, stayhom]
5                      [leader, hous, parti, caucus, arizona, andi, bigg, think, spread, covid, much, possib]
6                                                                                     [covid, test, administ]
7           [keep, think, master, public, health, write, doctor, dissert, global, effort, tackl, aid, pandem]
8    [ceylonblacktea, rich, theaflavin, help, increas, human, immun, covid, srilankatea, industri, successfu]
9         

# Step 3. Text representation
Computers don't understand natural language texts. Text is a mere sequence of letters for computers. While its still difficult for computer to understand what the sequence of letters mean, language is way more complicated than that. For an example, let us consider an idiom "Kicked the bucket". You know where I am going right? When I first heard that phrase as a kid I thought it meant someone was actually kicking a bucket. That's fun! But, when I realized that it meant someone died, it was no more fun! So, natural language is hard and computers don't understand it. 

Computers love numbers. At the core, computers perform their operations on numbers. So, it'd be good to represent the natural language text with numbers for computer algorithms to process easily. In the below section, we'll explore two different models for text representation namely Bag of words and TF-IDF.

# 3.1: Bag of words on the dataset
Create a dictionary of words present in the preprocessed_tweets dataset. Gensim offers a great api for the same. This dictionary assigns a numerical id to each word so that you can work on the number representations of the word. This makes the data processing very easy than working on strings. 

Then create a corpus of Bag of words where words are represented by their numerical ids along with the frequency of occurence of that word in the tweet for further processing.

In [9]:
dictionary = gensim.corpora.Dictionary(processed_tweets)

# Create Corpus: Term Document Frequency
corpus = [dictionary.doc2bow(text) for text in processed_tweets]

In [10]:
# Check the id to word mapping from the dictionary created above
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 5:
        break

0 full
1 harri
2 health
3 home
4 issu
5 offic


Since the text corpus is very huge and sparse, we should try to minimize the amount of text being used for modeling. For this reason, let us remove very rare and very common words. Gensim dictionary object provides a good api to perform this operation.
- words appearing less than 15 times
- words appearing in more than 10% of all documents

Then convert it into bag of word corpus with very rare and very common wordsd filtered out.

In [11]:
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_tweets]

# Test the Bag of Words representation of the tweet --> (token_id, token_count)
bow_corpus[tweet_num]

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]

In [12]:
# Preview BOW for our sample preprocessed tweet
bow_tweet_0 = bow_corpus[tweet_num]

for i in range(len(bow_tweet_0)):
    print("Word {} (\"{}\") appears {} time.".format(bow_tweet_0[i][0], 
                                                     dictionary[bow_tweet_0[i][0]], 
                                                     bow_tweet_0[i][1]))

Word 0 ("health") appears 1 time.
Word 1 ("home") appears 1 time.
Word 2 ("order") appears 1 time.
Word 3 ("state") appears 1 time.
Word 4 ("stay") appears 1 time.


# 3.2: TF-IDF on the data set
TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Summing the Tf-idf of all possible terms and documents recovers the mutual information between documents and term taking into account all the specificities of their joint distribution.

TF (Term Frequency) - number of times a term occurs in a document.  
IDF (Inverse Document Frequency) diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

In [13]:
# Create tf-idf model object using models.TfidfModel
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)

# Apply transformation to the entire corpus
tfidf_corpus = tfidf[bow_corpus]

# Test the tf-idf representation of the sample tweet. Each word is represented by (token_id, tf-idf score).
tfidf_corpus[tweet_num]

[(0, 0.40009295170061265),
 (1, 0.44472924895798494),
 (2, 0.45253501051552114),
 (3, 0.47433139975516847),
 (4, 0.4608289406979316)]

In [14]:
# Preview TF-IDF for our sample preprocessed tweet
tfidf_tweet_0 = tfidf_corpus[tweet_num]

for i in range(len(tfidf_tweet_0)):
    print("Word {} (\"{}\") TF-IDF score: {}.".format(tfidf_tweet_0[i][0], 
                                                     dictionary[tfidf_tweet_0[i][0]], 
                                                     tfidf_tweet_0[i][1]))

Word 0 ("health") TF-IDF score: 0.40009295170061265.
Word 1 ("home") TF-IDF score: 0.44472924895798494.
Word 2 ("order") TF-IDF score: 0.45253501051552114.
Word 3 ("state") TF-IDF score: 0.47433139975516847.
Word 4 ("stay") TF-IDF score: 0.4608289406979316.


# Step 4: Topic modeling using LDA
Topic modeling is a statistical model to discover the abstract topics in a collection of documents. Probabilistic Latent Semantic Analysis(PLSA) is one of the ealiest models for topic modeling. Latent Dirichlet Allocation(LDA) is the most common topic model algorithm in use today which is a generalization of PLSA. 

LDA introduces sparse Dirichlet prior distributions over document-topic and topic-word distributions. This algorithm tries to model the intuition that each document has different abstract topics and that each topic is generalized by a small number of words.

In this section we'll be building the topic models using LDA for both text representations developed above. 


## Step 4.1: Modeling using Bag of Words
The LDA algorithm requires a few inputs to build the clusters. The main parameter it requires in the number of clusters we want the model to cluster the tweets into. But how do we identify the number of topics? The best way to identify that is by visualizing the clusters itself. 

Start with a high number of topics like 10 or 20. Then map the clusters into a vector space and see if the clusters have clear boundaries. If the clusters overlap, reduce the number of clusters, build the model and visualize again. Repeat the process until you are satisfied with the segregation of the clusters.

### Step 4.1.1: Running LDA using bag of words

In [15]:
# Train the lda model using gensim.models.LdaMulticore on Bag of word corpus
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=3, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)

In [16]:
# Explore the words occuring in that topic and its relative weight
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.043*"work" + 0.038*"natur" + 0.035*"caus" + 0.034*"father" + 0.034*"murder" + 0.033*"speak" + 0.033*"trut" + 0.033*"poverti" + 0.033*"racism" + 0.030*"mask"


Topic: 1 
Words: 0.070*"coronavirus" + 0.065*"peopl" + 0.050*"test" + 0.041*"help" + 0.029*"affect" + 0.025*"everyon" + 0.021*"hospit" + 0.020*"fight" + 0.020*"poor" + 0.019*"either"


Topic: 2 
Words: 0.044*"health" + 0.034*"case" + 0.034*"keep" + 0.033*"pandem" + 0.026*"time" + 0.025*"death" + 0.024*"care" + 0.023*"human" + 0.023*"like" + 0.021*"trump"




From the top words in each topic, we can identify the generic topic in that cluster. In the above clustering, the topics could be around  
Topic 0: Self quarantining     
Topic 1: Impact of COVID-19 on work, racism, and poverty.   
Topic 2: Testing and health concerns due to COVID 19.  

### 4.1.2. Visualization using pyLDAVis for LDA with BOW
In this section, we'll visualize the topics generated by the above LDA model using pyLDAvis library. pyLDAvis provides an amazing interactive visualization tool to see how different clusters are generated. It produces the intertopic distance map and shows top relevant terms for each topic amongst other features. 

As mentioned above, we'll visualize the intertopic distance map to see if there is a good segregation of clusters. If there is an overlap between multiple clusters, we'd reduce the number of topics and run the LDA model with reduced number of clusters and visualize again. Once we are satisfied with the cluster segregation in the intertopic distance map, we can start looking into the terms to see what each topic represents.

In [17]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary=lda_model.id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Based on the visualizations, it'd be best to create 3 clusters instead of 10.

## 4.2: Modeling using TF-IDF
TF-IDF intends to reflect on the importance of each word in the tweet amongst other tweets. Thus it tries to create a better model instead of using mere Term Frequency as in Bag of words model. However, for TF-IDF to work it needs to have a good size of text in each document. However, tweet is usually very small in size. Thus, most of the times each word ends up being mentioned only once. Thus, TF-IDF doesn't work better for short texts. However, let's train the model and evaluate the performance and see how does it perform.

### 4.2.1. Running LDA using TF-IDF

In [18]:
# Train lda model using corpus_tfidf
lda_model_tfidf = gensim.models.LdaMulticore(tfidf_corpus, 
                                             num_topics=3, 
                                             id2word = dictionary, 
                                             passes = 2, 
                                             workers=4)

In [19]:
# Explore the words occuring in that topic and its relative weight
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.058*"peopl" + 0.035*"fight" + 0.033*"help" + 0.028*"medic" + 0.025*"live" + 0.023*"crisi" + 0.022*"equip" + 0.021*"virus" + 0.021*"corona" + 0.020*"first"


Topic: 1 Word: 0.038*"case" + 0.037*"make" + 0.036*"work" + 0.031*"health" + 0.027*"natur" + 0.026*"caus" + 0.025*"murder" + 0.025*"speak" + 0.025*"father" + 0.025*"racism"


Topic: 2 Word: 0.046*"coronavirus" + 0.032*"take" + 0.032*"test" + 0.027*"death" + 0.027*"time" + 0.026*"pandem" + 0.026*"state" + 0.024*"stay" + 0.023*"need" + 0.023*"home"




Below is my attempt at generalizing the topics from their corresponding relevant terms.  
Topic 0: President Trump's announcements around COVID19.   
Topic 1: Quarantining and stopping the spread of the disease.  
Topic 2: Impact of COVID on people

### 4.2.2. Visualization using pyLDAVis

In [20]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model_tfidf, tfidf_corpus, dictionary=lda_model_tfidf.id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


# Step 5. Model evaluation
Topic cluster is a statistical tool that imposes probability distribution over words. We can use different coherence measures to quantify the quality of the clusters and the probability distribution of the words. However, before we dwelve into the coherence measures, lets try to approach the evaluation with a more intuitive way i.e. human validation. 

## 5.1. Evaluation by human validation
In this section, we'll cluster a sample tweet and see how well does this clustering matches the overall topics generated above.

### 5.1.1: Human validation for Bag of words model
Classify a sample tweet into the topics and then evaluate if the general topic matches with the tweet better than other topics.

In [21]:
# Our test tweet is 
print('Our test tweet is: {}: {}'.format(tweet_num, [dictionary[word[0]] for word in bow_corpus[tweet_num]]))

for index, score in sorted(lda_model[bow_corpus[tweet_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {} : {}".format(score, index, lda_model.print_topic(index, 10))) 

Our test tweet is: 0: ['health', 'home', 'order', 'state', 'stay']

Score: 0.5977810025215149	 
Topic: 2 : 0.044*"health" + 0.034*"case" + 0.034*"keep" + 0.033*"pandem" + 0.026*"time" + 0.025*"death" + 0.024*"care" + 0.023*"human" + 0.023*"like" + 0.021*"trump"

Score: 0.34373247623443604	 
Topic: 0 : 0.043*"work" + 0.038*"natur" + 0.035*"caus" + 0.034*"father" + 0.034*"murder" + 0.033*"speak" + 0.033*"trut" + 0.033*"poverti" + 0.033*"racism" + 0.030*"mask"

Score: 0.05848647654056549	 
Topic: 1 : 0.070*"coronavirus" + 0.065*"peopl" + 0.050*"test" + 0.041*"help" + 0.029*"affect" + 0.025*"everyon" + 0.021*"hospit" + 0.020*"fight" + 0.020*"poor" + 0.019*"either"


The sample tweet is classified to Topic 0 with 88% score. Topic 0 in Bag of word model was centered around self quarantining. The sample tweet matches with this topic. We could try evaluating more tweets manually. Seems like the BOW based LDA model worked well.    

### 5.1.2 Human validation for TF-IDF model

In [22]:
# Our test tweet is 
print('Our test tweet is: {}: {}'.format(tweet_num, [dictionary[word[0]] for word in tfidf_corpus[tweet_num]]))

for index, score in sorted(lda_model_tfidf[tfidf_corpus[tweet_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {} : {}".format(score, index, lda_model.print_topic(index, 10)))

Our test tweet is: 0: ['health', 'home', 'order', 'state', 'stay']

Score: 0.6827797293663025	 
Topic: 2 : 0.044*"health" + 0.034*"case" + 0.034*"keep" + 0.033*"pandem" + 0.026*"time" + 0.025*"death" + 0.024*"care" + 0.023*"human" + 0.023*"like" + 0.021*"trump"

Score: 0.1886352002620697	 
Topic: 1 : 0.070*"coronavirus" + 0.065*"peopl" + 0.050*"test" + 0.041*"help" + 0.029*"affect" + 0.025*"everyon" + 0.021*"hospit" + 0.020*"fight" + 0.020*"poor" + 0.019*"either"

Score: 0.128585085272789	 
Topic: 0 : 0.043*"work" + 0.038*"natur" + 0.035*"caus" + 0.034*"father" + 0.034*"murder" + 0.033*"speak" + 0.033*"trut" + 0.033*"poverti" + 0.033*"racism" + 0.030*"mask"


As can be seen above, the sample tweet is split between topics 0 with 49% score and topic 1 with 38% score. As we saw in the section 4.2.1, Topic 0 was centered around President Trump's announcement around COVID 19 and Topic 1 around Quarantine and fight the spread of COVID-19. 

The sample tweet matches well with Topic 1. However, the confidence of this clustering is low compared to 88% confidence we saw for Bag of Words model. 

The TF-IDF modeling didn't have good confidence in the classification. This was expected as TF-IDF doesn't work good for short text documents. 

## 5.2 Coherence measures
The topic coherence measures scores a single topic by computing the semantic similarity between the top words in that topic. We can then average the scores of each topic to get the overall coherence measure for the model.

There are different coherence measures. In the below evaluation we'll be using the C_v measure. C_v measure is based on a sliding window, a one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosinus similarity.

In [23]:
from gensim.models import CoherenceModel

# Compute Coherence Score for BOW based LDA model
cv_model_bow = CoherenceModel(model=lda_model, texts=processed_tweets, dictionary=dictionary, coherence='c_v')
cv_bow_overall = cv_model_bow.get_coherence()
cv_bow_pertopic = cv_model_bow.get_coherence_per_topic()
print('\nCoherence Score: {0}\nPer Topic: {1}'.format(cv_bow_overall , cv_bow_pertopic))


Coherence Score: 0.4304417242985787
Per Topic: [0.4493676045869341, 0.33174836101253, 0.510209207296272]


In [24]:
# Compute Coherence Score for TF-IDF based LDA model
cv_model_tfidf = CoherenceModel(model=lda_model_tfidf, texts=processed_tweets, dictionary=lda_model_tfidf.id2word, coherence='c_v')
cv_tfidf_overall = cv_model_tfidf.get_coherence()
cv_tfidf_pertopic = cv_model_tfidf.get_coherence_per_topic()
print('\nCoherence Score: {0}\nPer Topic: {1}'.format(cv_tfidf_overall, cv_tfidf_pertopic))



Coherence Score: 0.5277570239510211
Per Topic: [0.5811252711244042, 0.46330507155149236, 0.5388407291771666]


Based on the coherence measures, it seems that the tf-idf model seemed to have gathered the topics based on better semantic similarity between the words. This was contradictory to what we saw in the previous section on human validation of the topics. However, the previous measure was done on a few sample tweets. 

# Step 6. Hyper parameter tuning 
Now that we have established the modeling and performance evaluation methods, let's try to tune the Number of Topics hyper parameter. We'll define a range for the number of topics. Then we'll train a model for each value for the hyper parameter. We'll also compute the coherence score and jot it down. At the end we'll plot the scores to see what's the best for the Number of Topics hyper parameter.

We'll repeat the same for both BOW and TF-IDF models and compare the scores to choose the best model.

The LDA model has more hyper-parameters namely alpha and beta values. We could tune the alpha and beta hyperparameters also in the similar fashion. 

For parameter tuning, sklearn exposes a GridSearchCV api that can be configured easily with ranges for different hyper parameters. For this, we can use the LDA_Transform model exposed by gensim. However, that approach uses the default log_likelihood score for tuning the hyper parameters. Since we decided to use coherence scores, we'll tune the hyper parameters in a loop instead of using GridSearchCV.

In [69]:
def compute_coherence_values(corpus, dictionary, num_topics):
    lda_model = gensim.models.LdaMulticore(corpus, 
                                       num_topics=num_topics, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)
        
    coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_tweets, dictionary=dictionary, coherence='c_v')
    
    return lda_model, coherence_model_lda.get_coherence()

In [80]:
import numpy as np
import tqdm

# Topics range
min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Models
model_sets = [bow_corpus, tfidf_corpus]
model_title = ['BOW', 'TF-IDF']

model_results = pd.DataFrame(columns=['Model','Num_Topics','Coherence'])

pbar = tqdm.tqdm(total=len(model_sets)*len(topics_range))
    
if 1 == 1:
  # iterate through validation corpuses
  for i in range(len(corpus_sets)):
    # iterate through number of topics
    for k in topics_range:
        # get the coherence score for the given parameters
        model, cv = compute_coherence_values(corpus=model_sets[i], dictionary=dictionary, num_topics=k)
        
        # Save the model results
        results = {'Model':model_title[i]
                   ,'Num_Topics':k
                   ,'Coherence':cv}
        
        model_results = model_results.append(results , ignore_index=True)
        pbar.update(1)

  model_results.to_csv('lda_tuning_results.csv', index=False)
  pbar.close()



  0%|          | 0/18 [00:00<?, ?it/s][A[A

  6%|▌         | 1/18 [00:01<00:25,  1.48s/it][A[A

 11%|█         | 2/18 [00:02<00:23,  1.47s/it][A[A

 17%|█▋        | 3/18 [00:04<00:20,  1.40s/it][A[A

 22%|██▏       | 4/18 [00:05<00:18,  1.33s/it][A[A

 28%|██▊       | 5/18 [00:06<00:16,  1.29s/it][A[A

 33%|███▎      | 6/18 [00:07<00:15,  1.28s/it][A[A

 39%|███▉      | 7/18 [00:09<00:14,  1.27s/it][A[A

 44%|████▍     | 8/18 [00:10<00:12,  1.29s/it][A[A

 50%|█████     | 9/18 [00:11<00:11,  1.30s/it][A[A

 56%|█████▌    | 10/18 [00:13<00:10,  1.32s/it][A[A

 61%|██████    | 11/18 [00:14<00:09,  1.32s/it][A[A

 67%|██████▋   | 12/18 [00:15<00:07,  1.32s/it][A[A

 72%|███████▏  | 13/18 [00:16<00:06,  1.28s/it][A[A

 78%|███████▊  | 14/18 [00:18<00:05,  1.26s/it][A[A

 83%|████████▎ | 15/18 [00:19<00:03,  1.26s/it][A[A

 89%|████████▉ | 16/18 [00:20<00:02,  1.27s/it][A[A

 94%|█████████▍| 17/18 [00:21<00:01,  1.28s/it][A[A

100%|██████████| 18/18 [00

In [81]:
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots

fig = px.line(model_results, x='Num_Topics', y='Coherence',color='Model')
fig.show()

From the above models, seems like both BOW and TF-IDF models with 2 topics performs the best.