### 10-30-2020


# TWEETINSIGHTS - PART 4 

# TOPIC EXTRACTION USING Latent Dirichlet Allocation

### ALEX MAZZARELLA

### DATA SCIENCE full time course - BrainStation
### CAPSTONE PROJECT

# =============================================================

In this notebook, we will attempt to extract meaningful topics from our dataset.
To do so, we will use the `LatentDirichletAllocation` model from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

Prior to do so, we will need to preprocess our data. We will basically only use the `tweet` text from our dataset in this part of our analysis.
The steps will be:

* vectorize our tweets using a CountVectorizer. Since we are processing text data generated from users for a social media channel, we will need to address several features that are not common in other types of text (e.g. scientific papers, books, news articles).
* fit the LDA model
* run a grid search for hyperparameters optimization
* visualize our results (we are going to use pyLDAvis)

Let's start by importing the required libraries and packages, and after that define and implement a few functions that we are going to use throughout the notebook!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from emot.emo_unicode import UNICODE_EMO, EMOTICONS

import nltk
# nltk.download('wordnet')
from string import punctuation

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aless\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# reading csv with clean tweets
df_net = pd.read_csv('master_tweets/clean_netflix/clean_netflix.csv')

In [3]:
# setup to avoid pandas columns exceed width of cells
pd.set_option('display.max_colwidth', 30)

In [4]:
df_net.head(3)

Unnamed: 0,created_at,followers_count,friends_count,handle,hashtags,retweet,tweet,tweet_id,user_mentions
0,2020-09-26 15:01:29+00:00,327,570,kaseydrzazga,[],0,@Spacefunmars @RyanPGoldch...,1309870714116337664,[{'screen_name': 'Spacefun...
1,2020-09-26 15:01:28+00:00,496,294,PlainPotatoTay,[],0,@charityfaith @netflix @Hu...,1309870713441005568,[{'screen_name': 'charityf...
2,2020-09-26 15:01:28+00:00,3,13,dsLdHzRDPbkII4p,[],1,"Episodes 13 &amp; 14 of ""A...",1309870711943487488,[{'screen_name': 'arashi5o...


In [5]:
df_net.shape

(25230, 9)

# =========================================================================

# FUNCTIONS DEFINITION

In this section, I will define most of the functions used in the notebook. Mostly these will be used for preprocessing the data.

As this section is quite 'verbose', if you are not particularly interested in knowing right now what is under the hood of each function, feel free to scroll down to the next section, 'Documents vectorization'. (You can always come back to check the code of these when they are called from the tw_tokenizer).

In [6]:
# defining function to convert emojis into word
def convert_emojis(s):
    '''
    Converts known emojis in text format, using library UNICODE_EMO
    
    INPUT: emoji
    
    OUTPUT: human-readable text describing emoji
    '''
    for emot in UNICODE_EMO:
        s = s.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
    return s

# uncomment to see the example
# convert_emojis('The new movie on Netflix is fun! 😎')

In [7]:
# converting emoticons into word
# this function have not actually been used in the end,
# but I will leave it for future possible use

def convert_emoticons(text):
    '''
    Converts known emoticons in text format, using library EMOTICONS
    
    INPUT: emoticons
    
    OUTPUT: human-readable text describing emoticons
    '''
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text

# uncomment to see the example
# convert_emoticons('Tonight I will watch my favourite show :-) ')

In [8]:
####################################################################################
# The function in this cell converts abbreviated expressions into their extended text
# e.g. ASAP --> as soon as possible
# the expressions and their extended text description are saved in a text file that I will
# first load into a set.
# The initial file was downloaded from this repo 
# https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt
# and then I have added additional expressions
####################################################################################


# initialize dictionary for short words
short_words_map_dict = {}
short_words_list = []

# load s file with short words ing slangs
# list of expression can be updated in the s file
# format of each line of s file is: ASAP=As Soon As Possible
with open('data/short_words_slang.txt', 'r') as file:
    short_words_str = file.read()

# for loop to create dictionary of 
# {' slang abbreviation' : 'text expression'}

for line in short_words_str.split("\n"):
    if line != "":
        # split abbreviation and text expression at '=' sign
        sw = (line.split("="))[0]
        # changing sw to lowercase
        sw = sw.lower()
        sw_expanded = (line.split("="))[1]
        # append the abbreviation to list
        short_words_list.append(sw)
        # add abbreviation and related text expression to dictionary
        short_words_map_dict[sw] = sw_expanded.lower()

# eliminating from list of unique short expressions any eventual duplicate 
short_words_list = set(short_words_list)


def short_words_conversion(s):
    '''
    converts a shortened expression into equivalent text.
    Function is not case sensitive.
    
    INPUT: parameter is a string 
    OUTPUT: the input string with short expressions converted into extended text
    '''
    
    # first tstep, assert that the parameter passed in input is a string
    assert isinstance(s, str), 'only str data type can be processed from short_words_conversion'
    
    s= s.lower()
    new_s = []
    for word in s.split():
        if word in short_words_list:
            new_s.append(short_words_map_dict[word])
        else:
            new_s.append(word) 
    return " ".join(new_s)



# example of function (uncomment to run)
# short_words_conversion("hello BrB 4U AsAp")

In [9]:
# Creating customized version of punctuation
# Because we are analyzing tweets, we would not want to have the '#' and '@' removed
# from our documents. (Will call it tw_punct)
# Also, tweets have bullet points and apostrophes that are not included in the 
# punctuation records.

# initializing the punctuation string
tw_punct = punctuation.replace('#', '').replace('@', '').replace('_','')
tw_punct += '•'
tw_punct += '’'
tw_punct += '”'

def tw_remove_punctuation(s):
    '''
    Removes punctuation from a tweet text. Leaves characters "#" and "@".
    Does not assert if input is string.
    
    Input: string of tweet.
    Output: string of tweet with punctuation removed.
    '''
    
    for punct in tw_punct:
        s = s.replace(punct, '')
    return s

# example of function (uncomment to run)
# print(tw_remove_punctuation("if you $$///see $$$$$$$()()()()()()no ==++punctuation%! except #poundsign and @at_sign,:;;;--, the-'| function+=- works"))

In [10]:
################################################################################
# This cell contains a dictionary with 
# - apostrophed expressions as keys (e.g. can't) , 
# - extended expression as values (e.g. can not)
# These will be used in the function 'remove_apostrophes'.
# I have downloaded the dictionary from 
# https://drive.google.com/file/d/0B1yuv8YaUVlZZ1RzMFJmc1ZsQmM/view
# then duplicated the expressions with added expressions with the Twitter " ’ "
# apostrophe, as well as expressions without apostrophes (e.g. cannot, neednt)
################################################################################


# %load data/appos_abbr.py

appos_abbr = {
"aren't" : "are not",
"can't" : "cannot",
"couldn't" : "could not",
"didn't" : "did not",
"doesn't" : "does not",
"don't" : "do not",
"hadn't" : "had not",
"hasn't" : "has not",
"haven't" : "have not",
"he'd" : "he would",
"he'll" : "he will",
"he's" : "he is",
"i'd" : "i would",
"i'll" : "i will",
"i'm" : "i am",
"isn't" : "is not",
"it's" : "it is",
"it'll":"it will",
"i've" : "i have",
"let's" : "let us",
"mightn't" : "might not",
"mustn't" : "must not",
"shan't" : "shall not",
"she'd" : "she would",
"she'll" : "she will",
"she's" : "she is",
"shouldn't" : "should not",
"should've" : "should have",
"that's" : "that is",
"that'll" : "that will",
"there's" : "there is",
"they'd" : "they would",
"they'll" : "they will",
"they're" : "they are",
"they've" : "they have",
"we'd" : "we would",
"we're" : "we are",
"weren't" : "were not",
"we've" : "we have",
"what'll" : "what will",
"what're" : "what are",
"what's" : "what is",
"what've" : "what have",
"where's" : "where is",
"who'd" : "who would",
"who'll" : "who will",
"who're" : "who are",
"who's" : "who is",
"who've" : "who have",
"won't" : "will not",
"wouldn't" : "would not",
"you'd" : "you would",
"you'll" : "you will",
"you're" : "you are",
"you've" : "you have",
"'re": " are",
"wasn't": "was not",
"we'll": "we will",
"didn't": "did not",
"aren’t" : "are not",
"arent" : "are not",
"can’t" : "cannot",
"couldn’t" : "could not",
"didn’t" : "did not",
"doesn’t" : "does not",
"don’t" : "do not",
"hadn’t" : "had not",
"hasn’t" : "has not",
"haven’t" : "have not",
"he’d" : "he would",
"i’d" : "i would",
"he’ll" : "he will",
"he’s" : "he is",
"i’ll" : "i will",
"i’m" : "i am",
"isn’t" : "is not",
"it’s" : "it is",
"it’ll":"it will",
"i’ve" : "i have",
"let’s" : "let us",
"mightn’t" : "might not",
"mustn’t" : "must not",
"shan’t" : "shall not",
"she’d" : "she would",
"she’ll" : "she will",
"she’s" : "she is",
"shouldn’t" : "should not",
"should’ve" : "should have",
"that’s" : "that is",
"that’ll" : "that will",
"there’s" : "there is",
"they’d" : "they would",
"they’ll" : "they will",
"they’re" : "they are",
"they’ve" : "they have",
"we’d" : "we would",
"we’re" : "we are",
"weren’t" : "were not",
"we’ve" : "we have",
"what’ll" : "what will",
"what’re" : "what are",
"what’s" : "what is",
"what’ve" : "what have",
"where’s" : "where is",
"who’d" : "who would",
"who’ll" : "who will",
"who’re" : "who are",
"who’s" : "who is",
"who’ve" : "who have",
"won’t" : "will not",
"wouldn’t" : "would not",
"you’d" : "you would",
"you’ll" : "you will",
"you’re" : "you are",
"you’ve" : "you have",
"’re": " are",
"wasn’t": "was not",
"we’ll":"we will",
"didn’t": "did not",
"couldnt" : "could not",
"didnt" : "did not",
"doesnt" : "does not",
"dont" : "do not",
"hadnt": "had not", 
"hasnt": "has not",
"havent":"have not",
"isnt":"is not",
"mightnt": "might not",
"mustnt":"must not",
"neednt":"need not",
"shant":"shall not",
"shes":"she is",
"shouldnt":"should not",
"shouldve":"should have",
"thatll":"that will",
"wasnt":"was not",
"werent":"were not",
"wont":"will not",
"wouldnt":"would not",
"youd":"you would",
"youll":"you will",
"youre":"you are",
"youve":"you have",
"y'all" : "you all",
"cannot" : "can not"
}

In [11]:
# This function replaces expression with apostrophes with their extended version

def remove_apostrophes(s):
    '''
    Converts known abbreviations with apostrophes to extended version.
    The dictionary "appos_abbr" must be loaded prior to initializing this function.
    Use command --> %load <your_path>/appos_abbr.py
        
    Input: lower case string containing expressions with apostrophes. The input string must be
    lower case, as the function only works with lower case.
    
    Output: string with known apostrophed expression converted to extended version.
    '''
    words = s.split()
    s = [appos_abbr[word] if word in appos_abbr else word for word in words]
    s = " ".join(s) 
    return s

# example of function in/out - uncomment to run
# print(remove_apostrophes("Hey,I can't tell you how much i'd like to, but we werent aware you'd want us to come"))

In [12]:
# this function returns a tokenized version of the text document passed as a parameter 

def tw_tokenizer(s):
    """
    Tokenizer built for social media text data.
    Returns a tokenized version of the tex document passed as a parameter.
    Asserts if the parameter is a string.
    
    INPUT: text document in string format.
    
    OUTPUT: tokenized document.
    
    Note: this function uses other subfunctions for steps 1 to 4
    
    Tokenizing steps: 
    1-lower cases the document
    2-converts (known) emojis to their human readable text meaning
    3-converts abbreviated words to their extended expressions (e.g. asap --> as soon as possible)
    4-converts expressions with apostrophes to their extended text version (e.g. can't --> can not)
    5-removes punctuation 
    6-replaces new line characters with blank space
    7-removes web site links and words with 3 characters or less
    8-tokens lemmatizer
    """
    
    # assert that the parameter passed in input is a string
    assert isinstance(s, str), 'only str data type can be processed from tw_tokenizer'
        
    # lower-casing the string
    s = s.lower()  
    
    # converting emojis and emoticons to text
    s = convert_emojis(s)
    
    # converting shortened expressions into extended text
    s = short_words_conversion(s)
        
    # remove numbers
    #s = re.sub(r'\d+', '', s)
    
    # converting expressions with apostrophes
    s = remove_apostrophes(s)
        
    # punctuation removal
    s = tw_remove_punctuation(s)
    
    # spaces in place of new line characters
    s = s.replace('\n', ' ')
          
    # split the string at each space to make the list of tokens (uncleaned)
    tokens = s.split()
    
    # remove hyperlinks from tokens
    tokens_new = [token for token in tokens if token[:4] != 'http' and len(token) > 3] 
   
    # create WordNetLemmatizer object
    wnl = nltk.stem.WordNetLemmatizer()

    # list of part-of-speech tags
    pos_tags = ['v', 'n', 'a']
    
    # initiate empty list to collect lemmatized tokens
    tokens_lem = list()

    # loop through each token
    for token in tokens_new:

        # loop through each part-of-speech tag
        for pos_tag in pos_tags:

            # lemmatize each token using each part-of-speech tag
            token = wnl.lemmatize(word=token, pos=pos_tag)

        # append the lemmatized token to the new list
        tokens_lem.append(token)
    
    return tokens_lem

# ========================================================================

# DOCUMENTS VECTORIZATION

We will now vectorize the documents. We will use a count vectoirzer, so that every word receives equal importance weights (differently from what would have been the case of using another function like TF-IDF).

We will not use any minimum or maximum parameter (min_df, max_df) right away. We want to see how many tokens we get returned from the vectorizer, to evaluate what is the appropriate filtering to apply.

## Count Vectorizer

In [13]:
# initializing the variable stop_w with the NLTK stop words
stop_w = nltk.corpus.stopwords.words('english')
    
# adding additional stop words
stop_w.extend(['#netflix', '@netflix','could', 'cannot', 'might', 'must', 'need', 'neednt', 'shall', 'win','yall', 'would', 'never'])

Since we used ''#netflix' and '@netflix' as search terms for the tweets, at least one of the two will be in each tweet.
They will therefore not add any value to the analysis, so we have added them to the stop words.
In addition to that, we added other stop words not included in NLTK.
Note: if these are not added, we will get a warning when running the tokenizer

In [14]:
%%time

# instantiating the CountVectorizer
# text is already converted to lowercase in the tokenizer,
# to make sure text is lowercased before other functions that 
# require lowercase input

my_cv = CountVectorizer(lowercase = False, tokenizer = tw_tokenizer, stop_words =stop_w )

# fitting and transforming in one step - faster than doing it separate 
# the sparse matrix will be saved in bow (bag of words)
bow = my_cv.fit_transform(df_net['tweet'])

Wall time: 31.5 s


In [15]:
# checking the size and count of non zero elements in the bag of words sparse matrix
bow

<25230x44527 sparse matrix of type '<class 'numpy.int64'>'
	with 264709 stored elements in Compressed Sparse Row format>

By vectorizing the text, we get a sparse matrix of 25,230 x 44,527. Therefore, after preprocessing the text, we have 44,527 tokens.
Because the documents (tweets) have been collected from social medias, I expect a very large numer of tokens to have a low frequency.
To check that information, I will create a dataframe with the dense matrix of what we obtained and check those values.

I will most likely not need  all those data points  for the scope of this project.

**NOTE**: depending on the memory of your machine, this operation might not be executable, therefore evaluate if you want to run it, as the next cells until re-vectorization are only for demonstration purposes. 

In [16]:
# creating a dataframe from the bag of words
df_net_vec = pd.DataFrame(data=bow.toarray(), columns=np.array(my_cv.get_feature_names()))

We have 44,527 columns - let's check the median of the sums of the those.

In [17]:
# heck the median of the sums of the columns
df_net_vec.sum(axis = 0).sort_values(ascending = False).median()

1.0

The median is 1. At least 50% of the tokens appear only 1 time. Tokens with this frequency, will likely be not relevant for our scope.

Let's check also the 4th quartile and the top 5% and 4%

In [18]:
df_net_vec.sum(axis = 0).sort_values(ascending = False).quantile(0.75)


2.0

In [19]:
df_net_vec.sum(axis = 0).sort_values(ascending = False).quantile(0.95)

17.0

In [20]:
df_net_vec.sum(axis = 0).sort_values(ascending = False).quantile(0.96)

23.0

At least 75% of the tokens appear 2 times or less. This is quite a small frequency for over 25,000 tweets!
Being the source of the documents a social media platform, I expect most of the terms with frequency 1 or 2 to be typos, uncommon handle names, uncommon first names, uncommonly used emojis.

Also, 95% of tokens appeared 17 times or less and 96% 23 times or less.

Let's now fit and transform again the documents (tweets) and scrap any token that appears less than 20 times (which is halfway through 17 and 23) in the vocabulary.

## Re-vectorizing data

In [21]:
%%time

# text is already converted to lowercase in the tokenizer,
# since other preprocessing functions require lowercase input

my_cv = CountVectorizer(lowercase = False, tokenizer = tw_tokenizer, stop_words =stop_w, min_df = 20 )

# fitting and transforming in one step - faster than doing it separate 
# the sparse matrix will be saved in bow (bag of words)
bow = my_cv.fit_transform(df_net['tweet'])

Wall time: 33.1 s


In [22]:
bow

<25230x1973 sparse matrix of type '<class 'numpy.int64'>'
	with 172045 stored elements in Compressed Sparse Row format>

By selecting only tokens that have a minimum frequency of 20, we end up with 1,973 unique tokens. Note, this accounts for ~4.5% of the prior vocabulary.

Let's have a quick visualization of the most and least popular terms in our bag of words.

In [23]:
# creating dataframe from the bag of words
df_net_vec = pd.DataFrame(data=bow.toarray(), columns=np.array(my_cv.get_feature_names()))

In [24]:
# 25 most popular words
df_net_vec.sum(axis = 0).sort_values(ascending = False).head(25)

watch      4709
show       2186
netflix    2162
season     1944
movie      1579
love       1568
good       1513
like       1496
series     1341
make       1311
please     1301
time       1051
know       1013
think       958
thank       946
film        941
come        927
episode     902
want        846
look        769
take        762
really      759
great       744
back        734
give        652
dtype: int64

In [25]:
# 25 of the least popular words
df_net_vec.sum(axis = 0).sort_values(ascending = False).tail(25)

mildred             20
sparkling_heart     20
asleep              20
#jurassicpark       20
solo                20
#supportrottmnt     20
@ivankatrump        20
unfortunately       20
surfshark           20
fill                20
ultra               20
shut                20
solution            20
gold                20
#elonaholmes        20
quickly             20
site                20
grinning_face       20
@lefty_lucie        20
heist               20
propaganda          20
#fashion            20
@joebiden           20
@jasonsfolly        20
@mrsrabbitresist    20
dtype: int64

A good number of the least popular words are handles and hashtags. However, since we have almost 2,000 terms in our vocabulary, this is just a very small representative sample. We cannot pull relevant observations just by reading them, and doing so for a larger, more significant sample (say 200 words), wold be highly time consuming (and inclined to have personal bias introduction).

Note that one of the terms is '#elonaholmes', which might be a type-o for Enola Holmes (a recent movie).

Let's proceed then with the next step, where we will try to extract the most commonly discussed topics.

# ===========================================================

# TOPIC MODELLING

## LDA

LDA (Latent Dirichlet Allocation) is a statistical topic modelling. The theory behind it is that in a text dataset, each document, can be described by a distribution of topics and each topic can be described by a distribution of words.

The goal of LDA, is to map all the documents to the topics in a way such that, the words in each document are mostly captured by those (imaginary) topics.

After fitting the model, we will use the pyLDAvis package to try to visualize the topics that the model has extracted.

Let's fit the model first. I will use the LDA from scikit-learn.
One downside of fitting this model in an exploratory phase, is that it does require the user to specify how many topics should be extracted. Since we don't know that information at this stage, we will have to make a non informed decision.

We will start from 12 topics, as during my learning process, I have seen that dataset with similar size have normally shown a number of topics between 6 and 18. But once again, this is only to have  a first starting number before running further optimization steps.

We will also set the number of maximum iterations to 30, in order to see if with a higher number of iterations than the default (10), the performance improves.

Regarding how to measure the performance, we will keep in consideration the perplexity score. We will discuss a bit more in detail right after we have our first score.

In [26]:
%%time
# instantiating lda model
# because we have a relatively small dataset, I will set max_iter to 30

tw_lda = LatentDirichletAllocation(n_components = 12,
                                   learning_method = 'batch',
                                   max_iter = 30,
                                   # perplexity evaluated every 5 iterations
                                   evaluate_every=5,
                                   n_jobs = -1,
                                   verbose = 1,
                                   random_state = 17)

Wall time: 1.99 ms


In [27]:
%%time
# fitting lda
# verbosity set to 1 to visualize if the performance improves 
# as we run additional iterations (evaluated every 5 iterations)

tw_lda.fit(df_net_vec)

iteration: 1 of max_iter: 30
iteration: 2 of max_iter: 30
iteration: 3 of max_iter: 30
iteration: 4 of max_iter: 30
iteration: 5 of max_iter: 30, perplexity: 1020.2187
iteration: 6 of max_iter: 30
iteration: 7 of max_iter: 30
iteration: 8 of max_iter: 30
iteration: 9 of max_iter: 30
iteration: 10 of max_iter: 30, perplexity: 947.2814
iteration: 11 of max_iter: 30
iteration: 12 of max_iter: 30
iteration: 13 of max_iter: 30
iteration: 14 of max_iter: 30
iteration: 15 of max_iter: 30, perplexity: 928.8356
iteration: 16 of max_iter: 30
iteration: 17 of max_iter: 30
iteration: 18 of max_iter: 30
iteration: 19 of max_iter: 30
iteration: 20 of max_iter: 30, perplexity: 921.8677
iteration: 21 of max_iter: 30
iteration: 22 of max_iter: 30
iteration: 23 of max_iter: 30
iteration: 24 of max_iter: 30
iteration: 25 of max_iter: 30, perplexity: 918.4118
iteration: 26 of max_iter: 30
iteration: 27 of max_iter: 30
iteration: 28 of max_iter: 30
iteration: 29 of max_iter: 30
iteration: 30 of max_iter: 3

LatentDirichletAllocation(evaluate_every=5, max_iter=30, n_components=12,
                          n_jobs=-1, random_state=17, verbose=1)

In [28]:
# calculating perplexity score
tw_lda.bound_

917.0773102630776

The perplexity score has improved significantly over the first 10 iterations. After that, it still kept improving, but with a much lower marginal score.

The purpose of the perplexity score is to give us a measure of how well the model represents (or reproduces) the statistics of the held-out data.
Another measure that LDA models in general provide, is the log-likelihood.
Without diving too deep in the theory behind the two metrics, we can say that perplexity is measured as the normalized log-likelihood of our held-out test set.

One thing to keep in mind is that since text in general might come from very different sources (e.g. social media, scientific documents, news articles, etc) it is unlikely to be able to define a range of perplexity values that can be optimal for our specific target (although generally, lower values are considered better than higher values).
Also, when evaluating such score, we have to keep in consideration if we are working with unigrams, bigrams, trigrams.

That considered, since we are analyzing text data from social media posts, perplexity might not be the best indicator of a model performance. That is because, human judgement and perplexity score, often are not correlated.

In a nutshell, considering the perplexity score to evaluate the model performance for our project, and trying to tweak the parameters to decrease such score, might not lead us to optimal results.

We will see that *coherence* might be a more appropriate indicator when we will test LDA Multicore (other notebook of this project). For the current analysis then, we will give only relative importance to the perplexity score, to try to narrow down a small list of number of topics, and then evaluate the topics results by visual inspection of the plot (LDAvis).

Let's start then, by visualizing the topics extracted by our model, with n_topics = 12.

In [29]:
# visualizing topics extracted from model through LDAvis

tw_lda_graph = pyLDAvis.sklearn.prepare(tw_lda, bow, my_cv)
pyLDAvis.display(tw_lda_graph)

In [30]:
# saving visualization in html format
pyLDAvis.save_html(tw_lda_graph,'lda_vis/sklearn/lda_first_12topics.html')

A few considerations on the LDAvis.

* the visualization is split in two parts. On the left, we can see each topic represented by a 'bubble': on the right, a list of tokens belonging to each topic. For each word, is also represented the overall word frequency (in our vocabulary) and an estimate of its frequency inside the selected topic. (to select a topic, simply click on a bubble).

* the visualization of an ideal model would show bubbles evenly spread across the chart and with very minimal overlapping. My experience is still limited, but this seems quite rare to see for Twitter text analysis. It seems to be more common to have visualizations with a larger amount of topics, and some overlapping.


That said, the discussion topics that seem most recognizable from the visualization above are:

* n 2 - 'enola holmes' : mistery movie released on September 23rd (#enolaholmes, enola, holmes)

* n 8 - a combination of 'save bay yanlis' and 'the social dilemma'. Bay yanlis is a Turkish series that was announced to be interrupted by FOX Turkey on September 29 (hence the word #netflixturkiye is very popular as well). The Social Dilemma is a film documentary that was released earlier this year, even though appeared to be still a popular topic. Note that in this topic, the vast mojority of the terms actually refer Bay Yanlis and The Social Dilemma.

* n 9 - American murder: a documentary released on September 30th (american, murder, #americanmurder, kill).

* n 10 - 'emily in paris', resident evil and Cobra Kai: Emily in Paris is a show that was first release on October 2, 2020. Cobra Kai was released earlier in North America, but only in late summer in other european countries. Resident Evil Infinite Darkness has been announced to be released in 2021 on September 27, 2020. 

* n 12 - 'teenage bounty hunters' : comedy-drama series released in August, but apparently very popular at the time of tweets collection (#teenagebountyparty, #teenagebountyhunters, teenage, bounty, #renewtbh).


For the other topics returned, I am not able to narrow down specific subjects. (n 4 seems to include 'Lucifer run', a series, but the relevance of the terms is quite low)

We have been able to distinguish some movies and series names, however some of them were mixed together.
Let's now see if we can improve these results.

# ============================================================

## Hyperparameter optimization.

In this section we will run some optimization tests, in an effort to explore if we can find a more representative topic extraction.
A very important hyperparameter in the LDA model is the number of components. We will test over 6, 12 and 18.
Ideally, we would be testing a much wider range of potential number of components (say 1 to 200), but to keep in mind the relevance of time constraints and computational power required, I will just limit our range to that above for now.

Another parameter that we can try to optimize at this early stage, is the learning decay (which controls the model learning rate). Learning decay can be set to values from 0.5 to 1. We will then test 0.5, 0.7, 0.9.


First: setup the parameters grid and fit the model.

In [31]:
%%time
# defining hyperparameters to optimize
params_grid = {'n_components': [6, 12, 18], 'learning_decay': [0.5, 0.7, 0.9] }

# initializing the model
my_lda = LatentDirichletAllocation()

# initializing grid search
my_lda_model = GridSearchCV(my_lda, param_grid = params_grid, verbose = 2, n_jobs = -1)

Wall time: 0 ns


In [32]:
%%time
# fitting grid search
my_lda_model.fit(df_net_vec)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  6.3min
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  9.4min finished


Wall time: 10min 24s


GridSearchCV(estimator=LatentDirichletAllocation(), n_jobs=-1,
             param_grid={'learning_decay': [0.5, 0.7, 0.9],
                         'n_components': [6, 12, 18]},
             verbose=2)

In [33]:
#  a recap of all the parameters (most of them are the default ones)

GridSearchCV(
            cv=None, 
            error_score='raise',
            estimator=LatentDirichletAllocation(
                                                 batch_size=128, 
                                                 doc_topic_prior=None,
                                                 evaluate_every=-1, 
                                                 learning_decay=0.7, 
                                                 learning_method=None,
                                                 learning_offset=10.0, 
                                                 max_doc_update_iter=100, 
                                                 max_iter=10,
                                                 mean_change_tol=0.001, 
                                                 n_components=10, 
                                                 n_jobs=1,
                                                 perp_tol=0.1, 
                                                 random_state=54,
                                                 topic_word_prior=None,  
                                                 verbose = 1), 
             n_jobs=1,
             param_grid = params_grid,
             pre_dispatch = '2*n_jobs',
             refit = True,
             return_train_score='warn', 
             verbose = 1)

GridSearchCV(error_score='raise',
             estimator=LatentDirichletAllocation(learning_method=None, n_jobs=1,
                                                 random_state=54, verbose=1),
             n_jobs=1,
             param_grid={'learning_decay': [0.5, 0.7, 0.9],
                         'n_components': [6, 12, 18]},
             return_train_score='warn', verbose=1)

Second, check model results.

In [34]:
# model best hyperparameters
print("Model best hyperparameters: ", my_lda_model.best_params_)

# model Log likelihood score with best parameters
print("Best Log-likelihood Score: ", my_lda_model.best_score_)

# model perplexity with best parameters
print("Model perplexity: ", my_lda_model.best_estimator_.perplexity(bow))

Model best hyperparameters:  {'learning_decay': 0.9, 'n_components': 6}
Best Log-likelihood Score:  -265856.81369767245
Model perplexity:  979.216657108108


A couple considerations:

* the parameters returned by the grid search are very similar to those of the base model we fitted. A difference is the number of max_iterations, which is actually only 10 here, instead of 30. That saves a considerable amount of time

* the model perplexity is almost exactly the same that we achieved with our first base model


LatentDirichletAllocation() does not seem to have a method that gives the possibility to visualize a rank of all the hyperparameters tested. But in the metadata report, it has all the hyperparameters fitted and a list with the rank from best hyperparameters combination to the worst.

In [35]:
# all hyperparameters combinations fitted in the grid
my_lda_model.cv_results_['params']

[{'learning_decay': 0.5, 'n_components': 6},
 {'learning_decay': 0.5, 'n_components': 12},
 {'learning_decay': 0.5, 'n_components': 18},
 {'learning_decay': 0.7, 'n_components': 6},
 {'learning_decay': 0.7, 'n_components': 12},
 {'learning_decay': 0.7, 'n_components': 18},
 {'learning_decay': 0.9, 'n_components': 6},
 {'learning_decay': 0.9, 'n_components': 12},
 {'learning_decay': 0.9, 'n_components': 18}]

In [36]:
# rank of hyperparameters combinations (those listed above): from best to least performing
my_lda_model.cv_results_['rank_test_score']

array([2, 6, 7, 3, 4, 9, 1, 5, 8])

This it is not very practical for consultation (especially in cases when we have a complex grid search), therefore we can use a for loop to improve the visualization of a rank.

Note: if the code below might seem confusing, run the command

`my_lda_model.cv_results_`

it will show the cross validation results dictionary, it might help understanding how I pointed at the parameters using the 'rank_test_score' list.


In [37]:
# I will leave it commented to not clutter the notebook, uncomment to run
# my_lda_model.cv_results_

In [38]:
# rank of hyperparameters, from best to worst
# the rank is based on the Log likelihood scores

print('Hyperparameters combinations, from best to worst:')
for i in range(len(my_lda_model.cv_results_['rank_test_score'])):
    print(f'Rank {i+1}')
    display(my_lda_model.cv_results_['params'][my_lda_model.cv_results_['rank_test_score'][i]-1])

Hyperparameters combinations, from best to worst:
Rank 1


{'learning_decay': 0.5, 'n_components': 12}

Rank 2


{'learning_decay': 0.7, 'n_components': 18}

Rank 3


{'learning_decay': 0.9, 'n_components': 6}

Rank 4


{'learning_decay': 0.5, 'n_components': 18}

Rank 5


{'learning_decay': 0.7, 'n_components': 6}

Rank 6


{'learning_decay': 0.9, 'n_components': 18}

Rank 7


{'learning_decay': 0.5, 'n_components': 6}

Rank 8


{'learning_decay': 0.7, 'n_components': 12}

Rank 9


{'learning_decay': 0.9, 'n_components': 12}

Now, let's check the 'best' combination with 6, 12 and 18 components. The reason for the multiple check, is that even though with 6 components we obtained a better log likelihood score (and again, this does not have the highest relevance for our ultimate decision), with 12 and 18 topics, we might find more exhaustive informations than with only 6.

### Visualizing topics with LDAvis (6 components).

In [39]:
%%time

# fitting LDA with 6 components, learning decay of 0.5
final_lda_6 = LatentDirichletAllocation( 
                                     n_components=6,
                                     learning_decay=0.5, 
                                     batch_size=128, 
                                     doc_topic_prior=None,
                                     evaluate_every=-1, 
                                     learning_method='batch',
                                     learning_offset=10.0, 
                                     max_doc_update_iter=100, 
                                     max_iter=10,
                                     mean_change_tol=0.001, 
                                     n_jobs=1,
                                     perp_tol=0.1, 
                                     random_state=17,
                                     topic_word_prior=None,  
                                     verbose = 0)

final_lda_6.fit(df_net_vec)

Wall time: 55.8 s


LatentDirichletAllocation(learning_decay=0.5, n_components=6, n_jobs=1,
                          random_state=17)

In [40]:
# visualizing topics extracted from model through LDAvis

final_lda_6_vis =  pyLDAvis.sklearn.prepare(final_lda_6, bow, my_cv)
pyLDAvis.display(final_lda_6_vis)

With only 6 topics we observe much larger blobs, however:

* with such a small number of components, we would have expected to see at least minimal (if none) overlapping
* some of the topics that we recognized before are not showing anymore!

Topic number 6, seems to be including mostly 'save Bay Yanlis', therefore we see more coherence in the terms of the topic.

The downside, is that we are not able to spot as many subjects of discussion as we did in the base model.

Let's save the visualization in an html file and move on to visualizing 12 topics.


In [41]:
pyLDAvis.save_html(final_lda_6_vis,'lda_vis/sklearn/lda_final_6topics.html')

### Visualizing topics with LDAvis (12 components).


In [42]:
%%time


# fitting LDA with 12 components, learning decay of 0.7
final_lda_12 = LatentDirichletAllocation( 
                                     n_components=12,
                                     learning_decay=0.7, 
                                     batch_size=128, 
                                     doc_topic_prior=None,
                                     evaluate_every=-1, 
                                     learning_method='batch',
                                     learning_offset=10.0, 
                                     max_doc_update_iter=100, 
                                     max_iter=10,
                                     mean_change_tol=0.001, 
                                     n_jobs=1,
                                     perp_tol=0.1, 
                                     random_state=17,
                                     topic_word_prior=None,  
                                     verbose = 0)

final_lda_12.fit(df_net_vec)

Wall time: 46.9 s


LatentDirichletAllocation(n_components=12, n_jobs=1, random_state=17)

In [43]:
# visualizing topics extracted from model through LDAvis

final_lda_12_vis =  pyLDAvis.sklearn.prepare(final_lda_12, bow, my_cv)
pyLDAvis.display(final_lda_12_vis)

Unsurprisingly, with 12 components we get pretty much the same as our first base model. The difference I noticed (by visual inspection), is that the terms representing the subjects seem to be a bit more representative in the respective topics ranks.

We still see a fair amount of overlapping. This might be due both to the fact that we have topics with terms that have similar meanings, and also that for our dataset (~25,000 documents), the ideal number of distinct topics might be higher than 12.

These are elements to keep in mind for future exploration, for now let's save the visualization and move to the next one.

In [44]:
pyLDAvis.save_html(final_lda_12_vis,'lda_vis/sklearn/lda_final_12topics.html')

### Visualizing topics with LDAvis (18 components).

In [45]:
%%time

# fitting LDA with 18 components, learning decay of 0.9
final_lda_18 = LatentDirichletAllocation( 
                                     n_components=18,
                                     learning_decay=0.9, 
                                     batch_size=128, 
                                     doc_topic_prior=None,
                                     evaluate_every=-1, 
                                     learning_method='batch',
                                     learning_offset=10.0, 
                                     max_doc_update_iter=100, 
                                     max_iter=10,
                                     mean_change_tol=0.001, 
                                     n_jobs=1,
                                     perp_tol=0.1, 
                                     random_state=17,
                                     topic_word_prior=None,  
                                     verbose = 0)

final_lda_18.fit(df_net_vec)

Wall time: 45.8 s


LatentDirichletAllocation(learning_decay=0.9, n_components=18, n_jobs=1,
                          random_state=17)

In [46]:
# visualizing topics extracted from model through LDAvis

final_lda_18_vis =  pyLDAvis.sklearn.prepare(final_lda_18, bow, my_cv)
pyLDAvis.display(final_lda_18_vis)

Some of the topics we see, were already found and described when we fitted the base model.
We still see overlap, although we can spot now more human recognizable topics and the top terms in each of them seem to be more related to the topic.


* topic 2: Enola Holmes 
* topic 7: The Social Dilemma
* topic 8: American Murder
* topic 9: Lucifer Run (very weak term relevance though)
* topic 10: A life on our lanet, documentary presented by David Attenborough (both terms have relevant appeareance in the topic)
* topic 14: Emily in Paris, Resident Evil, Cobra Kai
* topic 16: Save Bay Yanlis
* topic 17: Teenage Bounty Hunters
* topic 18: record of youth: Korean television series  (that might explain the terms in Korean characters) released at the begin of September

Note that topic 10  overlaps with number 2 at 100%, but they each have their top terms related to the respective subjects.

Let's save the visualization and then draw some considerations.

In [47]:
pyLDAvis.save_html(final_lda_18_vis,'lda_vis/sklearn/lda_final_18topics.html')

# ======================================================================

# CONSIDERATIONS

* We have been able to recognize 9 subjects of discussion, which is a great start.

* One relevant challenge, is to find the right amount of topics to request to the model. Especially when we start having dozens of thousands of tweets, it becames increasingly difficult to achieve a good result by just guessing that value. 

* Even though there are other hyperparameters we can optimize, the number of topics is a fundamental one to fine tune.

* Besides our visual inspection, we have only used perplexity/log-likelihood to have an idea of the model performance, which we have already seen is not the best.

* Better perplexity scores, don't necessarily pair with better subject interpretation, as we have seen just above.

In the next notebook, rather then keep working at optimizing the hyperparameters of this model, we will try to use another LDA, as well as calculate the coherence value as a way of scoring the model's performance.