# Lab 7 - Topic Modeling

In this lab, you will learn:
* How to find topics in a corpus using topic modeling
* How to apply Latent Dirichlet Allocation (LDA), a topic modeling technique, to texts
* How to find the distribution of LDA topics in a corpus

This lab is written by Jisun AN (jisunan@smu.edu.sg) and Michelle KAN (michellekan@smu.edu.sg) based on existing tutorial, titled "[Topic Modeling with Gensim (Python)](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)," by Selva Prabhakaran. 


 

# 0. Introduction 

One of the primary applications of natural language processing is to automatically extract what topics people are discussing from large volumes of text. Some examples of large text could be feeds from social media, customer reviews of hotels, movies, etc, user feedbacks, news stories, e-mails of customer complaints etc.

Knowing what people are talking about and understanding their problems and opinions is highly valuable to businesses, administrators, political campaigns. And it’s really hard to manually read through such large volumes and compile the topics.

Thus is required an automated algorithm that can read through the text documents and automatically output the topics discussed. This is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this tutorial, we will take a real example of the COVID-19 Twitter dataset and use **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data, to extract the naturally discussed topics.

We will also extract the volume and percentage contribution of each topic to get an idea of how important a topic is.


In [None]:
!pip install pyldavis
!pip install -U gensim
# !pip install en_core_web_sm

You must restart runtime after updating libraries. 

In [None]:
import pandas as pd
import re 

# Gensim for topic modeling
import gensim
from gensim.utils import simple_preprocess
from gensim import matutils, models
import gensim.corpora as corpora
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy
import scipy.sparse

# NLTK Stop words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline


# 1. Getting the data

We will use the COVID-19 Twitter dataset, which is collected based on COVID-19 related keywords, including covid, coronavirus, etc from Jan to April 2020. 


In [None]:
ori_df = pd.read_table("https://raw.githubusercontent.com/anjisun221/css_codes/main/sample_covid19_tweet_20200101_20200412_en.tsv", sep="\t")

print(ori_df.shape)
ori_df.head()

In [None]:
df = ori_df.sample(n=5000, random_state=999)
print(df.shape)

In [None]:
df.text.head()

# (Preview) Let's build a quick LDA topic model 

In [None]:
# Convert to list
data = df.text.values.tolist()
data[:5]

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

# Create Dictionary
id2word = corpora.Dictionary(data_words)

# Create Corpus
texts = data_words

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# Let's start with 2 topics.
lda_model = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda_model.print_topics()

In [None]:
# Let's start with 10 topics. This may take a while.
lda_model = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=10, passes=10)
lda_model.print_topics()

#### Does above topics make sense to you? 

# 2. Data cleaning

We will do the followings:
* Remove @mention and url
* Tokenization: We will use Gensim's module `gensim.utils.simple_preprocess` to tokenize the sentence in our corpus. It will convert a document into a list of tokens. Read more [here](https://tedboy.github.io/nlps/generated/generated/gensim.utils.simple_preprocess.html).
* Removing Stop words
* Bigram extraction- extracting list of two words frequently occurring together in the document e.g, ‘front_bumper’, ‘oil_leak’, ‘maryland_college_park’ etc.
* Lemmatization




In [None]:
# Convert to list
data = df.text.values.tolist()
data[:5]

2-1. Remove @mention and url

In [None]:
# Remove @mentions 
data = [re.sub(r'@\w+', '', sent) for sent in data]

# Remove urls (remove a word starting with http)
data = [re.sub(r'http\S+', '', sent) for sent in data]

data[:5]

2-2. Tokenization

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

2-3. Remove stopwords

In [None]:
# Prepare stopwords using NLTK
stop_words = stopwords.words('english')

# You can add other words to the list of stop words as well
stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'amp'])

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
print(data_words_nostops[:1])

2-4. Bigram extraction

In [None]:
# Build the bigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=10) # higher threshold fewer phrases.

# Faster way to get a sentence clubbed as a bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

print(data_words_bigrams[10])

2-5. Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster.

We use Spacy for lemmatization. 
It also allows to consider terms with a particular part of speech tag.
We will use nouns (NOUN) and proper nouns (PNOUN) in this example. A proper noun is a specific name for a particular person, place, or thing. See options for part of speech here: https://spacy.io/usage/linguistic-features 



In [None]:
def lemmatization(texts, allowed_postags=['NOUN']):
    """"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

# Initialize spacy 'en' model, keeping only tagger component
# For normal use
# !python -m spacy download en_core_web_sm
# nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# For Colab use
import en_core_web_sm
nlp = en_core_web_sm.load()

print("Before Lemmatization:", data_words_bigrams[:1])

# Do lemmatization keeping only noun
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'PROPN'])

print("After Lemmatization: ", data_lemmatized[:1])


# 2. Building topic model

The two main inputs to the LDA topic model are the dictionary (id2word) and the corpus. Let’s create them.

Gensim creates a unique id for each word in the document (id2word). Then, the produced corpus is a mapping of (word_id, word_frequency).

Check [`gensim.corpora`](https://radimrehurek.com/gensim/corpora/dictionary.html) for details about `filter_extremes()`


In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)
id2word.filter_extremes(no_below=1.5, no_above=0.8) # this will filter out words that are less frequen

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

For example, (0, 1) above implies, word id 0 occurs once in the first document. Likewise, word id 1 occurs once and so on.

This is used as the input by the LDA model.

If you want to see what word a given id corresponds to, pass the id as a key to the dictionary (id2word).

In [None]:
print(id2word[0], id2word[1], id2word[2], id2word[3], id2word[4])


To build LDA model, you need to specify the number of topics apart from the dictionary (id2word) and the corpus. 

Passes is the total number of training passes. The larger passes would refine the assignment of words for topics. 

Check other parameters of LDA model [here](https://radimrehurek.com/gensim/models/ldamodel.html). 

In [None]:
# Build LDA model

# Let's start with 2 topics.
lda_model = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda_model.print_topics()

How to interpret a LDA topic?

Topic 0 is a represented as  `0.050*"coronavirus" + 0.021*"china" + 0.011*"case" + 0.010*"virus" + 0.008*"day" + 0.007*"cdc" + 0.007*"world" + 0.007*"death" + 0.006*"covid" + 0.006*"time"`

It means the top 10 keywords that contribute to this topic are: 'coronavirus', 'china', 'case', ... and so on and the weight of 'coronavirus' on topic 0 is 0.05.

The weights reflect how important a keyword is to that topic. 

Looking at these keywords, can you guess what this topic could be? You may summarise it either are 'covid-19 update' or 'covid-19 news.'

Likewise, Topic 1 could be 'Trump' or 'politics.'


In [None]:
# 3 topics.
lda_model = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda_model.print_topics()

In [None]:
# 10 topics.
lda_model = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=10, passes=10)
lda_model.print_topics()

# 3. Topic coference 

Model perplexity and [topic coherence](https://rare-technologies.com/what-is-topic-coherence/) provide a convenient measure to judge how good a given topic model is. Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. There are two major types C_V typically 0 < x < 1 and uMass -14 < x < 14. When using c_v, the coference score of >0.5 would be considered to be good and it would be rare to see a > 0.9. See more [here](https://stackoverflow.com/questions/54762690/what-is-the-meaning-of-coherence-score-0-4-is-it-good-or-bad). 


In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# 4. How to find the optimal number of topics for LDA?

To find the optimal number of topics, we will build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value.

Choosing a 'k' that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics.

If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.

We will use the elbow method, a visualization of changes of coherenve value by varying k, which gives us a graph of the optimal number of topics for greatest coherence. Check out [this](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/) for more for the elbow method.

The compute_coherence_values() (see below) trains multiple LDA models and provides the models and their corresponding coherence scores.


In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        print(f'Training model for num_topics= {num_topics}')
        model = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics, passes=10)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
start = 2
limit = 60
step = 6

In [None]:
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=start, limit=limit, step=step)
print('Completed!')

In [None]:
# Show graph
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

# 5. Best LDA model

So for further steps I will choose the model with 26 topics itself.
By increasing the number of passes, the topics can be refined. 

In [None]:
best_num_topics = 26

In [None]:
# Build LDA model
lda_model = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=best_num_topics, passes= 60)
lda_model.print_topics()


# 6. Visualize LDA topics

pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.


In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

So how to infer pyLDAvis’s output?

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

Upnext, we will focus on how to arrive at the optimal number of topics given any large corpus of text.

# 7. Finding the dominant topic in each sentence

One of the practical application of topic modeling is to determine what topic a given document is about.

To find that, we find the topic number that has the highest percentage contribution in that document.

The `format_topics_sentences()` function below nicely aggregates this information in a presentable table.

In [None]:
def format_topics_sentences(ldamodel=lda_model, corpus=corpus):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['dominant_topic', 'topic_perc_contrib', 'keywords']
    
    return(sent_topics_df)

df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus)
df_topic_sents_keywords.head()


In [None]:
# Combine the original data with inferred topics

dominant_topic = pd.Series(df_topic_sents_keywords.dominant_topic.values.tolist())
topic_perc_contrib = pd.Series(df_topic_sents_keywords.topic_perc_contrib.values.tolist())
keywords = pd.Series(df_topic_sents_keywords.keywords.values.tolist())

text_no = pd.Series(df.text_no.values.tolist())
timestampStr = pd.Series(df.timestampStr.values.tolist())
user_location_state = pd.Series(df.user_location_state.values.tolist())
text = pd.Series(df.text.values.tolist())

pd.set_option('display.max_colwidth', 150)
new_df = pd.concat([text_no, timestampStr, user_location_state, dominant_topic, topic_perc_contrib, keywords, text], axis=1)
new_df.columns = ['text_no', 'timestampStr', 'user_location_state', 'dominant_topic', 'topic_perc_contrib', 'keywords', 'text']
new_df.head()


In [None]:
new_df.dominant_topic.value_counts()

In [None]:
new_df.keywords.value_counts()

In [None]:
plt.hist(new_df.dominant_topic, bins=best_num_topics)

# 8. Find the most representative documents for each topic. 

Sometimes just the topic keywords may not be enough to make sense of what a topic is about. So, to help with understanding the topic, you can find the documents a given topic has contributed to the most and infer the topic by reading that document. 




In [None]:
# Group top 5 sentences under each topic
sent_topics_sorteddf_lda = pd.DataFrame()

sent_topics_outdf_grpd = new_df.groupby('dominant_topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_lda = pd.concat([sent_topics_sorteddf_lda, 
                                             grp.sort_values(['topic_perc_contrib'], ascending=[0]).head(3)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_lda.reset_index(drop=True, inplace=True)

# # Show
sent_topics_sorteddf_lda

# Exercise 1

Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) such as Adjectives and Verb and see if you can get better topics. 

After you complete and run below code, you will need to rerun almost all the above codes (from Section 2). 
See options for part of speech here: https://spacy.io/api/annotation

Question to anwser: 
Find the best LDA model. How many topics does it have? 



In [None]:
print("Before Lemmatization:", data_words_bigrams[:1])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = #WRITE YOUR CODE

print("After Lemmatization: ", data_lemmatized[:1])


# Exercise 2

Chunksize controls how many documents are processed at a time in the LDA training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory. 

#### Exercise 2a) 
Update the `compute_coherence_values` function below (duplicated for you from Section 4) by setting the chunk size of the LdaModel.
See [Set LdaModel parameters](https://radimrehurek.com/gensim/models/ldamodel.html).

Rerun the compute_coherence_values function based on chunk sizes of 100 to 800 (both inclusive) in steps of 200. 

In [None]:
### Update the following compute_coherence_values function to define chunk size for LdaModel 

def compute_coherence_values(dictionary, corpus, texts, limit, start, step):
    """
    Compute c_v coherence for various document chunksize

    """
    coherence_values = []
    model_list = []
    
    for ??? in range(start, limit, step):
        print(f'Training model based on chunk size= {???}')
        model = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=best_num_topics, passes=60, ???)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
### Set values of chunk size and run this cell after updating compute_coherence_values function above 

# setting values for chunk size
start = ??
limit = ??
step = ??

# Run LdaModel
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=start, limit=limit, step=step)
print('Completed')

#### Exercise 2b) 
Generate a coherence graph based on the chunk size defined and coherence values computed in Exercise 2a, where x-axis represents 
the 'Chunk Size' and y-axis represents 'Coherence score').

Question to anwser: 
According to the graph, what is the most optimal chunk size? 

In [None]:
## Write your code below


