# **Introduction**

This is the course material for 'Topic Modeling'. The notebook is prepared in Python 3.7+.

Author: Avinash OK ( okavinashok@gmail.com )

In this notebook we try to give you answers to the following questions:

* What is Topic Modeling?
* What are the different steps involved in the process?
* What all approaches can be used for Topic Modeling?
* When should you use Topic Modeling?
* How can this Notebook help you to build a Topic Model from scratch?

In [None]:
# Importing all Required Libraries
import nltk, string, re, os, pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")

from nltk.stem.porter import *
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction import stop_words
from collections import Counter
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
%matplotlib inline

import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show, output_notebook

import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("lda").setLevel(logging.WARNING)

### Definition

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.


Contents covered in this notebook:
1. Explanatory Data Analysis 
2. Text Processing  
    2.1. Tokenizing and  tf-idf algorithm  
    2.2. K-means Clustering  
    2.3. Latent Dirichlet Allocation (LDA)  / Topic Modelling


#### 1. Explanatory Data Analysis 

Here, we choose a very popular NLP dataset [`20 News Groups`](http://qwone.com/~jason/20Newsgroups/). It can be downloaded from this [link](https://www.kaggle.com/crawford/20-newsgroups/download) or imported directly from Scikit learn, a popular Data Science library in Python.

#####  What is 20 News Groups?
- The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang,  for his Newsweeder. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

##### **Here is a list of the 20 newsgroups, partitioned according to subject matter:**
<img src="./resources/20NewsGroup.PNG">

In [None]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

df = pd.DataFrame([newsgroups_train.data, newsgroups_train.target.tolist()]).T
df.columns = ['text', 'target']

targets = pd.DataFrame( newsgroups_train.target_names)
targets.columns=['title']

dataframe = pd.merge(df, targets, left_on='target', right_index=True)
dataframe.head()

##### Some sample texts from the dataset:

In [None]:
for text, topic in zip(dataframe['text'][125:130], dataframe['title'][125:130]):
    print("#"*125)
    print("Topic: "+ topic) 
    print("Text: "+ text)

#### Topic level split in the dataset

In [None]:
x = dataframe['title'].value_counts().index.values.astype('str')
y = dataframe['title'].value_counts().values
pct = [("%.2f"%(v*100))+"%"for v in (y/len(dataframe))]

trace1 = go.Bar(x=x, y=y, text=pct,
                marker=dict(
                color = y,colorscale='Portland',showscale=True,
                reversescale = False
                ))
layout = dict(title= 'Topic Level split in the dataset',
              yaxis = dict(title='Count'),
              xaxis = dict(title='Titles'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

#### Our Game-Plan

Now that we have an intial understanding of the dataset, let's quickly move on to the meat of the problem. For now, let's assume that we don't have the column `title` and try to see if we can cluster various documents in the column `text` based on similar words present. This is our "focused column" through out this process.

#### 2. Text pre-processing
In this stage, we perform a basic preprocessing over our focussed column `text`.

It will be slightly challenging to parse through this column since it's unstructured data. As a part of text preprocessing, we will strip out all punctuations, remove some english stop words (i.e. redundant words such as "a", "the", etc.) and any other words with a length less than 3.

<img src="./resources/TopicModelingPreprocessing.png">

In [None]:
stop = set(stopwords.words('english'))
def tokenize(text):
    """
    sent_tokenize(): segment text into sentences
    word_tokenize(): break sentences into words
    """
    try: 
        regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
        text = regex.sub(" ", text) # remove punctuation
        
        tokens_ = [word_tokenize(s) for s in sent_tokenize(text)]
        tokens = []
        for token_by_sent in tokens_:
            tokens += token_by_sent
        tokens = list(filter(lambda t: t.lower() not in stop, tokens))
        filtered_tokens = [w for w in tokens if re.search('[a-zA-Z]', w)]
        filtered_tokens = [w.lower() for w in filtered_tokens if len(w)>=3]
        
        return filtered_tokens
            
    except TypeError as e: print(text,e)

In [None]:
# create a dictionary of words for each category
cat_desc = dict()
for cat in newsgroups_train.target_names: 
    text = " ".join(dataframe.loc[dataframe['title']==cat, 'text'].values)
    cat_desc[cat] = tokenize(text)

# flat list of all words combined
flat_lst = [item for sublist in list(cat_desc.values()) for item in sublist]
allWordsCount = Counter(flat_lst)
all_top10 = allWordsCount.most_common(20)
x = [w[0] for w in all_top10]
y = [w[1] for w in all_top10]

In [None]:
y = dataframe['title'].value_counts().values
pct = [("%.2f"%(v*100))+"%"for v in (y/len(dataframe))]

trace1 = go.Bar(x=x, y=y, text=pct)
layout = dict(title= 'Overall Word Frequency in the dataset',
              yaxis = dict(title='Count'),
              xaxis = dict(title='Word'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

##### 2.1.a Tokenizing

Most of the time, the first steps of an NLP project is to **"tokenize"** your documents, whose main purpose is to normalize our texts. The three fundamental stages will usually include: 
* break the descriptions into sentences and then break the sentences into tokens
* remove punctuation and stop words
* lowercase the tokens
* herein, We can also only consider words that have length equal to or greater than 3 characters

In [None]:
%time
# apply the tokenizer into the "text" column
dataframe['tokens'] = dataframe['text'].map(tokenize)
dataframe['tokens'] = dataframe['text'].map(tokenize)

dataframe.reset_index(drop=True, inplace=True)

print("Let's eyeball how the sentences have been tokenized:")

for description, tokens in zip(dataframe['text'].head(), dataframe['tokens'].head()):
    print('description:', description)
    print('tokens:', tokens)
    print()

We could aso use the package `WordCloud` to easily visualize which words has the highest frequencies within each title:

In [None]:
# build dictionary with key=title and values as all the descriptions related.
cat_desc = dict()
for cat in newsgroups_train.target_names: 
    text = " ".join(dataframe.loc[dataframe['title']==cat, 'text'].values)
    cat_desc[cat] = tokenize(text)


# find the most common words for the top 4 categories
autos100 = Counter(cat_desc['rec.autos']).most_common(100)
space100 = Counter(cat_desc['sci.space']).most_common(100)
christian100 = Counter(cat_desc['soc.religion.christian']).most_common(100)
mideast100 = Counter(cat_desc['talk.politics.mideast']).most_common(100)

In [None]:
def generate_wordcloud(tup):
    wordcloud = WordCloud(background_color='white',
                          max_words=50, max_font_size=40,
                          random_state=42
                         ).generate(str(tup))
    return wordcloud

In [None]:
fig,axes = plt.subplots(2, 2, figsize=(30, 15))

ax = axes[0, 0]
ax.imshow(generate_wordcloud(autos100), interpolation="bilinear")
ax.axis('off')
ax.set_title("Title: Automobile; Top: 100", fontsize=30)

ax = axes[0, 1]
ax.imshow(generate_wordcloud(space100))
ax.axis('off')
ax.set_title("Title: Space; Top: 100", fontsize=30)

ax = axes[1, 0]
ax.imshow(generate_wordcloud(christian100))
ax.axis('off')
ax.set_title("Title: Christian; Top: 100", fontsize=30)

ax = axes[1, 1]
ax.imshow(generate_wordcloud(mideast100))
ax.axis('off')
ax.set_title("Title: Mid-East; Top: 100", fontsize=30)

##### 2.1.b tf-idf algorithm

tf-idf is the acronym for **Term Frequency–inverse Document Frequency**. It quantifies the importance of a particular word in relative to the vocabulary of a collection of documents or corpus. The metric depends on two factors: 
- **Term Frequency**: the occurences of a word in a given document (i.e. bag of words)
- **Inverse Document Frequency**: the reciprocal number of times a word occurs in a corpus of documents

Think about of it this way: If the word is used extensively in all documents, its existence within a specific document will not be able to provide us much specific information about the document itself. So the second term could be seen as a penalty term that penalizes common words such as "a", "the", "and", etc. tf-idf can therefore, be seen as a weighting scheme for words relevancy in a specific document. For more info refer this [link](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10, max_features=180000, tokenizer=tokenize, ngram_range=(1, 2))

In [None]:
all_desc = dataframe['text'].values
vz = vectorizer.fit_transform(list(all_desc))

vz is a tfidf matrix where:
* the number of rows is the total number of descriptions
* the number of columns is the total number of unique tokens across the descriptions

In [None]:
#  create a dictionary mapping the tokens to their tfidf values
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf = pd.DataFrame(columns=['idf']).from_dict( dict(tfidf), orient='index')
tfidf.columns = ['idf']

Below is the 10 tokens with the **lowest IDF score**, which is unsurprisingly, very generic words that we could not use to distinguish one description from another.

In [None]:
tfidf.sort_values(by=['idf'], ascending=True).head(10)

Below is the 10 tokens with the **highest IDF score**, which includes words that are a lot specific that by looking at them, we could guess the categories that they belong to: 

In [None]:
tfidf.sort_values(by=['idf'], ascending=False).head(10)

Given the high dimension of our tfidf matrix, we need to reduce their dimension using the Singular Value Decomposition (SVD) technique. And to visualize our vocabulary, we could next use t-SNE to reduce the dimension from 50 to 2. t-SNE is more suitable for dimensionality reduction to 2 or 3. 

#### **t-Distributed Stochastic Neighbor Embedding (t-SNE)**

t-SNE is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The goal is to take a set of points in a high-dimensional space and find a representation of those points in a lower-dimensional space, typically the 2D plane. It is based on probability distributions with random walk on neighborhood graphs to find the structure within the data. But since t-SNE complexity is significantly high, usually we'd use other high-dimension reduction techniques before applying t-SNE.

First, let's take a sample from the our focused column `text` since t-SNE can take a very long time to execute. We can then reduce the dimension of each vector from to n_components (30) using SVD.

In [None]:
sample_sz = 10000

dataframe_sample = dataframe.sample(n=sample_sz)
vz_sample = vectorizer.fit_transform(list(dataframe_sample['text']))

In [None]:
from sklearn.decomposition import TruncatedSVD

n_comp= 30
svd = TruncatedSVD(n_components=n_comp, random_state=42)
svd_tfidf = svd.fit_transform(vz_sample)

In [None]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=42, n_iter=500)

In [None]:
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)

It's now possible to visualize our data points. Note that the deviation as well as the size of the clusters imply little information  in t-SNE.

In [None]:
output_notebook()
plot_tfidf = bp.figure(plot_width=700, plot_height=600, title="tf-idf clustering of 'text'", tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave", x_axis_type=None, y_axis_type=None, min_border=1)

In [None]:
dataframe_sample.reset_index(inplace=True, drop=True)

tfidf_df = pd.DataFrame(tsne_tfidf, columns=['x', 'y'])
tfidf_df['text'] = dataframe_sample['text']
tfidf_df['tokens'] = dataframe_sample['tokens']
tfidf_df['title'] = dataframe_sample['title']

In [None]:
plot_tfidf.scatter(x='x', y='y', source=tfidf_df, alpha=0.7)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"text": "@text", "tokens": "@tokens", "title":"@title"}
show(plot_tfidf)

In [None]:
from IPython.display import Video
Video("./resources/Topic_model_scheme.webm")
# Video("https://upload.wikimedia.org/wikipedia/commons/7/70/Topic_model_scheme.webm")

#### 2.2. K-means Clustering

K-means clustering obejctive is to minimize the average squared Euclidean distance of the document / text from their cluster centroids. 

In [None]:
?MiniBatchKMeans

In [None]:
from sklearn.cluster import MiniBatchKMeans

num_clusters = 20 # need to be selected wisely
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1, init_size=1000, batch_size=1000, verbose=0, max_iter=1000)

kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)

sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(num_clusters):
    print("Cluster %d:" % i)
    aux = ''
    for j in sorted_centroids[i, :10]:
        try:
            aux += terms[j] + ' | '
        except:
            pass
    print(aux)
    print() 

In order to plot these clusters, first we will need to reduce the dimension of the distances to 2 using tsne: 

In [None]:
# repeat the same steps for the sample
kmeans = kmeans_model.fit(vz_sample)
kmeans_clusters = kmeans.predict(vz_sample)
kmeans_distances = kmeans.transform(vz_sample)
# reduce dimension to 2 using tsne
tsne_kmeans = tsne_model.fit_transform(kmeans_distances)

In [None]:
colormap = np.array(["#6d8dca", "#69de53", "#723bca", "#c3e14c", "#c84dc9", "#68af4e", "#6e6cd5",
"#e3be38", "#4e2d7c", "#5fdfa8", "#d34690", "#3f6d31", "#d44427", "#7fcdd8", "#cb4053", "#5e9981",
"#803a62", "#9b9e39", "#c88cca", "#e1c37b", "#34223b", "#bdd8a3", "#6e3326", "#cfbdce", "#d07d3c",
"#52697d", "#194196", "#d27c88", "#36422b", "#b68f79"])

#combined_sample.reset_index(drop=True, inplace=True)
kmeans_df = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])
kmeans_df['cluster'] = kmeans_clusters
kmeans_df['text'] = dataframe_sample['text']
kmeans_df['title'] = dataframe_sample['title']

plot_kmeans = bp.figure(plot_width=700, plot_height=600, title="KMeans clustering of 'text'", tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave", x_axis_type=None, y_axis_type=None, min_border=1)
source = ColumnDataSource(data=dict(x=kmeans_df['x'], y=kmeans_df['y'], color=colormap[kmeans_clusters], text=kmeans_df['text'], title=kmeans_df['title'], cluster=kmeans_df['cluster']))

plot_kmeans.scatter(x='x', y='y', color='color', source=source)
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips={"text": "@text", "title": "@title", "cluster":"@cluster" }
show(plot_kmeans)

These are some of the approaches to do Topic Modeling:
- **Latent Dirichlet Allocation**
- Hierarchical Dirichlet process (HDP)
- Gibbs Sampling Dirichlet Mixture Model (GSDMM)
- Nonnegative Matrix Factorization (NMF)
- Latent Semantic Analysis (LSA/LSI)
- Probabilistic Latent Semantic Analysis (pLSA)
- Explicit semantic analysis (ESA)

#### 2.3 **Latent Dirichlet Allocation**

Latent Dirichlet Allocation (LDA) is an algorithms used to discover the topics that are present in a corpus.

>  LDA starts from a fixed number of topics. Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics. Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents.
> 
> Reference: https://medium.com/intuitionmachine/the-two-paths-from-natural-language-processing-to-artificial-intelligence-d5384ddbfc18

Its input is a **bag of words**, i.e. each document represented as a row, with each columns containing the count of words in the corpus. We are going to use a powerful tool called pyLDAvis that gives us an interactive visualization for LDA. 

<img src="./resources/LDA.PNG">

### How to choose hyperparameters?

There are primarily three Hyperparameters which directly influence the output.

###### *doc_topic_prior : float, optional (default=None)*
Prior of document topic distribution theta. If the value is None, defaults to 1 / n_components. This is called `alpha`.

###### *topic_word_prior : float, optional (default=None)*
Prior of topic word distribution `beta`. If the value is None, defaults to 1 / n_components.

<img src="./resources/LDA_Hyperparameters.PNG">

In [None]:
?LatentDirichletAllocation
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

In [None]:
cvectorizer = CountVectorizer(min_df=4, max_features=180000, tokenizer=tokenize, ngram_range=(1,2))
cvz = cvectorizer.fit_transform(dataframe_sample['text'])
lda_model = LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20, random_state=42)
X_topics = lda_model.fit_transform(cvz)

n_top_words = 20
topic_summaries = []

topic_word = lda_model.components_  # get the topic words
vocab = cvectorizer.get_feature_names()

#### LDA Results

LDA considers each documents consists of multiple topics.

<img src="./resources/LDA_Results.png">

In [None]:
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' | '.join(topic_words)))

Inference:

#### How to interpret the results

<img src="./resources/LDA_Space.png">

In [None]:
# reduce dimension to 2 using tsne
tsne_lda = tsne_model.fit_transform(X_topics)

In [None]:
unnormalized = np.matrix(X_topics)
doc_topic = unnormalized/unnormalized.sum(axis=1)

lda_keys = []
for i, tweet in enumerate(dataframe_sample['text']):
    lda_keys += [doc_topic[i].argmax()]

lda_df = pd.DataFrame(tsne_lda, columns=['x','y'])
lda_df['description'] = dataframe_sample['text']
lda_df['category'] = dataframe_sample['title']
lda_df['topic'] = lda_keys
lda_df['topic'] = lda_df['topic'].map(int)

plot_lda = bp.figure(plot_width=700,
                     plot_height=600,
                     title="LDA topic visualization",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

In [None]:
source = ColumnDataSource(data=dict(x=lda_df['x'], y=lda_df['y'],
                                    color=colormap[lda_keys],
                                    description=lda_df['description'],
                                    topic=lda_df['topic'],
                                    category=lda_df['category']))

plot_lda.scatter(source=source, x='x', y='y', color='color')
hover = plot_kmeans.select(dict(type=HoverTool))
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips={"description":"@description",
                "topic":"@topic", "category":"@category"}
show(plot_lda)

In [None]:
def prepareLDAData():
    data = {
        'vocab': vocab,
        'doc_topic_dists': doc_topic,
        'doc_lengths': list(lda_df['len_docs']),
        'term_frequency':cvectorizer.vocabulary_,
        'topic_term_dists': lda_model.components_
    } 
    return data

In [None]:
import pyLDAvis

lda_df['len_docs'] = dataframe_sample['tokens'].map(len)
ldadata = prepareLDAData()
pyLDAvis.enable_notebook()
prepared_data = pyLDAvis.prepare(**ldadata)
pyLDAvis.display(prepared_data)

### To summarize:

General Rule of Thumb while doing Topic Modeling using LDA.

<img src="./resources/LDA_Steps.png">

### Some general Use-cases:

1. Text Categorization problem where the labels given from the business are not very reliable.
2. Recommender Engines for suggesting similar articles or books to potential customers.
3. Text description for a fixed number of Products are given and asked to be clustered without label information.

## **References**

1. [Topic Modeling Wikipedia Page](https://en.wikipedia.org/wiki/Topic_model)
2. [Topic Modeling in Python: Latent Dirichlet Allocation (LDA); Author- Shashank Kapadia](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0)
3. [Beginners Guide to Topic Modeling in Python; Author - Shivam Bhansal](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/)
4. [Topic Modeling with Gensim (Python); Author - Selva Prabhakaran](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)
5. [Python for NLP: Topic Modeling; Author - Usman Malik](https://stackabuse.com/python-for-nlp-topic-modeling/)
6. [Mercari Interactive EDA + Topic Modelling; Author -ThyKhuely](https://www.kaggle.com/thykhuely/mercari-interactive-eda-topic-modelling?scriptVersionId=1923301)
7. [Topic Modeling for The New York Times News Dataset; Author - Moorissa Tjokro](https://towardsdatascience.com/topic-modeling-for-the-new-york-times-news-dataset-1f643e15caac)
8. [LDA Topic Models (YouTube Video); Author - Andrius Knispelis](https://www.youtube.com/watch?v=3mHy4OSyRf0)

In [None]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')