<center><b>DIGHUM101</b></center>
<center>5-2: Topic Modeling</center>

---

# Learning objectives 

- Preprocessing text data
- Create an LDA topic model using Gensim
- Visualize and interpret a topic model using pyLDAvis

In [None]:
# Install new libraries if needed

# !pip install wordcloud
# !pip install pyldavis

In [None]:
# Import libraries

from collections import Counter # Count most common words
%matplotlib inline
import nltk # natural language toolkit
from nltk.corpus import stopwords
import numpy as np 
import os
import pandas as pd
import pyLDAvis.sklearn # visualize our topic models!
import re # regular expressions
# Preprocessing
import gensim
# Algorithms (unsupervised)
from sklearn.decomposition import LatentDirichletAllocation
# Tools to create our DTMs
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
# Visualize word clouds 
from wordcloud import WordCloud
# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)

# Topic modeling

There are many topic modeling algorithms, but we'll start with [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). This is a standard **unsupervised** machine learning text-mining tool that can be used to discover abstract "topics" contained within texts.

See [this cool animation](https://en.wikipedia.org/wiki/File:Topic_model_scheme.webm) on Wikipedia to get an idea about topic modeling works.


## Vocabulary

- **Topic Modeling:** A general class of statistical models that uncover abstract topics within a text. It uses the co-occurrence of words within documents, compared to their distribution across documents, to uncover these abstract themes. The output is a list of weighted words, which indicate the subject of each topic, and a weight distribution across topics for each document.
    
- **LDA:** Latent Dirichlet Allocation. A particular model for topic modeling. It does not take document order into account, unlike other topic modeling algorithms. Also see word2vec and BERT! (Week 5)

Like the rest of this class, the goal is not to learn everything about topic modeling. Instead, this notebook will provide you with some starter code to run some simple models with the idea that you can use this base of knowledge to explore further. Use the `sklearn` help files, Stack Overflow, and Google searching to review and learn more about what the code is doing and how to go further. 

Can you make this code work for your own data? Can you tweak the parameters to get better output?

# Create a dataframe from individual text files

You've gathered a bunch of text files, so now what? It is useful to get these files into a dataframe. Python does not make this terribly easy for the beginner, so use the boilerplate code below to help you.

Let's concatenate the eleven text files in the "Data/human-rights/" folder into a dataframe so we can manipulate that text like we have seen in the previous few notebooks.

In [None]:
# Where am I?
%pwd

In [None]:
# Define a variable with the file path for the directory containing the text files
# Go two directories up (../../) 
# and into the Data directory
# then into the human-rights subdirectory
dir_path = os.listdir("../../Data/human-rights/")

# View the contents of this directory
dir_path

In [None]:
# Designate an empty dictionary to store the filename and text as columns
for_dataframe = {}

# Loop through the directory of text files and open and read them
for file in dir_path:
    with open("../../Data/human-rights/" + file, "r", encoding="utf-8") as to_open:
         for_dataframe[file] = to_open.read()
            
# Create and append the dataframe with two columns - the file name and the text itself
human_rights = (pd.DataFrame.from_dict(for_dataframe, 
                                       orient = "index")
                .reset_index().rename(index = str, 
                                      columns = {"index": "File", 0: "Text"}))

In [None]:
human_rights

# Review - manipulate and explore text

In [None]:
# Check out text of one row to make sure it looks okay...
human_rights.iloc[0,1][:1000]

# Basic preprocessing

Preprocess the text! What else might you want to do that is not included here? Lemmatization? 

In [None]:
human_rights["Text_processed"] = human_rights["Text"].apply(gensim.utils.simple_preprocess)

human_rights["Text_processed"] 

In [None]:
# Using gensim for preprocessing using .apply()
processed = human_rights["Text"].apply(gensim.utils.simple_preprocess)

# Stopword removal using NLTK stopword list and a lambda function
stop = stopwords.words('english')
no_stop = processed.apply(lambda x: [w for w in x if w not in stopwords.words('english')]) 

# Convert list back to str
human_rights["Text_processed"] = [' '.join(t) for t in no_stop]

In [None]:
human_rights

In [None]:
human_rights['Text_processed'][0][:1000]

In [None]:
# Get top-10 words

hr_str = ' '.join(human_rights['Text_processed'].tolist())
hr_tok = hr_str.split()
hr_freq = Counter(hr_tok)

# Print the 10 most common words
hr_df = pd.DataFrame(hr_freq.most_common(10), columns = ["Word", "Frequency"])
hr_df

In [None]:
# Save to csv!
human_rights.to_csv('../../Data/human_rights.csv', index=False)

# Define a BOW model

In [None]:
# Define an empty bag (of words)
vectorizer = CountVectorizer()

# Use the .fit method to tokenize the text and learn the vocabulary
vectorizer.fit(human_rights["Text_processed"])

# Create the DTM

Recall that a [document term matrix](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) displays term frequencies or TFIDF scores that occur across a collection of documents. We want to encode the documents into a [sparse matrix](https://sebastianraschka.com/faq/docs/bag-of-words-sparsity.html#:~:text=By%20definition%2C%20a%20sparse%20matrix,as%20a%20word%2Dcount%20vector.&text=Thus%2C%20if%20most%20of%20your,most%20likely%20sparse%20as%20well!) to represent the frequencies or TFIDF scores of each vocabulary word across the documents.

Again, the column headers could read **(document number, term)   frequency**

In [None]:
# Encode the documents
vector = vectorizer.transform(human_rights["Text_processed"])

print(type(vector))
print(vector.shape)
print(vector) 

In [None]:
# View as a multidimensional array before converting to data frame
# Rows are the documents, columns are the terms

print(vector.toarray())

In [None]:
# Preview the terms

vectorizer.get_feature_names()[0:10]

# Define a bigram bag of words

In [None]:
# Note we are entering regular expression as a token_pattern argument

bigram_vectorizer = CountVectorizer(ngram_range = (1,2),
                                    stop_words = "english",
                                    token_pattern = r'\b\w+\b', 
                                    min_df = 1)

bigram_vectorizer

In [None]:
# Analyze string in the bigram bag of words

analyze = bigram_vectorizer.build_analyzer()
vocab = analyze(hr_str)

vocab[0:10]

In [None]:
# Show the 20 most commons
freq = Counter(vocab)
stop_df = pd.DataFrame(freq.most_common(20), columns = ["Word", "Frequency"])
stop_df

In [None]:
# Define a word cloud variable
cloud = WordCloud(background_color = "white", 
                  max_words = 20, 
                  contour_width = 5, 
                  width = 600, height = 300, 
                  random_state = 5)

# Process the word cloud
cloud.generate(hr_str)

# Visualize!
cloud.to_image()

Learn about using [custom colors here](https://amueller.github.io/word_cloud/auto_examples/a_new_hope.html)

In [None]:
# Visualize word frequencies in a horizontal bar plot

sns.barplot(x = "Frequency",
            y = "Word",
            data = stop_df,
            orient = "h");

# Finally! Fit the topic model

The input to LDA should be a DTM.

In [None]:
# Predetermine the number of topics

n_topics = 5

In [None]:
# CountVectorizer to create the DTM, using some arguments to filter words!

tf_vectorizer = CountVectorizer(max_df = 0.90, # ignore terms that appear in more than 90% of the documents 
                                   max_features = 500, # using the 500 most-frequent words accross all documents
                                   stop_words = "english") # tfidf_vectorizer has its own stopword list!

# Fit
cv = tf_vectorizer.fit_transform(hr_tok)

[Check out this question](https://stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer) to learn more about the `max_df` and `min_df` arguments. 

Finally, let's run our LDA model! Remember that LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. We are using raw frequencies here but could also use TFIDF. This would increase the chance of rare words being sampled, making them have a stronger influence in topic assignment. Try it out if you feel like it!

In [None]:
# Instantiate our LDA model (this might take a minute or two)
lda = LatentDirichletAllocation(n_components = n_topics, 
                                max_iter = 20, # the maximum number of passes over the training data (aka epochs) 
                                random_state = 42) # setting random_state creates replicable results 
lda = lda.fit(cv)

In [None]:
# Here is a function to print out the top words for each topic in a pretty way:

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #{}:".format(topic_idx+1))
        print(", ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

In [None]:
# Return the topics
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 20)

In [None]:
panel = pyLDAvis.sklearn.prepare(lda_model = lda, 
                                 dtm = cv, 
                                 vectorizer = tf_vectorizer, 
                                 mds = "tsne") # method for dimensionality reduction

pyLDAvis.display(panel)

# Interpreting PyLDAvis output
- Similar topics should appear close together on the plot; dissimilar topics should appear far apart. 
- The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.

## Salience
When no topic is selected in the plot on the left, the right bar chart shows the top-30 most **salient** terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.

## Probability Vs Exclusivity 
When you select a particular topic, this bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter λ, which can be adjusted with a slider above the bar chart:

* Setting λ close to 1.0 (the default) will rank the terms according to their probability within the topic.
* Setting λ close to 0.0 will rank the terms according to their "distinctiveness" or "exclusivity" within the topic. This means that terms that occur only in this topic, and do not occur in other topics.

You can move the slider between 0.0 and 1.0 to weigh term probability and exclusivity.

# Challenge 1

1. What is a topic in LDA? 
2. What is the relevance metric lambda in the pyLDAvis plot?
3. What do you know about the eleven human rights documents we used to do this exercise? 
4. Why are all these topics similar in size in the left plot?
5. Plug in your own data! 

# Challenge 2

Read up on LDA and its visualizations by clicking the below links:
- https://www.objectorientedsubject.net/2018/08/experiments-on-topic-modeling-pyldavis/
- http://www.cs.columbia.edu/~blei/papers/ChaneyBlei2012.pdf
- https://shravan-kuchkula.github.io/topic-modeling/#lda-results
- https://markroxor.github.io/gensim/static/notebooks/gensim_news_classification.html
- http://vis.stanford.edu/files/2012-Termite-AVI.pdf