![Save2Drive](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/save2drive.png)

# Outline

0. Instructions
1. Intro to topic modeling reading (~5 minutes)
2. Setup
3. Preview
4. Load data
5. First preprocessing experiment: Implement a function for the first preprocessing experiment
6. Create your first topic model
7. Qualitative observations of first topic model.
8. Quantitative observations of first topic model.
9. Second preprocessing experiment: Implement a different preprocessing function for the second preprocessing experiment and run topic modeling pipeline on new preprocessed data.
11. Parameter Experiment
12. Conclusions

# Instructions

1. This notebook will guide you through each step of the project. 
2. Throughout the project, you will record your experiments and observations. We have created a slide deck template for you that you can download and edit in Google Slides or Microsoft Power Point. The template is at **`Experiment-Report-Templates/2-Topic-Modeling-Template.pptx`**.
3. We are happy to help with *anything*. Ask away :) 

# Introduction to Topic Modeling Reading

![TM](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/tm-0.png)

![TM](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/tm-1.png)
![TM](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/tm-2.png)
![TM](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/tm-3.png)
![TM](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/tm-4.png)

# Run the cell below below to get setup

In [None]:
import sys, os
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  !rm -r AI4All2020-Michigan-NLP
  !git clone https://github.com/alahnala/AI4All2020-Michigan-NLP.git
  !cp -r AI4All2020-Michigan-NLP/utils/ .
  !cp -r AI4All2020-Michigan-NLP/Data/ .
  !cp -r AI4All2020-Michigan-NLP/slides/ .
  !cp -r AI4All2020-Michigan-NLP/Experiment-Report-Templates/ .
  !echo "=== Files Copied ==="

import pandas as pd
# used to visualize the topic model
%pip install --upgrade gensim
%pip install pyldavis
import pyLDAvis
import pyLDAvis.gensim
import nltk
nltk.download('punkt')
from nltk.stem.snowball import PorterStemmer
from utils.nlp_basics import *
from utils.syllable import *

import utils.tm_helpers as helpers

# general
from tqdm import tqdm
import os
import regex as re

# preprocess functions
from nltk.tokenize import word_tokenize
import spacy
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])


# topic modeling packages
import gensim
from gensim import models, corpora
from gensim.models.coherencemodel import CoherenceModel

print('Done')

# Preview

By the end of this project, you will have experimented with preprocessing steps, and generate an interactive visual to explore the topics. Run the cell below to get a preview of the end goal.

In [None]:
helpers.show_model()

# Data

## Load data

Run the cell below to load the data that we will use for topic modeling. The data is saved into a *list* data structure called `data`.

In [None]:
data = helpers.load_data()

**To get an idea of what `data` looks like, run the cell below to see what the first  looks like.**

In [None]:
print("Data length:", len(data), "First 10 items in data:\n")

# data is a list, data[:10] is the first ten items of that list
for i, item in enumerate(data[:10]):
    print("{}:".format(i), item)

# Preprocessing Experiment 1

## Write the function `preprocess_line`.
1. At this point, *preprocessing* is the main piece to experiment with and observe your implementation decisions on topic modeling. Using at least one new thing that you learned in 1-Intro-to-NLP, write the function `preprocess_line` which takes in a string of text saved in the input variable `line`, so that the function returns a list of tokens called `preprocessed_line`. 
2. At the bottom of the cell, we call `preprocess_line` on `test_string` so you can quickly observe how your function is working. `test_string` is initially set to be the first string from `data`, but you can change `test_string` to another string from `data` or your own string to get a fuller idea of the effects of your function.
3. Take a second to record your decisions in the slide titled **Preprocessing Experiment 1**.


In [None]:
def preprocess_line(line):
    '''
    Fill in this function. Refer to 1-Intro-to-NLP for preprocessing ideas.
    '''
    preprocessed_line = line.split()

    
    return preprocessed_line

test_string = data[0]
print(preprocess_line(test_string))

## Run `preprocess` on all of the data.
1. Run the cell below to `preprocess` all of the strings in `data` and save to `preprocessed_data` 
2. The cell will output a preview our `preprocessed_data`. How does is look different than our earlier preview? Do you have a bug or does it look how you want it to look? 
3. Take a second to record your observations in the slide deck, in the slide titled **Preprocessing Output Observations**.

In [None]:
def preprocess(data):
    preprocessed_data = []
    for line in tqdm(data):
        preprocessed_line = preprocess_line(line)
        preprocessed_data.append(preprocessed_line)
    return preprocessed_data
preprocessed_data = preprocess(data)

# data is a list, data[:10] is the first ten items of that list
for i, item in enumerate(preprocessed_data[:10]):
    print("{}:".format(i), item)

## Final data preparation for gensim topic modeling

1. In this step, we use functions from the `gensim` module to create `dictionary` and `corpus` in the format that the topic modeling function requires.
2. `dictionary`: This is our *set of words* after our preprocessing, which we will use for creating different probability distributions. This *set of words* is often referred to as the *vocabulary* in NLP terminology.
3. `corpus`: Words are typically converted into a numerical representation before using them in a model. To demonstrate this, the cell will print each word for the first item in your preprocessed data next to its numerical encoding. Why is the word **information** represented as **(3, 1)**?

In [None]:
print("Creating dictionary and corpus instances for gensim...", end='')

dictionary = corpora.Dictionary(preprocessed_data)
corpus = [dictionary.doc2bow(x) for x in preprocessed_data]

print("complete.\n")

print(dictionary, '\n')


for original, encoded in zip(preprocessed_data[0], corpus[0]):
    print("Original: {}".format(original), "--->  Encoded: {}".format(encoded))

# Create Topic Model


1. Run the cell below to create a topic model for our data, and the topic model will be saved in the variable `lda_model`. For now, do not worry about the parameters, we will experiment with those later.

In [None]:
'''
The following lines are used to setup the model's parameters. 
'''
UPDATE_EVERY = 10
CHUNKSIZE = 10
PASSES = 10
topic_model_settings = [{'num-topics':15, 'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':5,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':10,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}]
setting = topic_model_settings[0]
NUM_TOPICS = setting['num-topics']


'''
Here, we plug in our parameters and data into models.LdaModel to train the topic model. The topic model will be saved in the variable "lda_model".
'''
print("Training topic model of {} topics (this might take a moment)...".format(NUM_TOPICS), end='')
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, random_state = setting['parameters']['random_state'], update_every = setting['parameters']['update_every'], chunksize = setting['parameters']['chunksize'], passes = setting['parameters']['passes'], alpha = setting['parameters']['alpha'], per_word_topics = setting['parameters']['per_word_topics'])
print("complete.")


# Qualitative Observations

We have created a topic model and saved it in `lda_model`. But what does this mean? Let's learn more as we perform some qualitative observations. Later, we will discuss two quantitative metrics, but a lot of times, topic models are decided from manual observation.

## Topic Terms
1. Run the cell below to observe the topic terms.  
2. By creating the topic model, we have created a probability distribution over our set of words (`dictionary` or the vocabulary) for 15 topics. We have created a helper function called `show_topic_terms` that will extract the ten words with the highest probabilities for fitting into each topic. 
3. The columns of `topic_terms` are the top ten words of each topic (Wn) and Wn's probability of belonging to each topic (Wn Pr). Each row is a topic.
4. What do you think of these topics? Can you make sense of clear topic divisions between the 15 topics, or, would you be able to give a name to each topic? How could you modify the `preprocess_line` function to possibly improve the topic model to get a clearer division between topics?
5. Before you make any changes or proceed, make notes of your observations in the slide **Topic Model Qualitative Observations**. You can include a screenshot of part of this table, paste the table values into the slides.

In [None]:
topic_terms = helpers.show_topic_terms(lda_model, NUM_TOPICS)
topic_terms

## Topic Model Visualization

1. Run the cell below to generate an interactive visual to explore the topic model.
2. Add observations to the **Topic Model Qualitative Observations** slide.

In [None]:
helpers.show_model(lda_model, corpus, dictionary)

# Quantitative Observations

1. We will compute *perplexity* and *coherence,* which are two quantitative metrics for evaluating topic models. Better scores do not necessarily mean better models in topic modeling. 
2. Add the scores to the **Topic Model 1 Quantitative Observations** slide.


## Perplexity

"In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample ([Wikipedia](https://en.wikipedia.org/wiki/Perplexity))."

Intuition: the dictionary definition of perplexed is "Completely baffled; very puzzled [Dictionary Reference](https://www.lexico.com/en/definition/perplexed)." The more perplexed you are, the more puzzled you are. Use this to remember the intuition behind the *perplexity* metric. We aim for a model that less perplexed by new data, so a smaller perplexity is usually the goal.


In [None]:
'''
Measure quality of topic models with perplexity
'''
print("Measuring model perplexity...",end="")
ppl = lda_model.log_perplexity(corpus)
print('complete. Perplexity:', ppl, '\n')

## Coherence

The coherence metric captures the "degree of semantic similarity between high scoring words in the topic ([Towards Data Science](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0))."

The idea behind the *coherence* metric aligns with your experience observing your topic terms. Did you observe a very clear theme in each topic or was there a lot of overlapping themes between the topics? Clear themes are analogous to higher coherence scores, whereas overlapping themes are analogous to distinct topics.

In [None]:
'''
Measure quality of topic models with coherence
'''
print("Measuring model coherence...",end="")
coherence_model_lda = CoherenceModel(model=lda_model, texts = preprocessed_data, corpus=corpus, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('complete. Coherence Score:', coherence_lda, '\n') 

# Preprocessing Experiment 2

## Write the function `preprocess_line_2`.
1. As you did with `preprocessed_line`, implement the function `preprocessed_line_2`. Try to base your new decisions on your previous observations.  
2. Test your function on a `test_string`.
3. Take a second to record your decisions in the slide titled **Preprocessing Experiment 2**.


In [None]:
def preprocess_line_2(line):
    '''
    Fill in this function. Refer to 1-Intro-to-NLP for preprocessing ideas.
    '''
    preprocessed_line = line.split()
    

    
    return preprocessed_line

test_string = data[0]
print(preprocess_line_2(test_string))

# Run Topic Modeling pipeline for preprocessing experiment 2

1. Run the following cell to create a new topic model and output the quantitative metrics. Add your observations to the **Topic Model 2 Quantitative Observations** slide.
2. The next cell will output the topic terms. Add your observations to the **Topic Model 2 Qualitative Observations** slide.
3. The third cell will output the interactive visual. Add your observations to the **Topic Model 2 Qualitative Observations** slide.

In [None]:
'''
1. Preprocess all data
'''
def preprocess_2(data):
    preprocessed_data = []
    for line in tqdm(data):
        preprocessed_line = preprocess_line_2(line)
        preprocessed_data.append(preprocessed_line)
    return preprocessed_data
preprocessed_data = preprocess_2(data)

for i, item in enumerate(preprocessed_data[:10]):
    print("{}:".format(i), item)
    
    
'''
2. Create corpus and dictionary
'''   
print("Creating dictionary and corpus instances for gensim...", end='')
dictionary = corpora.Dictionary(preprocessed_data)
corpus = [dictionary.doc2bow(x) for x in preprocessed_data]
print("complete.\n")

'''
3. Create topic model
'''
print("Training topic model of {} topics (this might take a moment)...".format(NUM_TOPICS), end='')
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, random_state = setting['parameters']['random_state'], update_every = setting['parameters']['update_every'], chunksize = setting['parameters']['chunksize'], passes = setting['parameters']['passes'], alpha = setting['parameters']['alpha'], per_word_topics = setting['parameters']['per_word_topics'])
print("complete.")

'''
4. Measure quality of topic model with perplexity
'''
print("Measuring model perplexity...",end="")
ppl = lda_model.log_perplexity(corpus)
print('complete. Perplexity:', ppl, '\n')

'''
5. Measure quality of topic models with coherence
'''
print("Measuring model coherence...",end="")
coherence_model_lda = CoherenceModel(model=lda_model, texts = preprocessed_data, corpus=corpus, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('complete. Coherence Score:', coherence_lda, '\n') 

In [None]:
'''
6. Print topic terms
'''
topic_terms = helpers.show_topic_terms(lda_model, NUM_TOPICS)
topic_terms

In [None]:
'''
7. Generate interactive visual
'''
helpers.show_model(lda_model, corpus, dictionary)

# Parameter Experiment

1. Another impact on the quality of the topics are the different parameters.
2. For some, it is fun to learn by experience and just trying different values and making observations about the effect. Others may prefer to look at documentation or [tutorials](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) for guidance. Reading documentation for code and libraries can be another skill to work on. Try taking a look at the [Gensim LDA Documentation](https://radimrehurek.com/gensim/models/ldamodel.html) and/or [a tutorial](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) to see what you can learn about the different parameters.
3. Try experimenting with changing at least one parameter in the cell below, and report what parameters you are experimenting with in the **Topic Model Parameter Experiment** slide. (Hint: a good start is NUM_TOPICS)



In [None]:
UPDATE_EVERY = 10
CHUNKSIZE = 10
PASSES = 10
topic_model_settings = [{'num-topics':15, 'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':5,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':10,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}]
setting = topic_model_settings[0]
NUM_TOPICS = setting['num-topics']



## Create Topic Model with New Parameters

Run the cell to create your new topic model.

In [None]:
print("Training topic model (this will take a moment)...", end='')
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, random_state = setting['parameters']['random_state'], update_every = setting['parameters']['update_every'], chunksize = setting['parameters']['chunksize'], passes = setting['parameters']['passes'], alpha = setting['parameters']['alpha'], per_word_topics = setting['parameters']['per_word_topics'])
print("complete.")


## Quantitative Metrics

Report the scores in the **Parameter Experiment Quantitative Observations** slides.

In [None]:
'''
Measure quality of topic model with perplexity
'''
print("Measuring model perplexity...",end="")
ppl = lda_model.log_perplexity(corpus)
print('complete. Perplexity:', ppl, '\n')

'''
Measure quality of topic models with coherence
'''
print("Measuring model coherence...",end="")
coherence_model_lda = CoherenceModel(model=lda_model, texts = preprocessed_data, corpus=corpus, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('complete. Coherence Score:', coherence_lda, '\n') 

## Qualitative Observations

The columns of `topic_terms` are the top ten words of each topic (Wn) and Wn's probability of belonging to each topic (Wn Pr). 

Report your observations in the **Parameter Experiment Qualitative Observations** slides.

In [None]:
topic_terms = helpers.show_topic_terms(lda_model, NUM_TOPICS)
topic_terms

In [None]:
helpers.show_model(lda_model, corpus, dictionary)

# Conclusions

By now, you have experimented with two preprocessing functions and saw their impact on topic modeling. Spend a few minutes reflecting on what you learned from your experience and put a few notes in the **Topic Modeling Conclusions** slide.

# References

1. https://dl.acm.org/doi/fullHtml/10.1145/2133806.2133826
2. https://nlpforhackers.io/topic-modeling/
3. https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/  -- for more analyses of topic models