# Outline

1. Intro to topic modeling reading (~5 minutes)
2. Setup
3. Preview

![TM](./slides/tm-0.png)

![TM](./slides/tm-1.png)
![TM](./slides/tm-2.png)
![TM](./slides/tm-3.png)
![TM](./slides/tm-4.png)

# Setup

Run the cell below to import functions written by staff and functions from libraries. You will not need to worry about the details these, but if you are curious to see what they look like, checkout the file `tm_helpers.py`. If you want to learn more about how these functions or have questions, let us know! :)

In [1]:
import utils.tm_helpers as helpers

# general
from tqdm import tqdm
import os
import regex as re

# preprocess functions
from nltk.tokenize import word_tokenize
import spacy
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])


# topic modeling packages
import gensim
from gensim import models, corpora
from gensim.models.coherencemodel import CoherenceModel

# used to visualize the topic model
import pyLDAvis.gensim
import pyLDAvis

  from collections import Sequence, Sized, Iterable, Callable
  from collections import Sequence, Sized, Iterable, Callable


# Preview

By the end of this project, you will have experimented with preprocessing steps, and generate an interactive visual to explore the topics. Run the cell below to get a preview of the end goal.

In [23]:
helpers.show_model()

Generating visual, this will take a few moments...


# Data

## Load data

Run the cell below to load the data that we will use for topic modeling. The data is saved into a *list* data structure called `data`.

In [3]:
data = helpers.load_data()

**To get an idea of what `data` looks like, run the cell below to see what the first  looks like.**

In [26]:
print("Data length:", len(data), "First 10 items in data:\n")

# data is a list, data[:10] is the first ten items of that list
for i, item in enumerate(data[:10]):
    print("{}:".format(i), item)

Data length: 494 First 10 items in data:

0: Computer science is the study of computation and information.

1: Computer science deals with theory of computation, algorithms, computational problems and the design of computer systems hardware, software and applications.

2: Computer science addresses both human-made and natural information processes, such as communication, control, perception, learning and intelligence especially in human-made computing systems and machines.

3: According to Peter Denning, the fundamental question underlying computer science is, What can be automated?Its fields can be divided into theoretical and practical disciplines.

4: Computational complexity theory is highly abstract, while computer graphics and computational geometry emphasizes real-world applications.

5: Algorithmics is called the heart of computer science.

6: Programming language theory considers approaches to the description of computational processes, while software engineering involves the 

# Preprocessing Experiment 1

## Write the function `preprocess_line`.
1. At this point, *preprocessing* is the main piece to experiment with and observe your implementation decisions on topic modeling. Using at least one new thing that you learned in 1-Intro-to-NLP, write the function `preprocess_line` which takes in a string of text saved in the input variable `line`, so that the function returns a list of tokens called `preprocessed_line`. 
2. At the bottom of the cell, we call `preprocess_line` on `test_string` so you can quickly observe how your function is working. `test_string` is initially set to be the first string from `data`, but you can change `test_string` to another string from `data` or your own string to get a fuller idea of the effects of your function.
3. Take a second to record your decisions in the slide titled **Preprocessing Experiment 1**.


In [29]:
def preprocess_line(line):
    '''
    Fill in this function. Refer to 1-Intro-to-NLP for preprocessing ideas.
    '''
    preprocessed_line = []
    tokens = word_tokenize(line)
    
    # use spacy pipeline
    doc = nlp(" ".join(tokens))
    
    allowed_postags=['NOUN']
    
    # get pos_tags
    pos_tags = [token.pos_ for token in doc]
    
    # get_lemmas, also remove words that aren't in allowed pos tags, also remove stopwords
    lemmas = [token.lemma_ for token in doc if token.pos_ in allowed_postags and not token.is_stop]
    
    preprocessed_line = lemmas
    
    return preprocessed_line

test_string = data[0]
print(preprocess_line(test_string))

['computer', 'science', 'study', 'computation', 'information']


## Run `preprocess` on all of the data.
1. Run the cell below to `preprocess` all of the strings in `data` and save to `preprocessed_data` 
2. The cell will output a preview our `preprocessed_data`. How does is look different than our earlier preview? Do you have a bug or does it look how you want it to look? 
3. Take a second to record your observations in the slide deck, in the slide titled **Preprocessing Output Observations**.

In [28]:
def preprocess(data):
    preprocessed_data = []
    for line in tqdm(data):
        preprocessed_line = preprocess_line(line)
        preprocessed_data.append(preprocessed_line)
    return preprocessed_data
preprocessed_data = preprocess(data)

# data is a list, data[:10] is the first ten items of that list
for i, item in enumerate(preprocessed_data[:10]):
    print("{}:".format(i), item)

100%|██████████| 494/494 [00:01<00:00, 279.47it/s]

0: ['computer', 'science', 'study', 'computation', 'information']
1: ['computer', 'science', 'deal', 'theory', 'computation', 'algorithm', 'problem', 'design', 'computer', 'system', 'hardware', 'software', 'application']
2: ['computer', 'science', 'human', 'information', 'process', 'communication', 'control', 'perception', 'learning', 'intelligence', 'human', 'computing', 'system', 'machine']
3: ['question', 'computer', 'science', 'field', 'discipline']
4: ['complexity', 'theory', 'computer', 'graphic', 'geometry', 'world', 'application']
5: ['heart', 'computer', 'science']
6: ['programming', 'language', 'theory', 'approach', 'description', 'process', 'software', 'engineering', 'use', 'programming', 'language', 'system']
7: ['computer', 'architecture', 'computer', 'engineering', 'deal', 'construction', 'computer', 'component', 'computer', 'equipment']
8: ['computer', 'interaction', 'challenge', 'computer']
9: ['intelligence', 'goal', 'process', 'problem', 'decision', 'making', 'adaptat




## Final data preparation for gensim topic modeling

1. In this step, we use functions from the `gensim` module to create `dictionary` and `corpus` in the format that the topic modeling function requires.
2. `dictionary`: This is our *set of words* after our preprocessing, which we will use for creating different probability distributions. This *set of words* is often referred to as the *vocabulary* in NLP terminology.
3. `corpus`: Words are typically converted into a numerical representation before using them in a model. To demonstrate this, the cell will print each word for the first item in your preprocessed data next to its numerical encoding. Why is the word **information** represented as **(3, 1)**?

In [40]:
print("Creating dictionary and corpus instances for gensim...", end='')

dictionary = corpora.Dictionary(data_words_trigrams)
corpus = [dictionary.doc2bow(x) for x in data_words_trigrams]

print("complete.\n")

print(dictionary, '\n')


for original, encoded in zip(data_words_trigrams[0], corpus[0]):
    print("Original: {}".format(original), "--->  Encoded: {}".format(encoded))

Creating dictionary and corpus instances for gensim...complete.

Dictionary(1310 unique tokens: ['computation', 'computer_science', 'information', 'study', 'algorithm']...) 

Original: computer_science --->  Encoded: (0, 1)
Original: study --->  Encoded: (1, 1)
Original: computation --->  Encoded: (2, 1)
Original: information --->  Encoded: (3, 1)


# Create Topic Model


1. Run the cell below to create a topic model for our data, and the topic model will be saved in the variable `lda_model`. For now, do not worry about the parameters, we will experiment with those later.

In [43]:
'''
The following lines are used to setup the model's parameters. 
'''
UPDATE_EVERY = 10
CHUNKSIZE = 10
PASSES = 10
topic_model_settings = [{'num-topics':15, 'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':5,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':10,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}]
setting = topic_model_settings[0]
NUM_TOPICS = setting['num-topics']


'''
Here, we plug in our parameters and data into models.LdaModel to train the topic model. The topic model will be saved in the variable "lda_model".
'''
print("Training topic model of {} topics (this might take a moment)...".format(NUM_TOPICS), end='')
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, random_state = setting['parameters']['random_state'], update_every = setting['parameters']['update_every'], chunksize = setting['parameters']['chunksize'], passes = setting['parameters']['passes'], alpha = setting['parameters']['alpha'], per_word_topics = setting['parameters']['per_word_topics'])
print("complete.")


Training topic model of 15 topics (this might take a moment)...complete.


# Qualitative Observations

We have created a topic model and saved it in `lda_model`. But what does this mean? Let's learn more as we perform some qualitative observations. Later, we will discuss two quantitative metrics, but a lot of times, topic models are decided from manual observation.

## Topic Terms
1. Run the cell below to observe the topic terms.  
2. By creating the topic model, we have created a probability distribution over our set of words (`dictionary` or the vocabulary) for 15 topics. We have created a helper function called `show_topic_terms` that will extract the ten words with the highest probabilities for fitting into each topic. 
3. The columns of `topic_terms` are the top ten words of each topic (Wn) and Wn's probability of belonging to each topic (Wn Pr). Each row is a topic.
4. What do you think of these topics? Can you make sense of clear topic divisions between the 15 topics, or, would you be able to give a name to each topic? How could you modify the `preprocess_line` function to possibly improve the topic model to get a clearer division between topics?
5. Before you make any changes or proceed, make notes of your observations in the slide **Topic Model Qualitative Observations**. You can include a screenshot of part of this table, paste the table values into the slides.

In [44]:
topic_terms = helpers.show_topic_terms(lda_model, NUM_TOPICS)
topic_terms

Unnamed: 0_level_0,W1,W1 Pr,W2,W2 Pr,W3,W3 Pr,W4,W4 Pr,W5,W5 Pr,W6,W6 Pr,W7,W7 Pr,W8,W8 Pr,W9,W9 Pr,W10,W10 Pr
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,object,0.045,computer,0.043,network,0.027,student,0.024,cryptography,0.022,system,0.02,database,0.019,procedure,0.017,feature,0.015,computation,0.014
1,form,0.103,verb,0.067,noun,0.039,case,0.036,flour,0.032,class,0.032,adjective,0.029,declension,0.029,sugar,0.025,singular,0.024
2,vowel,0.095,romance,0.077,nasal,0.034,paradigms,0.029,work,0.028,distinction,0.021,consonant,0.02,e,0.019,diphthong,0.014,style,0.011
3,computer,0.096,system,0.04,computer_science,0.039,programming,0.03,paradigm,0.029,engineering,0.029,computation,0.021,study,0.019,software_engineering,0.016,order,0.014
4,difference,0.04,group,0.026,note,0.022,system,0.02,devoicing,0.015,country,0.014,fricative,0.012,affricate,0.012,resource,0.012,domain,0.012
5,century,0.125,period,0.034,way,0.029,literature,0.026,diphthong,0.025,part,0.024,pronunciation,0.023,chanson,0.021,spelling,0.02,variety,0.018
6,language,0.091,computer_science,0.048,langue,0.035,dialect,0.03,time,0.027,area,0.026,type,0.023,discipline,0.02,datum,0.019,field,0.018
7,purpose,0.052,texture,0.027,substitute,0.017,field,0.01,cornstarch,0.009,tablespoon,0.009,cup,0.009,percentage,0.009,transmission,0.007,method,0.007
8,alternation,0.076,development,0.073,loss,0.04,je,0.036,subjunctive,0.021,li,0.017,cluster,0.017,desjun,0.015,lave,0.015,leve,0.015
9,number,0.045,method,0.035,system,0.02,journal,0.019,man,0.018,computing,0.018,day,0.016,computer_science,0.016,paper,0.015,conference,0.015


## Topic Model Visualization

1. Run the cell below to generate an interactive visual to explore the topic model.
2. Add observations to the **Topic Model Qualitative Observations** slide.

In [45]:
helpers.show_model(lda_model, corpus, dictionary)

Generating visual, this will take a few moments...


# Quantitative Observations

1. We will compute *perplexity* and *coherence,* which are two quantitative metrics for evaluating topic models. Better scores do not necessarily mean better models in topic modeling. 
2. Add the scores to the **Topic Model 1 Quantitative Observations** slide.


## Perplexity

"In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample ([Wikipedia](https://en.wikipedia.org/wiki/Perplexity))."

Intuition: the dictionary definition of perplexed is "Completely baffled; very puzzled [Dictionary Reference](https://www.lexico.com/en/definition/perplexed)." The more perplexed you are, the more puzzled you are. Use this to remember the intuition behind the *perplexity* metric. We aim for a model that less perplexed by new data, so a smaller perplexity is usually the goal.


In [47]:
'''
Measure quality of topic models with perplexity
'''
print("Measuring model perplexity...",end="")
ppl = lda_model.log_perplexity(corpus)
print('complete. Perplexity:', ppl, '\n')  # a measure of how good the model is. lower the better.

Measuring model perplexity...complete. Perplexity: -9.470029050649728 



## Coherence

The coherence metric captures the "degree of semantic similarity between high scoring words in the topic ([Towards Data Science](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0))."

The idea behind the *coherence* metric aligns with your experience observing your topic terms. Did you observe a very clear theme in each topic or was there a lot of overlapping themes between the topics? Clear themes are analogous to higher coherence scores, whereas overlapping themes are analogous to distinct topics.

In [48]:
'''
Measure quality of topic models with coherence
'''
print("Measuring model coherence...",end="")
coherence_model_lda = CoherenceModel(model=lda_model, texts = data_words_trigrams[-10000:], corpus=corpus, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('complete. Coherence Score:', coherence_lda, '\n') 

Measuring model coherence...complete. Coherence Score: 0.42054033797171636 



# Preprocessing Experiment 2

## Write the function `preprocess_line`.
1. At this point, *preprocessing* is the main piece to experiment with and observe your implementation decisions on topic modeling. Using at least one new thing that you learned in 1-Intro-to-NLP, write the function `preprocess_line` which takes in a string of text saved in the input variable `line`, so that the function returns a list of tokens called `preprocessed_line`. 
2. At the bottom of the cell, we call `preprocess_line` on `test_string` so you can quickly observe how your function is working. `test_string` is initially set to be the first string from `data`, but you can change `test_string` to another string from `data` or your own string to get a fuller idea of the effects of your function.
3. Take a second to record your decisions in the slide titled **Preprocessing Experiment 1**.


In [None]:
def preprocess_line(line):
    '''
    Fill in this function. Refer to 1-Intro-to-NLP for preprocessing ideas.
    '''
    preprocessed_line = []
    tokens = word_tokenize(line)
    
    # use spacy pipeline
    doc = nlp(" ".join(tokens))
    
    allowed_postags=['NOUN']
    
    # get pos_tags
    pos_tags = [token.pos_ for token in doc]
    
    # get_lemmas, also remove words that aren't in allowed pos tags, also remove stopwords
    lemmas = [token.lemma_ for token in doc if token.pos_ in allowed_postags and not token.is_stop]
    
    preprocessed_line = lemmas
    
    return preprocessed_line

test_string = data[0]
print(preprocess_line(test_string))

# Learn about Topic Model Parameters

In [None]:
UPDATE_EVERY = 10
CHUNKSIZE = 10
PASSES = 10
topic_model_settings = [{'num-topics':15, 'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':5,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':10,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}]
setting = topic_model_settings[0]
NUM_TOPICS = setting['num-topics']

**We've already come up with a few different parameter settings, the main difference being the number of topics we are targetting. Let's start with the first setting, which will create 15 topics.**

In [14]:
setting = topic_model_settings[0]
NUM_TOPICS = setting['num-topics']
NUM_TOPICS = 3

## 2. Create a topic model

**Now we create a topic model for our text using functions from gensim! This will take a few minutes, so take this time to review some content. You can check out [this resource](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) on topic modeling as well, which will show several of the steps we have already covered in a bit more detail, and give you a preview of what we will do next!**

In [15]:
print("Training topic model (this will take a moment)...", end='')
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, random_state = setting['parameters']['random_state'], update_every = setting['parameters']['update_every'], chunksize = setting['parameters']['chunksize'], passes = setting['parameters']['passes'], alpha = setting['parameters']['alpha'], per_word_topics = setting['parameters']['per_word_topics'])
print("complete.")


Training topic model (this will take a moment)...complete.


## 3. Observe words associated with the topic

The columns of `topic_terms` are the top ten words of each topic (Wn) and Wn's probability of belonging to each topic (Wn Pr). 

In [16]:
topic_terms = helpers.show_topic_terms(lda_model, NUM_TOPICS)
topic_terms

Unnamed: 0_level_0,W1,W1 Pr,W2,W2 Pr,W3,W3 Pr,W4,W4 Pr,W5,W5 Pr,W6,W6 Pr,W7,W7 Pr,W8,W8 Pr,W9,W9 Pr,W10,W10 Pr
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,cake,0.038,computer,0.017,century,0.014,computer_science,0.011,information,0.01,programming,0.009,language,0.009,word,0.009,study,0.009,example,0.008
1,cake,0.031,computer,0.02,form,0.014,computer_science,0.013,verb,0.013,language,0.009,vowel,0.008,system,0.008,romance,0.007,alternation,0.007
2,language,0.015,noun,0.014,declension,0.013,class,0.013,adjective,0.011,form,0.011,history,0.008,case,0.008,term,0.008,computer,0.008


## 4. Evaluate quantitatively

**We will use two measures to evaluate the model: perplexity and coherence**

Todo: I'll expand on this a little bit to explain what these metrics mean

In [17]:
'''
6. Measure quality of topic models with perplexity
'''
print("Measuring model perplexity...",end="")
ppl = lda_model.log_perplexity(corpus)
print('complete. Perplexity:', ppl, '\n')  # a measure of how good the model is. lower the better.

'''
7. Measure quality of topic models with coherence
'''
print("Measuring model coherence...",end="")
coherence_model_lda = CoherenceModel(model=lda_model, texts = data_words_trigrams[-10000:], corpus=corpus, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('complete. Coherence Score:', coherence_lda, '\n') 

Measuring model perplexity...complete. Perplexity: -6.982638798070078 

Measuring model coherence...complete. Coherence Score: 0.524049569197438 



## 5. Assign text lines to a topic

In [18]:
from collections import defaultdict
import operator
'''
9. Save topic assignments for unique texts in a datastructure
'''
print("Making document topic assignments...")
text2distro = {}
for x in range(len(corpus)):
    topicdistribution = lda_model[corpus[x]]     # a list of tuples, e.g., [(8, 0.14625458), (10, 0.79183161)]
    topicarray = [0]*NUM_TOPICS

    for (topicid,topicprc) in topicdistribution:
        topicarray[topicid] = topicprc
    try:
        text2distro[' '.join(data_words_trigrams[-10000:][x])] = topicarray
    except:
        print('x:', x, "len(data_words_trigrams[-10000:]):", len(data_words_trigrams[-10000:]))

top_topics = defaultdict(lambda:0)
second_top_topics = defaultdict(lambda:0)
text2scores = defaultdict(lambda:0)

for text in text2distro:
    if len(text) > 1:
        distro = text2distro[text]
        idx2score = {i:score for i, score in enumerate(distro)}
        scores_sorted = sorted(idx2score.items(), key=operator.itemgetter(1), reverse=True)
        top_topic = scores_sorted[0][0]
        top_topics[top_topic] += 1
        second_top_topics[scores_sorted[1][0]] += 1
        text2scores[text] = scores_sorted

# print("Top topic distribution:\n", top_topics)
# print("\n\nSecond top topic distribution:\n", second_top_topics)

print("Topic #", '\t', "% docs assigned")
total = sum(top_topics.values())
for i in range(NUM_TOPICS):
    print(i, '\t', "{:.2%}".format(top_topics[i]/total))
print('complete.\n')

Making document topic assignments...
Topic # 	 % docs assigned
0 	 33.19%
1 	 49.12%
2 	 17.70%
complete.



## 5. Visualize the topics

**It will take a few moments to load up the visual**

In [19]:
helpers.show_model(lda_model, corpus, dictionary)

Generating visual, this will take a few moments...


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


## Join bigrams and trigrams
1. Next we will train a bigram model by using functions implemented by others (gensim) for us to use!
2. Now we apply our bigram model to our data to join unigrams into bigrams where appropriate. To understand what the changes look like, the `show_ngrams` function will output a few examples of lines that were changed by this process.

In [8]:
bigram_model, bigram_phrases = helpers.train_bigram_model(preprocessed_data)
data_words_bigrams = helpers.make_bigrams(preprocessed_data, bigram_model)

helpers.show_ngrams(preprocessed_data, data_words_bigrams)

3. Now we will train a trigram model and apply it to our data to join unigrams and bigrams where appropriate. 
5. To understand what the changes look like, the `show_ngrams` function will output a few examples of lines that were changed by this process.

In [30]:
trigram_model = helpers.train_trigram_model(preprocessed_data, bigram_phrases)

data_words_trigrams = helpers.make_trigrams(data_words_bigrams, bigram_model, trigram_model)
helpers.show_ngrams(data_words_bigrams, data_words_trigrams)

100%|██████████| 494/494 [00:00<00:00, 20309.81it/s]


Unnamed: 0,Before,After


# References

1. https://dl.acm.org/doi/fullHtml/10.1145/2133806.2133826
2. https://nlpforhackers.io/topic-modeling/
3. https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/  -- for more analyses of topic models