# Setup

Here we are importing functions written by staff to help with the project. You will not need to worry about these, but if you are curious to see what they look like, checkout the file `tm_helpers.py`. If you want to learn more about how these functions or have questions, let us know! :)

In [None]:
from tm_helpers import *

# general
from tqdm import tqdm
import os
import regex as re

# preprocess functions
from nltk.tokenize import word_tokenize
import spacy
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])


# topic modeling packages
import gensim
from gensim import models, corpora
from gensim.models.coherencemodel import CoherenceModel

# used to visualize the topic model
import pyLDAvis.gensim
import pyLDAvis

In [None]:
show_model()

# Data

In this section you will:

1. Load bookcorpus
2. Cleanup stopwords
3. Join bigrams and trigrams
4. Final data preparation for gensim topic modeling

## 1. Load data

In [None]:
data = load_data()

**Let's preview what the data looks like.**

In [None]:
print("Data length:", len(data))
print("data[:10]", data[:10]) # data is a list, data[:10] is the first ten items of that list

# 2. Preprocessing

In [None]:
def preprocess_line(line):
    '''
    Fill in this function. Refer to 1-Intro-to-NLP for preprocessing ideas.
    '''
    preprocessed_line = []
    tokens = word_tokenize(line)
    
    # use spacy pipeline
    doc = nlp(" ".join(tokens))
    
    # allowed_postags
#     allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
    allowed_postags=['NOUN']
    
    
    # get pos_tags
    pos_tags = [token.pos_ for token in doc]
    
    # get_lemmas, also remove words that aren't in allowed pos tags, also remove stopwords
    lemmas = [token.lemma_ for token in doc if token.pos_ in allowed_postags and not token.is_stop]
    
    
    preprocessed_line = lemmas
    
    return preprocessed_line

def preprocess(data):
    preprocessed_data = []
    for line in tqdm(data):
        preprocessed_line = preprocess_line(line)
        preprocessed_data.append(preprocessed_line)
    return preprocessed_data

In [None]:
'''
A stopword solution
'''

# STOPWORDS = load_stopwords()
# print("Cleaning data")
# tokenized_data = []
# for text in tqdm(data):
#     cleaned = clean_text(text, STOPWORDS)
#     tokenized_data.append(cleaned)

**Run call `preprocess` on `data` and save to `preprocessed_data`, and then preview our `preprocessed_data`. How does is look different than our earlier preview? Do you have a bug or does it look how you want it to look?**

In [None]:
preprocessed_data = preprocess(data)
print("data[:10]", preprocessed_data[:10]) # data is a list, data[:10] is the first ten items of that list

## 3. Join bigrams and trigrams
**Next we will train a bigram model by using functions implemented by others (gensim) for us to use!**

TODO: I think I will have a "basic nlp" preprocessing project, where they'll learn what bigrams and trigrams are. They would do this before the topic modeling lesson, so we wouldn't need to explain what bigrams and trigrams are.

TODO: Add a link for reference to the gensim bigram model

In [None]:
bigram_model, bigram_phrases = train_bigram_model(preprocessed_data)

**Now we apply our bigram model to our data to join unigrams into bigrams where appropriate. To understand what the changes look like, the `preview_bigram_changes` function will output a few examples of lines that were changed by this process.**

In [None]:
data_words_bigrams = make_bigrams(preprocessed_data, bigram_model)

show_ngrams(preprocessed_data, data_words_bigrams)

**Now we will train a trigram model!**

TODO: Add a link for reference

In [None]:
trigram_model = train_trigram_model(preprocessed_data, bigram_phrases)

**Now we apply our bigram model to our data to join unigrams into bigrams where appropriate. To understand what the changes look like, the `preview_bigram_changes` function will output a few examples of lines that were changed by this process.**

In [None]:
data_words_trigrams = make_trigrams(data_words_bigrams, bigram_model, trigram_model)
show_ngrams(data_words_bigrams, data_words_trigrams)

## 4. Final data preparation for gensim topic modeling

Todo: for now, I'm just taking a chunk of the data to speed up my developing process. I can go through the data and choose a couple appropriate books that the students can actually use.

In [None]:
print("Creating dictionary and corpus instances for gensim...", end='')

dictionary = corpora.Dictionary(data_words_trigrams[-10000:])
corpus = [dictionary.doc2bow(x) for x in data_words_trigrams[-10000:]]

print("complete.")

# Create Topic Model

In this section, you will:

1. Learn about topic model parameters
2. Create a topic model
3. Observe words associated with the topics
4. Evaluate quantitatively
5. Assign text lines to a topic

In [None]:
UPDATE_EVERY = 10
CHUNKSIZE = 100
# CHUNKSIZE = 10 soln

PASSES = 10
topic_model_settings = [{'num-topics':15, 'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':5,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}, 
                        {'num-topics':10,'parameters':{'random_state':100, 'update_every':UPDATE_EVERY, 'chunksize':CHUNKSIZE, 'passes':PASSES, 'alpha':'auto', 'per_word_topics':False}}]


## 1. Learn about topic model parameters

Todo: add some things to show what the purpose of the main parameters are

**We've already come up with a few different parameter settings, the main difference being the number of topics we are targetting. Let's start with the first setting, which will create 15 topics.**

In [None]:
setting = topic_model_settings[0]
NUM_TOPICS = setting['num-topics']
NUM_TOPICS = 3

## 2. Create a topic model

**Now we create a topic model for our text using functions from gensim! This will take a few minutes, so take this time to review some content. You can check out [this resource](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) on topic modeling as well, which will show several of the steps we have already covered in a bit more detail, and give you a preview of what we will do next!**

In [None]:
print("Training topic model (this will take a moment)...", end='')
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, random_state = setting['parameters']['random_state'], update_every = setting['parameters']['update_every'], chunksize = setting['parameters']['chunksize'], passes = setting['parameters']['passes'], alpha = setting['parameters']['alpha'], per_word_topics = setting['parameters']['per_word_topics'])
print("complete.")


## 3. Observe words associated with the topic

The columns of `topic_terms` are the top ten words of each topic (Wn) and Wn's probability of belonging to each topic (Wn Pr). 

In [None]:
topic_terms = show_topic_terms(lda_model, NUM_TOPICS)
topic_terms

## 4. Evaluate quantitatively

**We will use two measures to evaluate the model: perplexity and coherence**

Todo: I'll expand on this a little bit to explain what these metrics mean

In [None]:
'''
6. Measure quality of topic models with perplexity
'''
print("Measuring model perplexity...",end="")
ppl = lda_model.log_perplexity(corpus)
print('complete. Perplexity:', ppl, '\n')  # a measure of how good the model is. lower the better.

'''
7. Measure quality of topic models with coherence
'''
print("Measuring model coherence...",end="")
coherence_model_lda = CoherenceModel(model=lda_model, texts = data_words_trigrams[-10000:], corpus=corpus, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('complete. Coherence Score:', coherence_lda, '\n') 

## 5. Assign text lines to a topic

In [None]:
from collections import defaultdict
import operator
'''
9. Save topic assignments for unique texts in a datastructure
'''
print("Making document topic assignments...")
text2distro = {}
for x in range(len(corpus)):
    topicdistribution = lda_model[corpus[x]]     # a list of tuples, e.g., [(8, 0.14625458), (10, 0.79183161)]
    topicarray = [0]*NUM_TOPICS

    for (topicid,topicprc) in topicdistribution:
        topicarray[topicid] = topicprc
    try:
        text2distro[' '.join(data_words_trigrams[-10000:][x])] = topicarray
    except:
        print('x:', x, "len(data_words_trigrams[-10000:]):", len(data_words_trigrams[-10000:]))

top_topics = defaultdict(lambda:0)
second_top_topics = defaultdict(lambda:0)
text2scores = defaultdict(lambda:0)

for text in text2distro:
    if len(text) > 1:
        distro = text2distro[text]
        idx2score = {i:score for i, score in enumerate(distro)}
        scores_sorted = sorted(idx2score.items(), key=operator.itemgetter(1), reverse=True)
        top_topic = scores_sorted[0][0]
        top_topics[top_topic] += 1
        second_top_topics[scores_sorted[1][0]] += 1
        text2scores[text] = scores_sorted

# print("Top topic distribution:\n", top_topics)
# print("\n\nSecond top topic distribution:\n", second_top_topics)

print("Topic #", '\t', "% docs assigned")
total = sum(top_topics.values())
for i in range(NUM_TOPICS):
    print(i, '\t', "{:.2%}".format(top_topics[i]/total))
print('complete.\n')

## 5. Visualize the topics

**It will take a few moments to load up the visual**

In [None]:
# lda_model = models.LdaModel.load(model_names[0])
# Visualize the topics

print("Generating visual...this will take a few moments.")
pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
LDAvis_prepared

# Name Topics

We might not do this, but I have the code...it depends on if we want to do any analysis with this topic later.

In [None]:
keep_threshold = .01 # put the percentage (.01 = 1%) threshold of documents assigned to the topic in order to be considered for naming
KEPT_TOPICS = [i for i in range(NUM_TOPICS) if top_topics[i]/total >= keep_threshold]

print("Kept Topics, which have {:.2%} of the docs assigned to them".format(keep_threshold))
print(KEPT_TOPICS)
KEPT_TOPIC_NAMES = {}

In [None]:
name_topics(KEPT_TOPICS, KEPT_TOPIC_NAMES)

# References

In [None]:
#https://nlpforhackers.io/topic-modeling/
#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/  -- for more analyses of topic models