<a href="https://colab.research.google.com/github/ryanleeallred/DS-Unit-4-Sprint-1-NLP/blob/main/module4-topic-modeling/LS_DS_414_Topic_Modeling_Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling (Prepare)

On Monday, we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
**Primary use case**: What are your documents about? Who might want to know that in the industry - 
* Identifying common themes in customer reviews
* Grouping job ads into categories 
* Monitoring communications (Email - State Department, Google) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* Part 1: Describe how an LDA Model works
* Part 2: Build an LDA Model with Gensim
* Part 3: Interpret LDA results & Select the appropriate number of topics

In [None]:
import re
import NumPy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

from sklearn.datasets import fetch_20newsgroups
from pandarallel import pandarallel


import spacy
spacy.util.fix_random_seed(0)

import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

# Part 1: Describe How an LDA Model Works

We will focus on a high level of understanding of how LDA works, meaning we will focus on "what it does" instead of "how it does it." I realize that this may be unsatisfying, so I've included some resources that serve as a prerequisite for understanding how LDA works at a mathematical level. 

LDA is a [**Probabilistic Graphical Model (PGM)**](https://en.wikipedia.org/wiki/Graphical_model)

PGM is represented by a graph that expresses the conditional dependence structure between random variables. Here's the LDA representation dependency graph: 

![](https://filebox.ece.vt.edu/~s14ece6504/projects/alfadda_topic/main_figure_3.png)

These images are communicating the hierarchical dependency between probability distributions and their parameters. This is an application of Bayesian Probability - on steroids. 


To understand how LDA works, one must first understand how PGM works. If this is something that you're interested in learning more about, here are some resources: 

This Github repo has transformed a textbook into a collection of Jupyter Notebooks. This repo is called  [**"Bayesian Methods for Hackers"**](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers) The cool thing about this repo is that each chapter has the same material covered in several notebooks. Still each notebook is written in a different python package: **PyMC2, PyMC3, Pyro, and Tensorflow Probability.** So you can even learn a new library if you want or stick to what you know. 

[**Pyro**](https://pyro.ai/) is considered a very powerful probabilistic programming library that combines probabilistic programming with deep learning. 

### Resources for LDA

[**Your Guide to Latent Dirichlet Allocation**](https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d) Here's a medium article that works through an example of LDA. This article is useful if you'd like to get exposure to LDA outside of this notebook.

[**LDA Topic Modeling**](https://lettier.com/projects/lda-topic-modeling/) This interactive data visualization tool allows us to explore a simple and visual example of LDA. We'll be using this to learn about LDA in class. 

[**Topic Modeling with Gensim**](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) This is an example of implementing LDA using the same dataset we are using in the guided project.  

# Problem Statement

We are going to load some emails. Those emails belong to topics; however, those topics are hierarchical.

    sci
        \_ electronics, space


    talk
        \_politics 
                  \_ guns, middle east
              
So what's the best way to categorize these emails - is it between science and talk? 

Is it between electronics, space, guns, and the Middle East? 

The Middle East is a pretty broad topic in and of itself; should that topic be broken down into further sub-topics?


Let's learn about Topic Modeling and how it can help us answer these questions!

### Load Email Corpus

In [None]:
# notice that the categories are hierarchical
# so there is a sense in which we 2 topics but also as many as 4 topics  
categories = ['sci.electronics', 'sci.space', 
              'talk.politics.guns', 'talk.politics.mideast']
data = fetch_20newsgroups(categories=categories)

In [None]:
# create X and Y from data

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# loda data into a dataframe 


data = {
    'content': X,
    'target': Y,
    'target_names': [target_names[target_id] for target_id in Y]
}

df = pd.DataFrame(data=data)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# clean our text data and save it to a new column
df["clean_data"] = df["content"].apply(clean_data)

In [None]:
df.head()

### Create Tokens 

We must first create tokens before using the Gemsim library to create bag-of-words vectors in precisely the right way that the LDA model wants. 

Let’s use spaCy to create some lemmas. But first, let’s initialize our multi-processing library pandarallel, which will empower us to use the same dataframe that our data is stored in but create tokens in parallel to save time.

Here's the documentation for [**pandarallel**](https://github.com/nalepae/pandarallel)

In [None]:
# we mush initalize pandarallel before we can use it
pandarallel.initialize(progress_bar=True, nb_workers=10)

In [None]:
# load in our spaCy language model
nlp = spacy.load("en_core_web_lg")

In [None]:
%%time
# create our tokens in the form of lemmas 
df['lemmas'] = df['clean_data'].parallel_apply(lambda x: [token.lemma_ for token in nlp(x) if (token.is_stop != True) and (token.is_punct != True)])

### Take a Look at Our Lemmas

In [None]:
# print out the lemmas from the first article

### Filter Out Low-Quality Lemmas

In [None]:
def filter_lemmas(lemmas):
    """
    Filter out any lemmas that are 2 characters or smaller
    """
    return [lemmas for lemmas in lemmas if len(lemmas) > 2]

In [None]:
# apply filter_lemmas

### The Two Main Inputs to the LDA Topic Model are the Dictionary (id2word) and the Corpus.

In [None]:
# Create Dictionary

# Term Document Frequency

# stores (token id, token count) for each doc in the corpus

# Human readable format of the corpus (term-frequency)


# YOUR CODE HERE
raise NotImplementedError()

# Part 2: Estimate an LDA Model with Gensim

 ### Train an LDA model

In [None]:
### This cell runs the single-processor version of the model (slower)
# %%time
# lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
#                                            id2word=id2word,
#                                            num_topics=20, 
#                                            chunksize=100,
#                                            passes=10,
#                                            per_word_topics=True)
# lda_model.save('lda_model.model')
# # https://radimrehurek.com/gensim/models/ldamodel.html

In [None]:
%%time

num_topics = 2
### This cell runs the multi-processor version of the model (faster)
lda_multicore_2_topics = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,# runtime related parameter
                                                        per_word_topics=True,
                                                        workers=10, # runtime related parameter
                                                        random_state=1234, 
                                                        iterations=20) # runtime related parameter

num_topics = 6
### This cell runs the multi-processor version of the model (faster)
lda_multicore_6_topics = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,# runtime related parameter
                                                        per_word_topics=True,
                                                        workers=10, # runtime related parameter
                                                        random_state=1234, 
                                                        iterations=20) # runtime related parameter

In [None]:
from gensim import models
#lda_multicore =  models.LdaModel.load('lda_multicore.model')

# Part 3: Interpret LDA Results & Select the Appropriate Number of Topics

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore_2_topics, corpus, id2word)
vis

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_multicore_6_topics, corpus, id2word)
vis

## What is Topic Coherence?
Topic Coherence measures the score of a single topic by **measuring the degree of semantic similarity between high-scoring words in the topic**. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.


A set of statements or facts is said to be coherent if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is **“the game is a team sport,”** **“the game is played with a ball,”** **“the game demands great physical efforts.”**


In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary: Gensim dictionary
    corpus: Gensim corpus
    texts: List of input texts
    limit: Max num of topics

    Returns:
    -------
    model_list: List of LDA topic models
    coherence_values: Coherence values corresponding to the LDA model with a respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,
                                                        random_state=1234,
                                                        per_word_topics=True,
                                                        workers=10)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
%%time
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=df['lemmas'], start=2, limit=10, step=1)

In [None]:
start=2; limit=10;  step=1;
x = range(start, limit, step)

plt.figure(figsize=(20,5))
plt.grid()
plt.title("Coherence Score vs. Number of Topics")
plt.xticks(x)
plt.plot(x, coherence_values, "-o")

plt.xlabel("Num Topics")
plt.ylabel("Coherence score")

plt.show();

### Index for Model 

Due to the probabilistic nature of this model, the modeling results can and usually do vary. Despite this, we will select 8 as the number of topics even if this model doesn't show 8 as having the highest coherence score. Also, we need to ask ourselves how many topics we want for our corpus, even if it doesn’t.

In [None]:
lda_trained_model = model_list[-2]

In [None]:
lda_trained_model

In [None]:
# visualize the 3 topics 
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_trained_model, corpus, id2word)
vis

## Create a Topic ID/Name Dictionary 

When populating your topic id/name dictionary, use the index ordering as shown in the viz tool. 

We'll use a function to map the the viz tool index ordering with the train LDA model ordering. 

In [None]:
# keys - use topic ids from pyLDAvis visualization 
# values - topic names that you create 
# save dictionary to `vis_topic_name_dict`
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
def get_topic_id_lookup_dict(vis, vis_topic_name_dict):
    """
    The starting index and the ordering of topic ids between the trained LDA model and the viz tool are different. So we need to create a dictionary that maps the correct association between topic ids from both sources. 
    """
    # value is order of topic ids according to pyLDAvis tool 
    # key is order of topic ids according to lda model
    model_vis_tool_topic_id_lookup = vis.topic_coordinates.topics.to_dict()

    # invert dictionary so that 
    # key is order of topic ids according to pyLDAvis tool 
    # value is order of topic ids according to lda model
    topic_id_lookup =  {v:k for k, v in model_vis_tool_topic_id_lookup.items()}
    
    # iterate through topic_id_lookup and index vis_topic_name_dict using the keys 
    # in order to swap the viz topic ids in vis_topic_name_dict for the lda model topic ids 
    return {v:vis_topic_name_dict[k]  for k, v in topic_id_lookup.items()}

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
topic_name_dict

## Assign Each Document a Topic Name

Now that we have a topic id/name look-up dict aligned with the index ordering of the trained LDA model, we can give each topic a topic name. 

The function below has been given to you. However, you are highly encouraged to read through it and make sure that you understand what it is doing each step of the way. An excellent way to do this is to copy and paste the code inside the function into a new cell, comment out all the lines of code and line by line, uncomment the code and see the output.

In [None]:
def get_topic_ids_for_docs(lda_model, corpus):
    
    """
    Passes a Bag-of-Words vector into a trained LDA model to get the topic id of that document. 
    
    Parameters
    ----------
    lda_model: Gensim object
        Must be a trained model 
        
    corpus: nested lists of tuples, 
        i.e. [[(),(), ..., ()], [(),(), ..., ()], ..., [(),(), ..., ()]]
        
    Returns
    -------
    topic_id_list: list
        Contains topic ids for all document vectors in corpus 
    """
    
    # store topic ids for each document
    doc_topic_ids = []

    # iterature through the bow vectors for each doc
    for doc_bow in corpus:
        
        # store the topic ids for the doc
        topic_ids = []
        # store the topic probabilities for the doc
        topic_probs = []

        # list of tuples
        # each tuple has a topic id and the prob that the doc belongs to that topic 
        topic_id_prob_tuples = lda_trained_model.get_document_topics(doc_bow)
        
        # iterate through the topic id/prob pairs 
        for topic_id_prob in topic_id_prob_tuples:
            
            # index for topic id
            topic_id = topic_id_prob[0]
            # index for prob that doc belongs that the corresponding topic
            topic_prob = topic_id_prob[1]

            # store all topic ids for doc
            topic_ids.append(topic_id)
            # store all topic probs for doc
            topic_probs.append(topic_prob)

        # get index for largest prob score
        max_topic_prob_ind = np.argmax(topic_probs)
        # get corresponding topic id
        max_prob_topic_id = topic_ids[max_topic_prob_ind]
        # store topic id that had the highest prob for doc being a memebr of that topic
        doc_topic_ids.append(max_prob_topic_id)
        
    return doc_topic_ids

In [None]:
# get the topic id for each doc in the corpus 
topic_id_list = get_topic_ids_for_docs(lda_trained_model, corpus)

# creat a feature for document's topic id
df["topic_id"] = topic_id_list

# iterate through the topic id and use the lookup table to assign each document with a topic name
df["new_topic_name"] = df["topic_id"].apply(lambda topic_id: topic_name_dict[topic_id])

In [None]:
# cool! so now all of our documents have topic ids and names 
df.head()

In [None]:
# you can mask for all Space articles 
science_mask = df.topic_id == 3
df[science_mask]

-----

## Where Do We Go From Here?

What exactly did we accomplish?

Outside of this guided project (i.e., in your job), you may or may not have access to existing article topic names like we did with this data set, which means that we won't always have a point of reference to "check our answers." So let's explore two possible situations in which you might find yourself using this Unsupervised Learning Topic Model. 

### 1. You have access to existing document topic labels

In this case, why would we bother with Topic Modeling? It could be the case that the current topic labels are not helpful for whatever task you're working on. For instance, our email dataset here has topic names; however, those topic labels are hierarchical, which doesn't suit your needs for some reason. So one option is to generate new labels that suit your needs (like we did here). 

### 2. Your corpus doesn't have any document topic labels

In this case, you don't have any pre-existing topic labels. Maybe you work at Indeed or LinkedIn or Google, and your job is to bring some structure to a huge collection of emails and messages that aren't labeled in any meaningful way, so it isn't easy to sort through these documents. This is a perfect use case of Topic Modeling. After you apply topic modeling, you’ll have organized your emails into broad categories. You can start structuring and then analyze your corpus and maybe even build a supervised learning model to predict the document's topic!