<a href="https://colab.research.google.com/github/bs3537/DS-Unit-4-Sprint-1-NLP/blob/master/Bruno_completed_Topic_Modeling_Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 4, Sprint 1, Module 4*

---

# Topic Modeling (Prepare)

On Monday we talked about summarizing your documents using just token counts. Today, we're going to learn about a much more sophisticated approach - learning 'topics' from documents. Topics are a latent structure. They are not directly observable in the data, but we know they're there by reading them.

> **latent**: existing but not yet developed or manifest; hidden or concealed.

## Use Cases
Primary use case: what the hell are your documents about? Who might want to know that in industry - 
* Identifying common themes in customer reviews
* Discovering the needle in a haystack 
* Monitoring communications (Email - State Department) 

## Learning Objectives
*At the end of the lesson you should be able to:*
* <a href="#p1">Part 1</a>: Describe how an LDA Model works
* <a href="#p2">Part 2</a>: Estimate a LDA Model with Gensim
* <a href="#p3">Part 3</a>: Interpret LDA results
* <a href="#p4">Part 4</a>: Select the appropriate number of topics


# Latent Dirchilet Allocation Models (Learn)
<a id="#p1"></a>

## Overview
LDA is a "generative probabilistic model". 

Let's play with a model available [here](https://lettier.com/projects/lda-topic-modeling/)

## Follow Along

## Challenge 

# Estimating LDA Models with Gensim (Learn)
<a id="#p1"></a>

## Overview
### A Literary Introduction: *Jane Austen V. Charlotte Bronte*
Despite being born nearly forty years apart, modern fans often pit Jane Austen & Charlotte Bronte against one another in a battle for literary  supremacy. The battle centers around the topics of education for women, courting, and marriage. The authors' similar backgrounds naturally draw comparisons, but the modern fascination is probably due to novelty of British women publishing novels during the early 19th century. 

Can we help close a literary battle for supremacy and simply acknowledge that the authors addressed different topics and deserve to be acknowledged as excellent authors each in their own right?

We're going to apply Latent Dirichlet Allocation a machine learning algorithm for topic modeling to each of the author's novels to compare the distribution of topics in their novels.

In [0]:
import numpy as np
import gensim
import os
import re

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora

from gensim.models.ldamulticore import LdaMulticore

import pandas as pd



In [0]:
gensim.__version__

'3.4.0'

### Novel Data
I grabbed the novel data pre-split into a bunch of smaller chunks

In [0]:
path = './data/austen-brontë-split'

### Text Preprocessing
**Activity**: update the function `tokenize` with any technique you have learned so far this week. 

In [0]:
# 1) Plain Python - ''.split command
# 2) Spacy - just the lemmas from the document
# 3) Gensim - simple_preprocess

STOPWORDS = set(STOPWORDS).union(set(['said', 'mr', 'mrs']))

def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

In [0]:
import os

def gather_data(path_to_data): 
    data = []
    for f in os.listdir(path):
        if os.path.isdir(f) == False:
            if f[-3:] == 'txt':
                with open(os.path.join(path,f)) as t:
                    text = t.read().strip('\n')
                    data.append(tokenize(str(text)))       
    return data

In [0]:
tokens = gather_data(path)

In [0]:
tokens[0][0:10]

['emma',
 'jane',
 'austen',
 'volume',
 'chapter',
 'emma',
 'woodhouse',
 'handsome',
 'clever',
 'rich']

In [0]:
"this is a sample string with a \n newline character".replace('\n', '')

'this is a sample string with a  newline character'

## Follow Along

In [0]:
titles = [t[:-4] for t in os.listdir(path) if os.path.isdir(t) == False]

In [0]:
len(titles)

813

In [0]:
len(tokens)

813

### Author DataFrame


In [0]:
df = pd.DataFrame(index=titles, data={'tokens':tokens})

In [0]:
df.head()

Unnamed: 0,tokens
Austen_Emma0000,"[emma, jane, austen, volume, chapter, emma, wo..."
Austen_Emma0001,"[taylor, wish, pity, weston, thought, agree, p..."
Austen_Emma0002,"[behaved, charmingly, body, punctual, body, be..."
Austen_Emma0003,"[native, highbury, born, respectable, family, ..."
Austen_Emma0004,"[mention, handsome, letter, weston, received, ..."


In [0]:
df['author'] = df.reset_index()['index'].apply(lambda x: x.split('_')[0]).tolist()
df['book'] = df.reset_index()['index'].apply(lambda x: x.split('_')[1][:-4]).tolist()
df['section'] = df.reset_index()['index'].apply(lambda x: x[-4:]).tolist()
df['section'] = df['section'].astype('int')

In [0]:
df['author'] = df['author'].map({'Austen':1, 'CBronte':0})

In [0]:
df.author.value_counts()

0    441
1    372
Name: author, dtype: int64

In [0]:
df.head()

Unnamed: 0,tokens,author,book,section
Austen_Emma0000,"[emma, jane, austen, volume, chapter, emma, wo...",1,Emma,0
Austen_Emma0001,"[taylor, wish, pity, weston, thought, agree, p...",1,Emma,1
Austen_Emma0002,"[behaved, charmingly, body, punctual, body, be...",1,Emma,2
Austen_Emma0003,"[native, highbury, born, respectable, family, ...",1,Emma,3
Austen_Emma0004,"[mention, handsome, letter, weston, received, ...",1,Emma,4


### Streaming Documents
Here we use a new pythonic thingy: the `yield` statement in our function. This allows us to iterate over a bunch of documents without actually reading them into memory. You can see how we use this function later on. 

In [0]:
def doc_stream(path):
    for f in os.listdir(path):
        if os.path.isdir(f) == False:
            if f[-3:] == 'txt':
                with open(os.path.join(path,f)) as t:
                    text = t.read().strip('\n')
                    tokens = tokenize(text)
                yield tokens

In [0]:
streaming_data = doc_stream(path)

In [0]:
type(streaming_data)

generator

In [0]:
# gather_data => returns a list
# doc_stream => returns a generator

In [0]:
next(streaming_data) # Returns one document at a time from the generator

['emma',
 'jane',
 'austen',
 'volume',
 'chapter',
 'emma',
 'woodhouse',
 'handsome',
 'clever',
 'rich',
 'comfortable',
 'home',
 'happy',
 'disposition',
 'unite',
 'best',
 'blessings',
 'existence',
 'lived',
 'nearly',
 'years',
 'world',
 'little',
 'distress',
 'vex',
 'youngest',
 'daughters',
 'affectionate',
 'indulgent',
 'father',
 'consequence',
 'sister',
 'marriage',
 'mistress',
 'house',
 'early',
 'period',
 'mother',
 'died',
 'long',
 'ago',
 'indistinct',
 'remembrance',
 'caresses',
 'place',
 'supplied',
 'excellent',
 'woman',
 'governess',
 'fallen',
 'little',
 'short',
 'mother',
 'affection',
 'sixteen',
 'years',
 'miss',
 'taylor',
 'woodhouse',
 'family',
 'governess',
 'friend',
 'fond',
 'daughters',
 'particularly',
 'emma',
 'intimacy',
 'sisters',
 'miss',
 'taylor',
 'ceased',
 'hold',
 'nominal',
 'office',
 'governess',
 'mildness',
 'temper',
 'hardly',
 'allowed',
 'impose',
 'restraint',
 'shadow',
 'authority',
 'long',
 'passed',
 'away',


### Gensim LDA Topic Modeling

In [0]:
# A Dictionary Representation of all the words in our corpus
id2word = corpora.Dictionary(doc_stream(path))

In [0]:
id2word.token2id['tea']

284

In [0]:
id2word.doc2bow(tokenize("This is a sample message Darcy England England England"))

[(2754, 1), (3987, 3), (6602, 1), (6819, 1)]

In [0]:
import sys
print(sys.getsizeof(id2word))
print(sys.getsizeof(tokens))

56
7056


In [0]:
len(id2word.keys())

22096

In [0]:
# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=5, no_above=0.95)

In [0]:
len(id2word.keys())

8103

In [0]:
# a bag of words(bow) representation of our corpus
# Note: we haven't actually read any text into memory here
# Although abstracted away - tokenization IS happening in the doc_stream f(x)
corpus = [id2word.doc2bow(text) for text in doc_stream(path)]

In [0]:
corpus[345][:10]

[(0, 1),
 (2, 1),
 (11, 1),
 (21, 2),
 (32, 1),
 (34, 1),
 (35, 1),
 (37, 1),
 (53, 1),
 (54, 1)]

In [0]:
lda = LdaMulticore(corpus = corpus,
                   id2word = id2word,
                   random_state = 42,
                   num_topics = 15,
                   passes = 10,
                   workers = 4)

In [0]:
dir(lda)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slotnames__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_adapt_by_suffix',
 '_apply',
 '_load_specials',
 '_save_specials',
 '_smart_save',
 'alpha',
 'batch',
 'bound',
 'callbacks',
 'chunksize',
 'clear',
 'decay',
 'diff',
 'dispatcher',
 'distributed',
 'do_estep',
 'do_mstep',
 'dtype',
 'eta',
 'eval_every',
 'expElogbeta',
 'gamma_threshold',
 'get_document_topics',
 'get_term_topics',
 'get_topic_terms',
 'get_topics',
 'id2word',
 'inference',
 'init_dir_prior',
 'iterations',
 'load',
 'log_perplexity',
 'minimum_phi_value',
 'minimum_probability',
 'num_terms',
 'num_topics',
 'num_updates',
 'numworkers',
 'offset',
 'optimize_al

In [0]:
lda.print_topics()

[(0,
  '0.008*"frances" + 0.007*"lucy" + 0.006*"felt" + 0.005*"elinor" + 0.005*"edward" + 0.004*"mother" + 0.004*"heart" + 0.004*"little" + 0.004*"john" + 0.003*"monsieur"'),
 (1,
  '0.006*"like" + 0.005*"little" + 0.004*"rochester" + 0.004*"thought" + 0.004*"room" + 0.004*"night" + 0.004*"long" + 0.004*"door" + 0.004*"day" + 0.003*"looked"'),
 (2,
  '0.007*"helen" + 0.006*"brocklehurst" + 0.006*"temple" + 0.005*"thought" + 0.005*"little" + 0.005*"miss" + 0.005*"know" + 0.005*"time" + 0.004*"good" + 0.003*"like"'),
 (3,
  '0.022*"emma" + 0.015*"harriet" + 0.011*"weston" + 0.011*"knightley" + 0.010*"miss" + 0.009*"elton" + 0.008*"thing" + 0.008*"think" + 0.008*"good" + 0.008*"little"'),
 (4,
  '0.010*"marianne" + 0.006*"house" + 0.006*"elinor" + 0.005*"little" + 0.005*"know" + 0.004*"lady" + 0.004*"edward" + 0.004*"like" + 0.003*"day" + 0.003*"saw"'),
 (5,
  '0.009*"little" + 0.006*"madame" + 0.006*"like" + 0.005*"know" + 0.004*"good" + 0.004*"thought" + 0.004*"bretton" + 0.004*"monsieu

In [0]:
words = [re.findall(r'"([^"]*)"',t[1]) for t in lda.print_topics()]

In [0]:
topics = [' '.join(t[0:10]) for t in words]

In [0]:
for id, t in enumerate(topics): 
    print(f"------ Topic {id} ------")
    print(t, end="\n\n")

------ Topic 0 ------
frances lucy felt elinor edward mother heart little john monsieur

------ Topic 1 ------
like little rochester thought room night long door day looked

------ Topic 2 ------
helen brocklehurst temple thought little miss know time good like

------ Topic 3 ------
emma harriet weston knightley miss elton thing think good little

------ Topic 4 ------
marianne house elinor little know lady edward like day saw

------ Topic 5 ------
little madame like know good thought bretton monsieur graham day

------ Topic 6 ------
hunsden like little good time eye thought hand think man

------ Topic 7 ------
jane miss know shall think like time come day fairfax

------ Topic 8 ------
miss jane emma woodhouse elton thing bessie like shall frank

------ Topic 9 ------
monsieur know good little time like sister old night miss

------ Topic 10 ------
room know little time good house come harriet papa came

------ Topic 11 ------
miss helen temple heart eyes time elton little harriet

## Challenge 

You will apply an LDA model to a customer review dataset to practice the fitting and estimation of LDA. 

# Interpret LDA Results (Learn)
<a id="#p3"></a>

## Overview

## Follow Along

### Topic Distance Visualization

In [0]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

In [0]:
pyLDAvis.gensim.prepare(lda, corpus, id2word)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Overall Model / Documents

In [0]:
len(corpus[0])

310

In [0]:
lda[corpus[0]]

[(3, 0.62296575), (7, 0.09147917), (14, 0.28370324)]

In [0]:
distro = [lda[d] for d in corpus]

In [0]:
distro[3]

[(3, 0.598606),
 (5, 0.05823973),
 (7, 0.0619056),
 (12, 0.18525653),
 (14, 0.09435414)]

In [0]:
distro = [lda[d] for d in corpus]

def update(doc):
        d_dist = {k:0 for k in range(0,15)}
        for t in doc:
            d_dist[t[0]] = t[1]
        return d_dist
    
new_distro = [update(d) for d in distro]

In [0]:
len(new_distro)

813

In [0]:
new_distro[0]

{0: 0,
 1: 0,
 2: 0,
 3: 0.622963,
 4: 0,
 5: 0,
 6: 0,
 7: 0.09152189,
 8: 0,
 9: 0,
 10: 0,
 11: 0,
 12: 0,
 13: 0,
 14: 0.28366324}

In [0]:
df.head()

Unnamed: 0,tokens,author,book,section
Austen_Emma0000,"[emma, jane, austen, volume, chapter, emma, wo...",1,Emma,0
Austen_Emma0001,"[taylor, wish, pity, weston, thought, agree, p...",1,Emma,1
Austen_Emma0002,"[behaved, charmingly, body, punctual, body, be...",1,Emma,2
Austen_Emma0003,"[native, highbury, born, respectable, family, ...",1,Emma,3
Austen_Emma0004,"[mention, handsome, letter, weston, received, ...",1,Emma,4


In [0]:
df = pd.DataFrame.from_records(new_distro, index=titles)
df.columns = topics
df['author'] = df.reset_index()['index'].apply(lambda x: x.split('_')[0]).tolist()

In [0]:
df.head()

Unnamed: 0,frances lucy felt elinor edward mother heart little john monsieur,like little rochester thought room night long door day looked,helen brocklehurst temple thought little miss know time good like,emma harriet weston knightley miss elton thing think good little,marianne house elinor little know lady edward like day saw,little madame like know good thought bretton monsieur graham day,hunsden like little good time eye thought hand think man,jane miss know shall think like time come day fairfax,miss jane emma woodhouse elton thing bessie like shall frank,monsieur know good little time like sister old night miss,room know little time good house come harriet papa came,miss helen temple heart eyes time elton little harriet minutes,bingley miss elizabeth bennet jane darcy room good know sure,elizabeth darcy bennet jane miss know wickham bingley soon collins,elinor marianne sister mother time think edward know dashwood miss,author
Austen_Emma0000,0.0,0.0,0.0,0.622963,0.0,0.0,0.0,0.091522,0.0,0.0,0.0,0.0,0.0,0.0,0.283663,Austen
Austen_Emma0001,0.0,0.0,0.0,0.997371,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Austen
Austen_Emma0002,0.0,0.0,0.0,0.997563,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Austen
Austen_Emma0003,0.0,0.0,0.0,0.598332,0.0,0.058142,0.0,0.061645,0.0,0.0,0.0,0.0,0.185168,0.0,0.095076,Austen
Austen_Emma0004,0.0,0.0,0.0,0.708458,0.0,0.0,0.0,0.198859,0.090836,0.0,0.0,0.0,0.0,0.0,0.0,Austen


In [0]:
df.groupby('author').mean()

Unnamed: 0_level_0,frances lucy felt elinor edward mother heart little john monsieur,like little rochester thought room night long door day looked,helen brocklehurst temple thought little miss know time good like,emma harriet weston knightley miss elton thing think good little,marianne house elinor little know lady edward like day saw,little madame like know good thought bretton monsieur graham day,hunsden like little good time eye thought hand think man,jane miss know shall think like time come day fairfax,miss jane emma woodhouse elton thing bessie like shall frank,monsieur know good little time like sister old night miss,room know little time good house come harriet papa came,miss helen temple heart eyes time elton little harriet minutes,bingley miss elizabeth bennet jane darcy room good know sure,elizabeth darcy bennet jane miss know wickham bingley soon collins,elinor marianne sister mother time think edward know dashwood miss
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Austen,0.003704,0.003903,0.003012,0.230096,0.029158,0.003323,0.00172,0.092028,0.038381,0.00318,0.003461,0.001526,0.048813,0.249736,0.285426
CBronte,0.000948,0.458366,0.02347,0.002322,0.01201,0.278114,0.098264,0.072756,0.008513,0.014664,0.002263,0.002262,0.002721,0.006095,0.014713


## Challenge
### *Can we see if one of the authors focus more on men than women?*

*  Use Spacy for text preprocessing
*  Extract the Named Entities from the documents using Spacy (command is fairly straight forward)
*  Create unique list of names from the authors (you'll find that there are different types of named entities not all people)
*  Label the names with genders (can you this by hand or you use the US census name lists)
*  Customize your processing to replace the proper name with your gender from the previous step's lookup table
*  Then follow the rest of the LDA flow


# Selecting the Number of Topics (Learn)
<a id="#p4"></a>

## Overview

## Follow Along

In [0]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [0]:
from gensim.models.coherencemodel import CoherenceModel

def compute_coherence_values(dictionary, corpus, limit, start=2, step=3, passes=5):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    limit : Max num of topics
    passes: the number of times the entire lda model & coherence values are calculated

    Returns:
    -------
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    
    coherence_values = []
    
    for iter_ in range(passes):
        print(f'PASS #{iter_}')
        for num_topics in range(start, limit, step):
            model = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=4)
            coherencemodel = CoherenceModel(model=model,dictionary=dictionary,corpus=corpus, coherence='u_mass')
            coherence_values.append({'pass': iter_, 
                                     'num_topics': num_topics, 
                                     'coherence_score': coherencemodel.get_coherence()
                                    })
            print(f'Evaluating Topic Model with {num_topics} topics...')

    return coherence_values

In [0]:
# Can take a long time to run.
coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus,
                                                        start=5, 
                                                        limit=30, 
                                                        step=3,
                                                        passes=10)

PASS #0
Evaluating Topic Model with 5 topics...
Evaluating Topic Model with 8 topics...
Evaluating Topic Model with 11 topics...
Evaluating Topic Model with 14 topics...
Evaluating Topic Model with 17 topics...
Evaluating Topic Model with 20 topics...
Evaluating Topic Model with 23 topics...
Evaluating Topic Model with 26 topics...
Evaluating Topic Model with 29 topics...
PASS #1
Evaluating Topic Model with 5 topics...
Evaluating Topic Model with 8 topics...
Evaluating Topic Model with 11 topics...
Evaluating Topic Model with 14 topics...
Evaluating Topic Model with 17 topics...
Evaluating Topic Model with 20 topics...
Evaluating Topic Model with 23 topics...
Evaluating Topic Model with 26 topics...
Evaluating Topic Model with 29 topics...
PASS #2
Evaluating Topic Model with 5 topics...
Evaluating Topic Model with 8 topics...
Evaluating Topic Model with 11 topics...
Evaluating Topic Model with 14 topics...
Evaluating Topic Model with 17 topics...
Evaluating Topic Model with 20 topics..

In [0]:
topic_coherence = pd.DataFrame.from_records(coherence_values)

In [0]:
topic_coherence.head()

In [0]:
import seaborn as sns

ax = sns.lineplot(x="num_topics", y="coherence_score", data=topic_coherence)

In [0]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [0]:
lda[id2word.doc2bow(tokenize("This is a sample document to score with a topic distribution."))]

# Sources

### *References*
* [Andrew Ng et al paper on LDA](https://ai.stanford.edu/~ang/papers/jair03-lda.pdf)
* On [Coherence](https://pdfs.semanticscholar.org/1521/8d9c029cbb903ae7c729b2c644c24994c201.pdf)

### *Resources*

* [Gensim](https://radimrehurek.com/gensim/): Python package for topic modeling, nlp, word vectorization, and few other things. Well maintained and well documented.
* [Topic Modeling with Gensim](http://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#11createthedictionaryandcorpusneededfortopicmodeling): A kind of cookbook for LDA with gensim. Excellent overview, but the you need to be aware of missing import statements and assumed prior knowledge.
* [Chinese Restuarant Process](https://en.wikipedia.org/wiki/Chinese_restaurant_process): That really obscure stats thing I mentioned... 
* [PyLDAvis](https://github.com/bmabey/pyLDAvis): Library for visualizing the topic model and performing some exploratory work. Works well. Has a direct parrell implementation in R as well. 
* [Rare Technologies](https://rare-technologies.com/): The people that made & maintain gensim and a few other libraries.
* [Jane Austen v. Charlotte Bronte](https://www.literaryladiesguide.com/literary-musings/jane-austen-charlotte-bronte-different-alike/)