# Topic Modeling Lab

### To do list:
- Use sk-learn because
    - It has both LDA and NMF
    - It is the same API they'll use for many stats and ML tasks
    - There's a good example guide here: https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730
    
- Steps 
    - preprocessing (use 'clean' data from lab notebook 1)
    - fitting LDA
    - viewing, interpreting topics
    - cross tabs of topics and personal attributes of interest
    - fitting NMF
    - viewing, interpreting results
    - comparing LDA & NMF topics (deal with alignment)
    - cross tabs
    - compare LDA & NMF cross tabs (undermine singular, objective truth)
    
- Functionality
    - Display topics found
    - Undo stems

# 0. Setup
#### Import the packages we'll use

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from bs4 import BeautifulSoup
import re

#### Read in our data and peak at the first rows

In [None]:
profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')
#profiles = profiles.sample(5000)
profiles.head(2)

#### Clean up the text to remove HTML and other things

In [None]:
def clean(text):
    t = BeautifulSoup(text, 'lxml').get_text()
    t = t.lower()
    
    bad_words = ['http', 'www', '\nnan']
    
    for b in bad_words:
        t = t.replace(b, '')
    
    return t

profiles['clean'] = profiles.text.apply(clean)
profiles.head(2)

# 1. Topic Modeling
#### Some parameters

In [None]:
#how many topics we want our model to find
ntopics = 15

#how many top words we want to display for each topic
nshow = 10

#what we will use as our documents, here the cleaned up text of each profile
documents = profiles.clean.values

#### Convert text to count vectors

In [None]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, 
                                stop_words='english')

print("Vectorizing text by word counts...")
tf_text = tf_vectorizer.fit_transform(documents)

tmp = tf_text.get_shape()
print("Our transformed text has", tmp[0], "rows and", tmp[1], "columns.")

In [None]:
tf_feature_names = tf_vectorizer.get_feature_names()

print("The first few words (alphabetically) are:\n", tf_feature_names[:20])

#### Build a topic model using LDA

- LDA can be a little slow. We'll use a faster method later on.
- Set `n_jobs=` to the number of processors you want to use to compute LDA. If you set it to `-1`, it will use all available processors. 

In [None]:
model = LatentDirichletAllocation(n_components=ntopics, max_iter=10, 
                                  learning_method='online', n_jobs=-1)

print('Performing LDA on vectors...')
lda = model.fit(tf_text)

print('Done!')

#### Define a function for showing our topics

In [None]:
def display_topics(model, feature_names, n_words):
    for topic_idx, topic in enumerate(model.components_):
        words = []
        topic = topic.argsort() 
        topic = topic[:-n_words - 1:-1]
        for i in topic:
            words.append(feature_names[i])
            
        print("Topic", topic_idx, ":  ", " ".join(words))
    return

#### Show our topics with the top words in each

In [None]:
display_topics(lda, tf_feature_names, n_words=nshow)

### Let's try again with a different model: NMF
#### Convert the words to a TF-IDF vector

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000, 
                                   stop_words='english')

print("Vectorizing text by TF-IDF...")
tfidf_text = tfidf_vectorizer.fit_transform(documents)
tmp = tfidf_text.get_shape()
print("Our transformed text has", tmp[0], "rows and", tmp[1], "columns.")

#### The features are the same, because they are just the list of words in the text

In [None]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print("The first few words (alphabetically) are:\n", tfidf_feature_names[:20])

#### Build a topic model using NMF

- NMF is faster than LDA and often works a little better for small documents.
- NMF has no `n_jobs=` parameter.

In [None]:
model = NMF(n_components=ntopics, alpha=.1, l1_ratio=.5, init='nndsvd')

print('Performing NMF on vectors...')
nmf = model.fit(tfidf_text)

print('Done!')

#### Show our topics with the top words in each

In [None]:
display_topics(nmf, tfidf_feature_names, nshow)