# Topic Modeling Lab

### To do list:
- Use sk-learn because
    - It has both LDA and NMF
    - It is the same API they'll use for many stats and ML tasks
    - There's a good example guide here: https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730
    
- Steps 
    - preprocessing (use 'clean' data from lab notebook 1)
    - fitting LDA
    - viewing, interpreting topics
    - cross tabs of topics and personal attributes of interest
    - fitting NMF
    - viewing, interpreting results
    - comparing LDA & NMF topics (deal with alignment)
    - cross tabs
    - compare LDA & NMF cross tabs (undermine singular, objective truth)
    
- Functionality
    - Display topics found
    - Undo stems

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from bs4 import BeautifulSoup
import re

In [2]:
profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')
#profiles = profiles.sample(5000)
profiles.head(2)

Unnamed: 0,age_group,body,alcohol_use,drug_use,edu,race_ethnicity,height_group,industry,kids,orientation,pets_likes,pets_has,pets_any,religion,sex,smoker,languages,text
0,20,overweight,yes,no,HS,multiple,over_6,other,no,straight,both,neither,no,none,m,yes,English_only,about me:<br />\n<br />\ni would love to think...
1,30,average,yes,yes,unknown,White,under_6,other,no,straight,both,neither,no,none,m,no,multiple,i am a chef: this is what that means.<br />\n1...


In [3]:
def clean(text):
    t = BeautifulSoup(text, 'lxml').get_text()
    t = t.lower()
    
    bad_words = ['http', 'www', '\nnan']
    
    for b in bad_words:
        t = t.replace(b, '')
    
    return t

profiles['clean'] = profiles.text.apply(clean)
profiles.head(2)

Unnamed: 0,age_group,body,alcohol_use,drug_use,edu,race_ethnicity,height_group,industry,kids,orientation,pets_likes,pets_has,pets_any,religion,sex,smoker,languages,text,clean
0,20,overweight,yes,no,HS,multiple,over_6,other,no,straight,both,neither,no,none,m,yes,English_only,about me:<br />\n<br />\ni would love to think...,about me:\n\ni would love to think that i was ...
1,30,average,yes,yes,unknown,White,under_6,other,no,straight,both,neither,no,none,m,no,multiple,i am a chef: this is what that means.<br />\n1...,i am a chef: this is what that means.\n1. i am...


In [9]:
ntopics = 15
documents = profiles.clean.values

print("Vectorizing text by word counts...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, 
                                max_features=1000, 
                                stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

print('Performing LDA on vectors...')
lda = LatentDirichletAllocation(n_components=ntopics, 
                                max_iter=10, 
                                learning_method='online', 
                                n_jobs=-1 ).fit(tf)

print('Done!')

Vectorizing text by word counts...
Performing LDA on vectors...
Done!


In [10]:
nshow = 10

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        words = " ".join([feature_names[i] 
                          for i in topic.argsort()[:-no_top_words - 1:-1]])
        print("Topic", topic_idx, ":  ", words)
                        
display_topics(lda, tf_feature_names, no_top_words=nshow)

Topic 0 :   work sports enjoy friends time play working hard playing family
Topic 1 :   like really just don ve good things pretty think ll
Topic 2 :   love like friends good just music people family going fun
Topic 3 :   don people like know just want life think love time
Topic 4 :   food movies books harry potter family italian star friends thai
Topic 5 :   san bay years francisco area moved ve school new city
Topic 6 :   new hiking bike city things fun camping travel love biking
Topic 7 :   music tv movies shows men bad books mad rock development
Topic 8 :   love friends good food new family time life wine enjoy
Topic 9 :   like people lot things books read time favorite movies reading
Topic 10 :   things making work time working world art people make new
Topic 11 :   life love people good looking open friends enjoy things time
Topic 12 :   music love like food art movies books rock good dance
Topic 13 :   games video game playing fi board sci movies play friends
Topic 14 :   im nan

In [11]:
print("Vectorizing text by TF-IDF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, 
                                   max_features=1000, 
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

print('Performing NMF on vectors...')
nmf = NMF(n_components=ntopics, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

print('Done!')

Vectorizing text by TF-IDF...
Performing NMF on vectors...
Done!


In [12]:
display_topics(nmf, tfidf_feature_names, nshow)

Topic 0 :   don really just ve know think ll things time say
Topic 1 :   nan smile eyes want working hair coffee iphone phone money
Topic 2 :   friends family hanging food time enjoy home smile dinner going
Topic 3 :   im dont lol alot wanna life just haha know really
Topic 4 :   music books movies food art rock shows black tv playing
Topic 5 :   life enjoy world open live great nature share looking relationship
Topic 6 :   bay area sf years ve work moved city year east
Topic 7 :   love music food laugh cook dancing dance travel great really
Topic 8 :   like lot stuff don movies things read going watch think
Topic 9 :   new things trying try places meeting learning exploring enjoy meet
Topic 10 :   fun looking just want guy lol know going meet working
Topic 11 :   good food really wine humor sense pretty conversation time great
Topic 12 :   favorite book movie enjoy movies games read time shows books
Topic 13 :   san francisco city moved living born years raised lived live
Topic 14 :  