# Topic Modeling Lab

### To do list:
   
- To-do:
    - interpreting topics (out of notebook placeholder)
    - top topics by personal attributes
    - comparing LDA & NMF topics (deal with alignment)
    - credit to https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730


# 0. Setup
### Step 1: Import the packages we'll use

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from bs4 import BeautifulSoup
import warnings

### Step 2: Read in our data 

In [None]:
profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')
profiles.head(2)

### Step 3: Pick which section of the profiles you want to analyze.
#### Options:
- text - All of the text from a profile
- essay0 - My self summary
- essay1 - What I’m doing with my life
- essay2 - I’m really good at
- essay3 - The first thing people usually notice about me
- essay4 - Favorite books, movies, show, music, and food
- essay5 - The six things I could never do without
- essay6 - I spend a lot of time thinking about
- essay7 - On a typical Friday night I am
- essay8 - The most private thing I am willing to admit
- essay9 - You should message me if...

#### Replace `'essay0'` in the cell below with the essay you want to look at.

In [None]:
profile_section_to_use = 'essay0'

### Step 4: Clean up the text for that essay.
#### Helper function for cleaning up text
- removes HTML code, link artefacts
- converts to lowercase

In [None]:
# Some of the essays have just a link in the text. BeautifulSoup sees that and gets 
# the wrong idea. This line hides those warnings.
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

def clean(text):
    if pd.isnull(text):
        t = np.nan
    else:
        t = BeautifulSoup(text, 'lxml').get_text()
        t = t.lower()

        bad_words = ['http', 'www', '\nnan']

        for b in bad_words:
            t = t.replace(b, '')
    if t == '':
        t = np.nan
    
    return t

#### Clean and select the text.

In [None]:
print('Cleaning up profile text for', profile_section_to_use, '...')
profiles['clean'] = profiles[profile_section_to_use].apply(clean)

print('We started with', profiles.shape[0], 'profiles.')
print("Dropping profiles that didn't fill out the essay we chose...")
profiles.dropna(axis=0, subset=['clean'], inplace=True)

print('We have', profiles.shape[0], 'profiles left.')

# 1. Topic Modeling
#### Some parameters: change these to get different numbers of topics or words per topic

In [None]:
#how many topics we want our model to find
ntopics = 10

#how many top words we want to display for each topic
nshow = 10

#what we will use as our documents, here the cleaned up text of each profile
documents = profiles['clean'].values

## 1.1 LDA
### Step 1: Convert text to numbers the computer understands
- LDA takes "count vectors" as input, that is, a count of how many times each word shows up in each document. 
    - Here we tell it to only use the 1,000 most popular words, ignoring stop words
- [Learn more](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) about LDA

In [None]:
tf_vectorizer = CountVectorizer(max_features=1000, stop_words='english')

print("Vectorizing text by word counts...")
tf_text = tf_vectorizer.fit_transform(documents)

tmp = tf_text.get_shape()
print("Our transformed text has", tmp[0], "rows and", tmp[1], "columns.")

In [None]:
tf_feature_names = tf_vectorizer.get_feature_names()

print("The first few words (alphabetically) are:\n", tf_feature_names[:20])

### Step 2: Build a topic model using LDA

- LDA can be a little slow. We'll use a faster method later on.
- Set `n_jobs=` to the number of processors you want to use to compute LDA. If you set it to `-1`, it will use all available processors. 

In [None]:
model = LatentDirichletAllocation(n_components=ntopics, max_iter=10, 
                                  learning_method='online', n_jobs=-1)

print('Performing LDA on vectors...')
lda = model.fit(tf_text)

print('Done!')

#### A function to show us the most important words in each topic

In [None]:
def display_topics(model, feature_names, n_words=10):
    # loop through each topic (component) in the model
    for topic_idx, topic in enumerate(model.components_):
        words = []
        # sort the words in the topic by importance
        topic = topic.argsort() 
        # select the n_words most important words
        topic = topic[:-n_words - 1:-1]
        # for each important word, get it's name (i.e. the word) from our list of names
        for i in topic:
            words.append(feature_names[i])
        # print the topic number and its most important words, separated by spaces
        print("Topic", topic_idx, ":  ", " ".join(words))
    return

### Step 3: Show our topics with the top words in each

In [None]:
display_topics(lda, tf_feature_names, n_words=nshow)

### Step 4: Interpret these topics
- This part is for you to do: code can't do it for you.
- Look at the list of important words for each topic, and think about these questions.
    - What do the words have in common?
    - What could someone write that would use most of those words?
    - What does this topic seem to be about?
- Try to come up with a short, catchy name for each topic.
    - For example, if the words were "san francisco city moved living born years raised lived live", you might call it "places lived" because the topic seems to be about where people currently live (San Francisco) and where they were born / raised / moved from. 
- Try other numbers of topics.
    - If the topics seem repetitive, you might want to try looking for fewer topics.
    - If the topics seem confusing or vague, you might want to try looking for more topics (so that they can be more specific).

## 1.2 NMF
### Step 1: Convert text to numbers the computer understands
- NMF takes "tf-idf vectors" as input. Tf-idf stands for "text frequency - inverse document frequency." 
    - Text frequency is the same as the count vectors above: how often does each word appear in the text. 
    - Inverse document frequency means we divide ("inverse") by the number of documents the word is in. (If everyone uses the word, it isn't very helpful for figuring out what different people are talking about. So this measurement looks for words that are used a lot in some documents, and not at all in others.)
    - Here we tell it to only use the 1,000 most popular words, ignoring stop words
- [Learn more](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization#Text_mining) about NMF.

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')

print("Vectorizing text by TF-IDF...")
tfidf_text = tfidf_vectorizer.fit_transform(documents)

tmp = tfidf_text.get_shape()
print("Our transformed text has", tmp[0], "rows and", tmp[1], "columns.")

#### The features are the same, because they are just the list of words in the text

In [None]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print("The first few words (alphabetically) are:\n", tfidf_feature_names[:20])

### Step 2: Build a topic model using NMF

- NMF is faster than LDA and often works a little better for small documents like we have here.

In [None]:
model = NMF(n_components=ntopics, alpha=.1, l1_ratio=.5, init='nndsvd')

print('Performing NMF on vectors...')
nmf = model.fit(tfidf_text)

print('Done!')

### Step 3: Show our topics with the top words in each

In [None]:
display_topics(nmf, tfidf_feature_names, nshow)

### Step 4: Interpret these topics
- This part is for you to do: code can't do it for you.
- Look at the list of important words for each topic, and think about these questions.
    - What do the words have in common?
    - What could someone write that would use most of those words?
    - What does this topic seem to be about?
- Try to come up with a short, catchy name for each topic.
    - For example, if the words were "san francisco city moved living born years raised lived live", you might call it "places lived" because the topic seems to be about where people currently live (San Francisco) and where they were born / raised / moved from. 
- Try other numbers of topics.
    - If the topics seem repetitive, you might want to try looking for fewer topics.
    - If the topics seem confusing or vague, you might want to try looking for more topics (so that they can be more specific).

### Step 5: Compare the topics from LDA and NMF
- Do any of the topics seem to be the same in both models?
- Are some topics in one model but not the other?
- Do the topics you get from one of the models make more sense than the ones you get from the other?