# Topic modeling with nltk and scikit-learn

This will be a quick walk-through of building a topic model. I'm using [scikit-learn](http://scikit-learn.org/stable/), [nltk](http://www.nltk.org/) and [pandas](https://pandas.pydata.org/).

First, load some data. I'm using a cleaned version of the Pitchfork music review dataset from [kaggle](https://www.kaggle.com/nolanbconaway/pitchfork-data) that's hosted on my website. It has 18 thousand reviews with about 12 million words total.

In [58]:
import os
import urllib2
import pandas as pd
import numpy as np

DATA_PATH = 'reviews.csv'

# download the data if necessary
if not os.path.exists(DATA_PATH):
    f = urllib2.urlopen('https://cyphe.rs/static/reviews.csv')
    with open(DATA_PATH, 'wb') as outfile:
        outfile.write(f.read())
    
# load it into a dataframe  
df = pd.read_csv('reviews.csv')


# we only care about the content of the reviews.
# filter out ones None/NaN content.
df = df[[type(c) == str for c in df.content]]
docs = df.content
print len(docs), 'documents'

18378 documents


## Tokenize the raw text

First, create a tokenizer. This will crawl through the raw text and convert it to a series of cleaned tokens.

You can use a preconfigured tokenizer, or define one using a regex. The example below is very simple, but it works well enough.

In [59]:
from nltk.tokenize import RegexpTokenizer

# this will find all alphanumeric tokens of 2 or more characters.
# it will not count hyphenated words or contractions as one token.
tokenizer = RegexpTokenizer('\w\w+').tokenize

## Vectorize the documents

Now create a vectorizer. This will convert the set of tokens for each document into a vector of constant size so that documents can be analyzed with linear algebra. 

There are a variety of vectorization methods you can use. Most perform some variation of counting the most common few thousand words in each document. [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (term frequency -- inverse document frequency) is probably the most popular method. TF-IDF deemphasizes words that are common in every document, so it tends to be better for topic modeling. You can also try a count vectorizer (which just uses unprocessed word counts) or a hashing vectorizer (which maps  all words to the vector space, not just the N most common). Keep in mind that using a hashing vectorizer will make it much more difficult to convert topics generated by the final model into human-interpretable topics.

The important parameters here are `max_df` (which determines how many common words are allowed into your topics), `max_features` (which determines the size of the vectors), and `stop_words` (if set to "english", common words like "the", "and", etc. will be ignored).

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

# min_df and max_df control which tokens are considered by the vectorizer.
# only words which are in at least min_df documents but not in more than max_df documents are counted.
# if either value is < 1, it represents a portion of the total documents.

# max_features is the size of the vector -- the total number of distinct words which the
# vectorizer counts for each document.
vectorizer = TfidfVectorizer(max_df=0.6,
                             min_df=2,
                             max_features=1000,
                             tokenizer=tokenizer,
                             stop_words='english')

doc_vecs = vectorizer.fit_transform(docs)

## Generate a topic model

Finally, use the vectorized documents train a topic model. At this point, each document is represented by a 1000-dimensional TF/IDF vector, and all the doc vectors form a `(n_docs x n_features)` matrix. A topic model will attempt to find an approximate factorization of the matrix that retains most of the relevant information.

SKlearn implements two methods, latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF). They basically solve the same problem, but LDA is supposed to be qualitatively better for big-data settings. I've had better luck with NMF for datasets around this size. This [quora post](https://www.quora.com/What-is-the-difference-between-NMF-and-LDA-Why-are-the-priors-of-LDA-sparse-induced) goes into more detail about the differences.

The main parameter choice you have to make is the number of topics you want. Too many, and you might have lots of near-duplicates; too few, and you might miss important themes in the data. You can try a few different values and see what works.

Each "topic" is a linear combination of all the features created by the vectorizer, and each feature corresponds to a token. Topics tend to be dominated by just a few features, so the topics can be thought of as weighted groups of words that tend to occur together. The utility functions below convert the raw topic vectors into readable strings.

In [61]:
from sklearn.decomposition import NMF #, LatentDirichletAllocation

def name_topics(vectorizer, model):
    """
    Map topic vectors to human-readable groups of words, and save the data as 
    a dataframe with named columns.
    """
    topics = []
    feature_names = vectorizer.get_feature_names()
    for comp in model.components_:
        total = sum(comp)
        topic = ', '.join('(%.3f) %s' % (comp[i] / total, feature_names[i])
                          for i in comp.argsort()[:-6:-1])
        topics.append(topic)
    return topics

def print_top_topics(model, topics, vectors):
    sums = vectors.sum(axis=0)
    print 'Top topics:'
    for i in np.argsort(-sums):
        print '%.2f:' % sums[i], topics[i]

Try with just a few topics first. This should train very quickly.

In [70]:
model5 = NMF(n_components=5)
topic_vecs = model5.fit_transform(doc_vecs)
topics5 = name_topics(vectorizer, model5)
print_top_topics(model5, topics5, topic_vecs)

Top topics:
512.86: (0.009) guitar, (0.007) sounds, (0.007) work, (0.006) track, (0.006) piano
479.83: (0.009) love, (0.009) pop, (0.006) don, (0.005) life, (0.005) good
374.92: (0.040) band, (0.019) rock, (0.012) metal, (0.011) punk, (0.009) bands
326.55: (0.018) house, (0.018) dance, (0.013) techno, (0.012) disco, (0.012) tracks
234.34: (0.028) rap, (0.022) hop, (0.021) hip, (0.012) rapper, (0.009) beats


More topics (will take a little longer):

In [71]:
model20 = NMF(n_components=20)
topic_vecs = model20.fit_transform(doc_vecs)
topics20 = name_topics(vectorizer, model20)
print_top_topics(model20, topics20, topic_vecs)

Top topics:
323.19: (0.013) don, (0.013) good, (0.012) know, (0.012) really, (0.011) ve
297.86: (0.014) guitar, (0.012) noise, (0.011) track, (0.010) sounds, (0.009) electronic
282.48: (0.015) record, (0.012) feels, (0.010) work, (0.010) feel, (0.009) sense
272.28: (0.021) live, (0.017) disc, (0.013) set, (0.012) version, (0.011) tracks
252.57: (0.122) band, (0.011) group, (0.010) members, (0.008) ve, (0.008) drummer
221.39: (0.122) rock, (0.028) indie, (0.021) roll, (0.019) guitar, (0.017) bands
210.25: (0.088) pop, (0.023) indie, (0.010) synth, (0.009) melodies, (0.008) debut
186.15: (0.041) house, (0.036) dance, (0.031) techno, (0.020) dj, (0.019) disco
180.61: (0.031) folk, (0.021) country, (0.017) guitar, (0.017) acoustic, (0.017) voice
166.96: (0.107) punk, (0.030) post, (0.029) hardcore, (0.015) wave, (0.015) new
161.02: (0.093) love, (0.014) sings, (0.014) life, (0.012) girl, (0.012) heart
145.66: (0.065) rap, (0.029) rapper, (0.017) mixtape, (0.012) beats, (0.009) young
145.32

Tons of topics (might take a while):

In [31]:
model50 = NMF(n_components=50)
topic_vecs = model50.fit_transform(doc_vecs)
topics50 = name_topics(vectorizer, model50)
print_top_topics(model50, topics50, topic_vecs)

Top topics:
296.08: (0.014) noise, (0.014) guitar, (0.011) track, (0.010) electronic, (0.010) sounds
273.76: (0.014) don, (0.013) good, (0.012) know, (0.012) really, (0.012) ve
270.94: (0.021) record, (0.013) feels, (0.012) feel, (0.012) work, (0.010) sense
250.19: (0.126) band, (0.011) group, (0.010) members, (0.008) ve, (0.008) bands
238.04: (0.028) live, (0.021) disc, (0.016) set, (0.015) version, (0.015) tracks
225.36: (0.035) new, (0.018) years, (0.015) old, (0.015) young, (0.014) city
210.06: (0.093) pop, (0.024) indie, (0.011) synth, (0.009) melodies, (0.008) group
208.04: (0.086) love, (0.015) sings, (0.013) life, (0.012) soul, (0.012) heart
204.14: (0.128) rock, (0.029) indie, (0.022) roll, (0.020) guitar, (0.018) bands
188.75: (0.041) house, (0.036) dance, (0.030) techno, (0.023) disco, (0.020) dj
180.13: (0.031) folk, (0.019) guitar, (0.018) country, (0.017) acoustic, (0.017) voice
150.12: (0.086) jazz, (0.028) funk, (0.020) soul, (0.020) free, (0.011) musicians
143.20: (0.1

Finally, you can put the vectorizer and the topic model together into a pipeline and classify arbitrary documents with the new model. You can use the topics as features for classificaiton, or to cluster documents together. 

In [73]:
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(vectorizer, model20)

def rank_topics(topics, vec):
    for i in np.argsort(-vec)[:5]:
        print '%.3f: %s' % (vec[i], topics[i])

example = df.iloc[np.random.randint(len(df))]
        
top_vec = pipeline.transform([example.content])
rank_topics(topics20, top_vec[0])

print
print '"%s" by %s (%s)' % (example.title, example.artist, example.genre)
print
print example.content

0.040: (0.014) guitar, (0.012) noise, (0.011) track, (0.010) sounds, (0.009) electronic
0.038: (0.041) house, (0.036) dance, (0.031) techno, (0.020) dj, (0.019) disco
0.036: (0.101) hop, (0.096) hip, (0.018) beats, (0.014) beat, (0.013) samples
0.034: (0.109) ep, (0.023) track, (0.014) release, (0.014) new, (0.014) year
0.024: (0.015) record, (0.012) feels, (0.010) work, (0.010) feel, (0.009) sense

"lp" by container (experimental)

Ren Schofield was a member of the noise scene before he started his one-man techno project Container, and you can hear this anarchic influence seep into the clean lines of his current work. "I like things raw and kind of sloppy," he told Resident Advisor in 2012. "I like things when they're not perfect." That sensibility is what makes Container’s records so compelling and unique. His songs live on the verge of chaos, and though they never actually fall apart, the threat remains. At the same time, his adherence to regular rhythms and logical changes makes ea