# Text

There are lots of things you might want to do with documents
 - group them together
 - summarize one or several of them
 - classify them (e.g. sentiment, tagging)
 - explore, e.g. temporal trends in what they discuss

To do any of these things using a computer, you first need to convert words into numbers.

[https://github.com/fastforwardlabs/usaa-lda](https://github.com/fastforwardlabs/usaa-lda)

## Bags of words

The "bag of words" approach is the simplest method for this.

![Bag of words](img/bagwords.png)

There are at least three problems with this approach:

 - **long, sparse data**: a short text is turned into a very long series of numbers, most of which are zero ("sparse"). The problem is not that this is wasteful (although it is that!), it's that this sparsity makes applying even simple algorithms tough. You have to apply clever or heuristic algorithms to group documents
 
 - **synonyms and multiple meanings**: The bag of words version of a document that mentions "movies" and a document that mentions "films" are different (they should not be), and a document that mentions the kind of "bow" that shoots an arrow is similar to one about the kind of "bow" you make to a King.
 
 - **word order**: totally lost ("man bites dog" and "dog bites man" are the same).

We're going to look at a technique that addresses the first of these two problems.

# Topic modeling

Topic modeling is a statistical method to find groups of words that tend to co-occur in a corpus of documents.

For example, maybe the words "movie", "film" and "director" often occur in the same documents. That would make them a "topic".

Topic modeling algorithms find these groups automatically. They are an instance of the class of algorithms known as "unsupervised machine learning".

In doing this, we become able to express documents as a combination of a relatively small number (~100) of topics, rather than thousands of words (most of which don't occur), and we can treat documents about "films" and "movies" similarly.

# Topic modeling workflow

Topic modeling has two steps:

 - learn topics from a corpus of representative documents
 - figure out which of these topics occur in the particular document(s) you're interested in
 
If you're a machine learning person, you'll recognize these as training and evaluation.

<img src="img/lda_topics.png" width=50%>

<img src="img/lda_evaluate.png" width=50%>

Once you've done the second step, you've expressed your new document as a short vector of numbers that you can no do all sorts of things with:

 - group documents together
 - summarize one or several documents
 - classify it (e.g. sentiment, tagging)
 - explore, e.g. temporal trends in what people are discussing, e.g. [Time-series plots of 1000 topics extracted from 20 years of the New York Times.](http://christo.cs.umass.edu/NYT/)

## Latent Dirichlet Allocation

The best known and best algorithm for finding topics in a corpus is Latent Dirichlet Allocation. It's got a complicated name and, to be frank, it's a complicated algorithm. If you'd like to begin to dig into the details, there are two resources I recommend very highly!

 - [Tim Hopper's PyData NYC 2015 talk](https://www.youtube.com/watch?v=_R66X_udxZQ)
 - [David Blei's ACM article](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf)
 
For the purposes of this talk, I'm just going to say it finds groups of words that co-occur by magic!

The good news is, there are several excellent open source implementations of the algorithm, and we're going to use one of those today.

We're going to apply it to a public dataset of Amazon product reviews.

# Load data

The cell below opens the file and loads it into a pandas dataframe.

There's a lot going on in this line, all of which is useful to understand if you're a Python programmer, but none of which is necessary to understand if you're only interested in LDA.

If you would like a more detailed explanation of what's going on, please see [the Data notebook](data.ipynb)

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import json
import numpy as np
import pandas as pd

with open("reviews.json", 'rb') as f:
    reviews = pd.DataFrame(json.loads(l.decode()) for l in f)

A pandas dataframe is a structured table-like object that, among many other things, supports a bunch of SQL-like operations and handles fiddly data types like data and times well. I don't know much about R, but I understand R has objects like this too.

`head` allows us to see the first few rows:

In [2]:
reviews.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,1223000893,"[0, 0]",3.0,I purchased the Trilogy with hoping my two cat...,"01 12, 2011",A14CK12J7C7JRK,Consumer in NorCal,Nice Distraction for my cats for about 15 minutes,1294790400
1,1223000893,"[0, 0]",5.0,There are usually one or more of my cats watch...,"09 14, 2013",A39QHP5WLON5HV,Melodee Placial,Entertaining for my cats,1379116800
2,1223000893,"[0, 0]",4.0,I bought the triliogy and have tested out all ...,"12 19, 2012",A2CR37UY3VR7BN,Michelle Ashbery,Entertaining,1355875200
3,1223000893,"[2, 2]",4.0,My female kitty could care less about these vi...,"05 12, 2011",A2A4COGL9VW2HY,Michelle P,Happy to have them,1305158400
4,1223000893,"[6, 7]",3.0,"If I had gotten just volume two, I would have ...","03 5, 2012",A2UBQA85NIGLHA,"Tim Isenhour ""Timbo""",You really only need vol 2,1330905600


Individual columns can be accessed as keys:

In [3]:
reviews['asin'].head()

0    1223000893
1    1223000893
2    1223000893
3    1223000893
4    1223000893
Name: asin, dtype: object

In [4]:
print("{} reviews".format(len(reviews)))
print("of {} products ".format(len(reviews.asin.unique())))
print("by {} unique authors ".format(len(reviews.reviewerID.unique())))

150000 reviews
of 8019 products 
by 19834 unique authors 


# Vectorize reviews

You have to preprocess the text a little before you apply LDA: you need to split documents into words, and you need to turn words into vectorized numbers.

Ironically, in order to get the benefits of LDA, we first need to run bag of words on our data!

The code to do this is built into scikit-learn.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english', binary=True, max_features=10000)
X = vectorizer.fit_transform(reviews['reviewText'])

In [6]:
X.shape

(150000, 10000)

# Learn topics using LDA

In [7]:
from sklearn.decomposition import LatentDirichletAllocation

In [8]:
%%time
lda = LatentDirichletAllocation(n_topics=100, learning_method='batch', n_jobs=-2, random_state=0)
lda.fit(X[:20000])

CPU times: user 2.18 s, sys: 3.91 s, total: 6.09 s
Wall time: 1min 50s


# Inspect topics

The first thing you should do when you fit a topic model is inspect a few of the words that dominate each topic to check that the topics are coherent.

To do this, we need to look at the `components_` attribute now attached to `lda`. This is an array with `n_topics` rows and a number of columns equal to the size of the vocabulary.

In [9]:
print(lda.components_.shape)

(100, 10000)


Each number in this array is the weight of the corresponding word in the corresponding topic.

The weights of each topic should add up to one, i.e. each row of `lda.components_` should add up to 1.

In [10]:
lda.components_.sum(axis=1)

array([  3408.92129467,   3102.9078119 ,   6336.39414052,  11663.78298759,
         6490.58482843,   2169.81552908,   8859.48340892,   4534.03122498,
         5448.7790058 ,   5755.0881906 ,   4889.51340858,   5075.79388066,
         8843.02416106,   7355.26193235,   9079.94863263,   5157.55260889,
         5280.97273017,   4702.69994325,   2572.84577759,   4126.57521393,
         4053.29010868,   5431.65181866,   2312.3476769 ,   3141.97971945,
         3274.40201268,   2722.35552177,  10258.55265971,   2627.74462264,
         3631.71145661,   3163.75557003,   5398.16158546,   3139.12988856,
         3338.19691485,   3291.67852031,  12333.21738803,  11198.25555185,
         2398.44318405,   2716.57729782,   7590.03002085,   4553.21892308,
         6128.35697499,   3245.63827339,   2892.78490628,   2806.79505779,
         2739.62518746,   2296.39971362,   2874.21059793,   3977.09070756,
         3425.73783267,   3456.18167458,   3552.11374289,  12284.0741342 ,
         2782.1136586 ,  

Oh dear! It turns out there's a bug in scikit-learn's implementation of LDA. Let's fix it here. This should be fixed in the next version of scikit-learn.

In [11]:
lda.components_ /= lda.components_.sum(axis=1)[:, None]

This array of topics and words (or terms) is usually called the topic-term matrix, so let's save it under that name:

In [12]:
topic_term = lda.components_

The word corresponding to each column in this array, which we'll call the `vocabulary`, is available as a list from `vectorizer.get_feature_names()`

In [13]:
vocabulary = vectorizer.get_feature_names()
print(vocabulary[:10000:1000]) # print the 0th, 999th, 1999th, 2999th, etc. item in the vocabulary

['00', 'bits', 'contest', 'emits', 'grid', 'leading', 'opt', 'really', 'slice', 'tightening']


Now let's look at the top 10 words that dominate each topic.

It does that by going through each row (i.e. each topic) in the `topic_term` array, finding the biggest numbers in that row, then finding the corresponding word.

In [14]:
def print_topic(topic_term, topic_id, vocabulary):
    print("{:2d} ".format(topic_id) + " ".join(vocabulary[i] for i in topic_term[i].argsort()[:-11:-1]))

for i, _  in enumerate(topic_term):
    print_topic(topic_term, i, vocabulary)

 0 good product used airline use just tubing like basic 25
 1 okay hamster thank just dwarf hamsters like don use area
 2 water fish tank stress conditioner change product use tap coat
 3 price store amazon pet local cheaper stores buy product buying
 4 test kit water accurate ammonia easy use testing tests ph
 5 packaged bunnies arrived hedgehog heavy easy use bought gerbils duty
 6 fit cage fits easy perfectly nice great room size bought
 7 dried worms use liver freeze easy pills make like product
 8 really like just product think don time use sure did
 9 apart tough toys stuffing rip ripped squeaker tear dog toy
10 like 34 dog don make just use great easy works
11 turtle bags tank use just seachem purigen filter bulb product
12 feeder set food time cat just 12 review cats hours
13 works handle use great grass poop tool job dog good
14 food container bag dog holds open great lid just easy
15 don didn know dogs doesn away bought ingredient won dog
16 directions followed instructions b

In this case, the topics are reasonably coherent, so I'm going to move on.

If things look messy:
 - n_topics might be too large, either for the diversity in the corpus (maybe there really aren't 1000 topics), or for the number of documents you have (you just don't have enough data)
 - n_topics might be too low (real topics have to be merged together by the algorithm, which doesn't work well)
 - you've got a bug!
 
Setting n_topics very small (say 5) or very high (say 1000) is a good way of building up some intuition for what works (although beware `n_topics=1000` will take a long time to run).

# Inspect a document in the corpus

The topic model is the lens through which we're going to view future documents.

But let's first look at our existing documents through this lens.

To do that we have to transform the documents we trained on to be distributions of topics (e.g. document 1 is 20% topic A, 30% topic B, etc.)

We do that by running the `lda.transform` method on the vectorized documents `X`:

In [15]:
doc_topic = lda.transform(X)

`doc_topic` has a row for each document, and a column for each topic.

In [16]:
doc_topic.shape

(150000, 100)

Again, each row should add up to 1, and again they don't because of a bug in scikit-learn, so let's fix that:

In [17]:
doc_topic /= doc_topic.sum(axis=1)[:, None]

Finally let's look at the topic distribution of a random document

In [18]:
reviews.loc[7, 'reviewText']

"My toy poodle loves this stuff and will let me &#34;sort&#34; of brush her teeth because of it.  I was hoping it would help with her doggy breath and it does some.  Interestingly... it says &#34;peanutbutter&#34; but it doesn't smell like peanutbutter."

In [19]:
top_topics = (doc_topic[7]).argsort()[:-6:-1]
print(top_topics)

[97 10 91 36 26]


What are these topics?

In [20]:
for i in top_topics:
    print_topic(topic_term, i, vocabulary)

97 teeth dog toothpaste flavor like dogs brush breath loves taste
10 like 34 dog don make just use great easy works
91 keeps warm busy nice heat hot bag stay cold night
36 kinda just chocolate use like cheaply thought got ignored didn
26 food fish eat love feed like feeding pellets tank water


# Visualization

pyLDAvis is a comprehensive package for visualizing the results of a topic model. It's useful for understanding the structure of the model you've just discovered. The topics exist in a huge space. This package squeezes things down to 2D so we can look at it on the screen.

In my experience, it generates a ton of spurious warnings, so let's disable warnings for this package when we import it (a useful trick!)

In [21]:
import warnings

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    try:
        import pyLDAvis
    except ImportError:
        print('ERROR: pyLDAvis not installed! Skip to next section!')

In addition to the `topic_term` and `doc_topic` matrices, pyLDAvis needs to know how often each word occurs in the entire corpus, and how long each document is. Here are calculations that give those.

In [22]:
term_frequency = np.asarray(X.sum(axis=0)).squeeze()
doc_lengths = reviews['reviewText'].str.len()

In [23]:
lda_vis = pyLDAvis.prepare(topic_term_dists=topic_term,
                           doc_topic_dists=doc_topic,
                           doc_lengths=doc_lengths,
                           vocab=vocabulary,
                           term_frequency=term_frequency)

In [None]:
pyLDAvis.display(lda_vis)

# Put all this together in a Pipeline and persist the model

The process of getting from document to topic distribution is a little fiddly. We need to:
 - Vectorize the document (using the same vocabulary we used when training above)
 - Transform the document using the LDA object
 
scikit-learn allows us to bundle these steps (and more!) together in an object called a `Pipeline`, which we can save to disk, reload, and work with again. Let's build one, train it, and save it.

**WARNING**: this next cell will take a while to execute the first time you run it. After that though, the model will be loaded from disk.

In [44]:
import pickle
from sklearn.pipeline import make_pipeline

try:
    with open('topic_model.pkl', 'rb') as f:
        topic_pipeline = pickle.load(f)
    pipeline_vocabulary = topic_pipeline.steps[0][1].get_feature_names()
except IOError:
    topic_pipeline = make_pipeline(
        CountVectorizer(stop_words='english', binary=True, max_features=10000),
        LatentDirichletAllocation(n_topics=100, learning_method='batch', n_jobs=-2, random_state=0)
    )
    topic_pipeline.fit(reviews.loc[:20000, 'reviewText'])
    with open('topic_model.pkl', 'wb') as f:
        pickle.dump(topic_pipeline, f)
        
pipeline_topic_term = topic_pipeline.steps[1][1].components_
pipeline_vocabulary = topic_pipeline.steps[0][1].get_feature_names()

# Determine topics of a new document

The single document we looks at above was a pretty short document. Let's make a more interesting, longer document out of all the reviews of that product.

In [45]:
reviews.query('asin == "4847676011"')['reviewText']

5    My Rottie has food allergies to poultry, beef ...
6    My puppy loves this stuff! His tail starts wag...
7    My toy poodle loves this stuff and will let me...
8    Works great and dog doesn't hate the taste.  G...
9    Yes , my Princess is enjoying the taste showin...
Name: reviewText, dtype: object

In [46]:
reviews4847676011 = reviews.query('asin == "4847676011"')['reviewText'].str.cat(sep=' ')

In [47]:
print(reviews4847676011)

My Rottie has food allergies to poultry, beef and dairy. I've had a difficult time finding a toothpaste that doesn't make him allergic and he enjoys the taste. This toothpaste is peanut flavor (smells like black licorice). He loves the taste and doesn't wiggle as much when I brush his teeth every night. The price is ok, but I do wish that the tube came in a larger size. Soooo, if your pup has allergies or doesn't like his/her current toothpaste you might want to try this one. My puppy loves this stuff! His tail starts wagging as soon as I ask him if he's ready to brush his teeth! It is actually an enjoyable daily experience! Definitely my &#34;Go To&#34; dog toothpaste. My toy poodle loves this stuff and will let me &#34;sort&#34; of brush her teeth because of it.  I was hoping it would help with her doggy breath and it does some.  Interestingly... it says &#34;peanutbutter&#34; but it doesn't smell like peanutbutter. Works great and dog doesn't hate the taste.  Gum health is important

In [48]:
doc_topic = topic_pipeline.transform([reviews4847676011])

In [49]:
print(doc_topic.shape)

(1, 100)


In [50]:
top_topics = (doc_topic[0]).argsort()[:-6:-1]
print(top_topics)

[51  2 57 84 66]


In [51]:
for i in top_topics:
    print_topic(pipeline_topic_term, i, pipeline_vocabulary)

51 teeth dog dogs taste like flavor toothpaste brush loves breath
 2 trimmed midwest cool just crates gobble don spending like got
57 treats dog dogs treat extreme pieces small just don kong
84 dog skin coat dogs oil soft food product coats dry
66 fight love round ve make time great residue sure biggest


# What next?

Use the `doc_topic` array for a downstream task, e.g.
 - corpus exploration (remember the [NYT visualization](http://christo.cs.umass.edu/NYT/))
 - document clustering, e.g. use something like `KMeans` (in scikit-learn) to visualize which documents are most similar in terms of their topics, which may surface groups of topics or groups of documents

## Summarization

Here's a short algorithm, but see [Fast Forward Labs Report 04](http://ff04.fastforwardlabs.com) for details:
  - Train LDA on all products of a certain type (e.g. all the books)
  - Treat all the reviews of a particular product as one document, and infer their topic distribution
  - Infer the topic distribution for each sentence
  - For each topic that dominates the reviews of a product, pick some sentences that are themselves dominated by that topic

Be aware of limitations:
 - Choosing `n_topics` is an art rather than a science!
 - The topics don't come with names. Sometimes they overlap. Sometimes they're not what you want them to be. For example, if you run a topic model on the NYT corpus, there's no guarantee you'll get topics that correspond to the sections of the newspaper (business, metro, world, sport, etc.!)

The way you used `fit` and `transform` for both the `vectorizer`, `lda`, and `topic_pipeline` objects is generic across scikit-learn, so play with scikit-learn, e.g. [Andreas Mueller's presentation](https://www.youtube.com/watch?v=8CzwlZbwDkI) is a good place to start.
 
Remember if you're interested in the LDA algorithm itself, take a look at

 - [Tim Hopper's PyData NYC 2015 talk](https://www.youtube.com/watch?v=_R66X_udxZQ)
 - [David Blei's ACM article](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf)