In [5]:
import csv
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import DBSCAN, KMeans

In [2]:
def clusters_to_csv(labels, types, coords):
    '''
    Helper function to turn scikit-learn clusters into abbreviated CSVs
    '''
    for k in set(labels):
        class_members = [index[0] for index in np.argwhere(labels == k)]
        for index in class_members:
            print '%s,%s,%s' % (int(k), types[index], '{0},{1}'.format(*coords[index]))

## Clustering documents

As we've discussed, the same principles that can be applied to clustering crime in two dimensions can also be applied to clustering doucments in much higher dimensional spaces. We'll demonstrate this concept using a selection of Jeb Bush's e-mails from his time serving as Florida's governor.

But first, we'll need to talk about what makes two documents "similar," which can be defined in a number of ways.

In [3]:
sample_docs = [
    'The quick brown fox jumped over the lazy dog',
    'The dog jumped over squirrel',
    'Four score and seven years ago'
]

We'll use those sample docs to start. Intuitively, you should be able to see that documents 0 and 1 have some similar elements ("dog," "jumped over," etc.) but document 2 is pretty different from the rest. Let's quantify that using two different distance measures: Euclidean and Cosine.

In [11]:
# First we'll vectorize our documents, as we did last week
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(sample_docs).toarray()
print features

[[0 0 0 1 1 0 1 1 1 1 1 0 0 0 2 0]
 [0 0 1 0 1 0 0 0 0 0 0 0 0 1 2 0]
 [1 1 0 0 0 1 0 0 0 0 0 1 1 0 0 1]]


## Euclidean distance

We'll start by looking at Euclidean distance.

Euclidean distance is what you probably most commonly think of when you think of distance. It's the shortest path between two points on a plane. If you take two points and draw a line between them, the length of that line is the Euclidean distance.

<img src="http://i.stack.imgur.com/w7kJP.png">

In [10]:
# We'll use a helpful scikit-learn function to calculate their pairwise distances, starting with Euclidean
euclidean_distances = pairwise_distances(features, metric='euclidean')
print euclidean_distances

[[ 0.          2.82842712  4.12310563]
 [ 2.82842712  0.          3.60555128]
 [ 4.12310563  3.60555128  0.        ]]


According to our Euclidean distance measure, document 0 and document 1 are 2.8 units apart, documents 0 and 2 are 4.1 units apart and documents 2 and 3 are 3.6 units apart. So this definitely captures the distances we're looking for. But in practice, there's another similarity measure that's more often used for looking at documents, known as cosine similarity.

## Cosine similarity

Unlike Euclidean distance, which looks at the absolute distance between points, cosine similarity accomplishes something similar by looking at the angle between them on a plane, like so:

<img src="https://engineering.aweber.com/wp-content/uploads/2013/02/4AUbj.png">

It is calculated via scikit-learn in a manner similar to Euclidean distance:

In [12]:
cosine_distances = pairwise_distances(features, metric='cosine')
print cosine_distances

[[  0.00000000e+00   4.30197118e-01   1.00000000e+00]
 [  4.30197118e-01   3.33066907e-16   1.00000000e+00]
 [  1.00000000e+00   1.00000000e+00  -2.22044605e-16]]


In practice, either one of these metrics can work for document similarity tasks. For now it's mostly important to know that there's more than one definition of similarity. Usually I start with cosine distance and test other metrics to see which work best for the task at hand.

## Clustering e-mails

After that little digression into distance metrics, we can now move on to clustering real documents -- in this case, subject lines from a selection of Jeb Bush's e-mails. Conveniently, we can use basically the same code as we used for the crime clustering example to accomplish this task.

In [16]:
data = open('data/jeb_subjects.csv').read().split('\n')
print data [:100]

Budget Power Pt.
Re: Personal e-mail address
RE: I'M BEGINNING TO WONDER IF I'VE BEEN DECEIVED....
G


