In [5]:
%matplotlib inline
# Plot everything as SVG
%config InlineBackend.figure_formats=['svg']

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Configure some styling
palette = ['#386DF9', '#FFDC52', '#FF1614', '#62F591', '#AA22FF', '#34495E']
sns.set(font_scale=1.1, style='darkgrid', palette=palette, context='notebook')

from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
emails = pd.read_csv('../data/emails.csv')
emails.head()

Unnamed: 0,author,email
0,Sebastian Jaiden,fire threat news derail image derail cover up ...
1,Francine Loyd,people people support presentation presentatio...
2,Harrison Chip,threat fire scandal undermine undermine danger...
3,Sebastian Jaiden,press release support public present journalis...
4,Harrison Chip,danger image cover up whistleblower whistleblo...


We'll pull out just the email text for now and vectorize it using `scikit-learn`'s built-in TFIDF vectorizer.

In [7]:
email_texts = emails['email'].values
vectr = TfidfVectorizer()
email_vecs = vectr.fit_transform(email_texts)

Now that we have our emails represented as vectors, we can run them through a clustering algorithm.

There are many clustering algorithms available. A popular one is _k-means_, which is available in `scikit-learn`.

In [8]:
from sklearn.cluster import KMeans

# The k-means algorithm requires that we specify how many clusters we expect
model = KMeans(n_clusters=5)

# The model returns labels (i.e. cluster id) for each email
labels = model.fit_predict(email_vecs)
labels[:10]

array([2, 3, 2, 3, 2, 2, 3, 1, 2, 1], dtype=int32)

Let's use the labels to group the original emails and see the authors of each email group.

In [9]:
clusters = [{
    'authors': set(), # to avoid duplicates
    'emails': []
} for i in range(5)]

for i, label in enumerate(labels):
    email = emails.iloc[i]
    clusters[label]['authors'].add(email['author'])
    clusters[label]['emails'].append(email['email'])

# Print the first 5 emails from each cluster to get a sense of what's in them
for i, cluster in enumerate(clusters):
    print('Cluster', i)
    print(cluster['emails'][:5])
    print('---')

Cluster 0
['voting campaigning voters behind candidacy finance voters politician fund candidacy campaign polls citizens behind election politics ahead behind points behind pac campaigning fund politician politician politician voters', 'politics vote election behind finance election vote fund candidacy appeal politics politics pac voters voter ahead pac voting finance appeal politics behind election finance finance campaign points behind polls', 'points ahead citizens appeal behind ahead appeal citizens behind voter fund fund citizens election campaign voters campaigning appeal points voting ahead points politics', 'politics pac points fund politician politician citizens fund points campaigning voter finance polls points politician fund politics polls appeal vote campaign voters voter finance citizens voter vote candidacy', 'politics pac candidacy appeal ahead election candidacy candidacy campaign voter polls citizens ahead voting candidacy finance election pac candidacy points politics

It looks like cluster 3 is all about the crime coefficients. Now let's look at the authors of those emails.

In [10]:
# set to remove duplicates
clusters[3]['authors']

{'Dustin Randell',
 'Francine Loyd',
 'Harrison Chip',
 'Ian Upton',
 'Izabelle Rene',
 'Reid Kiley',
 'Sabine Finnian',
 'Sebastian Jaiden',
 'Sheba Mate',
 'Torkel Whitney'}