# Detecting Deception in Enron Emails
## 3. Modeling

In this notebook, we attempt three methods of clustering and topic modeling, beginning with simple N-gram features, then reducing dimensionality for K-means using paragraph vectoring, or Doc2Vec. We also try variants of Latent Dirichlet Allocation (LDA) to identify sub-topics within emails for comparison to our keyword-targeted labels.

### 3.0 Libraries and Preprocessed Data

In [1]:
from __future__ import print_function
from __future__ import division

import os, sys
import collections
import re

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

import email
import nltk
from nltk.tokenize.treebank import TreebankWordTokenizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from scipy import sparse, hstack
from sklearn.feature_extraction.text import *
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
# run 'easy_install -U gensim' to install gensim and then re-launch jupyter notebook
import gensim


# Helper libraries
import constants
import utils
import vocabulary



#### Load Preprocessed Dataset

In [2]:
# test import; runtime ~ 5 minutes for full set
emails_df = pd.read_pickle(path + '/cleaned_emails.pkl')
print(emails_df.shape)
emails_df.head()

# ENTER number of rows for training; NONE if using all
num_rows = 50000

if num_rows != None:
    mini_df = emails_df.loc[range(0,num_rows),]

Shape: (517401, 2)


Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


## 3.1 Baseline: K-Means Clustering with Simple N-gram Features

Using partition of early emails only to reduce dimensions in baseline bi-gram model.

In [14]:
# consider longer n-grams
transformer = TfidfTransformer()
vectorize = CountVectorizer(ngram_range=(1, 2), max_df=.01, min_df=5)
n_grams = vectorize.fit_transform(mini_df["email_str"])
print("N-grams shape: ", n_grams.shape)

n_grams_idf = transformer.fit_transform(n_grams)
print("N-grams TF-IDF:", n_grams_idf.shape)

N-grams shape:  (50000, 168196)
N-grams TF-IDF: (50000, 168196)


In [27]:
# fit k-means (n=4, tol=.01, max_iter=100 takes ~20 mins to train)
kmeans = KMeans(n_clusters=5, tol=.01, max_iter=100)
kmeans.fit(n_grams_idf)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.01, verbose=0)

In [28]:
# evaluate predictions on positively labeled examples
base_preds = kmeans.predict(n_grams_idf)
print("First 30 clusters:", base_preds[:30])

# GET IDS OF POSITIVE LABELS
positive_labels = emails_df["suspicious_ind"][emails_df["suspicious_ind"]==1]

print("Positive labels:  ", base_preds[positive_labels])

First 30 clusters: [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Positive labels:   [2 2 2 2 2]


In [29]:
# evaluate closest examples in cluster 2 based on cosine similarity
suspicious_ids = np.zeros(len(base_preds))
for k in label_dict.keys():
    suspicious_ids[k] = label_dict.get(k)

# enter index of typical cluster
key_cluster = 2

cluster_ids = (base_preds==key_cluster).astype(int)

features_np = sparse.csr_matrix.todense(feature_vects)

labeled_feats = features_np[np.multiply(suspicious_ids, cluster_ids).astype(bool)]
print(labeled_feats.shape)

(5, 75864)


In [43]:
# cosine_similarity
print(label_dict.keys())
cos_sims = cosine_similarity(labeled_feats, feature_vects)
print(cos_sims.shape)
closest = np.argsort(cos_sims, axis = 1)
print(closest[:,-10:])
print(cos_sims[1,985])

dict_keys([346, 27091, 373, 13966, 405])
(5, 50000)
[[21177 31738 39744 13961 11650 15532  7014   966   346  2759]
 [  985  2779  1893  2769   356   976  1879   373  2787   995]
 [  357   358  1892  2771   978   977  2820  1855   405  1027]
 [17248 10977 14371 11429 16116 13982 11606 13966  9998 17433]
 [20382 19225 29563 21933 27092 30259 24055 31124 27091 20569]]
0.164362146998


In [37]:
# nearest emails flagged:
emails_df.loc[985, "email_str"]

"<s> jacques , the agreement looks fine . </s> <s> my only comment is that george and larry might object to the language that `` the bank that was requested to finance the construction of the project declined to make the loan based on the high costs of the construction of the project '' . </s> <s> <unk> , that bank lowered the loan amount based on lower estimates of <unk> which <unk> the amount of equity that would be required . </s> <s> did i loan them $ DGDGDGDGDGDGDG ? </s> <s> i thought it was less . </s> <s> regarding exhibit a , the assets include : the land , <unk> plans , engineering completed , appraisal , and <unk> study . </s> <s> most of these items are in a state of partial completion by the consultants . </s> <s> i have been speaking directly to the architect , engineer , and <unk> engineer . </s> <s> i am unclear on what is the best way to proceed with these consultants . </s> <s> the obligations should include the fees owed to the consultants above . </s> <s> do we need

## 3.2 K-Means Clustering using Paragraph Vectors (doc2vec)

https://radimrehurek.com/gensim/models/doc2vec.html

In [9]:
# create tagged documents for model training
tagger = gensim.models.doc2vec.TaggedDocument
tagged_docs = []
for i, email in enumerate(emails_df["email_list"]):
    tagged_docs.append(tagger(email, [i]))

print(tagged_docs[0])

TaggedDocument(['here', 'is', 'our', 'forecast', '<s>'], [0])


In [10]:
doc_model = gensim.models.doc2vec.Doc2Vec(documents=tagged_docs, vector_size = 1000, window = 5, alpha = .01, min_count = 5)



In [11]:
print(doc_model.docvecs.most_similar(0))

[(40709, 0.6937914490699768), (40678, 0.6881959438323975), (40849, 0.6868060231208801), (44116, 0.684868574142456), (7543, 0.6820092797279358), (44276, 0.680620014667511), (42515, 0.6795379519462585), (42274, 0.6784195899963379), (40589, 0.6751448512077332), (42558, 0.673206090927124)]


Doc2Vec help:

https://radimrehurek.com/gensim/models/doc2vec.html

https://stackoverflow.com/questions/41709318/what-is-gensims-docvecs

In [None]:
doc_feats = doc_model.docvecs.doctag_syn0
print(doc_feats.shape)

In [None]:
kmeans_doc = KMeans(n_clusters=5, tol=.01, max_iter=100)
kmeans_doc.fit(...)