# Detecting Deception in Enron Emails
## 3. Modeling

In this notebook, we attempt three methods of clustering and topic modeling, beginning with simple N-gram features, then reducing dimensionality for K-means using paragraph vectoring, or Doc2Vec. We also try variants of Latent Dirichlet Allocation (LDA) to identify sub-topics within emails for comparison to our keyword-targeted labels.

### 3.0 Libraries and Preprocessed Data

In [1]:
from __future__ import print_function
from __future__ import division

import os, sys
import collections
import re

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

import email
import nltk
from nltk.tokenize.treebank import TreebankWordTokenizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from scipy import sparse, hstack
from sklearn.feature_extraction.text import *
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
# run 'easy_install -U gensim' to install gensim and then re-launch jupyter notebook
#import gensim


# Helper libraries
import constants
import utils
import vocabulary

#### Load Preprocessed Dataset

In [2]:
# test import; runtime ~ 5 minutes for full set
path = '/home/cmiller11/NLP_Enron'
emails_df = pd.read_pickle(path + '/cleaned_emails.pkl')
print(emails_df.shape)
emails_df.head()

# ENTER number of rows for training; NONE if using all
num_rows = 50000

if num_rows != None:
    mini_df = emails_df.loc[range(0,num_rows),]

(517401, 3)


## 3.1 Baseline: K-Means Clustering with Simple N-gram Features

Using partition of early emails only to reduce dimensions in baseline bi-gram model.

In [14]:
# consider longer n-grams
transformer = TfidfTransformer()
vectorize = CountVectorizer(ngram_range=(1, 2), max_df=.01, min_df=5)
n_grams = vectorize.fit_transform(mini_df["email_str"])
print("N-grams shape: ", n_grams.shape)

n_grams_idf = transformer.fit_transform(n_grams)
print("N-grams TF-IDF:", n_grams_idf.shape)

N-grams shape:  (50000, 168196)
N-grams TF-IDF: (50000, 168196)


In [27]:
# fit k-means (n=4, tol=.01, max_iter=100 takes ~20 mins to train)
kmeans = KMeans(n_clusters=5, tol=.01, max_iter=100)
kmeans.fit(n_grams_idf)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.01, verbose=0)

In [28]:
# evaluate predictions on positively labeled examples
base_preds = kmeans.predict(n_grams_idf)
print("First 30 clusters:", base_preds[:30])

# GET IDS OF POSITIVE LABELS
positive_labels = emails_df["suspicious_ind"][emails_df["suspicious_ind"]==1]

print("Positive labels:  ", base_preds[positive_labels])

First 30 clusters: [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Positive labels:   [2 2 2 2 2]


In [29]:
# evaluate closest examples in cluster 2 based on cosine similarity
suspicious_ids = np.zeros(len(base_preds))
for k in label_dict.keys():
    suspicious_ids[k] = label_dict.get(k)

# enter index of typical cluster
key_cluster = 2

cluster_ids = (base_preds==key_cluster).astype(int)

features_np = sparse.csr_matrix.todense(feature_vects)

labeled_feats = features_np[np.multiply(suspicious_ids, cluster_ids).astype(bool)]
print(labeled_feats.shape)

(5, 75864)


In [43]:
# cosine_similarity
print(label_dict.keys())
cos_sims = cosine_similarity(labeled_feats, feature_vects)
print(cos_sims.shape)
closest = np.argsort(cos_sims, axis = 1)
print(closest[:,-10:])
print(cos_sims[1,985])

dict_keys([346, 27091, 373, 13966, 405])
(5, 50000)
[[21177 31738 39744 13961 11650 15532  7014   966   346  2759]
 [  985  2779  1893  2769   356   976  1879   373  2787   995]
 [  357   358  1892  2771   978   977  2820  1855   405  1027]
 [17248 10977 14371 11429 16116 13982 11606 13966  9998 17433]
 [20382 19225 29563 21933 27092 30259 24055 31124 27091 20569]]
0.164362146998


In [37]:
# nearest emails flagged:
emails_df.loc[985, "email_str"]

"<s> jacques , the agreement looks fine . </s> <s> my only comment is that george and larry might object to the language that `` the bank that was requested to finance the construction of the project declined to make the loan based on the high costs of the construction of the project '' . </s> <s> <unk> , that bank lowered the loan amount based on lower estimates of <unk> which <unk> the amount of equity that would be required . </s> <s> did i loan them $ DGDGDGDGDGDGDG ? </s> <s> i thought it was less . </s> <s> regarding exhibit a , the assets include : the land , <unk> plans , engineering completed , appraisal , and <unk> study . </s> <s> most of these items are in a state of partial completion by the consultants . </s> <s> i have been speaking directly to the architect , engineer , and <unk> engineer . </s> <s> i am unclear on what is the best way to proceed with these consultants . </s> <s> the obligations should include the fees owed to the consultants above . </s> <s> do we need

## 3.2 K-Means Clustering using Paragraph Vectors (doc2vec)

https://radimrehurek.com/gensim/models/doc2vec.html

In [9]:
# create tagged documents for model training
tagger = gensim.models.doc2vec.TaggedDocument
tagged_docs = []
for i, email in enumerate(emails_df["email_list"]):
    tagged_docs.append(tagger(email, [i]))

print(tagged_docs[0])

TaggedDocument(['here', 'is', 'our', 'forecast', '<s>'], [0])


In [10]:
doc_model = gensim.models.doc2vec.Doc2Vec(documents=tagged_docs, vector_size = 1000, window = 5, alpha = .01, min_count = 5)



In [11]:
print(doc_model.docvecs.most_similar(0))

[(40709, 0.6937914490699768), (40678, 0.6881959438323975), (40849, 0.6868060231208801), (44116, 0.684868574142456), (7543, 0.6820092797279358), (44276, 0.680620014667511), (42515, 0.6795379519462585), (42274, 0.6784195899963379), (40589, 0.6751448512077332), (42558, 0.673206090927124)]


Doc2Vec help:

https://radimrehurek.com/gensim/models/doc2vec.html

https://stackoverflow.com/questions/41709318/what-is-gensims-docvecs

In [None]:
doc_feats = doc_model.docvecs.doctag_syn0
print(doc_feats.shape)

In [None]:
kmeans_doc = KMeans(n_clusters=5, tol=.01, max_iter=100)
kmeans_doc.fit(...)

### LDA Topic Assignment

In [3]:
#Implement LDA for topics of each document
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

#hyperparameters
random.seed(24)
#n_samples = 2000
n_features = 3000
n_components = 10
n_top_words = 20
rand_num = 10000

rand_ids = random.sample(range(emails_df.shape[0]), rand_num)
email_samples = emails_df.loc[rand_ids,]
#training_emails = all_emails.loc[all_ids,]

train_content = list(email_samples['email_str'])
#dev_content = list(email_samples['content_str'])

#tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
#                                   max_features=n_features,
#                                   stop_words='english')

#tfidf = tfidf_vectorizer.fit_transform(all_content)

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.4, min_df=10,
                                max_features=n_features,
                                stop_words='english',
                                ngram_range = (1,1))
#t0 = time()
#tf = tf_vectorizer.fit_transform(data_samples)
#print("done in %0.3fs." % (time() - t0))
#print()

tf_train = tf_vectorizer.fit_transform(train_content)
tf_vocab = tf_vectorizer.vocabulary_
tf_feature_names = tf_vectorizer.get_feature_names()
print(tf_train.shape)

Extracting tf features for LDA...
(10000, 3000)


In [169]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

#hyperparameters
lda_iterations = 1000
    
    
print("Fitting LDA models with tf features, "
      "n_features=%d..."
      % (n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=lda_iterations,
                                learning_method='online',
                                learning_offset=50., n_jobs = -1,
                                random_state=0, evaluate_every = 50, verbose = False)
t0 = time()
lda.fit(tf_train)
print("done in %0.3fs." % (time() - t0))
print("with {} iterations".format(lda.n_iter_))

print("\nTopics in LDA model:")
print_top_words(lda, tf_feature_names, n_top_words)


Fitting LDA models with tf features, n_features=3000...
done in 1418.911s.
with 349 iterations

Topics in LDA model:
Topic #0: thanks know meeting need let time cc like kay attached issues questions group information work pm forward review et discuss
Topic #1: image font td 09 br size width com tr dgdgdg href height face border table arial fantasy img align sportsline
Topic #2: power energy california said state market dgdgdg price prices iso electricity ferc new utilities final customers commission gas electric rate
Topic #3: com dgdgdg mail sent message original pm cc fax net dgdgdgdgdg monday aol mailto john fw chris thanks october sara
Topic #4: com message error vince intended mail recipient kaminski database information confidential use email dgdgdg doc corp delete distribution occurred sender
Topic #5: 20 company business gas new market million services energy 01 management president companies group year dgdgdg markets risk houston natural
Topic #6: dgdgdg com pm information cli

In [6]:
import lda
import guidedlda
import math

t0 = time()
random.seed(24)
pct_keyword_docs = 0.35
sample_keyword_doc_idx = set()
all_keyword_doc_idx = set()
phrases = ["raptor","ljm","fraud","manipulation", "condor","trutta","swap",\
               "differential","ferc", "merlin", "whitewing", "jedi", "vehicle"]
for phrase in phrases:
    query = emails_df[emails_df['email_str'].str.contains(phrase, case=False)]
    matching_indices = query.index.tolist()
    all_keyword_doc_idx = all_keyword_doc_idx.union(set(matching_indices))
    if len(matching_indices) == 0:
        print(phrase + " did not yield any results")
        continue
    num_keyword_docs = math.ceil(len(matching_indices)*pct_keyword_docs)
    rand_ids = random.sample(matching_indices, num_keyword_docs)
    sample_keyword_doc_idx = sample_keyword_doc_idx.union(set(rand_ids))
    print("after {} there are {} indices".format(phrase,len(sample_keyword_doc_idx)))     

guided_emails = emails_df.loc[list(sample_keyword_doc_idx)]

#hyperparameters
random.seed(24)
n_features = None
n_components = 10
n_top_words = 20
rand_num = 10000
seed_confidence = 0.99
max_iters = 700

all_idx = set()
for i in emails_df.index:
    all_idx.add(i)
available_emails_idx = all_idx.difference(all_keyword_doc_idx)
print("there are {} emails non relevant emails to sample from after removing {} relevant emails"\
.format(len(available_emails_idx),len(all_keyword_doc_idx)))
rand_ids = random.sample(list(available_emails_idx), rand_num)
train_idx = list(sample_keyword_doc_idx.union(set(rand_ids)))     
email_samples = emails_df.loc[train_idx,]

print("There are {} irrelevant docs".format(rand_num))
print("the train email df will be {}".format(email_samples.shape[0]))

train_content = list(email_samples['email_str'])
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.40, min_df=10,
                                max_features=n_features,
                                stop_words='english',
                                ngram_range = (1,1))

tf_train = tf_vectorizer.fit_transform(train_content)
tf_vocab = tf_vectorizer.vocabulary_
tf_feature_names = tf_vectorizer.get_feature_names()
print("Output of CountVectorizer: ",tf_train.shape)
             
seed_topic_list = [phrases]
seed_topics = {}
for t_id, keywords in enumerate(seed_topic_list):
    for word in keywords:
        seed_topics[tf_feature_names.index(word)] = t_id
    
glda = guidedlda.GuidedLDA(n_topics = n_components, n_iter = max_iters, random_state = 7, refresh = 50)
glda.fit(tf_train, seed_topics = seed_topics, seed_confidence = seed_confidence)
topic_word = glda.topic_word_
print(topic_word.shape)

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(tf_feature_names)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))
print("finished glda in {}".format(time() - t0))

after raptor there are 229 indices
after ljm there are 544 indices
after fraud there are 892 indices
after manipulation there are 1326 indices
after condor there are 1356 indices
after trutta there are 1368 indices
after swap there are 4900 indices
after differential there are 5187 indices
after ferc there are 10836 indices
after merlin there are 10935 indices
after whitewing there are 10979 indices
after jedi there are 11094 indices
after vehicle there are 12206 indices
there are 484399 emails non relevant emails to sample from after removing 33002 relevant emails
There are 10000 irrelevant docs
the train email df will be 22206
Extracting tf features for LDA...


INFO:guidedlda:n_documents: 22206
INFO:guidedlda:vocab_size: 22830
INFO:guidedlda:n_words: 5675974
INFO:guidedlda:n_topics: 10
INFO:guidedlda:n_iter: 700


Output of CountVectorizer:  (22206, 22830)


  if sparse and not np.issubdtype(doc_word.dtype, int):
INFO:guidedlda:<0> log likelihood: -62165857
INFO:guidedlda:<50> log likelihood: -47299041
INFO:guidedlda:<100> log likelihood: -46943247
INFO:guidedlda:<150> log likelihood: -46820467
INFO:guidedlda:<200> log likelihood: -46746635
INFO:guidedlda:<250> log likelihood: -46716167
INFO:guidedlda:<300> log likelihood: -46703662
INFO:guidedlda:<350> log likelihood: -46680900
INFO:guidedlda:<400> log likelihood: -46670569
INFO:guidedlda:<450> log likelihood: -46664131
INFO:guidedlda:<500> log likelihood: -46658764
INFO:guidedlda:<550> log likelihood: -46651523
INFO:guidedlda:<600> log likelihood: -46646283
INFO:guidedlda:<650> log likelihood: -46645347
INFO:guidedlda:<699> log likelihood: -46645476


(10, 22830)
Topic 0: ees ferc ect hou na et transmission rto corp market pm staff commission james order comments meeting need steffes
Topic 1: energy gas power natural new pipeline electric capacity paso prices el market com plant california dgdgdgdgdg price 09 services
Topic 2: 20 market information new rate risk time business provide 01 use contract issues trading service order need customers project
Topic 3: 20 3d power state california energy said electricity market prices utilities price font davis federal new utility ferc commission
Topic 4: com mail message gov doc org net intended ca recipient mailto aol bpa bracepatt received email confidential energy john
Topic 5: company said million new financial stock business dow jones trading billion energy news credit year investors 20 shares companies
Topic 6: com california market iso mail ferc power energy price electricity 20 prices state order cap wholesale jeff utilities generators
Topic 7: ect hou corp pm thanks enron_developmen

In [161]:
print(tf_feature_names.index('raptor'))
print(topic_word[:,349])
print(glda.nzw_[:,349])
#email_samples['email_str']

349
[0.00691522 0.02868753 0.00651866]
[ 272 1125  250]


In [None]:
#try the fit method
dev_vectorizer = CountVectorizer(vocabulary = tf_vocab)
tf_dev = dev_vectorizer.fit_transform(dev_content)
doc_topic_distr = lda.transform(tf_dev)
print(doc_topic_distr[1])
print(dev_content[1])
