# Enron Fraud

## Introduction

In this project, I looked at the Enron fraud data which is publicly available here: https://www.cs.cmu.edu/~./enron/
The objective of this study was to develop a predictive model to identify the persons-of-interest based on available data. The dataset included financial data for each person and detailed and summary information for emails exchanged.

The analysis was performed using python's sklearn package and the various classifiers included in the package. As a first step, I imported the packages necessary during course of this exercise.

In [42]:
#!/usr/bin/python

import sys
import pickle
import re
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix
from scipy.stats import scoreatpercentile
from sklearn import linear_model
from sklearn.svm import SVC
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import RandomizedPCA
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import cross_validation
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from IPython.display import display, HTML
%matplotlib inline

In the following step, I changed the current working directory to load existing user-defined functions that will ease the process of reading the source data without implementing the functionality once again.

In [43]:
os.chdir("/Users/ambarishbanerjee/Downloads/ud120-projects/tools/")
from feature_format import featureFormat, targetFeatureSplit
os.chdir("/Users/ambarishbanerjee/Downloads/ud120-projects/final_project/")
from tester import dump_classifier_and_data

As a next step, we start by reading the income and summary of emails exchanged for each person in our dataset. 

In [44]:
### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi','salary','deferral_payments',\
                 'total_payments','exercised_stock_options',\
                'from_messages','from_this_person_to_poi',\
                'bonus','restricted_stock','shared_receipt_with_poi',\
                 'to_messages','from_poi_to_this_person',\
                'restricted_stock_deferred','total_stock_value',\
                'expenses','loan_advances',\
                'other','director_fees','deferred_income',\
                'long_term_incentive']

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

## Feature Engineering

As a next step to our analysis, we begin by looking at the dataset. To that effect, considering the dataset comprises of about 150 records, I looked at the individual entries and noticed that there were two entries that did not appear to be any specific person's information. The first of this was the record identified as "Total" which is the gross total for all the people in our dataset. The second referred as "Travel Agency in the Park" could have been any agency contracted by Enron and hence the decision was to drop this record as well. I also observed that all fields for "Eugene Lockhart" contained NaN and hence the decision to drop this observation.

In [45]:
### Task 2: Remove outliers
my_dataset = data_dict.copy()

outlier_list = ["TOTAL","THE TRAVEL AGENCY IN THE PARK","LOCKHART EUGENE E"]

for record in outlier_list:
    if record in data_dict:
        data_dict.pop(record)

In addition to the corrections made to the dataset, it was observed that some of the values did not tally with what was provided in the enron61702insiderpay.pdf for two specific people - Sanjay Bhatnagar and Robert Belfer. Therefore the financial information for these tworecords were manually updated in the following section.

In [46]:
for record in data_dict:
    if record == "BELFER ROBERT":
        data_dict[record]['salary'] = 'NaN'
        data_dict[record]['bonus'] = 'NaN'
        data_dict[record]['long_term_incentive'] = 'NaN'
        data_dict[record]['deferred_income'] = -102500
        data_dict[record]['deferral_payments'] = 'NaN'
        data_dict[record]['loan_advances'] = 'NaN'
        data_dict[record]['other'] = 'NaN'
        data_dict[record]['expenses'] = 3285
        data_dict[record]['director_fees'] = 102500
        data_dict[record]['total_payments'] = 3285
        data_dict[record]['exercised_stock_options'] = 'NaN'
        data_dict[record]['restricted_stock'] = 44093
        data_dict[record]['restricted_stock_deferred'] = -44093
        data_dict[record]['total_stock_value'] = 'NaN'
        #print data_dict[record]
    if record == "BHATNAGAR SANJAY":
        data_dict[record]['salary'] = 'NaN'
        data_dict[record]['bonus'] = 'NaN'
        data_dict[record]['long_term_incentive'] = 'NaN'
        data_dict[record]['deferred_income'] = 'NaN'
        data_dict[record]['deferral_payments'] = 'NaN'
        data_dict[record]['loan_advances'] = 'NaN'
        data_dict[record]['other'] = 'NaN'
        data_dict[record]['expenses'] = 137864
        data_dict[record]['director_fees'] = 'NaN'
        data_dict[record]['total_payments'] = 137864
        data_dict[record]['exercised_stock_options'] = 15456290
        data_dict[record]['restricted_stock'] = 2604490
        data_dict[record]['restricted_stock_deferred'] = -2604490
        data_dict[record]['total_stock_value'] = 15456290
        #print data_dict[record]

In the next step, I created a list containing a mapping between the email address and the person-of-interest variable with the intent of looking up the "poi" corresponding to an email while analyzing the Enron corpus. Following this, I added a synthtic variable called person_id with the intent of using the person_id as a durable key in lieu of the email address. To that end, I also stored the three variables - email_address, poi, and person_id in a tuple to lookup any of the other two values provided the third one is known. It should be also noted that there were a lot of records for whom the email address were not known but fortunately all such records corresponded to non-poi-s in the dataset.

In [47]:
### Task 3: Create new feature(s)
### Store to data_dict for easy export below.
email_list = []
       
person_id = 0
for record in data_dict:
    person_id += 1
    data_dict[record]['person_id'] = person_id
    
email_lookup_tup = ()
for record in data_dict:
    email_list.append([data_dict[record]['email_address'],data_dict[record]['poi'],data_dict[record]['person_id']])
    email_lookup_tup = email_lookup_tup+((data_dict[record]['person_id'],\
                                          data_dict[record]['email_address'],\
                                          data_dict[record]['poi'],record),)
    
def tuple_lookup(val,col):
    #print val
    for id in range(len(email_lookup_tup)):
        if email_lookup_tup[id][0] == val:
            if col == "email_address":
                return email_lookup_tup[id][1]
            elif col == "poi":
                return email_lookup_tup[id][2]
            elif col == "name":
                return email_lookup_tup[id][3]

In the following section, the source data is split as a column vector comprising of the person-of-interest (target) and income and email communication summary as a feature vector. In the next step, the feature set is transformed into a pandas dataframe to facilitate introduction of two new synthetic variables. The two new variables are the number of emails sent by a person as a ratio of the total number of emails sent and the number of emails received as a ratio of the total number of emails received.

In [51]:
### Extract features and labels from dataset for local testing
data = featureFormat(data_dict, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

people_features = []
for i in range(len(labels)):
    people_features.append(features[i])

pd.set_option('display.float_format', lambda x: '%.3f' % x)

df = pd.DataFrame(data=people_features,columns=features_list[1:])
df['target'] = pd.Series(np.asarray(labels),index=df.index)
df['person_to_poi_proportion'] = df.apply(lambda row: row['from_this_person_to_poi']/\
                                          row['to_messages'] if row['to_messages'] > 0 \
                                          else 0.0, axis=1)
df['poi_to_person_proportion'] = df.apply(lambda row: row['from_poi_to_this_person']/\
                                          row['from_messages'] if row['from_messages'] > 0 \
                                          else 0.0, axis=1)

In the following step, I performed an independent two-sample t-test for each of the variables in the dataset. The main purpose of this exercise was to identify and exclude variables that did not appear to have a statistically significant effect on the mean between the two groups (poi = 0 and poi = 1).

In [52]:
group1 = df[df['target']==0]
group2 = df[df['target']==1]

stat_significant_cols = []

for var in list(df.columns.values):
    if var != "target":
        if ttest_ind(group1[var], group2[var])[1] <= 0.05:
            stat_significant_cols.append(var)

new_list_of_features = ["poi"]
calculated_features = []
for var in stat_significant_cols[:]:
    if var in features_list and var not in new_list_of_features:
        new_list_of_features.append(var)
    elif var == "person_to_poi_proportion":
        if 'from_this_person_to_poi' not in new_list_of_features:
            new_list_of_features.append('from_this_person_to_poi')
        if 'to_messages' not in new_list_of_features:
            new_list_of_features.append('to_messages')
    elif var == "poi_to_person_proportion":
        if 'from_poi_to_this_person' not in new_list_of_features:
            new_list_of_features.append('from_poi_to_this_person')
        if 'from_messages' not in new_list_of_features:
            new_list_of_features.append('from_messages')

In the next section, we rebuild the list of features based on the results from our independent sample t-tests. As part of this exercise, I also chose to drop two of the income variables included in our original dataset - "total_payments" and "total_stock_value". These two variables are the sum totals of others. Hence, knowing all the other variables, one can determine these two variables and therefore they don't add any value.
In the next step, we perform principal component analysis on the income variables with the intent of reducing the variables. Results showed that including two components can account for almost 96% of the variability in the income variables.

In [50]:
new_list_of_features.remove('total_payments')
new_list_of_features.remove('total_stock_value')
income_variables = ["salary","exercised_stock_options",\
                    "bonus","restricted_stock",\
                    "expenses","loan_advances","deferred_income",\
                    "long_term_incentive","other"]

for var in income_variables:
    if var in new_list_of_features:
        new_list_of_features.remove(var)
    new_list_of_features.append(var)
new_list_of_features.append('person_id')

features_list = new_list_of_features[:]
#print features_list
data = featureFormat(data_dict, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)
    
test_correlation = pd.DataFrame(data=np.array(features_train)[:,5:-1],columns=income_variables)
display(test_correlation.corr(method='pearson'))

pca = RandomizedPCA(n_components=len(income_variables)).fit(np.array(features_train)[:,5:-1])
total_variance_captured = 0
transformed_features = []
eigenvectors = []
for indexval in range(len(pca.explained_variance_ratio_)):
    if total_variance_captured < 0.95:
        total_variance_captured += pca.explained_variance_ratio_[indexval]
    else:
        break
#print indexval, total_variance_captured
mod_pca = RandomizedPCA(n_components=indexval).fit(np.array(features_train)[:,5:-1])
final_features = mod_pca.transform(np.array(features_train)[:,5:-1])

Unnamed: 0,salary,exercised_stock_options,bonus,restricted_stock,expenses,loan_advances,deferred_income,long_term_incentive,other
salary,1.0,0.616,0.742,0.559,0.345,0.42,-0.367,0.557,0.591
exercised_stock_options,0.616,1.0,0.639,0.75,0.106,0.725,-0.22,0.48,0.72
bonus,0.742,0.639,1.0,0.539,0.25,0.563,-0.34,0.459,0.52
restricted_stock,0.559,0.75,0.539,1.0,0.186,0.606,-0.134,0.349,0.656
expenses,0.345,0.106,0.25,0.186,1.0,0.13,-0.07,0.076,0.151
loan_advances,0.42,0.725,0.563,0.606,0.13,1.0,-0.035,0.435,0.77
deferred_income,-0.367,-0.22,-0.34,-0.134,-0.07,-0.035,1.0,-0.279,-0.358
long_term_incentive,0.557,0.48,0.459,0.349,0.076,0.435,-0.279,1.0,0.587
other,0.591,0.72,0.52,0.656,0.151,0.77,-0.358,0.587,1.0


In the following step, we concatenate the transformed income features with the number of emails sent by a person as a ratio of the total number of emails sent and the number of emails received as a ratio of the total number of emails received. Lastly, we add a synthetic variable, person_id, which will serve as a pointer to the original data and facilitate the process of pooling the results from the Enron corpus with the income information for each of the records in our dataset. As a final step in the data preparation process, the raw features were normalized such that the sample mean equates to 0 and the standard deviation to 1. The same transformation was applied to the test data as well.

In [53]:
final_features = np.concatenate((final_features, np.array(features_train)[:,0:5]), axis=1)
final_features = np.concatenate((final_features, np.array(features_train)[:,-1:]), axis=1)

final_features[:,3] = final_features[:,3]/final_features[:,6]
final_features[:,4] = final_features[:,4]/final_features[:,5]
## Correct this
final_features[:,3] = np.nan_to_num(final_features[:,3])
final_features[:,4] = np.nan_to_num(final_features[:,4])
final_features = np.delete(final_features, [5,6], 1)

final_features[:,0] = (final_features[:,0] - np.mean(final_features[:,0]))/\
np.std(final_features[:,0])
final_features[:,1] = (final_features[:,1] - np.mean(final_features[:,1]))/\
np.std(final_features[:,1])
final_features[:,2] = (final_features[:,2] - np.mean(final_features[:,2]))/\
np.std(final_features[:,2])
final_features[:,3] = (final_features[:,3] - np.mean(final_features[:,3]))/\
np.std(final_features[:,3])
final_features[:,4] = (final_features[:,4] - np.mean(final_features[:,4]))/\
np.std(final_features[:,4])

#print final_features[0]
final_features_test = mod_pca.transform(np.array(features_test)[:,5:-1])
final_features_test = np.concatenate((final_features_test, np.array(features_test)[:,0:5]), axis=1)
final_features_test = np.concatenate((final_features_test, np.array(features_test)[:,-1:]), axis=1)
final_features_test[:,3] = final_features_test[:,3]/final_features_test[:,6]
final_features_test[:,4] = final_features_test[:,4]/final_features_test[:,5]
final_features_test[:,3] = np.nan_to_num(final_features_test[:,3])
final_features_test[:,4] = np.nan_to_num(final_features_test[:,4])
final_features_test = np.delete(final_features_test, [5,6], 1)

final_features_test[:,0] = (final_features_test[:,0] - np.mean(final_features_test[:,0]))/\
np.std(final_features_test[:,0])
final_features_test[:,1] = (final_features_test[:,1] - np.mean(final_features_test[:,1]))/\
np.std(final_features_test[:,1])
final_features_test[:,2] = (final_features_test[:,2] - np.mean(final_features_test[:,2]))/\
np.std(final_features_test[:,2])
final_features_test[:,3] = (final_features_test[:,3] - np.mean(final_features_test[:,3]))/\
np.std(final_features_test[:,3])
final_features_test[:,4] = (final_features_test[:,4] - np.mean(final_features_test[:,4]))/\
np.std(final_features_test[:,4])

## Data Analysis and Model Development

As my first analysis, I tried a logit specification with "poi" as the response variable with the income and the email variables as the regressors and used the complexity factor as the only hyperparameter. The results hsown below indicate that the logit specification has a reasonably well precision and recall. Unfortunately, the model and regressor significance cannot be obtained with the sklearn logistic regression implementation. Hence, I used the same specification with the logistic regression implementation available in the statmodels library. 

In [54]:
param_grid_logit = {
         'C': [100000.0, 10000.0, 1000.0, 100.0, 10.0, 1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001],
         'class_weight': ["balanced",None]
          }

clf = GridSearchCV(linear_model.LogisticRegression(), param_grid_logit)
clf = clf.fit(final_features[:,[0,1,2,3,4]], labels_train)
logit_model = clf.best_estimator_

predictions_logit = logit_model.predict(final_features_test[:,[0,1,2,3,4]])
c_matrix_test = confusion_matrix(labels_test, predictions_logit[:])
print c_matrix_test

[[37  1]
 [ 3  2]]


The regressor significances shown below indicate that the two synthetic variables - poi_to_person_proportion and person_to_poi_proportion are almost insignificant. Hence, I decided to revise the model after dropping these two regressors from the model.

In [55]:
print final_features[0]
final_features_logit = ["finance1_n",
                        "finance2_n",
                        "shared_receipt_with_poi_n",
                        "poi_to_person_proportion",
                        "person_to_poi_proportion",
                        "person_id"]
final_features_df = pd.DataFrame(final_features,columns = final_features_logit)

final_features_df['Response'] = final_features_df['person_id'].\
apply(lambda x: tuple_lookup(x,col="poi"))

y, X = dmatrices('Response ~ finance1_n + finance2_n + shared_receipt_with_poi_n + poi_to_person_proportion\
                + person_to_poi_proportion',\
                 final_features_df, return_type="dataframe")

y_flat = np.ravel(y['Response[True]'])
model = sm.Logit(y_flat, X)
model_summary = model.fit(disp=0)
print('Parameters: ', model_summary.params)

model_significance = model_summary.get_margeff()
print(model_significance.summary())

[ -0.18189243  -0.42137307  -0.62578358  -0.40237325  -0.42637617  16.        ]
('Parameters: ', Intercept                   -2.131
finance1_n                   0.575
finance2_n                   0.241
shared_receipt_with_poi_n    0.606
poi_to_person_proportion     0.016
person_to_poi_proportion    -0.063
dtype: float64)
        Logit Marginal Effects       
Dep. Variable:                      y
Method:                          dydx
At:                           overall
                               dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
finance1_n                    0.0538      0.070      0.764      0.445        -0.084     0.192
finance2_n                    0.0226      0.026      0.881      0.378        -0.028     0.073
shared_receipt_with_poi_n     0.0567      0.024      2.399      0.016         0.010     0.103
poi_to_person_proportion      0.0015      0.026      0.

I re-ran the model with the variables that were relatively significant but there was little improvement in the the significance values.

In [56]:
y, X = dmatrices('Response ~ finance1_n + finance2_n + shared_receipt_with_poi_n',\
                 final_features_df, return_type="dataframe")

y_flat = np.ravel(y['Response[True]'])
model = sm.Logit(y_flat, X)
model_summary = model.fit(disp=0)
print('Parameters: ', model_summary.params)

model_significance = model_summary.get_margeff()
print(model_significance.summary())

('Parameters: ', Intercept                   -2.131
finance1_n                   0.585
finance2_n                   0.246
shared_receipt_with_poi_n    0.603
dtype: float64)
        Logit Marginal Effects       
Dep. Variable:                      y
Method:                          dydx
At:                           overall
                               dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
finance1_n                    0.0548      0.071      0.766      0.443        -0.085     0.195
finance2_n                    0.0230      0.026      0.897      0.370        -0.027     0.073
shared_receipt_with_poi_n     0.0565      0.023      2.470      0.013         0.012     0.101


Following this, I obtained a revised confusion matrix based using only three regressors - the two income variables and the number of emails shared with poi-s. As expected, there was no improvement as evident from the True Positive, False Positive, and False Negative values.

In [57]:
clf = GridSearchCV(linear_model.LogisticRegression(), param_grid_logit)
clf = clf.fit(final_features[:,[0,1,2]], labels_train)
logit_model_2 = clf.best_estimator_

predictions_logit_2 = logit_model_2.predict(final_features_test[:,[0,1,2]])
c_matrix_test_2 = confusion_matrix(labels_test, predictions_logit_2[:])
print c_matrix_test_2

[[37  1]
 [ 3  2]]


In the final revision, I added two interaction features. I ran a few other specifications as well and retained this one based on the statistical significance of the individual regressors. One noticeable observation with this specification was that the significance of the "shared_receipt_with_poi" dropped considerably and this was mostly due to the inclusion of the two interaction factors - "finance_1"/"shared_receipt_with_poi" and "finance_2"/"shared_receipt_with_poi".

In [58]:
y, X = dmatrices('Response ~ finance1_n + finance2_n + shared_receipt_with_poi_n \
                        + finance1_n*shared_receipt_with_poi_n + finance2_n*shared_receipt_with_poi_n',\
                 final_features_df, return_type="dataframe")

y_flat = np.ravel(y['Response[True]'])
model = sm.Logit(y_flat, X)
model_summary = model.fit(disp=0)
print('Parameters: ', model_summary.params)

model_significance = model_summary.get_margeff()
print(model_significance.summary())

('Parameters: ', Intercept                                0.415
finance1_n                              24.755
finance2_n                              -3.928
shared_receipt_with_poi_n               -0.964
finance1_n:shared_receipt_with_poi_n   -15.588
finance2_n:shared_receipt_with_poi_n     2.750
dtype: float64)
        Logit Marginal Effects       
Dep. Variable:                      y
Method:                          dydx
At:                           overall
                                          dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------
finance1_n                               2.2758      2.198      1.035      0.301        -2.033     6.585
finance2_n                              -0.3611      0.386     -0.936      0.349        -1.117     0.395
shared_receipt_with_poi_n               -0.0886      0.147     -0.604      0.546        -0.376     0.199
finance1

I re-ran the sklearn implementation of the logistric regression classifier with the additional interaction features and there were no improvement with respect to the previous instances.

In [59]:
final_features_logit_int = np.concatenate((final_features[:,[0,1,2,5]],final_features[:,0:1]*final_features[:,2:3],\
                                           final_features[:,1:2]*final_features[:,2:3])\
                                          ,axis=1)

final_features_test_logit_int = np.concatenate((final_features_test[:,[0,1,2,5]],\
                                                final_features_test[:,0:1]*final_features_test[:,2:3],\
                                               final_features_test[:,1:2]*final_features_test[:,2:3]),axis=1)

clf = GridSearchCV(linear_model.LogisticRegression(), param_grid_logit)
clf = clf.fit(final_features[:,[0,1,2,4,5]], labels_train)
logit_model_3 = clf.best_estimator_

predictions_logit_3 = logit_model_3.predict(final_features_test[:,[0,1,2,4,5]])
c_matrix_test_3 = confusion_matrix(labels_test, predictions_logit_3[:])
print c_matrix_test_3

[[37  1]
 [ 3  2]]


As my next algorithm, I chose support vector classifier (SVC). I used the complexity factor, class-weight, kernel, and the degree for polynomial kernel as part of the hyperparameter optimization process to find the optimal SVC classifier. Unfortunately, I was unable to identify not even one of the "poi" records and therefore obtained zero for both precision and recall.

In [60]:
param_grid_svc = {
         'C': [100000.0, 10000.0, 1000.0, 100.0, 10.0, 1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001],
         'class_weight': ["balanced",None],
         'kernel': ["linear", "poly", "sigmoid", "rbf"],
         'degree': [2,3,4,5]
          }
clf = GridSearchCV(SVC(), param_grid_svc)
clf = clf.fit(final_features[:,[0,1,2,3,4]], labels_train)
svm_model = clf.best_estimator_

predictions_svm = svm_model.predict(final_features_test[:,[0,1,2,3,4]])
c_matrix_test_svm = confusion_matrix(labels_test, predictions_svm[:])
print c_matrix_test_svm

[[38  0]
 [ 5  0]]


Decision trees were my next choice for classification algorithms. I used the evaluation criteria, depth of the tree, class-weight, minimum number of samples required for splitting, and minimum number of samples per leaf in my hyperparameter optimization step. Given the extremely small number of poi-s (13) in the dataset, I hoped the best paramters for minimum number of samples per leaf would tend to be extremely low but results suggested otherwise. However, this translated to non-identification of any of the poi-s in the dataset. Like the logistic regression results, the decision tree algorithm indicated the number of emails shared with poi as the most important feature for identifying the poi-s in the dataset.

In [61]:
param_grid_dtree = {
         'criterion':['gini','entropy'],
         'max_depth':[2,3,4,5,6],
         'class_weight': ["balanced",None],
         'min_samples_split':[2,4,6,8],
         'min_samples_leaf':[1,2,3,4,5,6,7]
          }
clf = GridSearchCV(tree.DecisionTreeClassifier(random_state = 42), param_grid_dtree)
clf = clf.fit(final_features[:,[0,1,2,3,4]], labels_train)
dtree_model = clf.best_estimator_
print clf.best_params_

predictions_dtree = dtree_model.predict(final_features_test[:,[0,1,2,3,4]])
c_matrix_test_dtree = confusion_matrix(labels_test, predictions_dtree[:])
print c_matrix_test_dtree
print dtree_model.feature_importances_

{'min_samples_split': 2, 'min_samples_leaf': 5, 'criterion': 'gini', 'max_depth': 3, 'class_weight': None}
[[36  2]
 [ 5  0]]
[ 0.02493101  0.          0.39967268  0.31801683  0.25737948]


As a revision to the baseline decision tree model developed in the previous step, I considered including only the two most important features (based on the gini information criterion) in the dataset. The improvement in the results were remarkable as we were able to identify two of the five poi-s in the test dataset.

In [62]:
clf = GridSearchCV(tree.DecisionTreeClassifier(random_state = 42), param_grid_dtree)
clf = clf.fit(final_features[:,[2,3]], labels_train)
dtree_model_2 = clf.best_estimator_
print clf.best_params_

predictions_dtree_2 = dtree_model_2.predict(final_features_test[:,[2,3]])
c_matrix_test_dtree_2 = confusion_matrix(labels_test, predictions_dtree_2[:])
print c_matrix_test_dtree_2
print dtree_model_2.feature_importances_

{'min_samples_split': 2, 'min_samples_leaf': 1, 'criterion': 'gini', 'max_depth': 2, 'class_weight': 'balanced'}
[[33  5]
 [ 3  2]]
[ 0.72112496  0.27887504]


My next classification algorithm of choice was Naive Bayes. Considering the simplicity of the algorithm, the results were fairly promising. The confusion matrix shown below indicate that the precision and recall were both 0.4 with Naive Bayes.

In [63]:
gnb = GaussianNB()
nb_model = gnb.fit(final_features[:,[0,1,2,3,4]], labels_train)

test_predict_nb = nb_model.predict(final_features_test[:,[0,1,2,3,4]])
print confusion_matrix(labels_test, test_predict_nb)

[[35  3]
 [ 3  2]]


My final classification algorithm of choice included Random Forests (RF). Random forest is an ensemble technique that builds the final algorithm from an ensemble of numerous weak classifiers through majority voting or mean prediction (in case of continuous response variable). Ensemble methods typically tend to be more robust and thus outperform other classification algorithms for datasets that are noisy (like the Enron dataset) and therefore I was more optimistic with the Random Forest classifier. Unfortunately, the results below indicate that the RF model fails to identify the poi-s in the dataset. However, we noticed that the emails shared with poi-s were once again the most important feature in the datset for identifying potential poi-s.

In [66]:
param_grid_RF = {
         'criterion':['gini','entropy'],
         'n_estimators':[100,250,500],
         'class_weight': ["balanced",None],
         'min_samples_split':[2,4,6,8],
         'min_samples_leaf':[1,2,3,4,5],
         'max_depth':[4,6,8,10]
          }
clf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_RF)
clf = clf.fit(final_features[:,[0,1,2,3,4]], labels_train)
RF_model = clf.best_estimator_
print clf.best_params_

predictions_RF = RF_model.predict(final_features_test[:,[0,1,2,3,4]])
c_matrix_test_RF = confusion_matrix(labels_test, predictions_RF[:])
print c_matrix_test_RF
print RF_model.feature_importances_

{'min_samples_leaf': 2, 'n_estimators': 500, 'criterion': 'gini', 'min_samples_split': 6, 'max_depth': 4, 'class_weight': 'balanced'}
[[36  2]
 [ 5  0]]
[ 0.218844    0.16692288  0.27486857  0.19957543  0.13978913]


As a revision to the RF model developed in the previous step, I considered excluding the two variables that exhibited the lowest importance of all the features in the dataset. Results indicate that tha revised model has a 0.4 precision and recall.

In [67]:
param_grid_RF_2 = {
         'criterion':['gini','entropy'],
         'n_estimators':[100,250,500],
         'class_weight': ["balanced",None],
         'min_samples_split':[2,4,6,8],
         'min_samples_leaf':[1,2,3,4,5],
         'max_depth':[4,6,8,10]
          }
clf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_RF_2)
clf = clf.fit(final_features[:,[0,2,3]], labels_train)
RF_model_2 = clf.best_estimator_
print clf.best_params_

predictions_RF_2 = RF_model_2.predict(final_features_test[:,[0,2,3]])
c_matrix_test_RF_2 = confusion_matrix(labels_test, predictions_RF_2[:])
print c_matrix_test_RF_2
print RF_model_2.feature_importances_

{'min_samples_leaf': 4, 'n_estimators': 100, 'criterion': 'gini', 'min_samples_split': 2, 'max_depth': 4, 'class_weight': 'balanced'}
[[35  3]
 [ 3  2]]
[ 0.38267731  0.3524988   0.26482388]


I tried a further revision based on the feature importances from the previous step but noticed that it failed to identify any of the poi-s in our dataset and hence, I decided to keep the previous RF model as the final RF classifier for this exercise.

In [68]:
clf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_RF_2)
clf = clf.fit(final_features[:,[0,2]], labels_train)
RF_model_3 = clf.best_estimator_
print clf.best_params_

predictions_RF_3 = RF_model_3.predict(final_features_test[:,[0,2]])
c_matrix_test_RF_3 = confusion_matrix(labels_test, predictions_RF_3[:])
print c_matrix_test_RF_3
print RF_model_3.feature_importances_

{'min_samples_leaf': 5, 'n_estimators': 100, 'criterion': 'gini', 'min_samples_split': 2, 'max_depth': 4, 'class_weight': None}
[[38  0]
 [ 5  0]]
[ 0.4600568  0.5399432]


In the next section, I tried to look at the email_texts for each person inthe dataset with the objective of finding additional information that would help us improve the precison and recall further. To that effect, my approach was o identify key words in the emails that can be associated with the persons of interest. I used a two-step approach to achieve this objective. In the first step, I tried to identify top 10 words that best identify emails that originated from the poi-s in the training set and in the second step I tried combining these words with the income and email summary features using a decision tree classifier.
As a first step in this process, I started off with loading existing user-defined helper functions to load the email data from the Enron corpus.

In [69]:
os.chdir("/Users/ambarishbanerjee/Downloads/ud120-projects/tools/")
from parse_out_email_text import parseOutText

In this section, I start by concatenating all emails for each person in our dataset into a single string and filtering (this was done iteratively) out proper nouns as they provide no information for poi-s in general but rather bear a signature for the person who authored the email. Finally, I include the person_id and the poi-variable to index and keep track of the response variable for each record.

In [70]:
os.chdir("/Users/ambarishbanerjee/Downloads/ud120-projects/final_project/emails_by_address/")
#print os.getcwd()
email_text_data = {}
all_person_email = []

for email_add in email_list[:]:
    email_filename = "from_"+email_add[0]+".txt"
    email_text_data["email_id"] = email_add[0]
    email_text_data["poi"] = email_add[1]
    email_text_data["person_id"] = email_add[2]
    if os.path.isfile(email_filename):
        email_text_data["email_text"] = open(email_filename, "r")
    else:
        email_text_data["email_text"] = None
    all_person_email.append(email_text_data.copy())
    
features_train_tm_list = []
labels_train_tm_list = []
features_test_tm_list = []
labels_test_tm_list = []

remove_words = ['[A-Za-z]*houect[A-Za-z]*','[A-Za-z]*kamins[A-Za-z]*','[A-Za-z]*keann[A-Za-z]*',\
               '[A-Za-z]*shirley[A-Za-z]*','[A-Za-z]*sbeck[A-Za-z]*','[A-Za-z]*stinson[A-Za-z]*',\
               '[A-Za-z]*shapiro[A-Za-z]*','[A-Za-z]*mcconn[A-Za-z]*','[A-Za-z]*zimin[A-Za-z]*',\
               '[A-Za-z]*jlavoransf[A-Za-z]*','[A-Za-z]*jshankmnsf[A-Za-z]*','[A-Za-z]*mhaedicnsf[A-Za-z]*',
               '[A-Za-z]*maureen[A-Za-z]*','[A-Za-z]*pallennsf[A-Za-z]*']

for record in all_person_email[:]:
    
    all_emails_concat = ""
    #print record['email_text']
    if record['email_text'] != None:
        for path in record['email_text']:
            ### only look at first 200 emails when developing
            ### once everything is working, remove this line to run over full dataset
            #print path[path.find('/'):]
            path = os.path.join('../..', path[path.find('/')+1:][:-1])
            email = open(path, "r")
            all_emails_concat += parseOutText(email) 
            email.close()
    for word in remove_words:
        word_pattern = re.compile(word)
        all_emails_concat = re.sub(word_pattern,"",all_emails_concat)
        
    if record["person_id"] in final_features[:,5]:
        features_train_tm_list.append(all_emails_concat)
        #from_data_train_tm.append(record['email_id'])
        labels_train_tm_list.append(record['poi'])
    else:
        features_test_tm_list.append(all_emails_concat)
        #from_data_test_tm.append(record['email_id'])
        labels_test_tm_list.append(record['poi'])

In the following step, I vectorized the data using Sklearn TFIdf-vectorizer and used a decision tree classifier on the vectorized data. As a quick note, I tried a few different values for the number of features and I observed that 10 features helped achieve the most number of positive ids with the pois on the training dataset.

In [71]:
features_train_tm = np.asarray(features_train_tm_list)
features_test_tm = np.asarray(features_test_tm_list)
labels_train_tm = np.asarray(labels_train_tm_list)
labels_test_tm = np.asarray(labels_test_tm_list)
#print features_train_tm[0]

param_grid_tm = {
         'criterion':['gini','entropy'],
         'max_depth':[2,3,4,5],
         'class_weight': ["balanced",None],
         'min_samples_split':[2,4,6,8],
         'min_samples_leaf':[1,2,3,4,5,6,7]
          }

vectorizer = TfidfVectorizer(stop_words="english",max_df=0.2,max_features=10)
features_train_tm = vectorizer.fit_transform(features_train_tm).toarray()
features_test_tm  = vectorizer.transform(features_test_tm).toarray()

clf = GridSearchCV(tree.DecisionTreeClassifier(random_state=42), param_grid_tm)
clf = clf.fit(features_train_tm, labels_train_tm)
tm_dtree = clf.best_estimator_

pred_train = tm_dtree.predict(features_train_tm)
print accuracy_score(labels_train_tm,pred_train)
print confusion_matrix(labels_train_tm,pred_train)

feature_words = vectorizer.get_feature_names()
print feature_words

0.9
[[86  1]
 [ 9  4]]
[u'brent', u'deriv', u'engin', u'enrononlin', u'ferc', u'hotel', u'lng', u'shall', u'sincer', u'student']


When testing on the test dataset, I observed that the algorithm failed to identify even one of the pois in the test set which led me to conclude that text mining approaches may not be suitable for identifying the poi-s in the dataset.

In [72]:
pred_test = tm_dtree.predict(features_test_tm)
print accuracy_score(labels_test_tm,pred_test)
print confusion_matrix(labels_test_tm, pred_test)

0.767441860465
[[33  5]
 [ 5  0]]


## Model Evaluation

In the final section of this project, we use the helper functions provided to test the accuracy, recall, precision, and the F1 score. We start by importing the helper function.

In [73]:
os.chdir("/Users/ambarishbanerjee/Downloads/ud120-projects/final_project/")
from tester import test_classifier

In the next section, we combine the training and test data used thus far to obtain the full list of transformed features that were generated prior to model development. Following this step, these synthetic features were merged in the original dataset as required by the framework for this project. The additional features are also included in the final feature set for this problem.

In [74]:
combined_data = np.concatenate((final_features,final_features_test),axis=0)
for record in combined_data[:,:]:
    #print record[-1]
    data_dict[tuple_lookup(record[-1],"name")]["finance_1_n"] = record[0]
    data_dict[tuple_lookup(record[-1],"name")]["finance_2_n"] = record[1]
    data_dict[tuple_lookup(record[-1],"name")]["shared_with_poi_n"] = record[2]
    data_dict[tuple_lookup(record[-1],"name")]["from_poi_to_person_ratio_n"] = record[3]
    data_dict[tuple_lookup(record[-1],"name")]["from_person_to_poi_ratio_n"] = record[4]

if "finance_1_n" not in features_list:
    features_list.extend(["finance_1_n","finance_2_n","shared_with_poi_n",\
                         "from_poi_to_person_ratio_n","from_person_to_poi_ratio_n"])
#print features_list[:]

As our first algorithm, we evaluate the logistic regression specification with the two transformed income variables and the number of emails shared with persons-of-interest as our regressors. Please note that certain other regressors were dropped from the model progressively due to low statistical significance. The results show us that the precision and recall for this model are 0.57 and 0.2, respectively. This implies that the model is successful 1 in 5 times in identifying the poi-s given that the person is actually a poi. 

In [75]:
features_list_logit = ["poi","finance_1_n","finance_2_n","shared_with_poi_n"]
test_classifier(logit_model_2, data_dict, features_list_logit, folds = 100)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
	Accuracy: 0.87333	Precision: 0.57143	Recall: 0.20000	F1: 0.29630	F2: 0.22989
	Total predictions: 1500	True positives:   40	False positives:   30	False negatives:  160	True negatives: 1270



Next, we evaluated the decision tree model we developed with two of the features from the entire dataset. The first of these includes the number of emails shared between a given person and a poi and the latter being the ratio of the messages received from poi-s to the total of the messages received. Once again, we started with all five engineered features but dropped three of them as their gini-index (indicates the information contained in those features) was relatively low. The results below indicate that the decision tree model is successful in identifying 1 in every 2 poi-s given that they are actually poi-s.

In [76]:
features_list_dtree = ["poi","shared_with_poi_n","from_poi_to_person_ratio_n"]
test_classifier(dtree_model_2, data_dict, features_list_dtree, folds = 100)

DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best')
	Accuracy: 0.84067	Precision: 0.42412	Recall: 0.54500	F1: 0.47702	F2: 0.51561
	Total predictions: 1500	True positives:  109	False positives:  148	False negatives:   91	True negatives: 1152



Our third algorithm for this project is the Naive Bayes classifier. We included all five engineered variables for this model as there was no way to determine individual significance of each of the features in the dataset. The model performed relatively well given its simplicity; results showed that it was able to identify 1 in every 3 poi-s in the dataset.

In [77]:
features_list_nb = ["poi","finance_1_n","finance_2_n","shared_with_poi_n",\
                         "from_poi_to_person_ratio_n","from_person_to_poi_ratio_n"]
test_classifier(nb_model, data_dict, features_list_nb, folds = 100)

GaussianNB()
	Accuracy: 0.85467	Precision: 0.43750	Recall: 0.31500	F1: 0.36628	F2: 0.33369
	Total predictions: 1500	True positives:   63	False positives:   81	False negatives:  137	True negatives: 1219



The final classification algorithm that we evaluated was the Random Forest. Once again we limited our feature set to only those that had a relatively high Gini-index. The precision and recall values were higher 0.35 and 0.4, respectively implying that the random forest classifier is able to identify 2 out of every 5 poi-s in the dataset.

In [78]:
features_list_rf = ["poi","finance_1_n","shared_with_poi_n",\
                         "from_poi_to_person_ratio_n"]
test_classifier(RF_model_2, data_dict, features_list_rf, folds = 100)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=4, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
	Accuracy: 0.82000	Precision: 0.34649	Recall: 0.39500	F1: 0.36916	F2: 0.38424
	Total predictions: 1500	True positives:   79	False positives:  149	False negatives:  121	True negatives: 1151



Based on the results presented above, the decision tree model outperformed the other algorithms considered in this project. In order to be consistent and for the purposes of comparison, it can be said that the feature set was varied between the models. Hence in the final section of this project, I revised the logistic regression, naive bayes, and random forest model to obtain the precision, recall and F1 score on a consistent feature set.

As expected, the precision, recall, and F1 score for the logit model dropped drastically as we switched from the optimal choice of features to a suboptimal choice.

In [79]:
clf = GridSearchCV(linear_model.LogisticRegression(), param_grid_logit)
clf = clf.fit(final_features[:,[2,3]], labels_train)
logit_model_comp = clf.best_estimator_

test_classifier(logit_model_comp, data_dict, features_list_dtree, folds = 100)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
	Accuracy: 0.86000	Precision: 0.18750	Recall: 0.01500	F1: 0.02778	F2: 0.01838
	Total predictions: 1500	True positives:    3	False positives:   13	False negatives:  197	True negatives: 1287



With the Naive Bayes classifier, we make a similar observation as the logistic classifier.

In [80]:
gnb = GaussianNB()
nb_model_comp = gnb.fit(final_features[:,[2,3]], labels_train)

test_classifier(nb_model_comp, data_dict, features_list_dtree, folds = 100)

GaussianNB()
	Accuracy: 0.82333	Precision: 0.20183	Recall: 0.11000	F1: 0.14239	F2: 0.12101
	Total predictions: 1500	True positives:   22	False positives:   87	False negatives:  178	True negatives: 1213



However, the random forest based classifier showed a significant improvement in precision, recall, and F1 scores after switching to the new set of features.

In [81]:
clf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_RF_2)
clf = clf.fit(final_features[:,[2,3]], labels_train)
RF_model_comp = clf.best_estimator_

test_classifier(RF_model_comp, data_dict, features_list_dtree, folds = 100)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=4, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
	Accuracy: 0.85333	Precision: 0.45614	Recall: 0.52000	F1: 0.48598	F2: 0.50584
	Total predictions: 1500	True positives:  104	False positives:  124	False negatives:   96	True negatives: 1176



Based on the precision and recall values, I chose to keep the decision tree model as it outperformed the other classification algorithms. As required by the framework of this project, I make a final call to one of the helper functions to write the feature and data set to separate files so that one may validate the results and the findings of this project.

In [82]:
clf = dtree_model_2
features_list = features_list_dtree
dump_classifier_and_data(dtree_model_2, data_dict, features_list_dtree)

## Summary and Conclusions

As part of this project, I investigated the Enron fraud data with an objective of developing a classification model that can help us predict the persons of interest given their income information and summary statistics on the emails exchanged. Following are the steps I took and the findings I made as part of analyzing this dataset:
1. I started with loading the raw data and removing the outliers from the dataset. This also involved correcting the income information for two of the people in the dataset based on other reliable sources.
2. In the next step, I dropped the two artifical features, namely total_payment and total_stock_value as they provided no additional information once all the other features are known.
3. In the next step, I undertook a two-sample independent t-test to retain those features whose means were statically different for the two groups - poi-s and non-poi-s.
4. In the next step, I performed a principal component analysis with the objective of reducing the income variable to include as many components as necessary to capture 95% of the variability in the income features. This resulted in two eigenvectors as their eigenvalues summaed to approximately 0.96.
5. In the fifth step, I included the number of emails shared between a person and a poi and two synthetic features. The first of these was the ratio emails recived by a person from a poi to the total number of emails recieved by that person and the latter was the number of emails sent by a person to a poi to the total number of emails sent.
6. Lastly, I chose to standardize all the features so that they each have zero mean and statndard of deviation equalling to 1.
7. Following this I evaluated five different classifiers, namely, Logistic Regression, Decision Trees, Support Vector Classifier, Naive bayes, and Random Forests. I began the model development process for each of them with all five engineered variables and gradually reduced the feature set based on statistical significance or the gini-index (except for Naive Bayes and Support vector Classifier). In case of support vector classifer, I was unable to identify even one of the poi-s in the dataset and hence the decision to drop the classifier from further consideration. In addition, I also considered interaction effects as an additional specfication for the Logistic Regression classifier but it did not improve the model's performance.
8. Lastly, I hoped to improve the accuracy of the classifiers developed in the previous step by augmenting the information available from the Enron corpus. To that effect, my approach was identify key terms that frequently feature in the emails sent by persons-of-interest in the dataset. Once the frequent terms are identified, they can be combined with the income and email summary features and then use the combined dataset with a classification algorithm to achieve improved accuracy and F1 score. Unfortunately, the key terms identified by the text mining process proved unsuccessful in identifying the persons-of interest in the test dataset.
9. In the final section of this project, we evaluate each of algoithms developed in Step 7 using stratified samples. Interestingly, the logistic regression specification developed with the transformed finance features and the number of emails shared with poi-s feel short of 0.3 recall and accuracy though having the highest precision (0.67) and recall (0.4) for the test data used earlier in this study. However, the decision tree, naive bayes, and the random forest specification scored higher than 0.3 precision and recall. The decision tree scored highest in terms of precision and recall. Another vital observation from this project was that the number of emails shared between any person and a poi proved to be the most powerful feature for identifying the poi-s in the dataset.

## Comment

While the five email-related features were useful during course of this project, calcualting precise values for three of them (email_sent_by_poi_to_person, email_sent_by_person_to_poi, and email_shared_with_poi) would require apriori knowledge of the poi-s in the dataset. Given the objective of this project (to identify the persons of interest), I suspect that the target variable's signature can be found in each of these three variables. This hypothesis was further strengthened when I noticed that the number of emails_shared_with_poi came out as the most powerful feature in the datset.