# Identifying Fraud in the Enron Dataset with Machine Learning
## Adam Wright
## July 18, 2016

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives. In this project I will use this data to build and validate a series of supervised machine learning algorithms to classify Enron employees as Persons of Interst for potential fraud. 

### Import Python Modules

In [1]:
import sys
import pickle
import numpy
import pandas
import sklearn
import time
from ggplot import *
import matplotlib
%matplotlib inline

sys.path.append('c:\\Users\\Adam\\Udacity\\Intro_to_Machine_Learning\\ud120-projects\\tools')
from feature_format import featureFormat, targetFeatureSplit

sys.path.append('c:\\Users\\Adam\\Udacity\\Intro_to_Machine_Learning\\ud120-projects\\final_project')
from tester import test_classifier, dump_classifier_and_data

### Load Data

I begin by loading the data, determining the number of cases, and examining a representative case.

In [2]:
enron_data = pickle.load( \
    open("c:\\Users\\Adam\\Udacity\\Intro_to_Machine_Learning\\ud120-projects\\final_project\\final_project_dataset.pkl", "r"))
print '{} Enron employees'.format(len(enron_data.keys()))
print '{} features in dataset'.format(len(enron_data['SKILLING JEFFREY K'].keys()))
print enron_data['SKILLING JEFFREY K']

146 Enron employees
21 features in dataset
{'salary': 1111258, 'to_messages': 3627, 'deferral_payments': 'NaN', 'total_payments': 8682716, 'exercised_stock_options': 19250000, 'bonus': 5600000, 'restricted_stock': 6843672, 'shared_receipt_with_poi': 2042, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 26093672, 'expenses': 29336, 'loan_advances': 'NaN', 'from_messages': 108, 'other': 22122, 'from_this_person_to_poi': 30, 'poi': True, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 1920000, 'email_address': 'jeff.skilling@enron.com', 'from_poi_to_this_person': 88}


It would appear that there are 146 cases and 21 features in the dataset. The data is labeled with a boolean 'poi' indicator which - as in the case of Jeff Skilling - is set to true for employees subsequently indicted for fraud. As expected there also appears to be two broad classes of features - financial and email. Unfortunately, even in the high profile case of Jeff Skilling, there appears to be a fair bit of missing data.

### Missing Data

In order to more efficiently deal with missing data and identify outliers, I convert the data dictionary into a pandas dataframe.

In [3]:
df = pandas.DataFrame.from_dict(enron_data, orient = 'index')
print df.head()

                    salary to_messages deferral_payments total_payments  \
ALLEN PHILLIP K     201955        2902           2869717        4484442   
BADUM JAMES P          NaN         NaN            178980         182466   
BANNANTINE JAMES M     477         566               NaN         916197   
BAXTER JOHN C       267102         NaN           1295738        5634343   
BAY FRANKLIN R      239671         NaN            260455         827696   

                   exercised_stock_options    bonus restricted_stock  \
ALLEN PHILLIP K                    1729541  4175000           126027   
BADUM JAMES P                       257817      NaN              NaN   
BANNANTINE JAMES M                 4046157      NaN          1757552   
BAXTER JOHN C                      6680544  1200000          3942714   
BAY FRANKLIN R                         NaN   400000           145796   

                   shared_receipt_with_poi restricted_stock_deferred  \
ALLEN PHILLIP K                       1407  

Next, I examine the missing data.

In [4]:
# Convert to numpy nan
df.replace(to_replace='NaN', value=numpy.nan, inplace=True)

# Count number of NaN's for columns
print df.isnull().sum()

# DataFrame dimensions
print df.shape

salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
exercised_stock_options       44
bonus                         64
restricted_stock              36
shared_receipt_with_poi       60
restricted_stock_deferred    128
total_stock_value             20
expenses                      51
loan_advances                142
from_messages                 60
other                         53
from_this_person_to_poi       60
poi                            0
director_fees                129
deferred_income               97
long_term_incentive           80
email_address                 35
from_poi_to_this_person       60
dtype: int64
(146, 21)


Examining the list of missing data, I am not too concerned by the financial fields that have relatively few values as it is plausible that few employees would have been eligible for that sort of compensation. I am however concerned by the cases for which I have no total payment information and/or no email address as that could indicate a complete lack of financial and/or email data for those cases. They merit further investigation.

In [5]:
df.loc[df.total_payments.isnull()]

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,...,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,email_address,from_poi_to_this_person
CHAN RONNIE,,,,,,,32460.0,,-32460.0,,...,,,,,False,98784.0,-98784.0,,,
CHRISTODOULOU DIOMEDES,,,,,5127155.0,,950730.0,,,6077885.0,...,,,,,False,,,,diomedes.christodoulou@enron.com,
CLINE KENNETH W,,,,,,,662086.0,,-472568.0,189518.0,...,,,,,False,,,,,
CORDES WILLIAM R,,764.0,,,651850.0,,386335.0,58.0,,1038185.0,...,,12.0,,0.0,False,,,,bill.cordes@enron.com,10.0
FOWLER PEGGY,,517.0,,,1324578.0,,560170.0,10.0,,1884748.0,...,,36.0,,0.0,False,,,,kulvinder.fowler@enron.com,0.0
GATHMANN WILLIAM D,,,,,1753766.0,,264013.0,,-72419.0,1945360.0,...,,,,,False,,,,,
GILLIS JOHN,,,,,9803.0,,75838.0,,,85641.0,...,,,,,False,,,,,
HAYSLETT RODERICK J,,2649.0,,,,,346663.0,571.0,,346663.0,...,,1061.0,,38.0,False,,,,rod.hayslett@enron.com,35.0
HUGHES JAMES A,,719.0,,,754966.0,,363428.0,589.0,,1118394.0,...,,34.0,,5.0,False,,,,james.hughes@enron.com,35.0
LEWIS RICHARD,,952.0,,,850477.0,,,739.0,,850477.0,...,,26.0,,0.0,False,,,,richard.lewis@enron.com,10.0


It looks like the people without any payments from Enron did receive equity compensation and also did send and received emails to POIs - though none of the individuals were POIs themselves. After some research into the individual names, it appears that some of these individuals were board members (e.g. Ronnie Chan) which makes their compensation sensible and is a strong indicator that I do not want to exclude this cohort from my analysis. This is also evidence that a sensible imputation value for financial information would be zero as missing data seems to indicate a lack of that form of compensation.

In [6]:
df.loc[df.email_address.isnull()]

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,...,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,email_address,from_poi_to_this_person
BADUM JAMES P,,,178980.0,182466.0,257817.0,,,,,257817.0,...,,,,,False,,,,,
BAXTER JOHN C,267102.0,,1295738.0,5634343.0,6680544.0,1200000.0,3942714.0,,,10623258.0,...,,,2660303.0,,False,,-1386055.0,1586055.0,,
BAZELIDES PHILIP J,80818.0,,684694.0,860136.0,1599641.0,,,,,1599641.0,...,,,874.0,,False,,,93750.0,,
BELFER ROBERT,,,-102500.0,102500.0,3285.0,,,,44093.0,-44093.0,...,,,,,False,3285.0,,,,
BLAKE JR. NORMAN P,,,,1279.0,,,,,,,...,,,,,False,113784.0,-113784.0,,,
CHAN RONNIE,,,,,,,32460.0,,-32460.0,,...,,,,,False,98784.0,-98784.0,,,
CLINE KENNETH W,,,,,,,662086.0,,-472568.0,189518.0,...,,,,,False,,,,,
CUMBERLAND MICHAEL S,184899.0,,,807956.0,,325000.0,207940.0,,,207940.0,...,,,713.0,,False,,,275000.0,,
DUNCAN JOHN H,,,,77492.0,371750.0,,,,,371750.0,...,,,,,False,102492.0,-25000.0,,,
FUGH JOHN L,,,50591.0,50591.0,176378.0,,,,,176378.0,...,,,,,False,,,,,


As I suspected, I am missing all email data variables for people whom I lack an email address. These individuals did however receive significant and various compensation and none of them are POIs. This is useful information - even highly compensated individuals who did not communicate with the POIs appear to have been above suspicion and I therefore do not want to exclude this cohort from the dataset.

As before, examining this data leads me to believe that imputing a value of 0 to missing email data will be the best approach. The fact that an individual did not have any recoreded communication with a POI is clearly predictive and should be recorded.

This view also uncovers two obvious problem cases who are not people: 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK'. The total case looks like a summary entry and the travel agency just looks like a data entry mistake. They will both be removed.

In [7]:
# drop 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK'
df_drop = df.drop(['TOTAL', 'THE TRAVEL AGENCY IN THE PARK'], axis = 0)
print df_drop.shape

(144, 21)


I will now set missing data values equal to zero for both financial and email features, as discussed above.

In [8]:
df_imp = df_drop.replace(to_replace=numpy.nan, value=0)
print df_imp.shape
print df_imp.isnull().sum()

(144, 21)
salary                       0
to_messages                  0
deferral_payments            0
total_payments               0
exercised_stock_options      0
bonus                        0
restricted_stock             0
shared_receipt_with_poi      0
restricted_stock_deferred    0
total_stock_value            0
expenses                     0
loan_advances                0
from_messages                0
other                        0
from_this_person_to_poi      0
poi                          0
director_fees                0
deferred_income              0
long_term_incentive          0
email_address                0
from_poi_to_this_person      0
dtype: int64


### Outliers

As a final step prior to feature selection, I will examine the data for any potential outliers.

In [9]:
df_imp.describe()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,expenses,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,from_poi_to_this_person
count,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144.0,144,144.0,144.0,144.0,144.0
mean,185446.034722,1238.555556,222089.555556,2256543.0,2075801.979167,675997.354167,868536.291667,702.611111,73417.902778,2909785.611111,35375.340278,582812.5,363.583333,294745.534722,24.625,0.125,9980.319444,-193683.270833,336957.833333,38.756944
std,197042.123807,2237.564816,754101.302578,8847189.0,4795513.145239,1233155.255938,2016572.388715,1077.290736,1301983.390377,6189018.075043,45309.303038,6794471.77894,1450.675239,1131325.452833,79.778266,0.331873,31300.575144,606011.13512,687182.567651,74.276769
min,0.0,0.0,-102500.0,0.0,0.0,0.0,-2604490.0,0.0,-1787380.0,-44093.0,0.0,0.0,0.0,0.0,0.0,False,0.0,-3504386.0,0.0,0.0
25%,0.0,0.0,0.0,90192.75,0.0,0.0,24345.0,0.0,0.0,244326.5,0.0,0.0,0.0,0.0,0.0,0,0.0,-37086.0,0.0,0.0
50%,210596.0,347.5,0.0,941359.5,608293.5,300000.0,360528.0,114.0,0.0,965955.0,20182.0,0.0,17.5,919.0,0.0,0,0.0,0.0,0.0,4.0
75%,269667.5,1623.0,8535.5,1945668.0,1683580.25,800000.0,737456.0,933.75,0.0,2295175.75,53328.25,0.0,53.0,148577.0,14.0,0,0.0,0.0,374586.25,41.25
max,1111258.0,15149.0,6426990.0,103559800.0,34348384.0,8000000.0,14761694.0,5521.0,15456290.0,49110078.0,228763.0,81525000.0,14368.0,10359729.0,609.0,True,137864.0,0.0,5145434.0,528.0


By Tukey's definition of even extreme outliers (i.e. 25% Percentile - 3 * IQR or 75% Percentile + 3 * IQR) many of the values in the table above qualify. However, after examining them closely, the extremely high max values are without exception attributable to one of the well known principals at Enron (e.g. Kennethy Lay, Jeffrey Skilling) who were also obviously POIs. In this case these extreme values reflect valuable information - people who were outleirs in their compensation were also very likely to be POIs. Thus, despite these cases meeting the technical definition of being outliers I will not exclude them as they contain valuable, predictive information.

### Feature Selection

Having cleaned the data, I will now use my intuition to create/select the features that will give my algorithms the most predictive power.

I will begin with the email features. As a person's email address as no predicitve value, it is dropped:

In [10]:
df_noemailaddr = df_imp.drop(['email_address'], axis = 1)
df_noemailaddr.shape

(144, 20)

That leaves me with five email features: to_messages, from_messages, shared_receipt_with_poi, from_this_person_to_poi, and from_poi_to_this_person. Intuitively, people who send more emails to or receive more emails from POIs are more likely to be POIs themseleves. However, if there is a large amount of variation in total email volume between persons, a simple count of POI emails might simply be indicative of a heavy email user rather than malfeasance. So, I will first examine the range of emails that people sent and received:

In [14]:
print 'Most received emails: {}'.format(df_noemailaddr['to_messages'].max(axis = 0))
print 'Fewest received emails: {}'.format(df_noemailaddr['to_messages'].min(axis = 0))
print 'Most sent emails: {}'.format(df_noemailaddr['from_messages'].max(axis = 0))
print 'Fewest sent emails: {}'.format(df_noemailaddr['from_messages'].min(axis = 0))

Most received emails: 15149.0
Fewest received emails: 0.0
Most sent emails: 14368.0
Fewest sent emails: 0.0


The variation in email volume is huge so some sort of correction needs to be made to account for it. A simple way to control for volume is to make the feature of interest the ratio of emails that a person sent or received from a POI. This way a high volume emailer will have to send/receive a lot of emails to POIs to stand out while a low volume emailer who sent/received emails mostly to POIs will also stand out. To control for the potential confounder of email volume, I create three new features: poi_ratio (ratio of emails from and to POIs/emails sent and received), to_poi_ratio (emails to POI/emails sent), and from_poi_ratio (emails from POI/emails received). 

In [19]:
poi_ratio = (df_noemailaddr['from_poi_to_this_person'] + df_noemailaddr['from_this_person_to_poi']) / \
(df_noemailaddr['from_messages'] + df_noemailaddr['to_messages'])
to_poi_ratio = (df_noemailaddr['from_this_person_to_poi']) / (df_noemailaddr['from_messages'])
from_poi_ratio = (df_noemailaddr['from_poi_to_this_person']) / (df_noemailaddr['to_messages'])

df_noemailaddr['poi_ratio'] = pandas.Series(poi_ratio)
df_noemailaddr['to_poi_ratio'] = pandas.Series(to_poi_ratio)
df_noemailaddr['from_poi_ratio'] = pandas.Series(from_poi_ratio)

df_emails = df_noemailaddr.drop(['to_messages', 'from_messages', 'from_this_person_to_poi', 'from_poi_to_this_person'], axis = 1)

Next I consider the compensation features. There are 15 different compensation features, many of which (e.g. salary and total payements) overlap. Rather than trying to parse this complicated features space individually, my plan will be to use L1-based feature selection in a pipeline. Specifically, since this is a classification problem, I will use a Linear SVM to select the best features.

In [24]:
# impute email ratios for people without emails as 0
df_final = df_emails.replace(to_replace=numpy.nan, value=0)

print df_final.head()
print df_final.shape
print df_final.loc['SKILLING JEFFREY K']

                    salary  deferral_payments  total_payments  \
ALLEN PHILLIP K     201955            2869717         4484442   
BADUM JAMES P            0             178980          182466   
BANNANTINE JAMES M     477                  0          916197   
BAXTER JOHN C       267102            1295738         5634343   
BAY FRANKLIN R      239671             260455          827696   

                    exercised_stock_options    bonus  restricted_stock  \
ALLEN PHILLIP K                     1729541  4175000            126027   
BADUM JAMES P                        257817        0                 0   
BANNANTINE JAMES M                  4046157        0           1757552   
BAXTER JOHN C                       6680544  1200000           3942714   
BAY FRANKLIN R                            0   400000            145796   

                    shared_receipt_with_poi  restricted_stock_deferred  \
ALLEN PHILLIP K                        1407                    -126027   
BADUM JAMES P   

The resulting dataframe of 144 cases has 18 features to explain whether or not an individual case is a POI.

### Split Data into Training and Test Sets

With my final dataset in hand, I can split it into training and test sets. Due the small sample size, I will use a Stratified Shuffle Split in order to create more permutations of testing and training data.

In [27]:
from sklearn import cross_validation

labels = df_final['poi']
features = df_final.drop('poi', axis = 1)
shuffle = sklearn.cross_validation.StratifiedShuffleSplit(labels, n_iter=10, test_size=0.1, random_state=0)

### Train Classifiers

I will train three seperate classifiers, and choose the best performing two to optimize.

#### Gaussian Naive Bayes

I envision the Naive Bayes, by far the simplest of my classifiers, as serving as a baseline to compare my other, more sophisticated classifiers against. I would be very surprised if it was the best performer.

In [28]:
from sklearn.naive_bayes import GaussianNB

gnb_clf = GaussianNB()
scores = sklearn.cross_validation.cross_val_score(gnb_clf, features, labels, cv = 5)
print 'Gaussian Naive Bayes: {}'.format(numpy.mean(scores))

Gaussian Naive Bayes: 0.726042692939


#### Random Forest

My next classifier is a random forest, which creates a number of decision tree classifiers and chooses the modal tree as the best classifier. I restrict myself to the default of 10 estimators due to memory constraints.

In [29]:
from sklearn.ensemble import RandomForestClassifier

random_forest_clf = RandomForestClassifier(n_estimators = 10)
scores = sklearn.cross_validation.cross_val_score(random_forest_clf, features, labels, cv = 5)
print 'Random Forest: {}'.format(numpy.mean(scores))

Random Forest: 0.882692939245


#### AdaBoost

My final classifier is the AdaBoost. AdaBoost fits a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases. My weak classifier will be a decision tree, like it was for the random forest.

In [72]:
from sklearn.ensemble import AdaBoostClassifier

ab_clf = AdaBoostClassifier(n_estimators=100)
scores = sklearn.cross_validation.cross_val_score(ab_clf, features, labels, cv = 5)
print 'AdaBoost: {}'.format(numpy.mean(scores))

AdaBoost: 0.834400656814


### Parameter Tuning

As expected, the two best performing classifiers were the Random Forest and the AdaBoost. As a final step, I will create a pipeline for each that will use a linear SVC to select the best features and then optimize the parameters to produce results with the best possible precision and accuracy. At a minimum, I hope that this process will produce a classifier with both an accuracy and a precision greater than 0.3 leading to a F1 score also higher than 0.3.

In [209]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import GridSearchCV

pipe_rf = Pipeline([('feat', SelectKBest()), ('clf', RandomForestClassifier())])

K = range(1, 19)
min_samples_split = [1, 2, 3, 4, 5]
n_estimators = [10, 20, 50] 
min_samples_leaf = [1,2,3,4]
#criterion = ['gini', 'entropy']
criterion = ['gini']

param_grid_rf = [{'feat__k': K,
              'clf__max_depth': max_depth,
              'clf__min_samples_split': min_samples_split,
              'clf__n_estimators': n_estimators,
              'clf__min_samples_leaf': min_samples_leaf,
              'clf__criterion': criterion}]

gs_rf = GridSearchCV(estimator = pipe_rf, param_grid = param_grid_rf)
gs_rf.fit(features, labels)

print gs_rf.best_params_
print gs_rf.best_score_

{'feat__k': 13, 'clf__n_estimators': 10, 'clf__criterion': 'gini', 'clf__max_depth': 5, 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 3}
0.895833333333


In [210]:
features_list = list(features.columns)
feature_indices_rf = gs_rf.best_estimator_.named_steps['feat'].get_support(indices = True)
final_feature_list_rf = [features_list[i] for i in feature_indices_rf]

print final_feature_list_rf

['salary', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'loan_advances', 'deferred_income', 'long_term_incentive', 'poi_ratio', 'to_poi_ratio']


In [211]:
absent = []
for element in features_list:
    if element in final_feature_list_rf:
        pass
    else:
        absent.append(element)
print 'Dropped features: {}'.format(absent)

Dropped features: ['deferral_payments', 'restricted_stock_deferred', 'other', 'director_fees', 'from_poi_ratio']


In [212]:
print 'Feature scores: {}'.format(sorted(zip(features_list, gs_rf.best_estimator_.named_steps['feat'].scores_), 
                                         key = lambda x: x[1], reverse = True))

Feature scores: [('exercised_stock_options', 25.097541528735491), ('total_stock_value', 24.467654047526398), ('bonus', 21.060001707536571), ('salary', 18.575703268041785), ('to_poi_ratio', 16.641707070468989), ('deferred_income', 11.595547659730601), ('long_term_incentive', 10.072454529369441), ('restricted_stock', 9.3467007910514877), ('total_payments', 8.8738352555162319), ('shared_receipt_with_poi', 8.7464855321290802), ('loan_advances', 7.2427303965360181), ('expenses', 6.2342011405067401), ('poi_ratio', 5.5185055438125579), ('other', 4.2461535406760671), ('from_poi_ratio', 3.2107619169667441), ('director_fees', 2.1076559432760908), ('deferral_payments', 0.2170589303395084), ('restricted_stock_deferred', 0.06498431172371151)]


To optimize the Random Forest classifier, I create a two step pipeline wherein I first select the K best features and then optimize the Random Forest algorithm for those features from the Enron dataset. 

The optimal number of features turns out to be 13. The only features not included are deferral payments, deferred stock, other payments, director fees, and ratio of emails from POIs. Having such a large pool of features increases the variance of the algorithm, making it more prone to overtraining.  If the next classifer that I will discuss hadn't worked so well it might have been fruitful to undergo further feature selection or some sort of dimenstionality reduction (e.g. PCA) to try to decrease this feature space. 

The final step in my pipeline is to optimize the parameters of the Random Forest algorithm. The parameters that I pass to GridSearchCV() function to optimize are (optimal value in parentheses):

1. min_samples_split (3) - at what number of cases the individual decision trees will stop splitting branches 
2. n_estimators (10) - how many individual trees to run for the final, weighted forest output
3. min_samples_leaf (1) - the minimum number of samples required for a leaf node
4. max_depth (5) - how many splits to make before stopping a branch
5. criterion (gini) - formula used for the splitting function

The overall accuracy score of nearly 0.9 is solid.

In [213]:
from sklearn.tree import DecisionTreeClassifier

pipe_ada = Pipeline([('feat', SelectKBest()),
                     ('clf', AdaBoostClassifier(DecisionTreeClassifier()))])

K = range(1, 19)
n_estimators_ada = [5, 10, 30, 40, 45, 50, 100, 150]
#learning_rate = [0.1, 0.5, 1, 1.5, 2, 2.5, 3, 5]
learning_rate = [2.0, 2.1, 2.2, 2.3, 2.4, 2.5]
algorithm = ['SAMME']

param_grid_ada = [{'feat__k': K,
                  'clf__n_estimators': n_estimators_ada,
                  'clf__learning_rate': learning_rate,
                  'clf__algorithm': algorithm}]

gs_ada = GridSearchCV(estimator = pipe_ada, param_grid = param_grid_ada)
gs_ada.fit(features, labels)

print gs_ada.best_params_

{'clf__learning_rate': 2.1, 'feat__k': 8, 'clf__algorithm': 'SAMME', 'clf__n_estimators': 150}


In [214]:
print gs_ada.best_score_

0.875


In [215]:
feature_indices_ada = gs_ada.best_estimator_.named_steps['feat'].get_support(indices = True)
final_feature_list_ada = [features_list[i] for i in feature_indices_ada]

print 'Final feature list: {}'.format(final_feature_list_ada)

Final feature list: ['salary', 'exercised_stock_options', 'bonus', 'restricted_stock', 'total_stock_value', 'deferred_income', 'long_term_incentive', 'to_poi_ratio']


In [216]:
print 'Feature scores: {}'.format(sorted(zip(features_list, gs_ada.best_estimator_.named_steps['feat'].scores_), 
                                         key = lambda x: x[1], reverse = True))

Feature scores: [('exercised_stock_options', 25.097541528735491), ('total_stock_value', 24.467654047526398), ('bonus', 21.060001707536571), ('salary', 18.575703268041785), ('to_poi_ratio', 16.641707070468989), ('deferred_income', 11.595547659730601), ('long_term_incentive', 10.072454529369441), ('restricted_stock', 9.3467007910514877), ('total_payments', 8.8738352555162319), ('shared_receipt_with_poi', 8.7464855321290802), ('loan_advances', 7.2427303965360181), ('expenses', 6.2342011405067401), ('poi_ratio', 5.5185055438125579), ('other', 4.2461535406760671), ('from_poi_ratio', 3.2107619169667441), ('director_fees', 2.1076559432760908), ('deferral_payments', 0.2170589303395084), ('restricted_stock_deferred', 0.06498431172371151)]


To optimize the AdaBoost classifier, I create a two step pipeline wherein I first select the K best features and then optimize the AdaBoost algorithm for those features from the Enron dataset. 

Eight features were selected as part of the pipeline. They were salary, exercised stock options, bonus, restricted stock, total stock value, deferred income, long term incentives, and ratio of emails sent to POIs. Intuitively these features make sense - all major forms of compensation as well as the ratio of emails that person sent to POIs. The fact that only eight features were selected in the pipeline is also reassuring as a smaller feature space is less prone to overtraining and should be better able to deal with novel cases (i.e. test data).

The final step in my pipeline is to optimize the parameters of the AdaBoost algorithm. The parameters that I pass to GridSearchCV() function to optimize are (optimal value in parentheses):

1. n_estimators (150) - the maximum number of estimators at which boosting is terminated
2. learning_rate (2.1) - the rate at which the contribution of each classifier shrinks
3. algorithm (SAMME) - the boosting algorithm used 

The overall accuracy score of nearly 0.88 is also solid, though not as high as the Random Forest Classifier's score of 0.9. However, as we will see shortly the higher accuracy of the Random Forest Classifer does not capture important differences in recall and precision. 

### Validation

In order to validate my optimized classifiers I will make use of the test_classifier function provided in the tester.py script included with the project materials. The test_classifier function calculates the accuracy, precision, recall, F1, and F2 scores for the passed algorithm on many iterations of test/training splits from the passed data (i.e. it uses cross validation).

I will pay special attention to three validation metrics: precision, recall, and F1 score. Precision is defined as $True Positives / (True Positives + False Positives)$. In this specific case it means that it is a measure of the proportion of identified POIs who are actually POIs. Given the serious consequences of potentially selecting an innocent person for prosecution for corporate fraud it is important that this metric be as high as possible.

Recall is defined as $True Positives / (True Positives + False Negatives)$. For this data it measures the proportion of POIs who are correctly identified as being such. The lower the recall the more fraudsters get away, so maximizing this metric is also important.

F1 score is simply a weighted average of precision and recall: $F1 = 2 * (Precision * Recall) / (Precision + Recall)$.

The goal is for one of my classifiers to score 0.3 or better on each of Precision, Recall, and F1.

In [217]:
rf_best_clf = estimator_rf.best_estimator_
list_cols = list(df_final.columns.values)
list_cols.remove('poi')
list_cols.insert(0, 'poi')
data = df_final[list_cols].fillna(0).to_dict(orient='records')
enron_data_sub = {}
counter = 0
for item in data:
    enron_data_sub[counter] = item
    counter += 1
    
test_classifier(rf_best_clf, enron_data_sub, list_cols)

Pipeline(steps=[('features', SelectFromModel(estimator=LinearSVC(C=1.5, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=None, tol=0.0001,
     verbose=0),
        prefit=False, threshold=None)...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])
	Accuracy: 0.85180	Precision: 0.39143	Recall: 0.20100	F1: 0.26561	F2: 0.22267
	Total predictions: 15000	True positives:  402	False positives:  625	False negatives: 1598	True negatives: 12375



In [218]:
ada_best_clf = estimator_ada.best_estimator_
test_classifier(ada_best_clf, enron_data_sub, list_cols)

Pipeline(steps=[('features', SelectFromModel(estimator=LinearSVC(C=1.5, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=None, tol=0.0001,
     verbose=0),
        prefit=False, threshold=None)...andom_state=None, splitter='best'),
          learning_rate=2, n_estimators=40, random_state=None))])
	Accuracy: 0.80647	Precision: 0.30463	Recall: 0.35200	F1: 0.32661	F2: 0.34138
	Total predictions: 15000	True positives:  704	False positives: 1607	False negatives: 1296	True negatives: 11393



While the Random Forest classifer has good precision - i.e. it is good at not incorrectly classifying someone as a POI - it has relatively poor recall, allowing too many fraudsters to escape detection. This leads to an unacceptably low F1 score of 0.23. Fortunately, the AdaBoost classifier has both solid recall and precision, resulting in a F1 score of 0.32, well above my goal of at least 0.3.

It is worth noting that despite having a lower F1 score the Random Forest actually has a *better* accuracy score. This is because it creates many fewer false positives (i.e. improperly identifies a person as a POI) compared to the AdaBoost. Considering the potential application of this classifierr - using it to pinpoint who to investgate for potential indictment on fraud charges - a conservative algorithm might actually preferrable even if it lets more actual POIs get away. This is just a reminder that no one single metric can be substituted for a holistic evaluation of a algorithm's output given its ultimate purpose.  

### Dump Succesful Classifier

For grading purposes.

In [219]:
dump_classifier_and_data(ada_best_clf, enron_data_sub, list_cols)

Associated materials can be found on Github: link 

### Refrences

1. sklearn documentation: http://scikit-learn.org/stable/index.html
2. pandas documentation: http://pandas.pydata.org