#Udacity - NanoDegree - Project 4 - Identifying POI

##Project Goal
The goal of this project is to develop and tune a supervised classification algorithm to identify persons of interest (POI) in the Enron scandal based on a combination of publically available Enron financial data and email records. The modest goals are to have recall and precision scores above 0.3.  

The compiled data set contains information for 144 people employed at Enron, a ‘Travel Agency in the Park’, and the total compensations for all of these sources. Additionally, the Udacity.com course designer created email features that give the total number of e-mails sent and received for each user, and the total number of emails sent to and receieved from a POI. Of the 144 people, 18 of them are labeled as POIs.

Lets start by reading in the data;


In [1]:
import sys
import pickle
import numpy as np
import pandas as pd

#Load Udacity Tools for Testing Results
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import test_classifier, dump_classifier_and_data

### Load the dictionary containing the dataset
data_dict = pickle.load(open("final_project_dataset.pkl", "r") )
data = pd.DataFrame.from_dict(data_dict)
#Remove Invalide Rows
data = data.drop(['TOTAL','THE TRAVEL AGENCY IN THE PARK'],axis=1)
data = data.transpose()
#Give Each Person their own row
data

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
ALLEN PHILLIP K,4175000,2869717,-3081055,,phillip.allen@enron.com,1729541,13868,2195,47,65,...,304805,152,False,126027,-126027,201955,1407,2902,4484442,1729541
BADUM JAMES P,,178980,,,,257817,3486,,,,...,,,False,,,,,,182466,257817
BANNANTINE JAMES M,,,-5104,,james.bannantine@enron.com,4046157,56301,29,39,0,...,,864523,False,1757552,-560222,477,465,566,916197,5243487
BAXTER JOHN C,1200000,1295738,-1386055,,,6680544,11200,,,,...,1586055,2660303,False,3942714,,267102,,,5634343,10623258
BAY FRANKLIN R,400000,260455,-201641,,frank.bay@enron.com,,129142,,,,...,,69,False,145796,-82782,239671,,,827696,63014
BAZELIDES PHILIP J,,684694,,,,1599641,,,,,...,93750,874,False,,,80818,,,860136,1599641
BECK SALLY W,700000,,,,sally.beck@enron.com,,37172,4343,144,386,...,,566,False,126027,,231330,2639,7315,969068,126027
BELDEN TIMOTHY N,5249999,2144013,-2334434,,tim.belden@enron.com,953136,17355,484,228,108,...,,210698,True,157569,,213999,5521,7991,5501630,1110705
BELFER ROBERT,,-102500,,3285,,3285,,,,,...,,,False,,44093,,,,102500,-44093
BERBERIAN DAVID,,,,,david.berberian@enron.com,1624396,11892,,,,...,,,False,869220,,216582,,,228474,2493616


In regards to the financial information, I defined outliers as having values that are more than 3 standard deviations from the mean value for the group. This is not the traditional definition or criteria for an outlier which is 1.5 times the interquartile range below the first quartile or Above the third quartile. I used my definition after I have replaced missing values with zero. Using this definition of outliers there are 25 people in the data set that are financial outliers:

In [3]:
finance = ['salary',
             'deferral_payments',
             'total_payments',
             'exercised_stock_options',
             'bonus',
             'restricted_stock',
             'restricted_stock_deferred',
             'total_stock_value',
             'expenses',
             'loan_advances',
             'other',
             'director_fees',
             'deferred_income',
             'long_term_incentive']

from scipy import stats
#Use unique because some people are financial outliers in multiple variables
data[np.abs(stats.zscore(data[finance].replace('NaN',0.0))) > 3].index.unique()

array(['ALLEN PHILLIP K', 'BELDEN TIMOTHY N', 'BHATNAGAR SANJAY',
       'BLAKE JR. NORMAN P', 'FREVERT MARK A', 'GRAMM WENDY L',
       'HANNON KEVIN P', 'HIRKO JOSEPH', 'HORTON STANLEY C',
       'HUMPHREY GENE E', 'JAEDICKE ROBERT', 'LAVORATO JOHN J',
       'LAY KENNETH L', 'LEMAISTRE CHARLES', 'MARTIN AMANDA K',
       'MCCLELLAN GEORGE', 'MENDELSOHN JOHN', 'PAI LOU L',
       'RICE KENNETH D', 'SAVAGE FRANK', 'SHANKMAN JEFFREY A',
       'SKILLING JEFFREY K', 'URQUHART JOHN A', 'WAKEHAM JOHN',
       'WHITE JR THOMAS E', 'WINOKUR JR. HERBERT S'], dtype=object)

This accounts for 17% of the data being considered an outlier and also contains 33% of the POI. Ultimately I decided that financial outliers were relevant information, and decided not to remove them from the data for this analysis.

##New Feature

I decided to create two features I thought were be informative to the data: the ratio of emails received from a POI to the total number of emails received and the ratio of emails sent to a POI to the total number of emails sent.
I thought these features would be more informative than the total number of emails sent or received from a POI because it normalizes by how active or how popular a person is.

If a person sends 10 emails, and 5 of them are to a POI, that seems more relevant than if a person sends 1000 emails and 5 of them are to a POI. A person who sends 50% of their emails to a person of interest seems more suspect than a person who only sends 0.5% of their emails to a POI.

The inverse is not true, however. If a person received 10 emails from a POI, that is as relevant regardless if the person receives 20 emails or 2000 emails. The important idea is that how often is a POI is contacting this person, and how does that affect the likelihood that a person is also a POI. This ratio does not capture that idea, but I created it because I was willing to be proven wrong.

In [4]:
data.from_this_person_to_poi = data.from_this_person_to_poi.astype(float)
data.from_poi_to_this_person = data.from_poi_to_this_person.astype(float)
data.to_messages = data.to_messages.astype(float)
data.from_messages = data.from_messages.astype(float)
data['from_this_person_to_poi_ratio'] = data.from_this_person_to_poi/data.to_messages
data['from_poi_to_this_person_ratio'] = data.from_poi_to_this_person/data.from_messages
data = data.replace('NaN',0.0)

After the creation of these features I used sklearn's Pipeline, FeatureUnion, and GridSearch to search through a combintation of variable to find the set of features that gave the best performance for a Decision Tree Classifier.  I search between 1 and 9 of the best features using a 5 folds stratified cross validation to gauge performance for each set.  I am using a f1 score to maximize the weight combintation of recall and precision. 

In [16]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.grid_search import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve
from sklearn.learning_curve import learning_curve

tru = data.poi.values
trn = data.drop(['poi','email_address'],axis=1).values
clf = DecisionTreeClassifier()

#pca = PCA(n_components=2)
selection = SelectKBest(k=1)
#combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
combined_features = FeatureUnion([("univ_select", selection)])

X_features = combined_features.fit(trn, tru).transform(trn)

pipeline = Pipeline([("features", combined_features), ("clf", clf)])

#param_grid = dict(features__pca__n_components=range(0,15),
#                  features__univ_select__k=range(1,10))

param_grid = dict(features__univ_select__k=range(1,10))

skf = StratifiedKFold(tru, n_folds=5)
grid_search = GridSearchCV(pipeline, cv=skf, param_grid=param_grid, scoring='f1')
grid_search.fit(trn, tru)
print grid_search.best_score_
print grid_search.best_params_
var_scores = grid_search.best_estimator_.steps[0][1].get_params()['univ_select'].scores_
print var_scores
num_features = grid_search.best_params_['features__univ_select__k']
best_features = data.drop(['poi','email_address'],axis=1).columns[np.argsort(var_scores)[::-1][:num_features]]
best_features

0.36813973064
{'features__univ_select__k': 3}
[ 21.06000171   0.21705893  11.59554766   2.10765594  25.09754153
   6.23420114   0.1641645    5.34494152   2.42650813   7.2427304
  10.07245453   4.24615354   9.34670079   0.06498431  18.57570327
   8.74648553   1.69882435   8.87383526  24.46765405   4.16908382
   5.20965022]


Index([u'exercised_stock_options', u'total_stock_value', u'bonus'], dtype='object')

##Result of Search

The search is finished and the best f1 score of 0.368 with 3 'best features'.  Because of the random shuffling involved in the scoring, it is possible to get a variety in number and combintations of best features.  

The best features for predicting if a person is a person of interest are financal features: Exercised stock options, total stock value, and bonus.  Even when the number of best features fluxuate, these 3 are always among them.   

##Turning Model
Using the paramemters from the above search we will now tune the classifier to optimize its performance. I am searching through the spliting and depth criteria for the Decision Tree Classifier.  


In [17]:
tru = data.poi.values
trn = data[best_features].values
clf = DecisionTreeClassifier()
pipeline = Pipeline([("clf", clf)])

param_grid = dict(clf__criterion=("gini","entropy"),
                  clf__min_samples_split=[1,2,4,8,16,32],
                   clf__min_samples_leaf=[1,2,4,8,16,32],
                   clf__max_depth=[None,1,2,4,8,16,32])

skf = StratifiedKFold(tru, n_folds=5)
grid_search = GridSearchCV(pipeline, cv=skf, param_grid=param_grid, scoring='f1')
grid_search.fit(trn, tru)
print grid_search.best_params_
print grid_search.best_score_
best_clf = grid_search.best_estimator_
best_clf

{'clf__criterion': 'gini', 'clf__max_depth': 32, 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 1}
0.47075617284


Pipeline(steps=[('clf', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=32,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=1, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best'))])

The best parameters for the Decision Tree Classifier has a max depth of 32, a min sample leaf size of 1, and requires at least 1 values to split the tree when fitting.  These results produced the best averge f1 score of 0.47.   

##Importance of Validation

Validation is an attempt to confirm that a model will give reasonable or consistent results on new, untrained data. A classic mistake is to test the results of a model on the data used to train the model. This is no doubt give the best possible score, but can over fit the data leading to less than desired results on new data. Validation protects against this mistake by training the model on one set of data and testing on yet another.

I used 5-Fold Statfied Cross Validation for investigating and comparing algorithms in this analysis. This is where there the algorithm is trained on the on the data 5 times using 80% of the data as a training set and 20% of the data as a testing set.  Each set has approxiamately the same ratio of positive and negative examples of POI.  Each time this is done, the training and testing set are shuffled to create a new 80/20 split on the data. I then used the average performance as an estimate of its performance on new data.

## Final Performance

Using the Udacity 'test_classifier' function I evaluate my classifer to see if the recall and precision scores are both greater than 0.3.


In [18]:
#tt = pd.DataFrame(pca.fit_transform(trn),index=data.index)
#data2 = data.join(tt)
#print data2.columns
my_data = data.transpose().to_dict()
features =  best_features.tolist()
features = ['poi']+features
test_classifier(best_clf, my_data, features)

Pipeline(steps=[('clf', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=32,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=1, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best'))])
	Accuracy: 0.80492	Precision: 0.37066	Recall: 0.38400	F1: 0.37721	F2: 0.38125
	Total predictions: 13000	True positives:  768	False positives: 1304	False negatives: 1232	True negatives: 9696



The results of the tuned classifier are shown above.   The average precision of the 13000 prediction is 37% and the average recall is 38%.   The precision value is that out of all predictions of people being a POI, 38% of them are actually POI.  The recall value is that out of all of the POI, 37% are actually predicted to be POI by my tuned classifier.  