<i>Note: the following is a reference document showing the test runs that led up to the choosing of the optimal parameters used in the classification algorithms.  It is included in the interest of reproducibility.  Although it is referred to a few times in the question and answer write up, it is not at all necessary to read all of the following in order to understand the logic of the decisions made in this project!</i>

<H2>Data and code set up and intialization</H2>

In [1]:
#!/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")
import numpy as np

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data, load_classifier_and_data, test_classifier

from time import time
from copy import deepcopy
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn import grid_search, svm
from sklearn.feature_selection import SelectKBest, f_classif, SelectFromModel, RFECV, RFE
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score
from sklearn.cross_validation import StratifiedShuffleSplit, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier



with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

# all keys from data_dict
features_list = ['poi',
 'salary',
 'bonus',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'to_messages',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'from_poi_to_this_person',
 'email_address']


with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
 
# put values of 0 where there are NaNs.
# If there is no email address, update new feature has_email to 0, otherwise set it to 1.
# Also, remove the actual email address, since it doesn't make sense to try to quantify 
# a text string for the email address (or do feature scaling, for instance)
for emp, emp_dict in data_dict.items():    
    has_email = 1
    is_poi = 0
    
    for key, val in emp_dict.items():
        if key == 'email_address' and val == 'NaN':
            has_email = 0
        elif key == 'poi' and val == True:
            is_poi = 1
        elif val == 'NaN':
            emp_dict[key] = 0
        
    emp_dict['has_email'] = has_email
    emp_dict['poi'] = is_poi
    emp_dict.pop('email_address', 0)
        
data_dict.pop('TOTAL') # remove the invalid entry - no individual corresponds to 'TOTAL' entry in dictionary
        
features_list.remove('email_address')
features_list.append('has_email')

my_dataset = data_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)


# utility function to create stratified shuffle splits
def sss(n_folds=100):
    return StratifiedShuffleSplit(labels, n_iter=n_folds, random_state = 42)
    

clf_nb = GaussianNB()
clf_dt = DecisionTreeClassifier(random_state=42, class_weight = 'balanced')
clf_rf = RandomForestClassifier(random_state=42, max_depth=1, class_weight = 'balanced')
clf_ab = AdaBoostClassifier(random_state=42)
clf_ab_dt = AdaBoostClassifier(DecisionTreeClassifier(random_state=42, class_weight = 'balanced', max_depth=1), random_state=42)
clf_lr = LogisticRegression(random_state=42, penalty ='l1', class_weight = 'balanced') # l1 to use for feature selection
clf_sv = SVC(random_state=42, class_weight = 'balanced') 




<h3>A few overview stats</h3>

In [98]:
pois = [ indiv for indiv in my_dataset.keys() if my_dataset[indiv]['poi'] == 1]
pois_with_email = [ indiv for indiv in my_dataset.keys() if my_dataset[indiv]['poi'] == 1 and 
                   my_dataset[indiv]['has_email'] == 1]
num_features = len(data_dict['LAY KENNETH L'].keys())

print("Dataset has {} individuals".format(len(my_dataset)))
print("Number of POIs: {}".format(len(pois)))
print("Number of POIs with email: {}".format(len(pois_with_email)))
print("Number of features: {}".format(num_features))

Dataset has 145 individuals
Number of POIs: 18
Number of POIs with email: 18
Number of features: 21


<H2>Feature selection and recursive feature elimination using L1 penalty with logistic regression </H2>

As stated in http://scikit-learn.org/stable/modules/feature_selection.html, section 1.13.4.1. L1-based feature selection, linear models with L1 penalty can be used for feature selection, and this can be implemented with the SelectFromModel using a logistic regression model.  This was attempted below, in order to get a better feature selection effect than I got with SelectKBest and Principle Component Analysis (PCA), which actually seemed to worsen precision and recall scores.

In addition, I used the logistic regression with L1 penalty for the recursive feature elimination, to get a reduced feature set initially with 7 features.  Since there were rankings, it was easy to add the next-most likely to help feature, and it was shown below that adding it helped in one classification task. 

In [3]:

estimator = clf_lr # Use logistic regression with L1 penalty for feature selection

selector_prec = RFECV(estimator, step=1, cv=sss(1000), scoring="precision")
selector_prec = selector_prec.fit(features, labels)
selector_prec.support_ 

  'precision', 'predicted', average, warn_for)


array([False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False,  True, False, False, False,
        True, False])

In [4]:
# NEXT, select some runs for the BL clfs above

estimator = clf_lr

selector_recall = RFECV(estimator, step=1, cv=sss(1000), scoring="recall")
selector_recall = selector_recall.fit(features, labels)
selector_recall.support_ 

array([False, False, False, False, False, False,  True,  True, False,
       False, False,  True,  True, False,  True,  True, False, False,
        True, False])

In [5]:
selector_f1 = RFECV(estimator, step=1, cv=sss(1000), scoring="f1")
selector_f1 = selector_f1.fit(features, labels)
selector_f1.support_ 

  'precision', 'predicted', average, warn_for)


array([False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False,  True, False, False, False,
        True, False])

In [7]:
selector = selector_recall # use the recall RFECV for the tentative feature selection since it leaves 7 features, more than for precision and f1

print selector.support_ 
print selector.ranking_

lr_support = selector.support_
lr_ranking = selector.ranking_

lr_features_list = list(np.array(features_list)[1:][selector.support_])
lr_features_list.sort()
lr_features_list = ["poi"] + lr_features_list
lr_features_list

[False False False False False False  True  True False False False  True
  True False  True  True False False  True False]
[ 5  8  3 10  9 12  1  1 13  2 11  1  1  6  1  1  4  7  1 14]


['poi',
 'director_fees',
 'from_messages',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'restricted_stock_deferred',
 'shared_receipt_with_poi',
 'to_messages']

In [8]:
lr_features_list = list(np.array(features_list)[1:][selector_recall.support_])
lr_features_list.sort()
lr_features_list = ["poi"] + lr_features_list
print(selector_recall.ranking_)
lr_features_list

[ 5  8  3 10  9 12  1  1 13  2 11  1  1  6  1  1  4  7  1 14]


['poi',
 'director_fees',
 'from_messages',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'restricted_stock_deferred',
 'shared_receipt_with_poi',
 'to_messages']

<H4>Using the rankings, we can get the list <i>below</i> of most valuable features for the L1 penalty based feature selection</H4>  All features not eliminated by the recursive feature elimination have rank 1, and POI feature arbitrarily assigned rank 0.

In [9]:
rank = list(selector_recall.ranking_)
rank.insert(0, 0) # put a rank of 0 in for the feature importance rank of poi, for display
for ind, feat in zip(rank, features_list):
    print("rank: {}, feature: {}".format(ind,feat))
    
# NOTE that ranks printed for all features selected by the RECV will have rank = 1; rank of poi arbritrarily listed as 0

rank: 0, feature: poi
rank: 5, feature: salary
rank: 8, feature: bonus
rank: 3, feature: deferral_payments
rank: 10, feature: total_payments
rank: 9, feature: exercised_stock_options
rank: 12, feature: restricted_stock
rank: 1, feature: shared_receipt_with_poi
rank: 1, feature: restricted_stock_deferred
rank: 13, feature: total_stock_value
rank: 2, feature: expenses
rank: 11, feature: loan_advances
rank: 1, feature: to_messages
rank: 1, feature: from_messages
rank: 6, feature: other
rank: 1, feature: from_this_person_to_poi
rank: 1, feature: director_fees
rank: 4, feature: deferred_income
rank: 7, feature: long_term_incentive
rank: 1, feature: from_poi_to_this_person
rank: 14, feature: has_email


<H2>Algorithm tuning and validation</H2>
<br>
<i>Below we see multiple runs where we use the grid search with cross validation to hone in on the optimal parameter values for Random Forest and Adaptive Boosting algorithms.  These, used in conjuction with the F1-based feature selection in the pipeline, eventually yielded values for both precision and recall above 0.3.</i>

In [10]:
clf_ab5 = AdaBoostClassifier(random_state=42, n_estimators=33, learning_rate=0.55)
t0 = time()
test_classifier(clf_ab5, my_dataset, lr_features_list, folds = 500)
print("done in %0.3fs" % (time() - t0))

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.55, n_estimators=33, random_state=42)
	Accuracy: 0.84691	Precision: 0.21405	Recall: 0.25600	F1: 0.23315	F2: 0.24634
	Total predictions: 5500	True positives:  128	False positives:  470	False negatives:  372	True negatives: 4530

done in 30.037s


In [11]:
clf_ab5 = AdaBoostClassifier(random_state=42, n_estimators=33, learning_rate=0.55)
t0 = time()
test_classifier(clf_ab5, my_dataset, features_list, folds = 500)
print("done in %0.3fs" % (time() - t0))

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.55, n_estimators=33, random_state=42)
	Accuracy: 0.84707	Precision: 0.40108	Recall: 0.29800	F1: 0.34194	F2: 0.31415
	Total predictions: 7500	True positives:  298	False positives:  445	False negatives:  702	True negatives: 6055

done in 33.222s


In [12]:
dt2 = DecisionTreeClassifier(class_weight='balanced', criterion='gini',
            max_depth=3, random_state=42)
cl_ab6 = AdaBoostClassifier(dt2, random_state=42, n_estimators=15, learning_rate=0.3)

t0 = time()
test_classifier(cl_ab6, my_dataset, lr_features_list, folds = 300)
print("done in %0.3fs" % (time() - t0))

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'),
          learning_rate=0.3, n_estimators=15, random_state=42)
	Accuracy: 0.88485	Precision: 0.35401	Recall: 0.32333	F1: 0.33798	F2: 0.32904
	Total predictions: 3300	True positives:   97	False positives:  177	False negatives:  203	True negatives: 2823

done in 9.656s


In [13]:
lr_features_list.append('expenses')  # append the next feature that would have been selected by RFECV, with rank = 2
t0 = time()
test_classifier(cl_ab6, my_dataset, lr_features_list, folds = 300)
print("done in %0.3fs" % (time() - t0))

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'),
          learning_rate=0.3, n_estimators=15, random_state=42)
	Accuracy: 0.82310	Precision: 0.36008	Recall: 0.30667	F1: 0.33123	F2: 0.31604
	Total predictions: 4200	True positives:  184	False positives:  327	False negatives:  416	True negatives: 3273

done in 9.302s


In [14]:
lr_features_list.append('deferral_payments')  # append the next feature fron selected by RFECV, with rank = 3 and rerun
t0 = time()
test_classifier(cl_ab6, my_dataset, lr_features_list, folds = 300)
print("done in %0.3fs" % (time() - t0))

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'),
          learning_rate=0.3, n_estimators=15, random_state=42)
	Accuracy: 0.83452	Precision: 0.39560	Recall: 0.30000	F1: 0.34123	F2: 0.31524
	Total predictions: 4200	True positives:  180	False positives:  275	False negatives:  420	True negatives: 3325

done in 9.361s


In [15]:
# Adding in the last feature caused recall to drop from 0.33667 to 0.31833, so we'll remove this feature and stop here
lr_features_list.remove('deferral_payments')

In [16]:
sss = StratifiedShuffleSplit(labels, 300, random_state=42)
pipe = Pipeline(steps=[("ADA", clf_ab)])
parameters_ada = {
                   'ADA__n_estimators':[1,4,8,12],
                'ADA__learning_rate':[0.1,0.5,1]}
                
gs = GridSearchCV(pipe, parameters_ada, scoring="f1", cv = sss)
t0 = time()
gs.fit(features, labels)
print("done in %0.3fs" % (time() - t0))
t0 = time()
test_classifier(gs.best_estimator_, my_dataset, features_list, folds = 100)
print("done in %0.3fs" % (time() - t0))

done in 52.370s
Pipeline(memory=None,
     steps=[('ADA', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
          n_estimators=12, random_state=42))])
	Accuracy: 0.85067	Precision: 0.41549	Recall: 0.29500	F1: 0.34503	F2: 0.31316
	Total predictions: 1500	True positives:   59	False positives:   83	False negatives:  141	True negatives: 1217

done in 2.497s


In [17]:
clf_rf1 = RandomForestClassifier(criterion='gini', random_state=42, class_weight='balanced')
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

pipeline = Pipeline([('select', select),('clf', clf_rf1)])

param_grid = {
    'clf__max_depth': [2, 3, 4, 5],
    'clf__n_estimators': [40, 60, 90, 120]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="f1")
grid_search.fit(features, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 2, 'clf__n_estimators': 120}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...timators=120, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.80067	Precision: 0.34185	Recall: 0.53500	F1: 0.41715	F2: 0.48068
	Total predictions: 4500	True positives:  321	False positives:  618	False negatives:  279	True negatives: 3282



In [18]:
clf_rf1 = RandomForestClassifier(criterion='gini', random_state=42, class_weight='balanced')
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

pipeline = Pipeline([('scaler', MinMaxScaler()),('select', select),('clf', clf_rf1)])

param_grid = {
    'clf__max_depth': [2, 3, 4, 5],
    'clf__n_estimators': [40, 60, 90, 120]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="f1")
grid_search.fit(features, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 2, 'clf__n_estimators': 40}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.78467	Precision: 0.32070	Recall: 0.55000	F1: 0.40516	F2: 0.48119
	Total predictions: 4500	True positives:  330	False positives:  699	False negatives:  270	True negatives: 3201



In [19]:
param_grid = {
    'select__estimator__tol': [ 0.01, 0.001, 0.0001, 0.00001],
    'clf__max_depth': [2, 3, 4],
    'clf__n_estimators': [40, 60, 80]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
grid_search.fit(features, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 3, 'select__estimator__tol': 0.01, 'clf__n_estimators': 40}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.80556	Precision: 0.32573	Recall: 0.42833	F1: 0.37005	F2: 0.40295
	Total predictions: 4500	True positives:  257	False positives:  532	False negatives:  343	True negatives: 3368



In [23]:
param_grid = {
    'select__estimator__tol': [ 0.1, 0.01, 0.001, 0.0001, 0.00001],
    'clf__max_depth': [2, 3, 4],
    'clf__n_estimators': [40, 60, 90, 120]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
grid_search.fit(features, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 3, 'select__estimator__tol': 0.01, 'clf__n_estimators': 40}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.80556	Precision: 0.32573	Recall: 0.42833	F1: 0.37005	F2: 0.40295
	Total predictions: 4500	True positives:  257	False positives:  532	False negatives:  343	True negatives: 3368



In [24]:
test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 1000)

Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.80127	Precision: 0.31987	Recall: 0.43550	F1: 0.36883	F2: 0.40614
	Total predictions: 15000	True positives:  871	False positives: 1852	False negatives: 1129	True negatives: 11148



In [25]:
grid_search

GridSearchCV(cv=StratifiedShuffleSplit(labels=[0. 0. ... 1. 0.], n_iter=100, test_size=0.1, random_state=42),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=10, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'clf__max_depth': [2, 3, 4], 'select__estimator__tol': [0.1, 0.01, 0.001, 0.0001, 1e-05], 'clf__n_estimators': [40, 60, 90, 120]},
       pre_dispatch='2*n_jobs', refit=True, scoring='precision', verbose=0)

In [27]:
winner1 = grid_search.best_estimator_
winner1.__dict__

{'memory': None,
 'steps': [('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
  ('select',
   SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
             fit_intercept=True, intercept_scaling=1, max_iter=100,
             multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
             solver='liblinear', tol=0.01, verbose=0, warm_start=False),
           norm_order=1, prefit=False, threshold=None)),
  ('clf', RandomForestClassifier(bootstrap=True, class_weight='balanced',
               criterion='gini', max_depth=3, max_features='auto',
               max_leaf_nodes=None, min_impurity_decrease=0.0,
               min_impurity_split=None, min_samples_leaf=1,
               min_samples_split=2, min_weight_fraction_leaf=0.0,
               n_estimators=40, n_jobs=1, oob_score=False, random_state=42,
               verbose=0, warm_start=False))]}

In [28]:
test_classifier(winner1, my_dataset, features_list, folds = 1000)

Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.80127	Precision: 0.31987	Recall: 0.43550	F1: 0.36883	F2: 0.40614
	Total predictions: 15000	True positives:  871	False positives: 1852	False negatives: 1129	True negatives: 11148



In [29]:
test_classifier(winner1, my_dataset, lr_features_list, folds = 300)

Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.77071	Precision: 0.29945	Recall: 0.45167	F1: 0.36013	F2: 0.40998
	Total predictions: 4200	True positives:  271	False positives:  634	False negatives:  329	True negatives: 2966



In [30]:
test_classifier(winner1, my_dataset, lr_features_list, folds = 1000)

Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.76936	Precision: 0.30298	Recall: 0.47250	F1: 0.36921	F2: 0.42495
	Total predictions: 14000	True positives:  945	False positives: 2174	False negatives: 1055	True negatives: 9826



In [31]:
features_without_mine = features_list
features_without_mine.remove('has_email')
features_without_mine

['poi',
 'salary',
 'bonus',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'to_messages',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'from_poi_to_this_person']

In [32]:
# Rerun test against features minus the feature added
#  compare to baseline with 300 folds: Accuracy: 0.83044	Precision: 0.37519	Recall: 0.40833	F1: 0.39106	F2: 0.40124
test_classifier(winner1, my_dataset, features_without_mine, folds = 300)

Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.81867	Precision: 0.35165	Recall: 0.42667	F1: 0.38554	F2: 0.40921
	Total predictions: 4500	True positives:  256	False positives:  472	False negatives:  344	True negatives: 3428



So the feature added, has_email, has no benefit, at least with the best so far classifier and 300 folds.

In [33]:
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

pipeline = Pipeline([('scaler', MinMaxScaler()), ('select', select), ('clf', GaussianNB())])

"""
param_grid = {
    'select__estimator__tol': [ 0.01, 0.001, 0.0001, 0.00001],
    'select__estimator__C': [0.1, 0.5, 0.9, 1.0]
}
"""

param_grid = {
    'select__estimator__tol': [ 0.01, 0.001, 0.0001, 0.00001]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
grid_search.fit(features, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'select__estimator__tol': 0.01}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.01, verbose=0, warm_start=False),
        norm_order=1, prefit=False, threshold=None)), ('clf', GaussianNB(priors=None))])
	Accuracy: 0.28800	Precision: 0.14290	Recall: 0.86833	F1: 0.24541	F2: 0.43086
	Total predictions: 4500	True positives:  521	False positives: 3125	False negatives:   79	True negatives:  775



In [34]:
scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features)
features_scaled[0]

array([1.81735475e-01, 5.21875000e-01, 4.55198951e-01, 4.33029255e-02,
       5.03529074e-02, 1.57231836e-01, 2.54845137e-01, 9.63456735e-02,
       3.60830823e-02, 6.06216914e-02, 0.00000000e+00, 1.91563800e-01,
       1.52770045e-01, 1.46721985e-05, 1.06732348e-01, 0.00000000e+00,
       1.20800334e-01, 5.92379574e-02, 8.90151515e-02, 1.00000000e+00])

In [35]:
features[0]

array([ 2.019550e+05,  4.175000e+06,  2.869717e+06,  4.484442e+06,
        1.729541e+06,  1.260270e+05,  1.407000e+03, -1.260270e+05,
        1.729541e+06,  1.386800e+04,  0.000000e+00,  2.902000e+03,
        2.195000e+03,  1.520000e+02,  6.500000e+01,  0.000000e+00,
       -3.081055e+06,  3.048050e+05,  4.700000e+01,  1.000000e+00])

In [36]:
clf_rf1 = RandomForestClassifier(criterion='gini', random_state=42, class_weight='balanced')
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

pipeline = Pipeline([('select', select),('clf', clf_rf1)])

param_grid = {
    'clf__max_depth': [2, 3, 4],
    'clf__n_estimators': [40, 60, 90]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
grid_search.fit(features_scaled, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 3, 'clf__n_estimators': 40}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.81511	Precision: 0.35204	Recall: 0.46000	F1: 0.39884	F2: 0.43342
	Total predictions: 4500	True positives:  276	False positives:  508	False negatives:  324	True negatives: 3392



In [37]:
test_classifier(grid_search.best_estimator_, my_dataset, lr_features_list, folds = 300)

Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.79167	Precision: 0.33087	Recall: 0.44833	F1: 0.38075	F2: 0.41861
	Total predictions: 4200	True positives:  269	False positives:  544	False negatives:  331	True negatives: 3056



In [38]:
clf_rf1 = RandomForestClassifier(criterion='gini', random_state=42, class_weight='balanced')
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

pipeline = Pipeline([('select', select),('clf', clf_rf1)])

param_grid = {
    'clf__max_depth': [2, 3, 4],
    'clf__n_estimators': [40, 60, 90]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
grid_search.fit(features_scaled, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 3, 'clf__n_estimators': 40}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.81511	Precision: 0.35204	Recall: 0.46000	F1: 0.39884	F2: 0.43342
	Total predictions: 4500	True positives:  276	False positives:  508	False negatives:  324	True negatives: 3392



In [39]:
grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
grid_search.fit(features, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 2, 'clf__n_estimators': 90}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...stimators=90, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.80200	Precision: 0.34338	Recall: 0.53167	F1: 0.41727	F2: 0.47912
	Total predictions: 4500	True positives:  319	False positives:  610	False negatives:  281	True negatives: 3290



In [40]:
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

pipeline = Pipeline([('select', select),('clf', AdaBoostClassifier(random_state=42))])

param_grid = {
    'select__estimator__tol': [ 0.01, 0.001, 0.0001],
    'clf__learning_rate': [ 0.1, 0.5, 1.0],
    'clf__n_estimators': [ 10, 30, 50, 90, 140]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
grid_search.fit(features, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__learning_rate': 0.5, 'select__estimator__tol': 0.01, 'clf__n_estimators': 90}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.01, verbose=0, warm_start=False),
        norm_order=1, prefit=False, threshold=None)), ('clf', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.5, n_estimators=90, random_state=42))])
	Accuracy: 0.85822	Precision: 0.46091	Recall: 0.37333	F1: 0.41252	F2: 0.38808
	Total predictions: 4500	True positives:  224	False positives:  262	False negatives:  376	True negatives: 3638



In [41]:
test_classifier(gs.best_estimator_, my_dataset, lr_features_list, folds = 300)
print("done in %0.3fs" % (time() - t0))

Pipeline(memory=None,
     steps=[('ADA', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
          n_estimators=12, random_state=42))])
	Accuracy: 0.84429	Precision: 0.43891	Recall: 0.32333	F1: 0.37236	F2: 0.34131
	Total predictions: 4200	True positives:  194	False positives:  248	False negatives:  406	True negatives: 3352

done in 4209.499s


In [42]:
features_list

['poi',
 'salary',
 'bonus',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'to_messages',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'from_poi_to_this_person']

In [43]:
clf_rf2 = RandomForestClassifier(criterion='gini', random_state=42, class_weight='balanced')
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

pipeline = Pipeline([('select', select),('clf', clf_rf2)])


param_grid = {
    'select__estimator__tol': [ 0.01, 0.001, 0.0001, 0.00001],
    'clf__max_depth': [2, 3, 4],
    'clf__n_estimators': [40, 60, 80]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
grid_search.fit(features, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 2, 'select__estimator__tol': 0.0001, 'clf__n_estimators': 80}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...stimators=80, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.80111	Precision: 0.34292	Recall: 0.53667	F1: 0.41845	F2: 0.48218
	Total predictions: 4500	True positives:  322	False positives:  617	False negatives:  278	True negatives: 3283



In [44]:
winner_grid1 = grid_search
winner_clf1 = grid_search.best_estimator_

In [45]:
dump_classifier_and_data(winner_clf1, my_dataset, features_list)

# dump_classifier_and_data(clf, dataset, feature_list):

In [46]:
from tester import dump_classifier_and_data, load_classifier_and_data, test_classifier

pkl_clf, pkl_dataset, pkl_feature_list = load_classifier_and_data()
test_classifier(pkl_clf, pkl_dataset, pkl_feature_list)

Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...stimators=80, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.79467	Precision: 0.33466	Recall: 0.54650	F1: 0.41512	F2: 0.48509
	Total predictions: 15000	True positives: 1093	False positives: 2173	False negatives:  907	True negatives: 10827



In [47]:
def quick_test(clf):  # assuming that features, labels initiated
    features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=42)
    clf.fit(features_train, labels_train)
    labels_predicted = clf.predict(features_test)
    classification_rep = classification_report(labels_test, labels_predicted)
    print("classification_report for {}: \n{}".format(clf, classification_rep))
    test_classifier(clf, my_dataset, features_list, folds = 150)
              

In [48]:
clf_ab_dt = AdaBoostClassifier(DecisionTreeClassifier(random_state=42, class_weight = 'balanced', max_depth=1), random_state=42)

<h3> Quick tests using simple test/train splits and classification report</h3>

In [49]:
quick_test(clf_nb)
quick_test(clf_dt)
quick_test(clf_rf)
quick_test(clf_ab)
quick_test(clf_ab_dt)
quick_test(clf_lr)
quick_test(clf_sv)

classification_report for GaussianNB(priors=None): 
             precision    recall  f1-score   support

        0.0       0.93      0.95      0.94        39
        1.0       0.50      0.40      0.44         5

avg / total       0.88      0.89      0.88        44

GaussianNB(priors=None)
	Accuracy: 0.73067	Precision: 0.23529	Recall: 0.45333	F1: 0.30979	F2: 0.38245
	Total predictions: 2250	True positives:  136	False positives:  442	False negatives:  164	True negatives: 1508

classification_report for DecisionTreeClassifier(class_weight='balanced', criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'): 
             precision    recall  f1-score   support

        0.0       0.90      0.95      0.92        39
        1.0       0.33      0.

  'precision', 'predicted', average, warn_for)


Got a divide by zero when trying out: SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False)
Precision or recall may be undefined due to a lack of true positive predicitons.


<h3>... and on to more systematic use of GridSearchCV to find optimal parameters for the algorithms and the L1-based 
feature selection</h3>

In [50]:
clf_rf2 = RandomForestClassifier(criterion='gini', random_state=42, class_weight='balanced')
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

pipeline = Pipeline([('scaler', MinMaxScaler()),('select', select),('clf', clf_rf2)])


param_grid = {
    'select__estimator__tol': [ 0.01, 0.001, 0.0001, 0.00001],
    'clf__max_depth': [2, 3, 4],
    'clf__n_estimators': [40, 60, 80]
}

scaler_grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
scaler_grid_search.fit(features, labels)
print(scaler_grid_search.best_params_)

test_classifier(scaler_grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 3, 'select__estimator__tol': 0.01, 'clf__n_estimators': 40}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...stimators=40, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.81867	Precision: 0.35165	Recall: 0.42667	F1: 0.38554	F2: 0.40921
	Total predictions: 4500	True positives:  256	False positives:  472	False negatives:  344	True negatives: 3428



In [51]:
clf_rf2 = RandomForestClassifier(criterion='gini', random_state=42, class_weight='balanced')
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)



pipeline = Pipeline([('select', select),('clf', clf_rf2)])


param_grid = {
    'select__estimator__tol': [ 0.01, 0.001, 0.0001, 0.00001],
    'clf__max_depth': [2, 3, 4],
    'clf__n_estimators': [40, 60, 80]
}

prescaled_grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
prescaled_grid_search.fit(features, labels)
print(prescaled_grid_search.best_params_)

test_classifier(prescaled_grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__max_depth': 2, 'select__estimator__tol': 0.0001, 'clf__n_estimators': 80}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...stimators=80, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.80111	Precision: 0.34292	Recall: 0.53667	F1: 0.41845	F2: 0.48218
	Total predictions: 4500	True positives:  322	False positives:  617	False negatives:  278	True negatives: 3283



In [52]:
clf_rf2 = RandomForestClassifier(criterion='gini', random_state=42, class_weight='balanced')
select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

pipeline = Pipeline([('scaler', MinMaxScaler()), 
                     ('select', SelectKBest()), 
                     ('pca', PCA(random_state=42)), 
                     ('clf', clf_rf2)])


param_grid = {
    'select__k': [7,10,15],
    'pca__n_components': [3,5,7],
    'clf__max_depth': [2, 3, 4],
    'clf__n_estimators': [40, 60, 80]
}

selectk_pca_grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
selectk_pca_grid_search.fit(features, labels)
print(selectk_pca_grid_search.best_params_)

test_classifier(selectk_pca_grid_search.best_estimator_, my_dataset, features_list, folds = 300)




{'pca__n_components': 7, 'clf__max_depth': 2, 'select__k': 15, 'clf__n_estimators': 80}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectKBest(k=15, score_func=<function f_classif at 0x000000000BC02208>)), ('pca', PCA(copy=True, iterated_power='auto', n_components=7, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', RandomForestCla...stimators=80, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.73667	Precision: 0.27517	Recall: 0.59667	F1: 0.37664	F2: 0.48365
	Total predictions: 4500	True positives:  358	False positives:  943	False negatives:  242	True negatives: 2957



In [53]:

select = SelectFromModel(LogisticRegression(class_weight='balanced', max_iter=100, penalty='l1', random_state=42))
sss = StratifiedShuffleSplit(labels, 100, random_state=42)

In [54]:
clf_ab2 = AdaBoostClassifier(random_state=42)
pipeline = Pipeline([('select', select),('clf', clf_ab2)])

param_grid = {
    'select__estimator__tol': [ 0.01, 0.001, 0.0001, 0.00001],
    'clf__learning_rate': [0.1, 0.5, 1.0],
    'clf__n_estimators': [20, 40, 60, 80, 100]
}
              
adab_grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
adab_grid_search.fit(features, labels)
print(adab_grid_search.best_params_)

test_classifier(adab_grid_search.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__learning_rate': 0.5, 'select__estimator__tol': 0.01, 'clf__n_estimators': 80}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.01, verbose=0, warm_start=False),
        norm_order=1, prefit=False, threshold=None)), ('clf', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.5, n_estimators=80, random_state=42))])
	Accuracy: 0.86156	Precision: 0.47569	Recall: 0.37500	F1: 0.41938	F2: 0.39158
	Total predictions: 4500	True positives:  225	False positives:  248	False negatives:  375	True negatives: 3652



In [55]:
test_classifier(adab_grid_search.best_estimator_, my_dataset, features_list, folds = 1000)

Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.01, verbose=0, warm_start=False),
        norm_order=1, prefit=False, threshold=None)), ('clf', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.5, n_estimators=80, random_state=42))])
	Accuracy: 0.85667	Precision: 0.45253	Recall: 0.35750	F1: 0.39944	F2: 0.37317
	Total predictions: 15000	True positives:  715	False positives:  865	False negatives: 1285	True negatives: 12135



In [56]:
clf_ab3 = AdaBoostClassifier(DecisionTreeClassifier(random_state=42, class_weight = 'balanced'), random_state=42)
pipeline = Pipeline([('select', select),('clf', clf_ab3)])

param_grid = {
    'select__estimator__tol': [ 0.01, 0.001, 0.0001, 0.00001],
    'clf__learning_rate': [0.1, 0.5, 1.0],
    'clf__n_estimators': [20, 40, 60, 80, 100],
    'clf__base_estimator__max_depth': [1,2,3,4]
}

adab_grid_search2 = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
adab_grid_search2.fit(features, labels)
print(adab_grid_search2.best_params_)

test_classifier(adab_grid_search2.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__base_estimator__max_depth': 1, 'clf__learning_rate': 0.1, 'select__estimator__tol': 0.0001, 'clf__n_estimators': 20}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...e=42,
            splitter='best'),
          learning_rate=0.1, n_estimators=20, random_state=42))])
	Accuracy: 0.71089	Precision: 0.24171	Recall: 0.54667	F1: 0.33521	F2: 0.43652
	Total predictions: 4500	True positives:  328	False positives: 1029	False negatives:  272	True negatives: 2871



In [57]:
pipeline = Pipeline([('select', SelectKBest()), 
                     ('pca', PCA(random_state=42)), 
                     ('clf', clf_rf2)])


param_grid = {
    'select__k': [6,8,12],
    'pca__n_components': [3,4,5,6],
    'clf__max_depth': [2, 3, 4],
    'clf__n_estimators': [40, 60, 80]
}

features_scaled = MinMaxScaler().fit_transform(features)

prescaled_grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="precision")
prescaled_grid_search.fit(features_scaled, labels)
print(prescaled_grid_search.best_params_)

test_classifier(prescaled_grid_search.best_estimator_, my_dataset, features_list, folds = 300)


{'pca__n_components': 5, 'clf__max_depth': 2, 'select__k': 6, 'clf__n_estimators': 80}
Pipeline(memory=None,
     steps=[('select', SelectKBest(k=6, score_func=<function f_classif at 0x000000000BC02208>)), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', RandomForestClassifier(bootstrap=True, class_weight='balanced',
           ...stimators=80, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.79067	Precision: 0.31042	Recall: 0.46667	F1: 0.37284	F2: 0.42399
	Total predictions: 4500	True positives:  280	False positives:  622	False negatives:  320	True negatives: 3278



In [58]:
test_classifier(prescaled_grid_search.best_estimator_, my_dataset, features_list, folds = 1000)

  f = msb / msw


Pipeline(memory=None,
     steps=[('select', SelectKBest(k=6, score_func=<function f_classif at 0x000000000BC02208>)), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', RandomForestClassifier(bootstrap=True, class_weight='balanced',
           ...stimators=80, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))])
	Accuracy: 0.78660	Precision: 0.30305	Recall: 0.46200	F1: 0.36601	F2: 0.41814
	Total predictions: 15000	True positives:  924	False positives: 2125	False negatives: 1076	True negatives: 10875



In [59]:
clf_ab2 = AdaBoostClassifier(random_state=42)
pipeline = Pipeline([('select', select),('clf', clf_ab2)])

param_grid = {
    'select__estimator__tol': [ 0.0005, 0.001, 0.005],
    'clf__learning_rate': [0.45, 0.5, 0.55],
    'clf__n_estimators': [10, 15, 20, 25, 30]
}
              
adab_grid_search2 = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
adab_grid_search2.fit(features, labels)
print(adab_grid_search2.best_params_)

test_classifier(adab_grid_search2.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__learning_rate': 0.45, 'select__estimator__tol': 0.0005, 'clf__n_estimators': 20}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0005, verbose=0, warm_...hm='SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=20, random_state=42))])
	Accuracy: 0.87533	Precision: 0.54503	Recall: 0.39333	F1: 0.45692	F2: 0.41652
	Total predictions: 4500	True positives:  236	False positives:  197	False negatives:  364	True negatives: 3703



In [60]:
test_classifier(adab_grid_search2.best_estimator_, my_dataset, features_list, folds = 1000)

Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0005, verbose=0, warm_...hm='SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=20, random_state=42))])
	Accuracy: 0.87307	Precision: 0.53297	Recall: 0.38800	F1: 0.44907	F2: 0.41032
	Total predictions: 15000	True positives:  776	False positives:  680	False negatives: 1224	True negatives: 12320



In [61]:
adab_grid_search2.best_estimator_.__dict__

{'memory': None,
 'steps': [('select',
   SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
             fit_intercept=True, intercept_scaling=1, max_iter=100,
             multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
             solver='liblinear', tol=0.0005, verbose=0, warm_start=False),
           norm_order=1, prefit=False, threshold=None)),
  ('clf', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
             learning_rate=0.45, n_estimators=20, random_state=42))]}

In [62]:
pipeline = Pipeline([('scaler', MinMaxScaler()),
                     ('select', SelectKBest()),
                     ('pca', PCA(random_state=42)), 
                     ('clf', AdaBoostClassifier(random_state=42))])


param_grid = {
    'select__k': [6,8,12],
    'pca__n_components': [3,5,6],
    'clf__learning_rate': [0.1, 0.5, 1.0],
    'clf__n_estimators': [20, 30, 50]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
grid_search.fit(features_scaled, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)


{'select__k': 12, 'pca__n_components': 3, 'clf__learning_rate': 1.0, 'clf__n_estimators': 20}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectKBest(k=12, score_func=<function f_classif at 0x000000000BC02208>)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=20, random_state=42))])
	Accuracy: 0.81889	Precision: 0.22078	Recall: 0.14167	F1: 0.17259	F2: 0.15260
	Total predictions: 4500	True positives:   85	False positives:  300	False negatives:  515	True negatives: 3600



In [63]:
pipeline = Pipeline([('scaler', MinMaxScaler()),
                     ('clf', AdaBoostClassifier(random_state=42))])


param_grid = {
    'clf__learning_rate': [0.1, 0.5, 1.0],
    'clf__n_estimators': [20, 30, 50]
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
grid_search.fit(features_scaled, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)


{'clf__learning_rate': 0.5, 'clf__n_estimators': 50}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('clf', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.5, n_estimators=50, random_state=42))])
	Accuracy: 0.84622	Precision: 0.40043	Recall: 0.30833	F1: 0.34840	F2: 0.32320
	Total predictions: 4500	True positives:  185	False positives:  277	False negatives:  415	True negatives: 3623



In [64]:
clf_final = AdaBoostClassifier(learning_rate=0.5, n_estimators=20, random_state=42)
selection_final = SelectFromModel(estimator=
                                  LogisticRegression(class_weight='balanced', penalty='l1', random_state=42, tol=0.0005))
pipeline_final = Pipeline([('select', selection_final), ('clf', clf_final)])

dump_classifier_and_data(pipeline_final, my_dataset, features_list)
pkl_clf, pkl_dataset, pkl_feature_list = load_classifier_and_data()
test_classifier(pkl_clf, pkl_dataset, pkl_feature_list)

Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0005, verbose=0, warm_...thm='SAMME.R', base_estimator=None,
          learning_rate=0.5, n_estimators=20, random_state=42))])
	Accuracy: 0.87280	Precision: 0.53181	Recall: 0.38450	F1: 0.44631	F2: 0.40705
	Total predictions: 15000	True positives:  769	False positives:  677	False negatives: 1231	True negatives: 12323



In [65]:
# Compare to testing with reduced # features from RFECV w/ logistic regr

test_classifier(clf_final, my_dataset, lr_features_list, folds = 300)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.5, n_estimators=20, random_state=42)
	Accuracy: 0.85881	Precision: 0.50801	Recall: 0.37000	F1: 0.42816	F2: 0.39126
	Total predictions: 4200	True positives:  222	False positives:  215	False negatives:  378	True negatives: 3385



<h3>Looks like we have a winner here.  Best to test if there are improvements by using other types of feature selection, as well as different features (the features obtained from recursive feature elimination, and the features minus the feature added for having an email).</h3>

In [68]:
pipeline = Pipeline([('select', select),('clf', AdaBoostClassifier(random_state=42))])

param_grid = {
    'select__estimator__tol': [ 0.0001, 0.0005, 0.001],
    'clf__learning_rate': [0.40, 0.45, 0.5],
    'clf__n_estimators': [15, 18, 20, 22, 25]
}
 
# same pipeline and classifier as best so far, with slightly different param choices
adab_grid_search3 = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
adab_grid_search3.fit(features, labels)
print(adab_grid_search3.best_params_)

test_classifier(adab_grid_search3.best_estimator_, my_dataset, features_list, folds = 300)

{'clf__learning_rate': 0.45, 'select__estimator__tol': 0.0001, 'clf__n_estimators': 18}
Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...hm='SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=18, random_state=42))])
	Accuracy: 0.87422	Precision: 0.53881	Recall: 0.39333	F1: 0.45472	F2: 0.41579
	Total predictions: 4500	True positives:  236	False positives:  202	False negatives:  364	True negatives: 3698



In [69]:
test_classifier(adab_grid_search3.best_estimator_, my_dataset, features_list, folds = 1000)

Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...hm='SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=18, random_state=42))])
	Accuracy: 0.87200	Precision: 0.52736	Recall: 0.38550	F1: 0.44541	F2: 0.40742
	Total predictions: 15000	True positives:  771	False positives:  691	False negatives: 1229	True negatives: 12309



In [70]:
test_classifier(adab_grid_search3.best_estimator_, my_dataset, lr_features_list, folds = 300)

Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...hm='SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=18, random_state=42))])
	Accuracy: 0.85929	Precision: 0.51016	Recall: 0.37667	F1: 0.43337	F2: 0.39747
	Total predictions: 4200	True positives:  226	False positives:  217	False negatives:  374	True negatives: 3383



In [71]:
test_classifier(adab_grid_search3.best_estimator_, my_dataset, features_without_mine, folds = 300)

Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...hm='SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=18, random_state=42))])
	Accuracy: 0.87422	Precision: 0.53881	Recall: 0.39333	F1: 0.45472	F2: 0.41579
	Total predictions: 4500	True positives:  236	False positives:  202	False negatives:  364	True negatives: 3698



In [73]:
# Does using feature scaling before the select-from-model using L1 penalty help?
pipeline = Pipeline([('scaler', MinMaxScaler()),
                     ('select', select),
                     ('clf', adab_grid_search3.best_estimator_)])

param_grid = {
    'select__estimator__tol': [ 0.0001, 0.0005, 0.001],
}


adab_scaler_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
adab_scaler_search.fit(features, labels)
print(adab_scaler_search.best_params_)

test_classifier(adab_scaler_search.best_estimator_, my_dataset, features_list, folds = 300)

{'select__estimator__tol': 0.0001}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,...'SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=18, random_state=42))]))])
	Accuracy: 0.87644	Precision: 0.56111	Recall: 0.33667	F1: 0.42083	F2: 0.36594
	Total predictions: 4500	True positives:  202	False positives:  158	False negatives:  398	True negatives: 3742



In [74]:
# Does using feature scaling INSTEAD of select-from-model using L1 penalty improve scores?
pipeline = Pipeline([('scaler', MinMaxScaler()),
                     ('clf', adab_grid_search3.best_estimator_)])

test_classifier(pipeline, my_dataset, features_list, folds = 300)

Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('clf', Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr'...'SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=18, random_state=42))]))])
	Accuracy: 0.87644	Precision: 0.56111	Recall: 0.33667	F1: 0.42083	F2: 0.36594
	Total predictions: 4500	True positives:  202	False positives:  158	False negatives:  398	True negatives: 3742



In [76]:
# Does using feature scaling along with SelectKBest and PCA instead of L1-based feature selection get better scores?
pipeline = Pipeline([('scaler', MinMaxScaler()),
                     ('select', SelectKBest()),
                     ('pca', PCA(random_state=42)), 
                     ('clf', adab_grid_search3.best_estimator_)])


param_grid = {
    'select__k': [6,10,15,20],
    'pca__n_components': [3,5,6],
}

grid_search = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=sss, scoring="recall")
grid_search.fit(features_scaled, labels)
print(grid_search.best_params_)

test_classifier(grid_search.best_estimator_, my_dataset, features_list, folds = 300)


{'pca__n_components': 3, 'select__k': 15}
Pipeline(memory=None,
     steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('select', SelectKBest(k=15, score_func=<function f_classif at 0x000000000BC02208>)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', Pipeline(memory...'SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=18, random_state=42))]))])
	Accuracy: 0.83933	Precision: 0.22422	Recall: 0.08333	F1: 0.12151	F2: 0.09531
	Total predictions: 4500	True positives:   50	False positives:  173	False negatives:  550	True negatives: 3727



<h3>Summarizing, the AdaBoost classifier used in conjunction with L1-based feature selection had superior recall and precision scores in comparison with Random Forest (although this was close) and in comparison to other feature selection methods using feature scaling, SelectKBest features, and principal component analysis (PCA).
<br><br>
The features obtained by the L1-based recursive feature elimination did OK, but not quite as good as all the features originally present with or without the new feature created for having or not having an email address.
<h3>

In [77]:
print(adab_grid_search3.best_params_)
final_clf = adab_grid_search3.best_estimator_
final_clf.__dict__

{'clf__learning_rate': 0.45, 'select__estimator__tol': 0.0001, 'clf__n_estimators': 18}


{'memory': None,
 'steps': [('select',
   SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
             fit_intercept=True, intercept_scaling=1, max_iter=100,
             multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
             solver='liblinear', tol=0.0001, verbose=0, warm_start=False),
           norm_order=1, prefit=False, threshold=None)),
  ('clf', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
             learning_rate=0.45, n_estimators=18, random_state=42))]}

In [78]:
dump_classifier_and_data(final_clf, my_dataset, features_list)
pkl_clf, pkl_dataset, pkl_feature_list = load_classifier_and_data()
test_classifier(pkl_clf, pkl_dataset, pkl_feature_list)

Pipeline(memory=None,
     steps=[('select', SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_...hm='SAMME.R', base_estimator=None,
          learning_rate=0.45, n_estimators=18, random_state=42))])
	Accuracy: 0.87200	Precision: 0.52736	Recall: 0.38550	F1: 0.44541	F2: 0.40742
	Total predictions: 15000	True positives:  771	False positives:  691	False negatives: 1229	True negatives: 12309



<H2>A look at outliers and top 1 and 5% of financial statistics</H2>

From below, we can see that the top 5% for salaries, bonuses, and total stock value have a much higher percentage of POIs than
would be expected based on the baseline POI rate.  The top 1% for these features are even more heavily skewed towards POIs.
Loan advances and restricted stock show some signs of high POI skew at the 99 percentile level.

Looking at all the values displayed below, we do not see any obviously invalid outlier values.  Since the values that look
like outliers have a very high chance of predicting POIs, it would make no sense to exclude them from the analysis given that
there is no apriori reason to do so.

In [79]:
def percent_poi_of_top_percent(feature, percentile):
    vals = [ data_dict[emp][feature] for emp in data_dict.keys() if data_dict[emp][feature] != 'NaN']
    vals = np.array(vals)
    print("Printing values for feature {}".format(feature))
    print(str(vals))
    thresh = np.percentile(vals, percentile)
    print("{} percentile threshold: {}".format(percentile, thresh))
    poi_for_feature = [ data_dict[emp]['poi'] for emp in data_dict.keys() if data_dict[emp][feature] != 'NaN' and data_dict[emp][feature] > thresh]
    print("Individuals over threshold are POI: {}".format(poi_for_feature))
    return (sum(map(int, poi_for_feature)) + 0.0)/len(poi_for_feature)

In [80]:
percent_poi_of_top_percent('salary', 95)

Printing values for feature salary
[ 365788  267102  170941       0  243293  267093       0  370448       0
  197091  130724  288589  248546  257486       0       0  288542  251654
  288558   63744       0  357091  271442       0       0  304110       0
  187922       0  213625  249201       0  231330       0  182245       0
  211788       0       0       0       0  224305  273746  339288  216582
  210500       0       0  272880     477       0  269076  428780  211844
       0  206121  174246  510364  365038       0  365163  162779       0
  236457       0 1072321  261516  329078       0  184899  192008  263413
  262663       0       0  374125  278601       0  199157       0   96840
   80818  213999  210692  222093  440698       0  240189  420636  275101
       0  314288   94941       0  239502 1111258       0       0       0
    6615  655037       0       0  404338       0  259996  317543       0
  201955  248146       0       0       0       0       0   76399  262788
       0  261809

0.375

In [90]:
percent_poi_of_top_percent('salary', 99)

Printing values for feature salary
[ 365788  267102  170941       0  243293  267093       0  370448       0
  197091  130724  288589  248546  257486       0       0  288542  251654
  288558   63744       0  357091  271442       0       0  304110       0
  187922       0  213625  249201       0  231330       0  182245       0
  211788       0       0       0       0  224305  273746  339288  216582
  210500       0       0  272880     477       0  269076  428780  211844
       0  206121  174246  510364  365038       0  365163  162779       0
  236457       0 1072321  261516  329078       0  184899  192008  263413
  262663       0       0  374125  278601       0  199157       0   96840
   80818  213999  210692  222093  440698       0  240189  420636  275101
       0  314288   94941       0  239502 1111258       0       0       0
    6615  655037       0       0  404338       0  259996  317543       0
  201955  248146       0       0       0       0       0   76399  262788
       0  261809

1.0

From the above two results we see that 37.5% of those at 95th percentile of salaries are POIs, while less than 20% of the 
sample are POIs.  Also, 100% of those at the 99th percentile are POIs.  Thus, the "outliers" for salary are very likely to be POIs.  <b>Since there is nothing invalid about the values for the salaries and the salaries apparently have some predictive value, there is no reason to exclude the outliers for salary</b>; indeed, this indicates that salary would be a good feature to use in our feature set for prediction.

We see the same results for bonus and total stock value, and not so much for restricted stock and loan advances. 
There are no values shown below that stand out as being obviously invalid.  Note that values of 'NaN' were excluded.

In [87]:
percent_poi_of_top_percent('bonus', 99)

Printing values for feature bonus
[ 600000 1200000  350000       0 1500000  325000       0 2600000       0
  400000       0  788750  850000  700000       0       0 1200000 1100000
  250000       0       0  850000 3100000       0       0 2000000       0
  250000       0 1000000  700000       0  700000       0  200000       0
 1700000       0       0       0       0  800000 1000000 8000000       0
  425000       0       0  750000       0       0  650000 1500000  200000
       0  600000       0 3000000 1100000       0 3000000  100000       0
  200000       0 7000000  750000  750000       0  325000  509870  900000
  700000       0       0 1150000 1350000       0  350000       0       0
       0 5249999  750000       0 1300000       0 1250000 1750000  400000
       0  800000       0       0  500000 5600000       0       0       0
       0  300000       0       0 1000000       0  325000  450000       0
 4175000  600000       0       0       0       0       0  100000 1000000
       0  300000 

0.5

In [81]:
percent_poi_of_top_percent('bonus', 95)

Printing values for feature bonus
[ 600000 1200000  350000       0 1500000  325000       0 2600000       0
  400000       0  788750  850000  700000       0       0 1200000 1100000
  250000       0       0  850000 3100000       0       0 2000000       0
  250000       0 1000000  700000       0  700000       0  200000       0
 1700000       0       0       0       0  800000 1000000 8000000       0
  425000       0       0  750000       0       0  650000 1500000  200000
       0  600000       0 3000000 1100000       0 3000000  100000       0
  200000       0 7000000  750000  750000       0  325000  509870  900000
  700000       0       0 1150000 1350000       0  350000       0       0
       0 5249999  750000       0 1300000       0 1250000 1750000  400000
       0  800000       0       0  500000 5600000       0       0       0
       0  300000       0       0 1000000       0  325000  450000       0
 4175000  600000       0       0       0       0       0  100000 1000000
       0  300000 

0.5

In [82]:
percent_poi_of_top_percent('total_stock_value', 95)

Printing values for feature total_stock_value
[  585062 10623258  6678735  1038185  6391065   208510   955873  1662855
  7256648   880290  2282768        0   954354   698920  2218275   372205
   698242  1416848   725735   384930  1030329  5898997   547143        0
   -44093  2072035        0   659249        0  1843816  1918887    98718
   126027  2217299  1008941        0   441096   189518   850477   151418
   758931   985032   360528  5167144  2493616  2027865        0        0
   877611  5243487   371750   987001  3128982  2493616   412878   159211
  1034346  6079137  3101279    47304  3614261  1362375   139130  3064208
        0 49110078   417619  2606763  1691366   207940   318607   947861
   668132   759557  1945360   803094   252055    85641  1621236   221141
  7890324  1599641  1110705  1640910  4817796  1794412   343434   126027
 22542539   976037        0   495633  7307594        0   511734 26093672
        0  1832468  1095040        0    28798        0   368705  6153642
  188

0.625

In [88]:
percent_poi_of_top_percent('total_stock_value', 99)

Printing values for feature total_stock_value
[  585062 10623258  6678735  1038185  6391065   208510   955873  1662855
  7256648   880290  2282768        0   954354   698920  2218275   372205
   698242  1416848   725735   384930  1030329  5898997   547143        0
   -44093  2072035        0   659249        0  1843816  1918887    98718
   126027  2217299  1008941        0   441096   189518   850477   151418
   758931   985032   360528  5167144  2493616  2027865        0        0
   877611  5243487   371750   987001  3128982  2493616   412878   159211
  1034346  6079137  3101279    47304  3614261  1362375   139130  3064208
        0 49110078   417619  2606763  1691366   207940   318607   947861
   668132   759557  1945360   803094   252055    85641  1621236   221141
  7890324  1599641  1110705  1640910  4817796  1794412   343434   126027
 22542539   976037        0   495633  7307594        0   511734 26093672
        0  1832468  1095040        0    28798        0   368705  6153642
  188

1.0

In [83]:
percent_poi_of_top_percent('restricted_stock', 95)

Printing values for feature restricted_stock
[  585062  3942714  1788391   386335   853064   208510   462384   558801
  2046079   409554        0        0   189041   698920        0   153686
   698242   360528   540672   384930        0  1552453   466101    32460
        0   630137        0   659249        0   378082   283649        0
   126027  2217299   407503        0   441096   662086        0   151418
    94556   985032   360528  1008149   869220   315068        0        0
   441096  1757552        0   379164  1293424   869220        0   141833
  1034346  2796177  1478269    47304  1323148        0        0   514847
        0 14761694   417619   969729   934065   207940   235370   441096
   480632        0   264013   524169   252055    75838   956775   161602
   381285        0   157569   189041   365320  1794412        0   126027
  2748364   126027        0   378082  2041016        0   511734  6843672
        0   405999   208809        0        0        0   463261  4131594
   560

0.25

In [89]:
percent_poi_of_top_percent('restricted_stock', 99)

Printing values for feature restricted_stock
[  585062  3942714  1788391   386335   853064   208510   462384   558801
  2046079   409554        0        0   189041   698920        0   153686
   698242   360528   540672   384930        0  1552453   466101    32460
        0   630137        0   659249        0   378082   283649        0
   126027  2217299   407503        0   441096   662086        0   151418
    94556   985032   360528  1008149   869220   315068        0        0
   441096  1757552        0   379164  1293424   869220        0   141833
  1034346  2796177  1478269    47304  1323148        0        0   514847
        0 14761694   417619   969729   934065   207940   235370   441096
   480632        0   264013   524169   252055    75838   956775   161602
   381285        0   157569   189041   365320  1794412        0   126027
  2748364   126027        0   378082  2041016        0   511734  6843672
        0   405999   208809        0        0        0   463261  4131594
   560

0.5

In [84]:
percent_poi_of_top_percent('loan_advances', 95)

Printing values for feature loan_advances
[       0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0 81525000        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0   400000        0        0        0
        0

0.3333333333333333

In [85]:
percent_poi_of_top_percent('loan_advances', 99)

Printing values for feature loan_advances
[       0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0 81525000        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0        0        0        0        0
        0        0        0        0   400000        0        0        0
        0

0.5