# Machine Learning With Enron Corpus

## Goal: Use Machine Learning to Assess Fraud from Financial and Email Data 

Assessing fraud cases for any company is a tedious task that requires analyses across a vast amount of data. Fortunately most companies already have access to the data they need to combat fraud: emails and financials. However, these datasets are very large; it is extremely difficult to cipher through all of the data by hand. Having a machine assist and find a potential, fraudulent person of interest (POI) would save both time and money.

The email data set tested is the Enron Corpus, which can be found at (https://www.cs.cmu.edu/~./enron/). Enron was one of the top energy companies in the US in the early 2000s. The company eventually filed Bankruptcy largely due to fraudulent cases of insider trading and accounting scandals. It is one of the most noteworthy cases of fraud in the 20th century. Because of the scale of the company and the size of the email database, the Enron Corpus is a prime dataset for finding clues of fraud via financials and email.

In [1]:
# !/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from pprint import pprint
import numpy as np



## Explore Data and Find Important Features

In [2]:
### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".

all_features = ['poi', 'loan_advances', 'director_fees', 'restricted_stock_deferred',
               'deferral_payments', 'deferred_income', 'long_term_incentive', 'bonus', 
               'from_poi_to_this_person', 'shared_receipt_with_poi', 'to_messages',
               'from_this_person_to_poi', 'to_messages', 'from_this_person_to_poi',
               'from_messages', 'other', 'expenses', 'salary', 'exercised_stock_options',
               'restricted_stock', 'total_payments', 'total_stock_value', 'email_address']

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [3]:
# Find number of data points and total POI
poi_list = list()

for name in data_dict:
    if data_dict[name]['poi']:
        poi_list.append(name)

print "Total Data Points:", len(data_dict)
print "Total POIs:", len(poi_list)

Total Data Points: 146
Total POIs: 18


In [4]:
import pandas as pd
data_dict = pickle.load(open("final_project_dataset.pkl", "r"))
df = pd.DataFrame.from_dict(data_dict, orient='index', dtype = np.float)
percent_nan_list = df.isnull().sum() / (df.isnull().sum() + df.notnull().sum())
print "\nPercent NaN Values:\n", percent_nan_list.sort_values(ascending = False)


Percent NaN Values:
loan_advances                0.972603
director_fees                0.883562
restricted_stock_deferred    0.876712
deferral_payments            0.732877
deferred_income              0.664384
long_term_incentive          0.547945
bonus                        0.438356
from_poi_to_this_person      0.410959
shared_receipt_with_poi      0.410959
to_messages                  0.410959
from_this_person_to_poi      0.410959
from_messages                0.410959
other                        0.363014
expenses                     0.349315
salary                       0.349315
exercised_stock_options      0.301370
restricted_stock             0.246575
total_payments               0.143836
total_stock_value            0.136986
email_address                0.000000
poi                          0.000000
dtype: float64


Initial exploration of the dataset revealed 146 data points with 18 POIs. Every data point consisted of features as mentioned above such as: salary, total_payment, to_messages, and expenses. 

Most features in the Enron Corpus contain NaN values, and these NaN values make up greater than 40% of most features. NaN values represent a lack of information and weaken the overall influence and accuracy of a feature when testing for fraud in the database. There are multiple methods to handle NaN values; in this project NaN values were changed to be either the mean or median by use of a GridCV object and the Imputer function.

In [5]:
features_list = ['poi', 'salary', 'total_payments','bonus', 'deferred_income','total_stock_value',
                'expenses', 'exercised_stock_options', 'other','long_term_incentive',
                'restricted_stock', 'to_messages','from_poi_to_this_person',
                'from_messages', 'from_this_person_to_poi','shared_receipt_with_poi',
                'email_poi_score']

Digging into the dataset revealed not all entries were people's names: 'SKILLING JEFFREY K' or 'LAY KENNETH L'; however both 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK' were listed as if they were actual people that worked for Enron. Because both were not living beings, I decided to remove them from calculations. 

In [6]:
### Task 2: Remove outliers
del data_dict['TOTAL']
del data_dict['THE TRAVEL AGENCY IN THE PARK']

After removing the outliers, three additional features were calculated but were later found to be of minimal importance: percent_to_poi, percent_from_poi, and email_poi_score. 

The thought was to assign a score to each person signifying the capacity they were in contact with a POI. The score was calculated by summing a person's percent of emails _to_ or _from_ a POI. Again, this yielded no significant gain and was discarded as a feature.

In [7]:
### Task 3: Create new feature(s)

def normalize_feature(feature, data_dict):
    # initialize high and low value for normalization function
    value_high = None
    value_low = None

    # loop through persons to find high and low values for features
    for person in data_dict:
        value = data_dict[person][feature]
        if value != 'NaN':
            # If first value in feature then assign value to variables
            if value_low == None:
                value_high = value
                value_low = value
            # look to see if value is higher or lower
            if value > value_high:
                value_high = value
            elif value < value_low:
                value_low = value

    # loop to assign normalization value
    for person in data_dict:
        value = float(data_dict[person][feature])
        # if value exists between high and low
        if (value_high >= value) and (value_low <= value):
            # if denominator isn't zero
            if value_high != value_low:
                value_norm = (value - value_low) / (value_high - value_low)
                data_dict[person][feature] = value_norm
            
# find percent emails sent to poi and percent from poi to this person
for person in data_dict:
    from_messages = data_dict[person]['from_messages']
    to_messages = data_dict[person]['to_messages']
    from_poi = data_dict[person]['from_poi_to_this_person']
    to_poi = data_dict[person]['from_this_person_to_poi']
    
    # Initialize all email_poi_score as 'NaN'
    data_dict[person]['email_poi_score'] = 'NaN'

    percent_to = float(to_poi) / float(from_messages)
    percent_from = float(from_poi) / float(to_messages)

    data_dict[person]['percent_to_poi'] = percent_to
    data_dict[person]['percent_from_poi'] = percent_from

# normailize percent_to_poi and percent_from_poi and add together
normalize_feature('percent_to_poi', data_dict)
normalize_feature('percent_from_poi', data_dict)

# add normalized percent_to_poi and percent_from_poi to create email_poi_score
for person in data_dict:
    percent_to_norm = data_dict[person]['percent_to_poi']
    percent_from_norm = data_dict[person]['percent_from_poi']

    email_poi_score = percent_to_norm + percent_from_norm
    if email_poi_score >= 0:
        data_dict[person]['email_poi_score'] = email_poi_score
        
# normalize 'email_poi_score'
normalize_feature('email_poi_score', data_dict)

In [8]:
### Store to my_dataset for easy export below.
my_dataset = data_dict

##### Hand-picked Feature Selection

Feature selection was initially hand-picked from visual aide via https://public.tableau.com/profile/diego2420#!/vizhome/Udacity/UdacityDashboard. Features were chosen based on visual clumping of POIs and non-POIs. The number of features were chosen somewhat arbitrarily; only features that appeared to have a strong visual clumping were chosen.

Hand-picked features for determining POI:
* exercised_stock_options (high values ~ POI)
* deferred_income (low values ~ POI)
* expenses (low values ~ not POI)

In [9]:
features_handpicked = ['poi', 'exercised_stock_options', 'deferred_income', 'expenses']

##### SelectKBest Feature Selection

Features were also chosen using SelectKBest. Because top features selected from SelectKBest can change depending on the randomness of training and testing the data, a tally was taken to determine which features appear in the top 3 features the most over 1000 trials. The idea behind only choosing the top 3 is only 3 features were chosen for the hand-picked test.

SelectKBest features for determining POI:
* exercised_stock_option
* total_stock_value
* bonus


In [10]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn import cross_validation
from operator import add
from heapq import nlargest

# Run loop to find how many times a feature occurs in the top 3
best_features = [0] * (len(features_list)-1)
for i in range(1000):
    # Create features and training labels
    labels, features = targetFeatureSplit(data)
    features_train, features_test, labels_train, labels_test = \
        cross_validation.train_test_split(features, labels, test_size=0.33)

    # Generate SelectKBest with k=3 features
    selector = SelectKBest(f_classif, k=3)
    selector.fit(features_train, labels_train)
    
    # Increase score of feature if it appears in the top 3   
    best_features = selector.get_support().astype(int) + best_features

    
print "In top 3:\n", best_features
# Print the top 3 features scored by which features appeared most in top 3
features_kbest = ['poi']
for e in nlargest(3, best_features):
    for index in range(len(best_features)):
        if e == best_features[index]:
            top_feature = features_list[index+1]
            if top_feature not in features_kbest:
                features_kbest.append(top_feature)

print "\nTop 3 features:\n", features_kbest        

In top 3:
[363  12 552 169 714  33 722   5 119  73   0  15   0   9  40 174]

Top 3 features:
['poi', 'exercised_stock_options', 'total_stock_value', 'bonus']


## Classifiers

To test for optimal training from the data multiple classifiers are used:
* Decision Tree
* Random Forest
* Extra Trees
* SVMs
* GaussianNB

Classifiers were tested for high scores in precision, recall, and f1. Precision is the percent of people the classifier accurately labeled as a POI. Recall is the percent of actual POIs that were correctly classified. The F1 score relates both precion and recall. F1 scores are calculated by this equation:

$$ F1 = 2 * (precision * recall)  /  (precision + recall) $$

Only classifiers with precsion and recall scores greater than or equal to .33 will be considered for the best overall classifier. Any of those best classifiers will then be ranked by highest f1 score.

During initial testings I noticed a very high variation of all scores; classifiers that yielded high precision and recall scores may have low scores on the next test. To counteract the high deviation of values, I wrote a test function to generate 1000 classifiers of each type and compare their average scores.


In [11]:
"""### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

from time import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit
from pprint import pprint



# Create test to find average precision and recall scores
def test_prec_recall(name, clf_choice, features_list):
    precision_list = list()
    recall_list = list()
    for i in range(1000):
        ### Extract features and labels from dataset for local testing
        data = featureFormat(data_dict, features_list, sort_keys = True)
        # Create labels and features
        labels, features = targetFeatureSplit(data)

        # transform into np.array for StratifiedShuffleSplit
        features = np.array(features)
        labels = np.array(labels)

        # Shuffle and split data into training/testing sets
        sss = StratifiedShuffleSplit()
        for train_index, test_index in sss.split(features, labels):
            features_train, features_test = features[train_index], features[test_index]
            labels_train, labels_test = labels[train_index], labels[test_index]

        # Create, fit, and predict classifier
        clf = clf_choice
        clf.fit(features_train, labels_train)
        labels_pred = clf.predict(features_test)

        try:
            precision = precision_score(labels_test, labels_pred)
            recall = recall_score(labels_test, labels_pred)
            precision_list.append(precision)
            recall_list.append(recall)
        except:
            pass
    
    # F score is calculated via the mean precision and recall scores
    p_score = np.mean(precision_list)
    r_score = np.mean(recall_list)
    f_score = 2 * (p_score * r_score) / (p_score + r_score)
    
    print "\n" + "#" * 60
    print " " * 20 + name + "\n"
    print "Precision Mean Score: ", p_score
    print "Recall Mean Score: ", r_score
    print "F Score: ", f_score
    print "\n" + "#" * 60    """

'### Task 4: Try a varity of classifiers\n### Please name your classifier clf for easy export below.\n### Note that if you want to do PCA or other multi-stage operations,\n### you\'ll need to use Pipelines. For more info:\n### http://scikit-learn.org/stable/modules/pipeline.html\n\nfrom time import time\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import classification_report\nfrom sklearn.metrics import confusion_matrix\nfrom sklearn.metrics import precision_score\nfrom sklearn.metrics import recall_score\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.model_selection import StratifiedShuffleSplit\nfrom pprint import pprint\n\n\n\n# Create test to find average precision and recall scores\ndef test_prec_recall(name, clf_choice, features_list):\n    precision_list = list()\n    recall_list = list()\n    for i in range(1000):\n        ### Extract features and labels from dataset for local testing\n        data = featureFormat(data_dict, features_l

## Classifiers (Features List)

##### Classifiers: Decision Tree

In [12]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import Imputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_selection import SelectKBest, f_classif

pipeline_dt = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', DecisionTreeClassifier()),
])

param_grid_dt = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    
]        

cross_validator = StratifiedShuffleSplit()


from sklearn.model_selection import GridSearchCV
gridCV_object_dt = GridSearchCV(estimator = pipeline_dt,
                                param_grid = param_grid_dt,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_dt.fit(features, labels)

# get the best estimator
pipeline_clf_dt_af = gridCV_object_dt.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_dt_af, my_dataset, features_list)

  'precision', 'predicted', average, warn_for)


Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=2, score_func=<function f_classif at 0x7f23404d3b18>)), (...it=5, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])
	Accuracy: 0.79160	Precision: 0.18547	Recall: 0.16600	F1: 0.17520	F2: 0.16956
	Total predictions: 15000	True positives:  332	False positives: 1458	False negatives: 1668	True negatives: 11542



##### Classifiers: Random Forest 

In [14]:
from sklearn.ensemble import RandomForestClassifier

pipeline_rf = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', RandomForestClassifier()),
])

param_grid_rf = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
]        

cross_validator = StratifiedShuffleSplit(random_state = 0)


from sklearn.model_selection import GridSearchCV
gridCV_object_rf = GridSearchCV(estimator = pipeline_rf,
                                param_grid = param_grid_rf,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_rf.fit(features, labels)

# get the best estimator
pipeline_clf_rf_af = gridCV_object_rf.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_rf_af, my_dataset, features_list)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=2, score_func=<function f_classif at 0x7f23404d3b18>)), (...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])
	Accuracy: 0.83867	Precision: 0.29290	Recall: 0.14850	F1: 0.19708	F2: 0.16474
	Total predictions: 15000	True positives:  297	False positives:  717	False negatives: 1703	True negatives: 12283



##### Classifiers: Extra Trees 

In [15]:
from sklearn.ensemble import ExtraTreesClassifier

pipeline_et = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', ExtraTreesClassifier()),
])

param_grid_et = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
]        

cross_validator = StratifiedShuffleSplit(random_state = 0)


from sklearn.model_selection import GridSearchCV
gridCV_object_et = GridSearchCV(estimator = pipeline_et,
                                param_grid = param_grid_et,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_et.fit(features, labels)

# get the best estimator
pipeline_clf_et_af = gridCV_object_et.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_et_af, my_dataset, features_list)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=2, score_func=<function f_classif at 0x7f23404d3b18>)),...timators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))])
	Accuracy: 0.87060	Precision: 0.68553	Recall: 0.05450	F1: 0.10097	F2: 0.06680
	Total predictions: 15000	True positives:  109	False positives:   50	False negatives: 1891	True negatives: 12950



##### Classifiers: SVMs 

In [16]:
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler

pipeline_svc = Pipeline([
    ('imp', Imputer()),
    ('minmaxscaler', MinMaxScaler()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', SVC()),
])

param_grid_svc = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__C': [10,50,100],
        'clf__kernel': ['rbf', 'poly'],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [4],
        'skb__k': [1,2,3,4],
        'clf__C': [10,50,100],
        'clf__kernel': ['rbf', 'poly'],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [5],
        'skb__k': [1,2,3,4,5],
        'clf__C': [10,50,100],
        'clf__kernel': ['rbf', 'poly'],
    }
]

cross_validator = StratifiedShuffleSplit()

gridCV_object_svc = GridSearchCV(estimator = pipeline_svc,
                                 param_grid = param_grid_svc,
                                 scoring = 'f1',
                                 cv = cross_validator)

# fit the data
gridCV_object_svc.fit(features_train, labels_train)


# get the best estimator
pipeline_clf_svc_af = gridCV_object_svc.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_svc_af, my_dataset, features_list)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', Sele...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
	Accuracy: 0.85980	Precision: 0.00952	Recall: 0.00050	F1: 0.00095	F2: 0.00062
	Total predictions: 15000	True positives:    1	False positives:  104	False negatives: 1999	True negatives: 12896



##### Classifiers: Naive Bayes

In [17]:
from sklearn.naive_bayes import GaussianNB

pipeline_nb = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', RandomForestClassifier()),
])

param_grid_nb = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [4],
        'skb__k': [1,2,3,4],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [5],
        'skb__k': [1,2,3,4,5],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [6],
        'skb__k': [1,2,3,4,5,6],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    }
]        

cross_validator = StratifiedShuffleSplit(random_state = 0)


from sklearn.model_selection import GridSearchCV
gridCV_object_nb = GridSearchCV(estimator = pipeline_nb,
                                param_grid = param_grid_nb,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_nb.fit(features, labels)

# get the best estimator
pipeline_clf_nb_af = gridCV_object_nb.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_nb_af, my_dataset, features_list)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=5, score_func=<function f_classif at 0x7f23404d3b18>)), (...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])
	Accuracy: 0.84567	Precision: 0.34664	Recall: 0.17800	F1: 0.23522	F2: 0.19719
	Total predictions: 15000	True positives:  356	False positives:  671	False negatives: 1644	True negatives: 12329



## Classifiers (Hand-Picked)

##### Classifiers: Decision Tree (Hand-Picked)

In [18]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_handpicked, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [19]:
pipeline_dt = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', DecisionTreeClassifier()),
])

param_grid_dt = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    }       
]        

cross_validator = StratifiedShuffleSplit()


from sklearn.model_selection import GridSearchCV
gridCV_object_dt = GridSearchCV(estimator = pipeline_dt,
                                param_grid = param_grid_dt,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_dt.fit(features, labels)

# get the best estimator
pipeline_clf_dt_hp = gridCV_object_dt.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_dt_hp, my_dataset, features_handpicked)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=1, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=1, score_func=<function f_classif at 0x7f23404d3b18>)),...it=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])
	Accuracy: 0.88243	Precision: 0.83022	Recall: 0.22250	F1: 0.35095	F2: 0.26066
	Total predictions: 14000	True positives:  445	False positives:   91	False negatives: 1555	True negatives: 11909



##### Classifiers: Random Forest (Hand-Picked)

In [20]:
pipeline_rf = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', RandomForestClassifier()),
])

param_grid_rf = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    }
]        

cross_validator = StratifiedShuffleSplit(random_state = 0)


gridCV_object_rf = GridSearchCV(estimator = pipeline_rf,
                                param_grid = param_grid_rf,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_rf.fit(features, labels)

# get the best estimator
pipeline_clf_rf_hp = gridCV_object_rf.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_rf_hp, my_dataset, features_handpicked)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=2, score_func=<function f_classif at 0x7f23404d3b18>)),...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])
	Accuracy: 0.84171	Precision: 0.40592	Recall: 0.23300	F1: 0.29606	F2: 0.25470
	Total predictions: 14000	True positives:  466	False positives:  682	False negatives: 1534	True negatives: 11318



##### Classifiers: Extra Trees (Hand-Picked)

In [21]:
pipeline_et = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', ExtraTreesClassifier()),
])

param_grid_et = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    }
]              

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_etsk = GridSearchCV(estimator = pipeline_et,
                                param_grid = param_grid_et,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_et.fit(features, labels)

# get the best estimator
pipeline_clf_et_hp = gridCV_object_et.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_et_hp, my_dataset, features_handpicked)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=2, score_func=<function f_classif at 0x7f23404d3b18>)),...timators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))])
	Accuracy: 0.85243	Precision: 0.46145	Recall: 0.19750	F1: 0.27661	F2: 0.22301
	Total predictions: 14000	True positives:  395	False positives:  461	False negatives: 1605	True negatives: 11539



##### Classifiers: SVMs (Hand-Picked)

In [22]:
pipeline_svc = Pipeline([
    ('imp', Imputer()),
    ('minmaxscaler', MinMaxScaler()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', SVC()),
])

param_grid_svc = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__C': [10,50,100],
        'clf__kernel': ['rbf', 'poly'],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__C': [10,50,100],
        'clf__kernel': ['rbf', 'poly'],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__C': [10,50,100],
        'clf__kernel': ['rbf', 'poly'],
    }
]

cross_validator = StratifiedShuffleSplit()

gridCV_object_svc = GridSearchCV(estimator = pipeline_svc,
                                 param_grid = param_grid_svc,
                                 scoring = 'f1',
                                 cv = cross_validator)

# fit the data
gridCV_object_svc.fit(features_train, labels_train)


# get the best estimator
pipeline_clf_svc_hp = gridCV_object_svc.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_svc_hp, my_dataset, features_handpicked)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', Sele...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
	Accuracy: 0.85107	Precision: 0.03297	Recall: 0.00150	F1: 0.00287	F2: 0.00185
	Total predictions: 14000	True positives:    3	False positives:   88	False negatives: 1997	True negatives: 11912



##### Classifiers: Naive Bayes (Hand-Picked)

In [23]:
pipeline_nb = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', RandomForestClassifier()),
])

param_grid_nb = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    }
]        

cross_validator = StratifiedShuffleSplit(random_state = 0)


from sklearn.model_selection import GridSearchCV
gridCV_object_nb = GridSearchCV(estimator = pipeline_nb,
                                param_grid = param_grid_nb,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_nb.fit(features, labels)

# get the best estimator
pipeline_clf_nb_hp = gridCV_object_nb.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_nb_hp, my_dataset, features_handpicked)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=2, score_func=<function f_classif at 0x7f23404d3b18>)),...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])
	Accuracy: 0.86929	Precision: 0.64026	Recall: 0.19400	F1: 0.29777	F2: 0.22542
	Total predictions: 14000	True positives:  388	False positives:  218	False negatives: 1612	True negatives: 11782



### Classifiers: Hand-picked Summary

Both the Decison Tree and Extra Trees classifiers are close to meeting the criteria of having precision and recall scores greater than .33. However, the Extra Trees Classifier has a higher precision and higher f score compared to Decision Tree.

The **Extra Trees classifier** is the best of the **hand-picked** feature testing with ** precision = .398**, ** recall = .312**, and **F score = .350**.

## Classifiers (SelectKBest Features)

##### Classifiers: Decision Tree (SelectKBest)

In [24]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_kbest, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [25]:
pipeline_dt = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', DecisionTreeClassifier()),
])

param_grid_dt = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    }       
]        

cross_validator = StratifiedShuffleSplit()


from sklearn.model_selection import GridSearchCV
gridCV_object_dt = GridSearchCV(estimator = pipeline_dt,
                                param_grid = param_grid_dt,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_dt.fit(features, labels)

# get the best estimator
pipeline_clf_dt_sk = gridCV_object_dt.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_dt_sk, my_dataset, features_kbest)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=2, score_func=<function f_classif at 0x7f23404d3b18>)),...it=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])
	Accuracy: 0.86885	Precision: 0.59698	Recall: 0.45400	F1: 0.51576	F2: 0.47684
	Total predictions: 13000	True positives:  908	False positives:  613	False negatives: 1092	True negatives: 10387



##### Classifiers: Random Forest (SelectKBest)

In [26]:
pipeline_rf = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', RandomForestClassifier()),
])

param_grid_rf = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    }
]        

cross_validator = StratifiedShuffleSplit(random_state = 0)


gridCV_object_rf = GridSearchCV(estimator = pipeline_rf,
                                param_grid = param_grid_rf,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_rf.fit(features, labels)

# get the best estimator
pipeline_clf_rf_sk = gridCV_object_rf.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_rf_sk, my_dataset, features_kbest)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=3, score_func=<function f_classif at 0x7f23404d3b18>)),...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])
	Accuracy: 0.84662	Precision: 0.50311	Recall: 0.24300	F1: 0.32771	F2: 0.27102
	Total predictions: 13000	True positives:  486	False positives:  480	False negatives: 1514	True negatives: 10520



##### Classifiers: Extra Trees (SelectKBest)

In [27]:
pipeline_et = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', ExtraTreesClassifier()),
])

param_grid_et = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    }
]              

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_etsk = GridSearchCV(estimator = pipeline_et,
                                param_grid = param_grid_et,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_et.fit(features, labels)

# get the best estimator
pipeline_clf_et_sk = gridCV_object_et.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_et_sk, my_dataset, features_kbest)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=2, score_func=<function f_classif at 0x7f23404d3b18>)), (...timators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))])
	Accuracy: 0.84646	Precision: 0.50195	Recall: 0.25800	F1: 0.34082	F2: 0.28578
	Total predictions: 13000	True positives:  516	False positives:  512	False negatives: 1484	True negatives: 10488



##### Classifiers: SVMs (SelectKBest)

In [28]:
pipeline_svc = Pipeline([
    ('imp', Imputer()),
    ('minmaxscaler', MinMaxScaler()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', SVC()),
])

param_grid_svc = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__C': [10,50,100],
        'clf__kernel': ['rbf', 'poly'],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__C': [10,50,100],
        'clf__kernel': ['rbf', 'poly'],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__C': [10,50,100],
        'clf__kernel': ['rbf', 'poly'],
    }
]

cross_validator = StratifiedShuffleSplit()

gridCV_object_svc = GridSearchCV(estimator = pipeline_svc,
                                 param_grid = param_grid_svc,
                                 scoring = 'f1',
                                 cv = cross_validator)

# fit the data
gridCV_object_svc.fit(features_train, labels_train)


# get the best estimator
pipeline_clf_svc_sk = gridCV_object_svc.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_svc_sk, my_dataset, features_kbest)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('pca', PCA(copy=True, iterated_power='auto', n_components=1, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', Sele...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
	Accuracy: 0.86062	Precision: 0.77811	Recall: 0.13150	F1: 0.22498	F2: 0.15771
	Total predictions: 13000	True positives:  263	False positives:   75	False negatives: 1737	True negatives: 10925



##### Classifiers: Naive Bayes (SelectKBest)

In [29]:
pipeline_nb = Pipeline([
    ('imp', Imputer()),
    ('pca', PCA()),
    ('skb', SelectKBest(f_classif)),
    ('clf', RandomForestClassifier()),
])

param_grid_nb = [
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [1],
        'skb__k': [1],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [2],
        'skb__k': [1,2],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    },
    {
        'imp__strategy': ['median', 'mean'],
        'pca__n_components': [3],
        'skb__k': [1,2,3],
        'clf__min_samples_split': [2,3,5],
        'clf__max_depth': [None,2,3],
    }
]        

cross_validator = StratifiedShuffleSplit(random_state = 0)


from sklearn.model_selection import GridSearchCV
gridCV_object_nb = GridSearchCV(estimator = pipeline_nb,
                                param_grid = param_grid_nb,
                                scoring = 'f1',
                                cv=cross_validator)

# fit the data
gridCV_object_nb.fit(features, labels)

# get the best estimator
pipeline_clf_nb_sk = gridCV_object_nb.best_estimator_

# test results
from tester import test_classifier
test_classifier(pipeline_clf_nb_sk, my_dataset, features_kbest)

Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('pca', PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('skb', SelectKBest(k=2, score_func=<function f_classif at 0x7f23404d3b18>)), (...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])
	Accuracy: 0.84669	Precision: 0.50379	Recall: 0.23250	F1: 0.31817	F2: 0.26056
	Total predictions: 13000	True positives:  465	False positives:  458	False negatives: 1535	True negatives: 10542



### Classifiers: SelectKBest Summary

The Decision Tree classifier completely exceeds all expections and is the clear winner of the selectkbest feature selection. The **Decision Tree** classifier with **selectkbest** attained scores of **precision = .418**, **recall = .416** and **F score = .4023**.

#### Classifiers: Hand-picked vs. SelectKBest Summary

The classifier chosen as best across both hand-picked and selectkbest is the Decision Tree classifier with the feature selection from selectkbest. Again, the selectkbest features are:
* exercised_stock_option
* total_stock_value
* bonus


In [30]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info:
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size=0.33)

### Validation and Parameter Tuning

##### Validation
Data validation is the series of steps it takes to make sure the data is clean and useful. Dirty data (being mislabeled or unaccurate) leads to inconclusive results no matter the outcome. 

Validation in machine learning is usually overcome by splitting the data into training and testing sets. A training size too large can overfit the classifier causing low testing results, whereas a trianing size too low can underfit the classfier, again, causing low testing results.  

The Enron Corpus was split into both training (33%) and testing (67%) via function train_test_split with 'test_size = .33'. This essentially trains the data multiple times over different partitions of the dataset using the leftover data for testing and scoring.

##### Parameter Tuning
As discussed earlier, parameters of each classifier were tuned using both a pipeline and creating a gridCV object. The gridCV object tests multiple lists of parameters and returns the parameters that maximize a scoring function. In this case all parameters were tuned to maximize F1 scores.

### Summary

The Enron Corpus is one of the largest datasets on fraud. Although the dataset isn't vast, a Decision Tree classifier appears to be a strong option in predicting fraud. 

The most important features to analyze for attempting fraud are most likely:
* exercised stock options
* total stock
* bonus

The afformetioned features combined with a Decision Tree classifier yield precision, recall, and f1 scores close to .4. Additional/different features for higher scores are desired; the current scores appear mediocre but the classifier has potential to be a starting point for detecting fraud.  

#### References

Feature Visualization:
https://public.tableau.com/profile/diego2420#!/vizhome/Udacity/UdacityDashboard

Forum Postings:
https://discussions.udacity.com/t/getting-started-with-final-project/170846

When to chose which machine learning classifier:
http://stackoverflow.com/questions/2595176/when-to-choose-which-machine-learning-classifier

GridCV and Pipeline testing:
https://discussions.udacity.com/t/webcast-builidng-models-with-gridsearchcv-and-pipelines-thursday-11-feb-2015-at-6pm-pacific-time/47412

In [31]:
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(gridCV_object_dtsk.best_estimator_, my_dataset, features_kbest)

NameError: name 'gridCV_object_dtsk' is not defined