# Machine Learning With Enron Corpus

## Goal: Use Machine Learning to Assess Fraud from Financial and Email Data 

Assessing fraud cases for any company is a tedious task that requires analyses across a vast amount of data. Fortunately most companies already have access to the data they need to combat fraud: emails and financials. However, these datasets are very large; it is extremely difficult to cipher through all of the data by hand. Having a machine assist and find a potential, fraudulent person of interest (POI) would save both time and money.

The email data set tested is the Enron Corpus, which can be found at (https://www.cs.cmu.edu/~./enron/). Enron was one of the top energy companies in the US in the early 2000s. The company eventually filed Bankruptcy largely due to fraudulent cases of insider trading and accounting scandals. It is one of the most noteworthy cases of fraud in the 20th century. Because of the scale of the company and the size of the email database, the Enron Corpus is a prime dataset for finding clues of fraud via financials and email.

In [1]:
# !/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from pprint import pprint
import numpy as np



## Explore Data and Find Important Features

In [2]:
### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".

all_features = ['poi', 'salary', 'deferral_payments', 'total_payments','loan_advances',
                'bonus', 'restricted_stock_deferred', 'deferred_income','total_stock_value',
                'expenses', 'exercised_stock_options', 'other','long_term_incentive',
                'restricted_stock', 'director_fees', 'to_messages','from_poi_to_this_person',
                'from_messages', 'from_this_person_to_poi','shared_receipt_with_poi',
                'email_poi_score']

features_list = ['poi', 'exercised_stock_options', 'total_stock_value', 'bonus']

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [3]:
# Find number of data points and total POI
poi_list = list()

for name in data_dict:
    if data_dict[name]['poi']:
        poi_list.append(name)

print "Total Data Points:", len(data_dict)
print "Total POIs:", len(poi_list)

Total Data Points: 146
Total POIs: 18


Initial exploration of the dataset revealed 144 data points with 18 POIs. Every data point consisted of features as mentioned above such as: salary, total_payment, to_messages, and expenses. Many of the features contained NaN values; these values were either removed or set to median/mean values when running calculations.

Digging into the dataset revealed not all entries were people's names: 'SKILLING JEFFREY K' or 'LAY KENNETH L'; however both 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK' were listed as if they were actual people that worked for Enron. Because both were not living beings, I decided to remove them from calculations. 

In [4]:
### Task 2: Remove outliers
del data_dict['TOTAL']
del data_dict['THE TRAVEL AGENCY IN THE PARK']

After removing the outliers, three additional features were calculated but were later found to be of minimal importance: percent_to_poi, percent_from_poi, and email_poi_score. 

The thought was to assign a score to each person signifying the capacity they were in contact with a POI. The score was calculated by summing a person's percent of emails _to_ or _from_ a POI. Again, this yielded no significant gain and was discarded as a feature.

In [5]:
### Task 3: Create new feature(s)

# Create feature: email_poi_score
# email_poi_score is the sum of percent of message data from_poi and to_poi
def normalize_feature(feature, data_dict):
    # initialize high and low value for normalization function
    value_high = None
    value_low = None

    # loop through persons to find high and low values for features
    for person in data_dict:
        value = data_dict[person][feature]
        if value != 'NaN':
            # If first value in feature then assign value to variables
            if value_low == None:
                value_high = value
                value_low = value
            # look to see if value is higher or lower
            if value > value_high:
                value_high = value
            elif value < value_low:
                value_low = value

    # loop to assign normalization value
    for person in data_dict:
        value = float(data_dict[person][feature])
        # if value exists between high and low
        if (value_high >= value) and (value_low <= value):
            # if denominator isn't zero
            if value_high != value_low:
                value_norm = (value - value_low) / (value_high - value_low)
                data_dict[person][feature] = value_norm
            


# find percent emails sent to poi and percent from poi to this person
for person in data_dict:
    from_messages = data_dict[person]['from_messages']
    to_messages = data_dict[person]['to_messages']
    from_poi = data_dict[person]['from_poi_to_this_person']
    to_poi = data_dict[person]['from_this_person_to_poi']
    
    # Initialize all email_poi_score as 'NaN'
    data_dict[person]['email_poi_score'] = 'NaN'

    percent_to = float(to_poi) / float(from_messages)
    percent_from = float(from_poi) / float(to_messages)

    data_dict[person]['percent_to_poi'] = percent_to
    data_dict[person]['percent_from_poi'] = percent_from

# normailize percent_to_poi and percent_from_poi and add together
normalize_feature('percent_to_poi', data_dict)
normalize_feature('percent_from_poi', data_dict)

# add normalized percent_to_poi and percent_from_poi to create email_poi_score
for person in data_dict:
    percent_to_norm = data_dict[person]['percent_to_poi']
    percent_from_norm = data_dict[person]['percent_from_poi']

    email_poi_score = percent_to_norm + percent_from_norm
    if email_poi_score >= 0:
        data_dict[person]['email_poi_score'] = email_poi_score


# Normalize features, DON'T normalize poi (feature[0] is 'poi')
for feature in features_list[1:]:
    normalize_feature(feature, data_dict)

### Store to my_dataset for easy export below.
my_dataset = data_dict

In [6]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, all_features, sort_keys = True)
labels, features = targetFeatureSplit(data)

##### Hand-picked Feature Selection

Feature selection was initially hand-picked from visual aide via https://public.tableau.com/profile/diego2420#!/vizhome/Udacity/UdacityDashboard. Features were chosen based on visual clumping of POIs and non-POIs. 

Hand-picked features for determining POI:
* exercised_stock_options (high values ~ POI)
* deferred_income (low values ~ POI)
* expenses (low values ~ not POI)

In [7]:
features_handpicked = ['poi', 'exercised_stock_options', 'deferred_income', 'expenses']

##### SelectKBest Feature Selection

Features were also chosen using SelectKBest. Because top features can change from the randomness of training and testing the data, a tally was taken to determine which features appear in the top 3 features the most over 1000 trials.

SelectKBest features for determining POI:
* exercised_stock_option
* total_stock_value
* bonus


In [8]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn import cross_validation
from operator import add
from heapq import nlargest

# Run loop to find how many times a feature occurs in the top 3
best_features = [0] * (len(all_features)-1)
for i in range(1000):
    # Create features and training labels
    labels, features = targetFeatureSplit(data)
    features_train, features_test, labels_train, labels_test = \
        cross_validation.train_test_split(features, labels, test_size=0.33)

    # Generate SelectKBest with k=3 features
    selector = SelectKBest(f_classif, k=3)
    selector.fit(features_train, labels_train)
    
    # Increase score of feature if it appears in the top 3   
    best_features = selector.get_support().astype(int) + best_features

# Print the top 3 features scored by which features appeared most in top 3
print "Top 3 features: "
for e in nlargest(3, best_features):
    for index in range(len(best_features)):
        if e == best_features[index]:
            print all_features[index+1]

  f = msb / msw


Top 3 features: 
exercised_stock_options
total_stock_value
bonus


In [9]:
features_kbest = ['poi', 'exercised_stock_options', 'total_stock_value', 'bonus']

## Classifiers

To test for optimal training from the data multiple classifiers are used:
* Decision Tree
* Random Forest
* Extra Trees
* SVMs
* GaussianNB

Classifiers were tested for high scores in precision, recall, and f1. Precision is the percent of people the classifier accurately labeled as a POI. Recall is the percent of actual POIs that were correctly classified. The F1 score relates both precion and recall. F1 scores are calculated by this equation:

$$ F1 = 2 * (precision * recall)  /  (precision + recall) $$

Only classifiers with precsion and recall scores greater than or equal to .33 will be considered for the best overall classifier. Any of those best classifiers will then be ranked by highest f1 score.

During initial testings I noticed a very high variation of all scores; classifiers that yielded high precision and recall scores may have low scores on the next test run. To counteract the high deviation of values, I wrote a test function to generate 1000 classifiers of each type and compare their average scores.


In [10]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

from time import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit
from pprint import pprint



# Create test to find average precision and recall scores
def test_prec_recall(name, clf_choice):
    precision_list = list()
    recall_list = list()
    for i in range(1000):
        ### Extract features and labels from dataset for local testing
        data = featureFormat(data_dict, features_list, sort_keys = True)
        # Create labels and features
        labels, features = targetFeatureSplit(data)

        # transform into np.array for StratifiedShuffleSplit
        features = np.array(features)
        labels = np.array(labels)

        # Shuffle and split data into training/testing sets
        sss = StratifiedShuffleSplit()
        for train_index, test_index in sss.split(features, labels):
            features_train, features_test = features[train_index], features[test_index]
            labels_train, labels_test = labels[train_index], labels[test_index]

        # Create, fit, and predict classifier
        clf = clf_choice
        clf.fit(features_train, labels_train)
        labels_pred = clf.predict(features_test)

        try:
            precision = precision_score(labels_test, labels_pred)
            recall = recall_score(labels_test, labels_pred)
            precision_list.append(precision)
            recall_list.append(recall)
        except:
            pass
    
    # F score is calculated via the mean precision and recall scores
    p_score = np.mean(precision_list)
    r_score = np.mean(recall_list)
    f_score = 2 * (p_score * r_score) / (p_score + r_score)
    
    print "\n" + "#" * 60
    print " " * 20 + name + "\n"
    print "Precision Mean Score: ", p_score
    print "Recall Mean Score: ", r_score
    print "F Score: ", f_score
    print "\n" + "#" * 60    

## Classifiers (Hand-picked Features)

In [11]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_handpicked, sort_keys = True)
labels, features = targetFeatureSplit(data)

##### Classifiers: Decision Tree (Hand-picked)

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import Imputer
from sklearn.tree import DecisionTreeClassifier

pipeline_dt = Pipeline([
    ('pca', PCA()),
    ('imp', Imputer()),
    ('clf', DecisionTreeClassifier(random_state = 49)),
])

param_grid_dt = {'pca__n_components': [2,3],
                 'imp__strategy': ['median', 'mean'],
                 'clf__min_samples_split': [2,3,5],
                 'clf__max_depth': [None,2,3],
                 }        

cross_validator = StratifiedShuffleSplit(random_state = 0)


from sklearn.model_selection import GridSearchCV
gridCV_object_dt = GridSearchCV(estimator = pipeline_dt,
                                param_grid = param_grid_dt,
                                scoring = 'f1',
                                cv=cross_validator)

gridCV_object_dt = gridCV_object_dt.fit(features_train, labels_train)

  'precision', 'predicted', average, warn_for)


In [13]:
test_prec_recall("Decision Tree: Hand-picked", gridCV_object_dt.best_estimator_)

  'precision', 'predicted', average, warn_for)



############################################################
                    Decision Tree: Hand-picked

Precision Mean Score:  0.391252380952
Recall Mean Score:  0.4
F Score:  0.395577836221

############################################################


##### Classifiers: Random Forest (Hand-picked)

In [14]:
from sklearn.ensemble import RandomForestClassifier

pipeline_rfhp = Pipeline([
    ('pca', PCA()),
    ('imp', Imputer()),
    ('clf', RandomForestClassifier(random_state = 49)),
])

param_grid_rfhp = {'pca__n_components': [2,3],
                 'imp__strategy': ['median', 'mean'],
                 'clf__n_estimators': [5,10,20],
                 'clf__min_samples_split': [2,5],
                 'clf__max_depth': [None,2,3],
                 }        

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_rfhp = GridSearchCV(estimator = pipeline_rfhp,
                                param_grid = param_grid_rfhp,
                                scoring = 'f1',
                                cv=cross_validator)

gridCV_object_rfhp = gridCV_object_rfhp.fit(features_train, labels_train)

In [15]:
test_prec_recall("Random Forest: Hand-picked", gridCV_object_rfhp.best_estimator_)


############################################################
                    Random Forest: Hand-picked

Precision Mean Score:  0.38880952381
Recall Mean Score:  0.3205
F Score:  0.351365513074

############################################################


##### Classifiers: Extra Trees (Hand-picked)

In [16]:
from sklearn.ensemble import ExtraTreesClassifier

pipeline_ethp = Pipeline([
    ('pca', PCA()),
    ('imp', Imputer()),
    ('clf', ExtraTreesClassifier(random_state = 49)),
])

param_grid_ethp = {'pca__n_components': [2,3],
                 'imp__strategy': ['median', 'mean'],
                 'clf__n_estimators': [5,10,20],
                 'clf__min_samples_split': [2,5],
                 'clf__max_depth': [None,2,3],
                 }        

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_ethp = GridSearchCV(estimator = pipeline_ethp,
                                param_grid = param_grid_ethp,
                                scoring = 'f1',
                                cv=cross_validator)

gridCV_object_ethp = gridCV_object_ethp.fit(features_train, labels_train)

In [17]:
test_prec_recall("Extra Tree: Hand-picked", gridCV_object_ethp.best_estimator_)


############################################################
                    Extra Tree: Hand-picked

Precision Mean Score:  0.331435714286
Recall Mean Score:  0.281
F Score:  0.304141099357

############################################################


##### Classifiers: SVMs (Hand-picked)

In [18]:
from sklearn.svm import SVC

pipeline_svchp = Pipeline([
    ('imp', Imputer()),
    ('clf', SVC())
])

param_grid_svchp = {'imp__strategy': ['median', 'mean'],
                  'clf__C': [10,50,100],
                  'clf__kernel': ['rbf', 'poly'],
                 }

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_svchp = GridSearchCV(estimator = pipeline_svchp,
                                 param_grid = param_grid_svchp,
                                 scoring = 'f1',
                                 cv = cross_validator)

gridCV_object_svchp = gridCV_object_svchp.fit(features_train, labels_train)

In [19]:
test_prec_recall("SVM: Hand-picked", gridCV_object_svchp.best_estimator_)


############################################################
                    SVM: Hand-picked

Precision Mean Score:  0.127
Recall Mean Score:  0.0635
F Score:  0.0846666666667

############################################################


##### Classifiers: Naive Bayes (Hand-picked)

In [20]:
##### Naive Bayes #####

from sklearn.naive_bayes import GaussianNB

pipeline_nbhp = Pipeline([
    ('pca', PCA()),
    ('imp', Imputer()),
    ('clf', GaussianNB())
])

param_grid_nbhp = {'pca__n_components': [1,2,3],
                 'imp__strategy': ['median', 'mean'],
                 }

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_nbhp = GridSearchCV(estimator = pipeline_nbhp,
                                 param_grid = param_grid_nbhp,
                                 scoring = 'f1',
                                 cv = cross_validator)

gridCV_object_svc = gridCV_object_nbhp.fit(features_train, labels_train)

In [21]:
test_prec_recall("Naive Bayes: Hand-picked", gridCV_object_nbhp.best_estimator_)


############################################################
                    Naive Bayes: Hand-picked

Precision Mean Score:  0.331316666667
Recall Mean Score:  0.2215
F Score:  0.265500829087

############################################################


### Classifiers: Hand-picked Summary

The Decision Tree classifier is the only one to meet the criteria of having precision and recall scores greater than .33. By default the **Decision Tree classifier** is the best of the **hand-picked** feature testing with an **F score of .3956**.

The Random Forest classifier should be noted here; all scores were comparable to the Decision Tree scores however the Random Forest recall score was too low at a value of .3205.

## Classifiers (SelectKBest Features)

In [22]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_kbest, sort_keys = True)
labels, features = targetFeatureSplit(data)

##### Classifiers: Decision Tree (SelectKBest)

In [23]:
pipeline_dtsk = Pipeline([
    ('pca', PCA()),
    ('imp', Imputer()),
    ('clf', DecisionTreeClassifier(random_state = 49)),
])

param_grid_dtsk = {'pca__n_components': [2,3],
                 'imp__strategy': ['median', 'mean'],
                 'clf__min_samples_split': [2,3,5],
                 'clf__max_depth': [None,2,3],
                 }        

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_dtsk = GridSearchCV(estimator = pipeline_dtsk,
                                param_grid = param_grid_dtsk,
                                scoring = 'f1',
                                cv=cross_validator)

gridCV_object_dtsk = gridCV_object_dtsk.fit(features_train, labels_train)

In [24]:
test_prec_recall("Decision Tree: SelectKBest", gridCV_object_dtsk.best_estimator_)


############################################################
                    Decision Tree: SelectKBest

Precision Mean Score:  0.397788095238
Recall Mean Score:  0.407
F Score:  0.40234132617

############################################################


##### Classifiers: Random Forest (SelectKBest)

In [25]:
pipeline_rfsk = Pipeline([
    ('pca', PCA()),
    ('imp', Imputer()),
    ('clf', RandomForestClassifier(random_state = 49)),
])

param_grid_rfsk = {'pca__n_components': [2,3],
                 'imp__strategy': ['median', 'mean'],
                 'clf__n_estimators': [5,10,20],
                 'clf__min_samples_split': [2,3,5],
                 'clf__max_depth': [None,2,3],
                 }        

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_rfsk = GridSearchCV(estimator = pipeline_rfsk,
                                param_grid = param_grid_rfsk,
                                scoring = 'f1',
                                cv=cross_validator)

gridCV_object_rfsk = gridCV_object_rfsk.fit(features_train, labels_train)

In [26]:
test_prec_recall("Random Forest: SelectKBest", gridCV_object_rfsk.best_estimator_)


############################################################
                    Random Forest: SelectKBest

Precision Mean Score:  0.375326190476
Recall Mean Score:  0.304
F Score:  0.335918630856

############################################################


##### Classifiers: Extra Trees (SelectKBest)

In [27]:
pipeline_etsk = Pipeline([
    ('pca', PCA()),
    ('imp', Imputer()),
    ('clf', ExtraTreesClassifier(random_state = 49)),
])

param_grid_etsk = {'pca__n_components': [2,3],
                 'imp__strategy': ['median', 'mean'],
                 'clf__n_estimators': [5,10,20],
                 'clf__min_samples_split': [2,3,5],
                 'clf__max_depth': [None,2,3],
                 }        

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_etsk = GridSearchCV(estimator = pipeline_etsk,
                                param_grid = param_grid_etsk,
                                scoring = 'f1',
                                cv=cross_validator)

gridCV_object_etsk = gridCV_object_etsk.fit(features_train, labels_train)

In [28]:
test_prec_recall("Extra Tree: SelectKBest", gridCV_object_etsk.best_estimator_)


############################################################
                    Extra Tree: SelectKBest

Precision Mean Score:  0.321566666667
Recall Mean Score:  0.2755
F Score:  0.296756196963

############################################################


##### Classifiers: SVMs (SelectKBest)

In [29]:
pipeline_svcsk = Pipeline([
    ('imp', Imputer()),
    ('clf', SVC())
])

param_grid_svcsk = {'imp__strategy': ['median', 'mean'],
                  'clf__C': [10,50,100],
                  'clf__kernel': ['rbf', 'poly'],
                 }

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_svcsk = GridSearchCV(estimator = pipeline_svcsk,
                                 param_grid = param_grid_svcsk,
                                 scoring = 'f1',
                                 cv = cross_validator)

gridCV_object_svcsk = gridCV_object_svcsk.fit(features_train, labels_train)

In [30]:
test_prec_recall("SVM: SelectKBest", gridCV_object_svcsk.best_estimator_)


############################################################
                    SVM: SelectKBest

Precision Mean Score:  0.099
Recall Mean Score:  0.0495
F Score:  0.066

############################################################


##### Classifiers: Naive Bayes (SelectKBest)

In [31]:
pipeline_nbsk = Pipeline([
    ('pca', PCA()),
    ('imp', Imputer()),
    ('clf', GaussianNB())
])

param_grid_nbsk = {'pca__n_components': [1,2,3],
                 'imp__strategy': ['median', 'mean'],
                 }

cross_validator = StratifiedShuffleSplit(random_state = 0)

gridCV_object_nbsk = GridSearchCV(estimator = pipeline_nbsk,
                                 param_grid = param_grid_nbsk,
                                 scoring = 'f1',
                                 cv = cross_validator)

gridCV_object_svcsk = gridCV_object_nbsk.fit(features_train, labels_train)

In [32]:
test_prec_recall("Naive Bayes: SelectKBest", gridCV_object_svcsk.best_estimator_)


############################################################
                    Naive Bayes: SelectKBest

Precision Mean Score:  0.35775
Recall Mean Score:  0.2315
F Score:  0.28110012728

############################################################


### Classifiers: SelectKBest Summary

Again, the Decision Tree classifier is the only one to exceed the criteria of having precision and recall scores greater than .33. By default the **Decision Tree** classifier is the best of the **selectkbest** feature testing with an **F score of .4023**.

#### Classifiers: Hand-picked vs. SelectKBest Summary

The classifier chosen as best across both hand-picked and selectkbest is the Decision Tree classifier. It maintains higher precision, recall, and F1 scores. Although other classifiers may obtain higher individual precision or recall scores no other classifiers appear to be a better all-around compared to DecisionTree.

Although the two features selection methods were very close, the selectkbest features of obtained a slightly higher f1 score. Again, the selectkbest features are:
* exercised_stock_option
* total_stock_value
* bonus


### Parameters

Multiple attempts were made to tune the SVM classifier. Two kernals were tested('rbf' and 'poly') along with varying C values from 1 - 1000. Trials to maximize SVM ceased when it appeared the classifier would never outclass either DecisionTree or GaussianNB. 

The DecisionTree paramenter 'criterion' was test for both 'gini' (default) and 'entropy'. Although close, 'entropy' introduced more deviation (results below) and was not used in the final calculation. 
~~~
############################################################
                    Decision Tree (criterion='entropy')

Precision Mean:  0.414058344433
Recall Mean:  0.383863888889
STD_sum:  0.46693555759
############################################################
~~~

In [34]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info:
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size=0.33)

### Validation

Validated data is 

Validation is usually overcome by splitting the data into training and testing sets. A training size too large can overfit the classifier causing low testing results, whereas a trianing size too low can underfit the classfier, again, causing low testing results.  

The Enron Corpus was split into both training (33%) and testing (67%) via function train_test_split with 'test_size = .33'. This essentially trains the data multiple times over different partitions of the dataset using the leftover data for testing and scoring.

### Summary

The Enron Corpus is one of the largest datasets on fraud. Although the dataset isn't vast, a DecisionTree classifier appears to be a strong option in predicting fraud, followed closely by GaussianNB. Additional/different features for higher precision and recall scores are desired; the current scores appear mediocre but the classifier has potential to be a starting point for detecting fraud.  

#### References

Feature Visualization:
https://public.tableau.com/profile/diego2420#!/vizhome/Udacity/UdacityDashboard

Forum Postings:
https://discussions.udacity.com/t/getting-started-with-final-project/170846

When to chose which machine learning classifier?
http://stackoverflow.com/questions/2595176/when-to-choose-which-machine-learning-classifier

In [35]:
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(gridCV_object_dtsk.best_estimator_, my_dataset, features_kbest)