# Enron Project Fraud Classifer

### Goal: Use Machine Learning to Assess Fraud from Financial and Email Data 

Many companies keep datasets of thier financial and email data. These datasets are very large; it is extremely difficult to cipher through all of the data. Having a machine assist and find a potential, fraudulent person of interest (POI) would save both time and money.

The email data set tested is the Enron Corpus, which can be found at (https://www.cs.cmu.edu/~./enron/). Enron was one of the top energy companies in the US in the early 2000s. The company eventually filed Bankruptcy largely due to fraudulent cases of insider trading and accounting scandals. It is one of the most noteworthy cases of fraud in the 20th century. Because of the scale of the company and the size of the email database, the Enron Corpus is a prime dataset for finding clues of fraud via financials and email.

In [None]:
# !/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

### Explore Data and Find Important Features

Initial exploration of the dataset revealed a couple outliers. Most entries were of people's names: 'SKILLING JEFFREY K' or 'LAY KENNETH L'; however both 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK' were listed as if they were actual people that worked for Enron. Because both were not living beings, I decided to remove them from calculations. 

Features were hand picked from visual aide via https://public.tableau.com/profile/diego2420#!/vizhome/Udacity/UdacityDashboard. Further inspection of data could be made with already built-in-functions to find the best fit features.

Important Features picked for determining POI:
* exercised_stock_options (high values ~ POI)
* deferred_income (low values ~ POI)
* expenses (low values ~ not POI)

Three additional features were calculated but were found to be of minimal importance: percent_to_poi, percent_from_poi, and email_poi_score. The thought was to assign a score to each person signifying the capacity they were in contact with a POI. The score was calculated by summing a person's percent of emails _to_ or _from_ a POI. Again, this yielded no significant gain and was discarded as a feature.

In [14]:
### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
all_features = ['poi', 'salary', 'deferral_payments', 'total_payments',
'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income',
'total_stock_value', 'expenses', 'exercised_stock_options', 'other',
'long_term_incentive', 'restricted_stock', 'director_fees', 'to_messages',
'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi',
'shared_receipt_with_poi']

current_max_features_list = ["poi", "exercised_stock_options",
"deferred_income", "expenses"]

features_list = ["poi", "exercised_stock_options", "deferred_income",
 "expenses"]

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [15]:
### Task 2: Remove outliers
del data_dict['TOTAL']
del data_dict['THE TRAVEL AGENCY IN THE PARK']

In [17]:
import pandas as pd

data_dict = pickle.load(open("final_project_dataset.pkl", "r") )
###creating dataFrame from dictionary - pandas
df = pd.DataFrame.from_dict(data_dict, orient='index', dtype=np.float)
print df.describe().loc[:,['salary','bonus']]

             salary         bonus
count  9.500000e+01  8.200000e+01
mean   5.621943e+05  2.374235e+06
std    2.716369e+06  1.071333e+07
min    4.770000e+02  7.000000e+04
25%    2.118160e+05  4.312500e+05
50%    2.599960e+05  7.693750e+05
75%    3.121170e+05  1.200000e+06
max    2.670423e+07  9.734362e+07


In [4]:
### Task 3: Create new feature(s)

# Create feature: email_poi_score
# email_poi_score is the sum of normalized message data for from_poi and to_poi
def normalize_feature(feature, data_dict):
    # initialize high and low value for normalization function
    value_high = None
    value_low = None

    # loop through persons to find high and low values for features
    for person in data_dict:
        value = data_dict[person][feature]
        if value != 'NaN':
            # If first value in feature then assign value to variables
            if value_low == None:
                value_high = value
                value_low = value
            # look to see if value is higher or lower
            if value > value_high:
                value_high = value
            elif value < value_low:
                value_low = value

    # loop to assign normalization value
    for person in data_dict:
        value = float(data_dict[person][feature])
        # if value exists between high and low
        if (value_high >= value) and (value_low <= value):
            # if denominator isn't zero
            if value_high != value_low:
                value_norm = (value - value_low) / (value_high - value_low)
                data_dict[person][feature] = value_norm



# find percent emails sent to poi and percent from poi to this person
for person in data_dict:
    from_messages = data_dict[person]['from_messages']
    to_messages = data_dict[person]['to_messages']
    from_poi = data_dict[person]['from_poi_to_this_person']
    to_poi = data_dict[person]['from_this_person_to_poi']

    data_dict[person]['email_poi_score'] = 'NaN'

    percent_to = float(to_poi) / float(from_messages)
    percent_from = float(from_poi) / float(to_messages)

    data_dict[person]['percent_to_poi'] = percent_to
    data_dict[person]['percent_from_poi'] = percent_from

# normailize percent_to_poi and percent_from_poi and add together
normalize_feature('percent_to_poi', data_dict)
normalize_feature('percent_from_poi', data_dict)

# add normalized percent_to_poi and percent_from_poi to create email_poi_score
for person in data_dict:
    percent_to_norm = data_dict[person]['percent_to_poi']
    percent_from_norm = data_dict[person]['percent_from_poi']

    email_poi_score = percent_to_norm + percent_from_norm
    if email_poi_score >= 0:
        data_dict[person]['email_poi_score'] = email_poi_score


# Normalize features, DON'T normail poi (feature[0] is 'poi')
for feature in features_list[1:]:
    normalize_feature(feature, data_dict)


### Store to my_dataset for easy export below.
my_dataset = data_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

### Classifiers

Data was tested with multiple classifiers consisting of individual trials of **raw** and **normalized** data. I noticed a very high variation when testing; some classifiers yielded high precision and recall scores, but when retested both scores occassionally dropped to zero. To counteract the high deviation of values, I wrote a few functions to generate 1000 classifiers of each type and compare their average scores.

Because SVM 'poly' took an extraordinant amount of time and never finished with raw data **only** a **normalized** data classifier was tested for SVM 'poly'

Output Data from Classifiers (normalized data only):
~~~
Note:  STD_sum is the sum of individual standard deviation scores (for simplification)
############################################################
                    DecisionTree

Precision Mean:  0.386631629482
Recall Mean:  0.381794300144
STD_sum:  0.426737950995
############################################################
                    RandomForest

Precision Mean:  0.381161507937
Recall Mean:  0.197606565657
STD_sum:  0.505336869258
############################################################
                    ExtraTree

Precision Mean:  0.4745251443
Recall Mean:  0.253108910534
STD_sum:  0.514709115944
############################################################
                    SVM 'rbf'

Precision Mean:  0.4215
Recall Mean:  0.0818132395382
STD_sum:  0.580025182133
############################################################
                    SVM 'poly'

Precision Mean:  0.024
Recall Mean:  0.00459642857143
STD_sum:  0.183943409806
############################################################
                    GaussianNB

Precision Mean:  0.522593650794
Recall Mean:  0.258691630592
STD_sum:  0.482226413296
############################################################
~~~

The classifier chosen as best is DecisionTree, followed somewhat closely by GaussianNB. DecisionTree maintains higher combined precision and recall scores while having relatively low standard deviations for both scores. Precision is the percent of people the classifier accurately labeled as a POI. Recall is the percent of actual POIs that were correctly classified. Although other classifiers have higher individual precision and recall scores (or individually having lower standard deviation) no other classifiers appear to be a better all-around compared to DecisionTree.

In [6]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

from time import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np

def test_prec_recall(name, clf_choice):
    precision_list = list()
    recall_list = list()
    f1_list = list()
    for i in range(100):
        ### Extract features and labels from dataset for local testing
        data = featureFormat(data_dict, features_list, sort_keys = True)
        # Create labels and features
        labels, features = targetFeatureSplit(data)

        # transform into np.array for StratifiedShuffleSplit
        features = np.array(features)
        labels = np.array(labels)

        # Shuffle and split data into training/testing sets
        sss = StratifiedShuffleSplit()
        for train_index, test_index in sss.split(features, labels):
            features_train, features_test = features[train_index], features[test_index]
            labels_train, labels_test = labels[train_index], labels[test_index]

        # Create, fit, and predict classifier
        clf = clf_choice
        clf.fit(features_train, labels_train)
        labels_pred = clf.predict(features_test)

        try:
            precision = precision_score(labels_test, labels_pred)
            recall = recall_score(labels_test, labels_pred)
            precision_list.append(precision)
            recall_list.append(recall)
        except:
            pass

    print "\n" + "#" * 60
    print " " * 20 + name + "\n"
    #print confusion_matrix(labels_test, labels_pred)
    print "Precision Mean: ", np.mean(precision_list)
    print "Recall Mean: ", np.mean(recall_list)
    #print "F1 Mean:", np.nanmean(f1_list)
    print "STD_sum: ", np.std(precision_list) + np.std(recall_list)
    print "\n" + "#" * 60

##### Decision Tree #####
# Seems to work best with specfic selected features
from sklearn.tree import DecisionTreeClassifier
test_prec_recall("Decision Tree", DecisionTreeClassifier())






clf = DecisionTreeClassifier(random_state = 49)
clf = clf.fit(features_train, labels_train)



from pprint import pprint
### Grid Search ###
from sklearn.model_selection import GridSearchCV
#create a dictionary with all the parameters we want to search through
param_grid = {'min_samples_split': [2, 5, 10, 15, 20, 25, 30],
              'max_depth': [4, 5, 6, 7, 8]}

cross_validator = StratifiedShuffleSplit(random_state = 0)

#create GridSearchCV object using param_grid
gridCV_object = GridSearchCV(estimator = DecisionTreeClassifier(random_state = 49),
                                         param_grid = param_grid,
                                         scoring = 'f1',
                                         cv=cross_validator)

#fit to the data
gridCV_object.fit(features_train, labels_train)

#what were the best parameters chosen from the parameter grid?
print "Best parameters from parameter grid:"
pprint(gridCV_object.best_params_)

#get the best estimator
clf_f1 = gridCV_object.best_estimator_

#get best parameters for the best estimator.
print "\nComplete set of parameters for best estimator:"
pprint(clf_f1.get_params())

#check scores
from sklearn.metrics import f1_score
print "\nF1 score for default decision tree:"
predictions_1 = clf.predict(features_test)
print f1_score(predictions_1, labels_test)

print "\nF1 score for best estimator from grid search"
predctions_2 = clf_f1.predict(features_test)
print f1_score(predctions_2, labels_test)



'''##### Random Forest #####
# Does okay...
from sklearn.ensemble import RandomForestClassifier
test_prec_recall("Random Forest", RandomForestClassifier())

##### Extra Trees #####
# Current best if using all available features, but still just okay..
from sklearn.ensemble import ExtraTreesClassifier
test_prec_recall("Extra Tree", ExtraTreesClassifier())

##### SVMS #####
# So far linear, poly, and rfb SVMs are pretty bad at predicting pre-normalize
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

pipe = Pipeline(steps=[('minmaxscaler', MinMaxScaler()),('clf', SVC())])

labels, features = targetFeatureSplit(data)
features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size=0.33)

pipe.fit(features_train, labels_train)

gridCV



##### Naive Bayes #####
# Naive Bayes never predicts true positive, but can predict true negative.
from sklearn.naive_bayes import GaussianNB
test_prec_recall("Naive Bayes", GaussianNB())

###### Best clf appears to be Decison Tree;
###### precision and recall mean > .3
###### generally lowest sum of precision and recall standard deviations
clf = tree.DecisionTreeClassifier()

'''

  'precision', 'predicted', average, warn_for)



############################################################
                    Decision Tree

Precision Mean:  0.476857142857
Recall Mean:  0.44
STD_sum:  0.738255806836

############################################################


  'precision', 'predicted', average, warn_for)


Best parameters from parameter grid:
{'max_depth': 5, 'min_samples_split': 2}

Complete set of parameters for best estimator:
{'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_split': 1e-07,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': False,
 'random_state': 49,
 'splitter': 'best'}

F1 score for default decision tree:
0.533333333333

F1 score for best estimator from grid search
0.444444444444


'##### Random Forest #####\n# Does okay...\nfrom sklearn.ensemble import RandomForestClassifier\ntest_prec_recall("Random Forest", RandomForestClassifier())\n\n##### Extra Trees #####\n# Current best if using all available features, but still just okay..\nfrom sklearn.ensemble import ExtraTreesClassifier\ntest_prec_recall("Extra Tree", ExtraTreesClassifier())\n\n##### SVMS #####\n# So far linear, poly, and rfb SVMs are pretty bad at predicting pre-normalize\nfrom sklearn.svm import SVC\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import MinMaxScaler\n\npipe = Pipeline(steps=[(\'minmaxscaler\', MinMaxScaler()),(\'clf\', SVC())])\n\nlabels, features = targetFeatureSplit(data)\nfeatures_train, features_test, labels_train, labels_test = train_test_split(\n    features, labels, test_size=0.33)\n\npipe.fit(features_train, labels_train)\n\ngridCV\n\n\n\n##### Naive Bayes #####\n# Naive Bayes never predicts true positive, but can predict true negative.\nfrom sklearn.naive

###Parameters

Multiple attempts were made to tune the SVM classifier. Two kernals were tested('rbf' and 'poly') along with varying C values from 1 - 1000. Trials to maximize SVM ceased when it appeared the classifier would never outclass either DecisionTree or GaussianNB. 

The DecisionTree paramenter 'criterion' was test for both 'gini' (default) and 'entropy'. Although close, 'entropy' introduced more deviation (results below) and was not used in the final calculation. 
~~~
############################################################
                    Decision Tree (criterion='entropy')

Precision Mean:  0.414058344433
Recall Mean:  0.383863888889
STD_sum:  0.46693555759
############################################################
~~~

In [5]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info:
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size=0.33)

### Validation

Validation is usually overcome by splitting the data into training and testing sets. A training size too large can overfit the classifier causing low testing results, whereas a trianing size too low can underfit the classfier, again, causing low testing results.  The Enron Corpus was split into both training (33%) and testing (67%) via function train_test_split with 'test_size = .33'. 

### Summary

The Enron Corpus is one of the largest datasets on fraud. Although the dataset isn't vast, a DecisionTree classifier appears to be a strong option in predicting fraud, followed closely by GaussianNB. Additional/different features for higher precision and recall scores are desired; the current scores appear mediocre but the classifier has potential to be a starting point for detecting fraud.  

#### References

Feature Visualization:
https://public.tableau.com/profile/diego2420#!/vizhome/Udacity/UdacityDashboard

Forum Postings:
https://discussions.udacity.com/t/getting-started-with-final-project/170846

When to chose which machine learning classifier?
http://stackoverflow.com/questions/2595176/when-to-choose-which-machine-learning-classifier

In [None]:
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(clf, my_dataset, features_list)