# Enron Project Fraud Classifer

## Goal: Use Machine Learning to Assess Fraud from Financial and Email Data 

Assessing fraud cases for any company is a tedious task that requires analyses across a vast amount of data. Fortunately most companies already have access to the data they need to combat fraud: emails and financials. However, these datasets are very large; it is extremely difficult to cipher through all of the data by hand. Having a machine assist and find a potential, fraudulent person of interest (POI) would save both time and money.

The email data set tested is the Enron Corpus, which can be found at (https://www.cs.cmu.edu/~./enron/). Enron was one of the top energy companies in the US in the early 2000s. The company eventually filed Bankruptcy largely due to fraudulent cases of insider trading and accounting scandals. It is one of the most noteworthy cases of fraud in the 20th century. Because of the scale of the company and the size of the email database, the Enron Corpus is a prime dataset for finding clues of fraud via financials and email.

In [23]:
# !/usr/bin/python

import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from pprint import pprint
import numpy as np

## Explore Data and Find Important Features

In [24]:
### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".

all_features = ['poi', 'salary', 'deferral_payments', 'total_payments',
'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income',
'total_stock_value', 'expenses', 'exercised_stock_options', 'other',
'long_term_incentive', 'restricted_stock', 'director_fees', 'to_messages',
'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi',
'shared_receipt_with_poi', 'email_poi_score']

features_list = ['poi', 'exercised_stock_options',
'deferred_income', 'expenses']

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [25]:
# Find number of data points and total POI
poi_list = list()

for name in data_dict:
    if data_dict[name]['poi']:
        poi_list.append(name)

print "Total Data Points:", len(data_dict)
print "Total POIs:", len(poi_list)

Total Data Points: 146
Total POIs: 18


Initial exploration of the dataset revealed 144 data points with 18 POIs. Every data point consisted of features as mentioned above such as: salary, total_payment, to_messages, and expenses. Many of the features contained NaN values; these values were either removed or set to median/mean values when running calculations.

Digging into the dataset revealed not all entries were people's names: 'SKILLING JEFFREY K' or 'LAY KENNETH L'; however both 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK' were listed as if they were actual people that worked for Enron. Because both were not living beings, I decided to remove them from calculations. 

In [26]:
### Task 2: Remove outliers
del data_dict['TOTAL']
del data_dict['THE TRAVEL AGENCY IN THE PARK']

After removing the outliers, three additional features were calculated but were later found to be of minimal importance: percent_to_poi, percent_from_poi, and email_poi_score. 

The thought was to assign a score to each person signifying the capacity they were in contact with a POI. The score was calculated by summing a person's percent of emails _to_ or _from_ a POI. Again, this yielded no significant gain and was discarded as a feature.

In [27]:
### Task 3: Create new feature(s)

# Create feature: email_poi_score
# email_poi_score is the sum of percent of message data from_poi and to_poi
def normalize_feature(feature, data_dict):
    # initialize high and low value for normalization function
    value_high = None
    value_low = None

    # loop through persons to find high and low values for features
    for person in data_dict:
        value = data_dict[person][feature]
        if value != 'NaN':
            # If first value in feature then assign value to variables
            if value_low == None:
                value_high = value
                value_low = value
            # look to see if value is higher or lower
            if value > value_high:
                value_high = value
            elif value < value_low:
                value_low = value

    # loop to assign normalization value
    for person in data_dict:
        value = float(data_dict[person][feature])
        # if value exists between high and low
        if (value_high >= value) and (value_low <= value):
            # if denominator isn't zero
            if value_high != value_low:
                value_norm = (value - value_low) / (value_high - value_low)
                data_dict[person][feature] = value_norm



# find percent emails sent to poi and percent from poi to this person
for person in data_dict:
    from_messages = data_dict[person]['from_messages']
    to_messages = data_dict[person]['to_messages']
    from_poi = data_dict[person]['from_poi_to_this_person']
    to_poi = data_dict[person]['from_this_person_to_poi']
    
    # Initialize all email_poi_score as 'NaN'
    data_dict[person]['email_poi_score'] = 'NaN'

    percent_to = float(to_poi) / float(from_messages)
    percent_from = float(from_poi) / float(to_messages)

    data_dict[person]['percent_to_poi'] = percent_to
    data_dict[person]['percent_from_poi'] = percent_from

# normailize percent_to_poi and percent_from_poi and add together
normalize_feature('percent_to_poi', data_dict)
normalize_feature('percent_from_poi', data_dict)

# add normalized percent_to_poi and percent_from_poi to create email_poi_score
for person in data_dict:
    percent_to_norm = data_dict[person]['percent_to_poi']
    percent_from_norm = data_dict[person]['percent_from_poi']

    email_poi_score = percent_to_norm + percent_from_norm
    if email_poi_score >= 0:
        data_dict[person]['email_poi_score'] = email_poi_score


# Normalize features, DON'T normalize poi (feature[0] is 'poi')
for feature in features_list[1:]:
    normalize_feature(feature, data_dict)

### Store to my_dataset for easy export below.
my_dataset = data_dict

In [28]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, all_features, sort_keys = True)
labels, features = targetFeatureSplit(data)

##### Hand-picked Feature Selection

Feature selection was initially hand-picked from visual aide via https://public.tableau.com/profile/diego2420#!/vizhome/Udacity/UdacityDashboard. Features were chosen based on visual clumping of POIs and non-POIs. 

Hand-picked features for determining POI:
* exercised_stock_options (high values ~ POI)
* deferred_income (low values ~ POI)
* expenses (low values ~ not POI)

##### SelectKBest Feature Selection

Features were also chosen using SelectKBest. 

In [29]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn import cross_validation

# Create features and training labels
labels, features = targetFeatureSplit(data)
features_train, features_test, labels_train, labels_test = \
    cross_validation.train_test_split(features, labels, test_size=0.33)

# Generate SelectKBest
selector = SelectKBest(f_classif, k=3)
selector.fit(features_train, labels_train)

# Generate top features list
selectkbest_features = list()
for i in range(len(selector.get_support())):
    if selector.get_support()[i]:
        selectkbest_features.append(all_features[i+1])

# Display SelectKBest features
print "SelectKBest Features:\n", selectkbest_features

SelectKBest Features:
['total_stock_value', 'expenses', 'exercised_stock_options']


## Classifiers

Data was tested with multiple classifiers consisting of individual trials of **raw** and **normalized** data. I noticed a very high variation when testing; some classifiers yielded high precision and recall scores, but when retested both scores occassionally dropped to zero. To counteract the high deviation of values, I wrote a few functions to generate 1000 classifiers of each type and compare their average scores.

Because SVM 'poly' took an extraordinant amount of time and never finished with raw data **only** a **normalized** data classifier was tested for SVM 'poly'.

In [30]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

from time import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedShuffleSplit
from pprint import pprint



# Create test to find average precision and recall scores
def test_prec_recall(name, clf_choice):
    precision_list = list()
    recall_list = list()
    f1_list = list()
    for i in range(100):
        ### Extract features and labels from dataset for local testing
        data = featureFormat(data_dict, features_list, sort_keys = True)
        # Create labels and features
        labels, features = targetFeatureSplit(data)

        # transform into np.array for StratifiedShuffleSplit
        features = np.array(features)
        labels = np.array(labels)

        # Shuffle and split data into training/testing sets
        sss = StratifiedShuffleSplit()
        for train_index, test_index in sss.split(features, labels):
            features_train, features_test = features[train_index], features[test_index]
            labels_train, labels_test = labels[train_index], labels[test_index]

        # Create, fit, and predict classifier
        clf = clf_choice
        clf.fit(features_train, labels_train)
        labels_pred = clf.predict(features_test)

        try:
            precision = precision_score(labels_test, labels_pred)
            recall = recall_score(labels_test, labels_pred)
            precision_list.append(precision)
            recall_list.append(recall)
        except:
            pass

    print "\n" + "#" * 60
    print " " * 20 + name + "\n"
    #print confusion_matrix(labels_test, labels_pred)
    print "Precision Mean: ", np.mean(precision_list)
    print "Recall Mean: ", np.mean(recall_list)
    #print "F1 Mean:", np.nanmean(f1_list)
    print "STD_sum: ", np.std(precision_list) + np.std(recall_list)
    print "\n" + "#" * 60

##### Classifiers: Decision Tree

In [31]:
from sklearn.tree import DecisionTreeClassifier
test_prec_recall("Decision Tree", DecisionTreeClassifier())


############################################################
                    Decision Tree

Precision Mean:  0.437333333333
Recall Mean:  0.425
STD_sum:  0.750703283749

############################################################


In [55]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import Imputer

kbest = SelectKBest(f_classif)

pipeline_dt = Pipeline([
    ('kbest', kbest),
    ('imp', Imputer()),
    ('std', StandardScaler()),
    #('pca', PCA()),
    ('clf', DecisionTreeClassifier(random_state = 49)),
])

param_grid_dt = {'kbest__k': [2,3,4],
                 'imp__strategy': ['median', 'mean'],
                 'clf__min_samples_split': [2,3,5,7,11],
                 'clf__max_depth': [2,3,5,7],
                 #'pca__n_components': [2,3,5,7]
                }
                 
                 
                 

cross_validator = StratifiedShuffleSplit(random_state = 0)


from sklearn.model_selection import GridSearchCV

gridCV_object_dt = GridSearchCV(estimator = pipeline_dt,
                                param_grid = param_grid_dt,
                                scoring = 'f1',
                                cv=cross_validator)

gridCV_object_dt = gridCV_object_dt.fit(features_train, labels_train)


print "Decision Tree - Pipeline Accuracy Score:"
print gridCV_object_dt.score(features_test, labels_test)

print "\nDecision Tree - Pipeline parameters:"
pprint(gridCV_object_dt.get_params())

Decision Tree - Pipeline Accuracy Score:
0.181818181818

Decision Tree - Pipeline parameters:
{'cv': StratifiedShuffleSplit(n_splits=10, random_state=0, test_size=0.1,
            train_size=None),
 'error_score': 'raise',
 'estimator': Pipeline(steps=[('kbest', SelectKBest(k=10, score_func=<function f_classif at 0x7f581bb738c0>)), ('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('std', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', DecisionTreeClassifier(class_weight=None, criterio...plit=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=49, splitter='best'))]),
 'estimator__clf': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=49, splitter='best'),
 'estimator__clf__

##### Classifiers: Random Forest

In [33]:
from sklearn.ensemble import RandomForestClassifier
test_prec_recall("Random Forest", RandomForestClassifier())


############################################################
                    Random Forest

Precision Mean:  0.300833333333
Recall Mean:  0.24
STD_sum:  0.660252582287

############################################################


##### Classifiers: Extra Trees

In [34]:
from sklearn.ensemble import ExtraTreesClassifier
test_prec_recall("Extra Tree", ExtraTreesClassifier())


############################################################
                    Extra Tree

Precision Mean:  0.348333333333
Recall Mean:  0.24
STD_sum:  0.730309117483

############################################################


##### Classifiers: SVMs

In [56]:
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

kbest = SelectKBest(f_classif)

pipeline_svc = Pipeline([
    ('kbest', kbest),
    ('imp', Imputer()),
    ('minmaxscaler', MinMaxScaler()),
    ('clf', SVC())
])

param_grid_svc = {'kbest__k': [2,3,4],
                  'imp__strategy': ['median', 'mean'],
                  'clf__C': [10,20],
                  'clf__kernel': ['rbf'],
                 }

cross_validator = StratifiedShuffleSplit(random_state = 0)

from sklearn.model_selection import GridSearchCV
gridCV_object_svc = GridSearchCV(estimator = pipeline_svc,
                                 param_grid = param_grid_svc,
                                 scoring = 'f1',
                                 cv = cross_validator)

gridCV_object_svc = gridCV_object_svc.fit(features_train, labels_train)


print "SVC - Pipeline Accuracy Score:"
print gridCV_object_svc.score(features_test, labels_test)

print "\nSVC - Pipeline parameters:"
pprint(gridCV_object_svc.get_params())

SVC - Pipeline Accuracy Score:
0.5

SVC - Pipeline parameters:
{'cv': StratifiedShuffleSplit(n_splits=10, random_state=0, test_size=0.1,
            train_size=None),
 'error_score': 'raise',
 'estimator': Pipeline(steps=[('kbest', SelectKBest(k=10, score_func=<function f_classif at 0x7f581bb738c0>)), ('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
 'estimator__clf': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
 'estimator__clf__C': 1.0,
 'estimator__clf__cache_size': 200,
 '

##### Classifiers: Naive Bayes

In [None]:
##### Naive Bayes #####
# Naive Bayes never predicts true positive, but can predict true negative.
from sklearn.naive_bayes import GaussianNB
test_prec_recall("Naive Bayes", GaussianNB())

#### Classifiers: Summary

The classifier chosen as best is DecisionTree, followed somewhat closely by GaussianNB. DecisionTree maintains higher combined precision and recall scores while having relatively low standard deviations for both scores. Precision is the percent of people the classifier accurately labeled as a POI. Recall is the percent of actual POIs that were correctly classified. Although other classifiers have higher individual precision and recall scores (or individually having lower standard deviation) no other classifiers appear to be a better all-around compared to DecisionTree.

In [None]:
###### Best clf appears to be Decison Tree;
###### precision and recall mean > .3
###### generally lowest sum of precision and recall standard deviations
clf = DecisionTreeClassifier()

### Parameters

Multiple attempts were made to tune the SVM classifier. Two kernals were tested('rbf' and 'poly') along with varying C values from 1 - 1000. Trials to maximize SVM ceased when it appeared the classifier would never outclass either DecisionTree or GaussianNB. 

The DecisionTree paramenter 'criterion' was test for both 'gini' (default) and 'entropy'. Although close, 'entropy' introduced more deviation (results below) and was not used in the final calculation. 
~~~
############################################################
                    Decision Tree (criterion='entropy')

Precision Mean:  0.414058344433
Recall Mean:  0.383863888889
STD_sum:  0.46693555759
############################################################
~~~

In [None]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info:
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size=0.33)

### Validation

Validated data is 

Validation is usually overcome by splitting the data into training and testing sets. A training size too large can overfit the classifier causing low testing results, whereas a trianing size too low can underfit the classfier, again, causing low testing results.  

The Enron Corpus was split into both training (33%) and testing (67%) via function train_test_split with 'test_size = .33'. This essentially trains the data multiple times over different partitions of the dataset using the leftover data for testing and scoring.

### Summary

The Enron Corpus is one of the largest datasets on fraud. Although the dataset isn't vast, a DecisionTree classifier appears to be a strong option in predicting fraud, followed closely by GaussianNB. Additional/different features for higher precision and recall scores are desired; the current scores appear mediocre but the classifier has potential to be a starting point for detecting fraud.  

#### References

Feature Visualization:
https://public.tableau.com/profile/diego2420#!/vizhome/Udacity/UdacityDashboard

Forum Postings:
https://discussions.udacity.com/t/getting-started-with-final-project/170846

When to chose which machine learning classifier?
http://stackoverflow.com/questions/2595176/when-to-choose-which-machine-learning-classifier

In [None]:
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(clf, my_dataset, features_list)