# Machine Learning Implementation to Identify Enron Persons of Interest

## Eric Gordon

This project is an application in order to create and test a machine learning algorithm against an interesting historical event: the collapse and fraud of the Enron Corporation. Machine learning is the perfect tool to use, create, train, and test a classifier that can attempt to correctly label persons of interest within a dataset of the Enron corporation only given certain features about the individuals. While this project utilizes a small sample of data, machine learning algorithms can use the 146 people in our dataset to create a tool that could be used to identify which of these 146 are persons of interest given separate data from the Enron corporation, or even potentially take it to test it against other examples of corporate fraud if given the data.

The below code investigates the data, and shows the attempts to optimize the Machine Learning algorithm. Follow this report as an exploratory data analysis.

#### Python Library Importing

In [1]:
import sys
import pickle
sys.path.append("../tools/")
import pandas as pd
import numpy as np
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB 
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.cross_validation import train_test_split
from sklearn.decomposition import PCA
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import seaborn as sns

#### Create Data

In [3]:
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

#### Data investigation

In [4]:
#Create Pandas Dataframe
enron_df = pd.DataFrame.from_records(list(data_dict.values()))
employees=np.array(data_dict.keys())
enron_df.set_index(employees, inplace=True)

#Investigation of Entry Names
print "Number of Entries:", len(enron_df) 

for person in enron_df.iterrows():
    print person[0]


Number of Entries: 146
METTS MARK
BAXTER JOHN C
ELLIOTT STEVEN
CORDES WILLIAM R
HANNON KEVIN P
MORDAUNT KRISTINA M
MEYER ROCKFORD G
MCMAHON JEFFREY
HORTON STANLEY C
PIPER GREGORY F
HUMPHREY GENE E
UMANOFF ADAM S
BLACHMAN JEREMY M
SUNDE MARTIN
GIBBS DANA R
LOWRY CHARLES P
COLWELL WESLEY
MULLER MARK S
JACKSON CHARLENE R
WESTFAHL RICHARD K
WALTERS GARETH W
WALLS JR ROBERT H
KITCHEN LOUISE
CHAN RONNIE
BELFER ROBERT
SHANKMAN JEFFREY A
WODRASKA JOHN
BERGSIEKER RICHARD P
URQUHART JOHN A
BIBI PHILIPPE A
RIEKER PAULA H
WHALEY DAVID A
BECK SALLY W
HAUG DAVID L
ECHOLS JOHN B
MENDELSOHN JOHN
HICKERSON GARY J
CLINE KENNETH W
LEWIS RICHARD
HAYES ROBERT E
MCCARTY DANNY J
KOPPER MICHAEL J
LEFF DANIEL P
LAVORATO JOHN J
BERBERIAN DAVID
DETMERING TIMOTHY J
WAKEHAM JOHN
POWERS WILLIAM
GOLD JOSEPH
BANNANTINE JAMES M
DUNCAN JOHN H
SHAPIRO RICHARD S
SHERRIFF JOHN R
SHELBY REX
LEMAISTRE CHARLES
DEFFNER JOSEPH M
KISHKILL JOSEPH G
WHALLEY LAWRENCE G
MCCONNELL MICHAEL S
PIRO JIM
DELAINEY DAVID W
SULLIVAN-SHAKLOV

Two seapearate names stand out that seem to be issues:<br>
'THE TRAVEL AGENCY IN THE PARK' <br>
'TOTAL'<br>
These entries will be deleted as outliers. 

In [4]:
#Number of POIs
count=0
for person in data_dict:
    if data_dict[person]["poi"]==True:
        count+=1      
print "POI's:",count

POI's: 18


In [5]:
# people with too many Nans
for person in data_dict:
    NaN=0
    for feature in data_dict[person]:
        if data_dict[person][feature] == "NaN":
            NaN += 1
    if NaN >= 18:
        print person, NaN

WHALEY DAVID A 18
WROBEL BRUCE 18
LOCKHART EUGENE E 20
THE TRAVEL AGENCY IN THE PARK 18
GRAMM WENDY L 18


In [6]:
#Take a look at Eugene Lockhart
print enron_df.loc['LOCKHART EUGENE E']

bonus                          NaN
deferral_payments              NaN
deferred_income                NaN
director_fees                  NaN
email_address                  NaN
exercised_stock_options        NaN
expenses                       NaN
from_messages                  NaN
from_poi_to_this_person        NaN
from_this_person_to_poi        NaN
loan_advances                  NaN
long_term_incentive            NaN
other                          NaN
poi                          False
restricted_stock               NaN
restricted_stock_deferred      NaN
salary                         NaN
shared_receipt_with_poi        NaN
to_messages                    NaN
total_payments                 NaN
total_stock_value              NaN
Name: LOCKHART EUGENE E, dtype: object


Clearly the entry 'LOCKHART EUGENE E' is also an outlier. It only has one data entry -- FALSE for POI -- and no other data. This entry therefore can not help shape our algorithm, and will also be removed as an outlier.

In [5]:
###Remove outliers
try:
    enron_df.drop('THE TRAVEL AGENCY IN THE PARK', inplace=True)
    enron_df.drop('TOTAL', inplace=True)
    enron_df.drop('LOCKHART EUGENE E', inplace=True)
    print "All Outliers Removed"
    print "Final number of Data Entries:", len(enron_df)
except:
    print "Outliers Already Removed"'\n'

All Outliers Removed
Final number of Data Entries: 143


Thus we see that our final data set had 143 entries of Eron employees.

#### Feature Investigation and Selection

Now we will examine the features of the data with more detail. 

In [8]:
#Number of Features and their Types
features=0
for person in data_dict:
    for feature in data_dict[person]:
        print feature, type(feature)
        features+=1
    break
print "Number of Feautres:", features
print''

#Number of NaN's for Each Feature
for feature in data_dict['METTS MARK'].keys():
    Nans=0    
    for person in data_dict:
        if data_dict[person][feature]=='NaN':
            Nans += 1
    print "Number of", feature, "NaNs: ", Nans

salary <type 'str'>
to_messages <type 'str'>
deferral_payments <type 'str'>
total_payments <type 'str'>
exercised_stock_options <type 'str'>
bonus <type 'str'>
restricted_stock <type 'str'>
shared_receipt_with_poi <type 'str'>
restricted_stock_deferred <type 'str'>
total_stock_value <type 'str'>
expenses <type 'str'>
loan_advances <type 'str'>
from_messages <type 'str'>
other <type 'str'>
from_this_person_to_poi <type 'str'>
poi <type 'str'>
director_fees <type 'str'>
deferred_income <type 'str'>
long_term_incentive <type 'str'>
email_address <type 'str'>
from_poi_to_this_person <type 'str'>
Number of Feautres: 21

Number of salary NaNs:  51
Number of to_messages NaNs:  60
Number of deferral_payments NaNs:  107
Number of total_payments NaNs:  21
Number of exercised_stock_options NaNs:  44
Number of bonus NaNs:  64
Number of restricted_stock NaNs:  36
Number of shared_receipt_with_poi NaNs:  60
Number of restricted_stock_deferred NaNs:  128
Number of total_stock_value NaNs:  20
Number o

In [9]:
### Create new feature(s)
# POI_From_To_Ratio is just the # of emails to a poi divided by emails from a poi to this person
enron_df['combined_emails'] = (enron_df['from_messages'].astype(float)) + (enron_df['from_poi_to_this_person'].astype(float)) +\
(enron_df['to_messages'].astype(float)) + (enron_df['from_this_person_to_poi'].astype(float)) 
   
enron_df['combined_emails'].fillna(value=0, inplace=True)
        
print "Example Combined Emails Entires: \n", enron_df['combined_emails'].head()
print ""
print "all features now \n", enron_df.keys()

Index([u'bonus', u'deferral_payments', u'deferred_income', u'director_fees',
       u'email_address', u'exercised_stock_options', u'expenses',
       u'from_messages', u'from_poi_to_this_person',
       u'from_this_person_to_poi', u'loan_advances', u'long_term_incentive',
       u'other', u'poi', u'restricted_stock', u'restricted_stock_deferred',
       u'salary', u'shared_receipt_with_poi', u'to_messages',
       u'total_payments', u'total_stock_value', u'combined_emails'],
      dtype='object')


In [45]:
### RETURN to DICTIONARY
data_dict=enron_df.to_dict('index')  

### Store to my_dataset for easy export below.
my_dataset = data_dict

print len(my_dataset)
print my_dataset['METTS MARK'].keys()

143
['to_messages', 'deferral_payments', 'expenses', 'poi', 'long_term_incentive', 'email_address', 'from_poi_to_this_person', 'deferred_income', 'combined_emails', 'restricted_stock_deferred', 'shared_receipt_with_poi', 'loan_advances', 'from_messages', 'other', 'director_fees', 'bonus', 'total_stock_value', 'from_this_person_to_poi', 'restricted_stock', 'salary', 'total_payments', 'exercised_stock_options']


Looking at the data provided, there are 21 features associated with each individual . The number of valid data entries for each feature also varies considerably per feature, as seen by the number of NaN's for listed for each feature above. Thus it will be worth investigating whether or not every fearutre should be included in our Machine Learning Algorithm. It will be confusing to manually select and determine which features to keep for predicting persons of interest, so we will nprogrammatically select the features to move forward with. 

Before implementing this though,I want to make note of the mannual feature that was created called ‘combined_emails’, in which all the numeric values of emails sent and received, both to persons of interest and not, were al combined into a single data point. The theory behind this is that I believed persons of interest may have been more active in overall email use compared to others, because of their ranks in the company, and possibly due to their over involvement in company dealings. So I wanted to look to see if overuse of email could help predict persons of interest. 

#### Selecting Features to Use

In [12]:
# Create Data
#Note: Email is just a string, so was excluded
temp_features = ['poi','to_messages', 
                 'deferral_payments','combined_emails', 'expenses',
                 'long_term_incentive', 'from_poi_to_this_person',
                 'deferred_income','restricted_stock_deferred', 
                 'shared_receipt_with_poi', 'loan_advances', 
                 'from_messages', 'other','director_fees', 
                 'bonus', 'total_stock_value', 
                 'from_this_person_to_poi', 'restricted_stock',
                 'salary', 'total_payments', 'exercised_stock_options']

print "Starting Features to Test :", len(temp_features) - 1 

#Create Data
data = featureFormat(data_dict, temp_features, sort_keys = True)
labels, features = targetFeatureSplit(data)

Starting Feature Length: 21

 Features List Now is: 
16 ['poi', 'salary', 'total_payments', 'exercised_stock_options', 'bonus', 'director_fees', 'deferred_income', 'shared_receipt_with_poi', 'loan_advances', 'other', 'total_stock_value', 'long_term_incentive', 'expenses', 'restricted_stock', 'from_poi_to_this_person', 'from_this_person_to_poi']


The SelectKBest algorithm tool from python’s Scikit-Learn library will be used with multiple iterations of randomized testing and training data to select the best features to use for our person of interest identifier. 

Before selecting best features though, all of the features will be scaled to be of numeric values between 0 and 1. These features are scaled to help neutralize the difference between the monetary feature values, that be in the millions, and the email data values that were at most in the hundreds. 

To figure out how many features to keep, I tested both the amount of features to keep, and the features to keep themselves.

In [None]:
#Selecting Number of Features to Keep
x=[]
f1score=[]
n_features_dictionary={}

#Iterate through number to use for KBest
for N in range(1, 21):
    kf=StratifiedShuffleSplit(labels, test_size=0.1,n_iter=5, random_state=15)
    f1=0
    features_list=['poi']
    feature_counts={}
    for train_indices, test_indices in kf:
        features_train= [features[ii] for ii in train_indices]
        features_test= [features[ii] for ii in test_indices]
        labels_train= [labels[ii] for ii in train_indices]
        labels_test= [labels[ii] for ii in test_indices]

        #Create a Classifier to use with N number of Features
        kbest=SelectKBest(k=N)
        scaler=MinMaxScaler()
        clssf=GaussianNB()
        pipe=Pipeline(steps=[('Scale', scaler), ('KBest', kbest), ('Tree', clssf)]) 
        
        #Get K Best Seperately
        kbest.fit(features_train, labels_train )
        features_to_use = kbest.get_support()
        
        #Predict with N features
        pipe.fit(features_train,labels_train)
        pred=pipe.predict(features_test)
        
        #Record Score
        f1_sc = f1_score(labels_test, pred)
        f1 += f1_sc
        
        #Create Feature Counts Dictionary for KBest is N
        for i, item in enumerate(features_to_use):
            if item:
                feature=temp_features[i + 1 ]
                if feature in feature_counts:
                    feature_counts[feature]+=1
                else:
                    feature_counts[feature]=1           

    f1 = f1/len(kf)
    #Record Average Score in list For F1 Score
    f1score.append(f1)
    #Create x List for Graph
    x.append(N)
    
    #Create Feature List for N Features
    for feature in feature_counts:
        if feature_counts[feature] > 1:
            features_list.append(feature)
    
    #Save Feature list to Features Dictionary for k=N Features
    n_features_dictionary[N] = features_list

In [None]:
#Plot Results Trials To See Number of Features to keep
%pylab inline

plt.scatter(x, f1score, c="Red") 
plt.xlabel('Number of Features')
plt.ylabel('f1 Score')
plt.title('F1 SCORES',fontsize=18)

print "Number of Features for Best F1:", f1score.index(max(f1score)) + 1

The average F1 score seems to be highest for a basic classifier when k is set to keeping 14 features, as seen by a graph above. So we will use the those 14 features going forward.

In [None]:
### Save features_list as Final List
### Print Out the Feature list from the n_features_dictionary

features_list=n_features_dictionary[14]

print "# of features is Now:", len(n_features_dictionary[14]), 
print "\n Feature list is now:", n_features_dictionary[14]

#### Classifier Exploration

Before settling on a single algorithm, it is worth trying several different machine learning algorithms to test out how some basic classifiers may work with this data. I will test following algorithm classifiers to see if any work well: <br>
AdaBoost, Decision Tree, Gaussian Naïve Bayes, Support Vector, and Random Forest.

We will mainly compare recall and percision scores to decide which features to choose. 

In [13]:
# Re Create Data With New List
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [56]:
### Try a varity of classifiers
### Definition of Classifier Tester Function
def test_class(clssf):
    kf=StratifiedShuffleSplit(labels, test_size=0.1,n_iter=20)
    accuracy=0
    precision=0
    recall=0
    for train_indices, test_indices in kf:
        features_train= [features[ii] for ii in train_indices]
        features_test= [features[ii] for ii in test_indices]
        labels_train= [labels[ii] for ii in train_indices]
        labels_test= [labels[ii] for ii in test_indices]
        
        pca=PCA(n_components= 6)
        pipe=Pipeline([
            ("Classifier",clssf)
        ])

        pipe.fit(features_train,labels_train)
        pred=pipe.predict(features_test)

        acc = accuracy_score(labels_test, pred)
        prec = precision_score(labels_test, pred)
        recl = recall_score(labels_test, pred)

        accuracy += acc
        precision += prec
        recall += recl

    accuracy= accuracy/len(kf)
    precision= precision/len(kf)
    recall= recall/len(kf)
    print "Accuracy:", accuracy
    print "Precision:", precision
    print "Recall:", recall
    return

In [52]:
clssf = AdaBoostClassifier()
test_class(clssf)

Accuracy: 0.84
Precision: 0.3
Recall: 0.25


In [53]:
clssf=DecisionTreeClassifier()
test_class(clssf)

Accuracy: 0.813333333333
Precision: 0.224166666667
Recall: 0.25


In [35]:
clssf=GaussianNB()
test_class(clssf)

Accuracy: 0.83
Precision: 0.3125
Recall: 0.275


In [54]:
clssf=SVC()
test_class(clssf)

Accuracy: 0.866666666667
Precision: 0.0
Recall: 0.0


In [37]:
clssf=RandomForestClassifier()
test_class(clssf)

Accuracy: 0.846666666667
Precision: 0.191666666667
Recall: 0.15


It seems that from just initial tests, that both the Gaussian Naïve Bayes and Decision Tree algorithms have some good initial validly scores, even without being optimized. We will try to optimize these two algorithms below. 

#### Parameter Optimization

The GridSearchCV tool in SciKit Learn will be used to optimize some parameter selection. Particularly some parameters such as the number of splits to take in our decision tree will be run through and evaluated, and the GridSearchCV will return the classifier with the best parameters. Hopefully this will improve our algorithm, and we can finalize our person of interest identifier. Also, we will experiment with performing Principal Component analysis, to possibly reduce the dimensionality of the features. This could help reduce some overly noisy data (too many features) and help the algorithm perform better. The option of keeping all 14 features will be availible though. 

In [58]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall 
tree_params = {"Classifier__criterion":['gini','entropy'], 
               "Classifier__splitter": ["best", "random"],
               "Classifier__min_samples_split":[2,5,18],
               'Classifier__max_leaf_nodes':[5,10,30],
               'Classifier__max_depth':[3,4,5,6],
               'PCA__n_components': [4,6,8,10,14]
              }

pca=PCA()
clssf=DecisionTreeClassifier()
pipe=Pipeline([
            ("PCA", pca),("Classifier",clssf)
        ])
CV = StratifiedShuffleSplit(labels, test_size=0.1, n_iter=100)


gs= GridSearchCV(pipe, tree_params , cv=CV, scoring='recall')

gs.fit(features, labels)
clssf=gs.best_estimator_

test_class(clssf)
print clssf

TypeError: test_classifier() takes at least 3 arguments (1 given)

In [32]:
gaussian_params = {
               'PCA__n_components': range(2,14)
              }

pca=PCA()
clssf=GaussianNB()
pipe=Pipeline([
            ("PCA", pca),("Classifier",clssf)
        ])

CV = StratifiedShuffleSplit(labels, test_size=0.1, n_iter=100)

gs= GridSearchCV(pipe, gaussian_params , cv=CV, scoring='recall')

gs.fit(features, labels)
clssf=gs.best_estimator_

test_class(clssf)
print clssf

  'precision', 'predicted', average, warn_for)


Accuracy: 0.863333333333
Precision: 0.308333333333
Recall: 0.2
Pipeline(steps=[('PCA', PCA(copy=True, n_components=2, whiten=False)), ('Classifier', GaussianNB())])


After optimizing the two classifiers, it seems that the performing principal component analysis then a Decision Tree Algorithm returns the best results of a classifier for this machine learning task. The parameter optimization tuned with GridSerachCV helped increase overall higher scores in performance. This will thus be our algorithm that we will use below.

# Final Person Of Interest Identifier Test and Closing Thoughts

In [49]:
features_list=['poi', 'salary', 'total_payments', 
               'exercised_stock_options', 'bonus', 'deferred_income', 
               'shared_receipt_with_poi', 'loan_advances', 'other', 
               'total_stock_value', 'long_term_incentive', 
               'expenses', 'restricted_stock', 'from_poi_to_this_person', 'from_this_person_to_poi']

clf= Pipeline(steps=[('PCA', PCA(copy=True, n_components=8, whiten=False)), ('Classifier', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=30, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])

from tester import test_classifier
print "Tester Classification report" 
test_classifier(clf, my_dataset, features_list)

Tester Classification report
Pipeline(steps=[('PCA', PCA(copy=True, n_components=12, whiten=False)), ('Classifier', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=30, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])
	Accuracy: 0.82047	Precision: 0.32348	Recall: 0.31750	F1: 0.32046	F2: 0.31868
	Total predictions: 15000	True positives:  635	False positives: 1328	False negatives: 1365	True negatives: 11672



## Final Thoughts and Evaluation

Using the final classifier (in the output above) the test above gives us several metrics to evaluate. That is, we gave the classifier 15000 entries that were formatted to mimic the data it was trained on, to see how the classifier would do in predicting whether an Enron employ, with data on the features we selected above, was actually a person of interest or not. The model was able to correctly classify and label 674 persons of interest who were in fact so according to data. Additionally, the model was able to correctly not flag 11777 of the employees as persons of interest, who indeed should not have been flagged. 

The model however did falsely predict 1223 individuals as a person of interest, when in fact the data point was not supposed to be flagged. This is what the “precision score” of 0.355, means. On the other hand, there were also 1326 individuals that the model did not flag as a person of interest, but in fact should of. Similarly, this is what the “recall score” of 0.337 means. Having more instances of incorrectly identifying persons of interest, and additionally having more instances of looking over persons of interest than correctly identifying them shows the limitations of this machine learning algorithm.

These mistakes with the classifier actually help reveal how unique and difficult it is to fully comprehend the Enron corporate collapse. This data set only contained 18 persons of interest in a set of 143, but our algorithm seemed identify a lot of other individuals within testing sets has having very similar characteristics as these persons of interest. There can be many further efforts in trying to better identify who was involved in the Enron corporate scandal, however, it does seem clear that the scandal at Enron was not only hard to fully characterize, but also very complicated. This project though, does show some insight though to how a machine learning algorithm can be used and implement to identify and flag unique trends in datasets. 

### Resources 

Enron- Email Dataset: <br>
 https://www.cs.cmu.edu/~./enron/

Article to identify POI’s from Enron:<br>
 http://usatoday30.usatoday.com/money/industries/energy/2005-12-28-enron-participants_x.htm

General Machine Learning Support and Help:<br>
https://www.udacity.com/course/intro-to-machine-learning--ud120 (And all Forums Included)

Python Coding Support and Help:<br>
http://stackoverflow.com/


