# P5: Identify Fraud from Enron Email
<b>by Daniel J. Lee</b> <br>
<b>July 17, 2017</b>

### Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those? [relevant rubric items: “data exploration”, “outlier investigation”]

## 1. Introduction

The purpose of this project is to identify Enron employees who have commited fraud based on past financial and email datasets. Enron Corporation is a American energy company that faced bankruptcy after being charged with manipulating financial data. We define these employees as a <b>person of interest (POI)</b> if the following was:
    
    * was indicted
    * testified in exchange for immunity
    * reached a settlement with the government

An example of a POI in this case would be the former CEO, [Jeffrey Skillings](https://en.wikipedia.org/wiki/Jeffrey_Skilling), who attempted to meet Wall Street expectations by modifying its balance sheet to indicate "favorable performance".

In this project, we used various machine learning algorithms and techniques  to determine <b>how well</b> we can identify a POI with provided financial and email data.






## 2. Data Overview

In this section, an initial examination of the data is performed:

In [4]:
df.head(3)

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,...,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,email_address,from_poi_to_this_person
ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442,1729541,4175000.0,126027.0,1407.0,-126027.0,1729541,...,,2195.0,152.0,65.0,False,,-3081055.0,304805.0,phillip.allen@enron.com,47.0
BADUM JAMES P,,,178980.0,182466,257817,,,,,257817,...,,,,,False,,,,,
BANNANTINE JAMES M,477.0,566.0,,916197,4046157,,1757552.0,465.0,-560222.0,5243487,...,,29.0,864523.0,0.0,False,,-5104.0,,james.bannantine@enron.com,39.0


In [102]:
print "There are " + str(len(df)) + " observations in the dataset composed of " \
+ str(len(df[df['poi'] == True])) + " POIs and " + str(len(df[df['poi'] == False])) + " Non-POIs."

There are 146 observations in the dataset composed of 18 POIs and 128 Non-POIs.


### 2a. Features

In [62]:
print "Count of Overall Features: " + str(len(features_list))
print "Count of Financial Features: " + str(len(financial_features))
print "Count of Email Features: " + str(len(email_features))

Count of Overall Features: 21
Count of Financial Features: 14
Count of Email Features: 6


<b>[Financial Features]</b> <br>
salary<br>
deferral_payments<br>
total_payments<br>
loan_advances<br>
bonus<br>
restricted_stock_deferred<br>
deferred_income<br>
total_stock_value<br>
expenses<br>
exercised_stock_options<br>
other<br>
long_term_incentive<br>
restricted_stock<br>
director_fees<br>

<i>** All units are in US dollars</i>

<b>[Email Features]</b> <br>
to_messages : number of emails received<br>
from_messages : number of emails sent<br>
from_poi_to_this_person : number of emails received from POI<br>
from_this_person_to_poi : numbef of emails sent to POI<br>
shared_receipt_with_poi : number of emails received with POI<br>
email_address: unique string<br>

<i>** All units are integers except email_address</i>

<b>[POI Indicator]</b> <br>
poi : 1 = <b>POI</b> , 0 = <b>non-POI</b>

### 2b. NaN

In [97]:
df.isnull().sum()

salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
exercised_stock_options       44
bonus                         64
restricted_stock              36
shared_receipt_with_poi       60
restricted_stock_deferred    128
total_stock_value             20
expenses                      51
loan_advances                142
from_messages                 60
other                         53
from_this_person_to_poi       60
poi                            0
director_fees                129
deferred_income               97
long_term_incentive           80
email_address                 35
from_poi_to_this_person       60
dtype: int64

&#x25AA; All of the features (excluding poi) are missing values. We decided to not adjust any data through methods such as filling with null values with the mean in this project.


## 3. Outlier Investigation

The main files used to investigate outliers were <b>outlier_classification.py</b> and <b>enron61702insiderpay.pdf</b>. We developed plots in <b>outlier_classification.py</b> to determine various financial features outliers. The following points came about-

#### "TOTAL" and "THE TRAVEL AGENCY IN THE PARK"

While checking 'salary', there was a single outlier of <i>TOTAL</i> that exceeded 20 million USD. <i>THE TRAVEL AGENCY IN THE PARK</i> was discovered and determined it was not an employee.

#### "BHATNAGAR SANJAY" and "BELFER ROBERT"

While checking <i>restricted_stock_deferred</i>, there were large positive outliers whereas the .pdf file indicated all negative values. After reviewing, it was shown that there was a shift in the financial feature information for two employees: <i>BHATNAGAR SANJAY</i> and <i>BELFER ROBERT</i>. The following below is after reshifting with the provided code located in the reference.

In [60]:
df.loc['BHATNAGAR SANJAY']

salary                                   0
to_messages                            523
deferral_payments                        0
total_payments                      137864
exercised_stock_options           15456290
bonus                                    0
restricted_stock                   2604490
shared_receipt_with_poi                463
restricted_stock_deferred         -2604490
total_stock_value                 15456290
expenses                            137864
loan_advances                            0
from_messages                           29
other                                    0
from_this_person_to_poi                  1
poi                                  False
director_fees                            0
deferred_income                          0
long_term_incentive                      0
from_poi_to_this_person                  0
received_from_poi_ratio                  0
sent_to_poi_ratio                0.0344828
shared_receipt_with_poi_ratio     0.885277
Name: BHATN

### What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values. [relevant rubric items: “create new features”, “properly scale features”, “intelligently select feature”]

## 4. Engineered / Scaled Features

The following features were created to assist in determining the POIs:
<br>
&#x25AA; from_poi_to_this_person / from_messages<br>
&#x25AA; from_this_person_to_poi / to_messages <br>
&#x25AA; shared_receipt_with_poi / from_messages <br>

In [79]:
df[ee].corr()['poi']

poi                              1.000000
to_messages                      0.108730
from_messages                   -0.033982
from_poi_to_this_person          0.190460
from_this_person_to_poi          0.129619
shared_receipt_with_poi          0.240876
shared_receipt_with_poi_ratio    0.247879
sent_to_poi_ratio                0.323885
received_from_poi_ratio          0.148698
Name: poi, dtype: float64

For the new features, we took the ratio of the employees' <b>messages involving with poi</b> and the <b>total messages either sent/received</b>. The features were created for employees who did <b>not</b> have missing value in the email features for the particular ratio. 

The given email features were created to allowed us to examine to what extent an employee is actually involved as a POI. It seems reasonable that it gave a better picture of a POI's behavior rather than their individual counts alone.

Based on the correlation test above, I decided to not include the <b>from_messages</b> feature since the new features, <b>sent_to_poi_ratio</b> and <b>received_from_poi_ratio</b>, displayed a better correlation with poi than the initial features itself. In addition, there seemed to be no correlation between from_messages and poi.


### Addition: Log Scaling

Since the financial data tend to show [skewed behavior](https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/), I was interested in seeing whether a log transformation was going to affect the correlation between the financial features and the POI label. The scaling won't be used for the algorithms but I thought it would be interesting to look at.

In [57]:
df[ff].corr()['poi']

poi                          1.000000
salary                       0.340120
bonus                        0.359381
long_term_incentive          0.257361
deferred_income             -0.274393
deferral_payments           -0.039439
loan_advances                0.220295
other                        0.170687
expenses                     0.193956
director_fees               -0.121080
total_payments               0.249020
exercised_stock_options      0.370618
restricted_stock             0.243607
restricted_stock_deferred    0.073052
total_stock_value            0.371828
Name: poi, dtype: float64

In [56]:
log_df[ff].corr()['poi']

poi                          1.000000
salary                       0.241644
bonus                        0.282985
long_term_incentive          0.185817
deferred_income              0.225283
deferral_payments            0.003921
loan_advances                0.123489
other                        0.337724
expenses                     0.294070
director_fees               -0.128680
total_payments               0.218541
exercised_stock_options      0.026145
restricted_stock             0.199639
restricted_stock_deferred   -0.137413
total_stock_value            0.209197
Name: poi, dtype: float64

<b>Observations:</b>

&#x25AA; <i>salary</i>, <i>bonus</i>, <i>loan_advances</i>, and <i>exercised_stock_options</i> became less relevant <br>
&#x25AA; <i>other</i> and <i>expenses</i> became more relevant.. <br>
&#x25AA; <i>deferred_income</i> and <i>restricted_stock_deferred</i> changed signs ( - to + )


## 5. Selecting Features

With the given features, we want to determine a feature's importance to whether an employee is a POI. The univariate feature selection process, [SelectKBest( )](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html), was used with GridSearchCV to select the 'k' amount of features that would maximize the precision and recall values.

I performed an initial run on the <b>DecisionTreeClassifier()</b> with parameters shown below to determine the optimal amount of features. I started with the default value of 10 features and continually increased to examine the results using file <i>tester.py</i>:

In [None]:
#Parameters used to determine the 'k' features
kbest__k=[8,9,10,11,12,13,14,15,'all']

dtc__criterion=['gini','entropy'],
                dtc__min_samples_leaf=[1,2,3],
                dtc__max_depth=[2,3,4],
                dtc__class_weight=['balanced']

### 5a. Test 1:  Determining 'k' Features Incl. Engineered Features

    [Trials incl. engineered features]
    
    *k=8
    Accuracy: 0.81907	Precision: 0.38187	Recall: 0.57700	F1: 0.45958	F2: 0.52350
    
    *k=10
    Accuracy: 0.82420	Precision: 0.39111	Recall: 0.57200	F1: 0.46457	F2: 0.52357
    
    *k=12
    Accuracy: 0.82920	Precision: 0.40403	Recall: 0.59150	F1: 0.48011	F2: 0.54127
    
    *k=14
    Accuracy: 0.82027	Precision: 0.39259	Recall: 0.63600	F1: 0.48550	F2: 0.56584
    
    *k='all'
    Accuracy: 0.74487	Precision: 0.29874	Recall: 0.67800	F1: 0.41474	F2: 0.54071

&#x25AA; As we increase 'k' amount of features, we can also see an <b>increase in our recall score</b> but fluctuations in the precision. Note that results were <i>including the new features</i>.

### 5b. Test 2:  Determining 'k' Features Excl. Engineered Features

Before proceeding, we want to determine whether the engineered features had an effect on the precision and recall scores. The same parameters for DecisionTreeClassifier() were used again <i>excluding the engineered features</i>:
   

    [Trials excl. engineered features]
    
    *k=8
    Accuracy: 0.66600	Precision: 0.21322	Recall: 0.55950	F1: 0.30877	F2: 0.42233
    
    *k=10
    Accuracy: 0.66273	Precision: 0.20376	Recall: 0.52600	F1: 0.29373	F2: 0.39960
    
    *k=12
    Accuracy: 0.70720	Precision: 0.25350	Recall: 0.61500	F1: 0.35902	F2: 0.47852
    
    *k=14
    Accuracy: 0.73793	Precision: 0.29721	Recall: 0.70750	F1: 0.41858	F2: 0.55442
    
    *k='all'
    Accuracy: 0.73613	Precision: 0.31238	Recall: 0.81500	F1: 0.45165	F2: 0.61658

&#x25AA; It seems that including the engineered features <i>negatively</i> affects the recall scores. Thus the engineered features were removed from the features list. From the results above, we decided to continue with using <b>k = 'all'</b>.

### 5c. Selected Features for k='all'

In [None]:
#Using k = 'all' and excluding the engineered features..

#SelectKBest() Scores: 
[('bonus', 30.728774633399713),
 ('salary', 15.858730905995131),
 ('shared_receipt_with_poi', 10.722570813682712),
 ('total_stock_value', 10.632581622204556),
 ('exercised_stock_options', 9.6801736665963212),
 ('total_payments', 8.9611911472233619),
 ('deferred_income', 8.7530370568859475),
 ('restricted_stock', 8.0549780560999462),
 ('long_term_incentive', 7.5551197773202938),
 ('loan_advances', 7.0379327981934612),
 ('from_poi_to_this_person', 4.9586666839661424),
 ('expenses', 4.1738959725629785),
 ('other', 3.2044591402721507),
 ('to_messages', 2.6161830046793662),
 ('director_fees', 1.8180800771349392),
 ('restricted_stock_deferred', 0.75787191672004672),
 ('from_this_person_to_poi', 0.11120823866694469),
 ('deferral_payments', 0.011021863646593114)]


#DecisionTreeClassifier() Feature Importances: 
[('expenses', 0.73121090112497555),
 ('shared_receipt_with_poi', 0.26878909887502445),
 ('salary', 0.0),
 ('to_messages', 0.0),
 ('deferral_payments', 0.0),
 ('total_payments', 0.0),
 ('exercised_stock_options', 0.0),
 ('bonus', 0.0),
 ('director_fees', 0.0),
 ('restricted_stock_deferred', 0.0),
 ('total_stock_value', 0.0),
 ('from_poi_to_this_person', 0.0),
 ('loan_advances', 0.0),
 ('other', 0.0),
 ('from_this_person_to_poi', 0.0),
 ('deferred_income', 0.0),
 ('restricted_stock', 0.0),
 ('long_term_incentive', 0.0)]

<b>Observations</b>:
<br>
&#x25AA; It seems that our DecisionTreeClassifier() was able to only consider <i>expense</i> and  <i>shared_receipt_with_poi</i> as the only important features.
<br>
&#x25AA; <i>bonus</i> was the most important feature according to SelectKBest().

### 5d. Conclusion

In this section, we wanted to figure out the optimal amount of features for our initial algorithm. As we performed different values for 'k' amount of features, two separate tests were performed: including and excluding the engineered features. Surprisingly, the engineered features negatively affected the recall score thus removed from the features list.

### What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms? [relevant rubric item: “pick an algorithm”]


## 6. Algorithms

The following algorithms were performed to obtain the goal of a score <b>higher than 0.3</b> for both precision and recall.
    
    * SVM / MinMaxScaler()
    * Decision Tree Classifier / KBest()
    * Random Forest Classifier / KBest()

An initial run was conducted to see how each algorithm performs. The file <b>tester.py</b> was used to obtain the precision and recall values for this project:
    
#### *[SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
    
    *svc__C=[1, 1.5, 1.75, 2, 2.25, 2.5],
	*svc__kernel=['linear', 'sigmoid', 'poly','rbf'],
	*svc__gamma=['auto'],
	*svc__class_weight=[None, 'balanced']
    
    Accuracy: 0.67140	Precision: 0.24685	Recall: 0.71400	F1: 0.36686	F2: 0.51795
    
    Best parameters:  {'svc__gamma': 'auto',
    'svc__class_weight': 'balanced', 
    'svc__kernel': 'sigmoid',
    'svc__C': 2.25}

#### *[Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
    
    *kbest__k=[4,5,6,7,8,9,10,11,12,13,14,15]
    *dtc__criterion=['gini','entropy'],
    *dtc__min_samples_leaf=[1,2,3],
    *dtc__max_depth=[2,3,4],
    *dtc__class_weight=['balanced']
    
    Accuracy: 0.82753	Precision: 0.39897	Recall: 0.57950	F1: 0.47258	F2: 0.53141
    
    Best parameters:  {'dtc__criterion': 'entropy',
    'dtc__max_depth': 2,
    'kbest__k': 11,
    'dtc__min_samples_leaf': 3,
    'dtc__class_weight': 'balanced'} 

#### *[Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
    
    *kbest__k=[4,5,6]
    *rfc__n_estimators=[9, 11, 12],
    *rfc__criterion=['gini', 'entropy'],
    *rfc__max_features=['auto', 3, 4],
    *rfc__min_samples_split=[2, 3, 4, 5],
    
    Accuracy: 0.84113	Precision: 0.34265	Recall: 0.20850	F1: 0.25925	F2: 0.22621
    
    Best parameters:  {'rfc__criterion': 'gini', 
    'rfc__min_samples_split': 2, 
    'kbest__k': 4,
    'rfc__max_features': 'auto',
    'rfc__n_estimators': 9} 
    
    

### What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier). [relevant rubric item: “tune the algorithm”]

For each algorithm, parameters are included as shown in the sections above. Tuning an algorithm is optimizing its parameters to favorably affect our model's performance on not just data we already seen. The goal is to have our model be able to make accurate predictions from those parameters. If the model was not appropriately tuned, we would expect weak results with the belief that it is optimal.

Although the parameters depend on factors according to the appropriate algorithm, we can still tune them through various approaches. In this project, <b>GridSearchCV</b> was used to tune the algorithms for this project. It allows us to exhaust through sequences of the algorithm's parameters and returns the best choice of fit.

In our selected algorithms, we selected few parameters and started with a normal range around the default values. After understanding which direction the model preferred for the parameters, we steadily went in that direction and stopped until the model found an optimal value. Afterwards, we continually added more parameters with ranges and monitored how our results were affected.

--

Based on the initial run for the algorithm, it is shown the <b>DecisionTreeClassifier()</b> displayed the highest precision and recall values. I decided to further tune the parameters while fixing the best parameters obtained from the initial run:

    [Trial 1]
    
    *dtc__criterion=['gini'],
    *dtc__min_samples_leaf=[3],
    *dtc__max_depth=[2],
    *dtc__class_weight=['balanced'],
    *dtc__min_samples_split=[2]

    
    Accuracy: 0.73773	Precision: 0.31425	Recall: 0.81800	F1: 0.45407	F2: 0.61942
    
    [Trial 2]
    
    *dtc__criterion=['gini'],
    *dtc__min_samples_leaf=[3],
    *dtc__max_depth=[2],
    *dtc__class_weight=['balanced'],
    *dtc__min_samples_split=[2],
    *dtc__max_leaf_nodes=[2,3,4,5]
    
    Accuracy: 0.63087	Precision: 0.24890	Recall: 0.87650	F1: 0.38770	F2: 0.58266
    
    [Trial 3]
    
    *dtc__criterion=['gini'],
    *dtc__min_samples_leaf=[3],
    *dtc__max_depth=[2],
    *dtc__class_weight=['balanced'],
    *dtc__min_samples_split=[4],
    *dtc__presort=[False],
    *dtc__splitter=['best']
    
    Accuracy: 0.73793	Precision: 0.31458	Recall: 0.81900	F1: 0.45456	F2: 0.62013

### What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis? [relevant rubric item: “validation strategy”]

Validating is a method of appropriately measuring the algorithm's performance. A classic error is measuring the algorithm's performance by testing on the <b>same data</b> you trained; we don't want the model to be overfitting to the same training data. By splitting the data into a <i>training and testing set</i>, we are able to verify that our algorithm is able to accurately predict with new data. For this project, we used <b>30%</b> of the data for testing whereas <b>70%</b> of it was used for training.

A validation strategy that was used for this project is the [Stratified Shuffle Split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html) to create random folds and maintain the distributions of POIs and non-POIs among the training and testing data.

### Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]

## Overall Results:

    *dtc__criterion=['gini'],
    *dtc__min_samples_leaf=[3],
    *dtc__max_depth=[2],
    *dtc__class_weight=['balanced'],
    *dtc__min_samples_split=[4],
    *dtc__presort=[False],
    *dtc__splitter=['best']
    
    Accuracy: 0.73793	Precision: 0.31458	Recall: 0.81900	F1: 0.45456	F2: 0.62013
    
    Total predictions: 15000
    True positives: 1445
    False positives: 3095
    False negatives:  555
    True negatives: 9905

The goal of this project was to obtain 30% or more in both precision and recall values. We define these terms as follows:

&#x25AA; <b> True positives (TP)</b>: the POI employees correctly identified by our model.
<br>
&#x25AA; <b>True negatives (TN)</b>:  the non-POI employee correctly identified by our model.
<br>
&#x25AA; <b>False positives (FP)</b>:  the non-POI employee discovered as POI by our model
<br>
&#x25AA; <b>False negatives (FN)</b>: the POI employee that our model failed to identify as POI.



&#x25AA; <b>Precision</b> is the percentage of true POIs found from the total number of POIs.
<br>
&#x25AA; <b>Recall</b> is the percentage of true POIs from the results identified by our model.


In this situation, we ideally want to <b>minimize false negatives</b> because we want to emphasize more about our model not missing POIs than wrongly identifying an employee as POI. Thus in this case, we desire our recall score to be higher.


### * URL Reference
https://en.wikipedia.org/wiki/Enron_scandal<br>
https://www.reddit.com/r/explainlikeimfive/comments/1lbvs7/eli5_enron_and_their_scandal/<br>
https://www.r-statistics.com/2013/05/log-transformations-for-skewed-and-wide-distributions-from-practical-data-science-with-r/<br>
https://stats.stackexchange.com/questions/72231/decision-trees-variable-feature-scaling-and-variable-feature-normalization<br>
http://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/ <br>
http://www.simafore.com/blog/bid/62333/4-key-advantages-of-using-decision-trees-for-predictive-analytics <br>
https://en.wikipedia.org/wiki/Overfitting<br>
https://www.youtube.com/watch?v=o9A4e7zopu8<br>
https://stackoverflow.com/questions/29530232/python-pandas-check-if-any-value-is-nan-in-dataframe<br>
https://stackoverflow.com/questions/22903267/what-is-tuning-in-machine-learning

### ** Python Reference

The purpose of the following code was to insert data for the report using pandas for a cleaner format.

In [93]:
import sys
import pickle
sys.path.append("../tools/")

import numpy as np
import pandas as pd

from collections import defaultdict

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
    
df = pd.DataFrame.from_dict(data_dict, orient="index")

In [94]:
# Cleaning Financial and Email Data..

features_list = ['poi',
'salary',
'deferral_payments',
'total_payments',
'loan_advances',
'bonus',
'restricted_stock_deferred',
'deferred_income',
'total_stock_value',
'expenses',
'exercised_stock_options',
'other',
'long_term_incentive',
'restricted_stock',
'director_fees',
'to_messages',
'email_address',
'from_poi_to_this_person',
'from_messages',
'from_this_person_to_poi',
'shared_receipt_with_poi',
#'shared_receipt_with_poi_ratio',
#'sent_to_poi_ratio',
#'received_from_poi_ratio'
]

financial_features = ['salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments', 
'loan_advances', 'other', 'expenses', 'director_fees', 'total_payments', 'exercised_stock_options', 
'restricted_stock', 'restricted_stock_deferred', 'total_stock_value']
email_features = ['email_address','to_messages', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi',
'shared_receipt_with_poi']

df = df.replace('NaN', np.nan, regex=True)

In [82]:
# Removes the email_address feature since unique
df.drop('email_address', axis=1, inplace=True)
email_features.remove('email_address')

# Removes outliers
df.drop('TOTAL', axis=0, inplace=True)
df.drop('THE TRAVEL AGENCY IN THE PARK', axis=0, inplace=True)

# Replace NaN with 0 to avoid issues
df[financial_features] = df[financial_features].replace(np.nan, 0)
df[email_features] = df[email_features].replace(np.nan, 0)

In [5]:
#  Editing 'BHATNAGAR SANJAY' and 'BELFER ROBERT'
df.set_value('BHATNAGAR SANJAY', 'total_payments', 137864);
df.set_value('BHATNAGAR SANJAY', 'expenses', 137864);
df.set_value('BHATNAGAR SANJAY', 'other', 0);
df.set_value('BHATNAGAR SANJAY', 'director_fees', 0);
df.set_value('BHATNAGAR SANJAY', 'exercised_stock_options', 15456290);
df.set_value('BHATNAGAR SANJAY', 'restricted_stock', 2604490);
df.set_value('BHATNAGAR SANJAY', 'restricted_stock_deferred', -2604490);
df.set_value('BHATNAGAR SANJAY', 'total_stock_value', 15456290);

df.set_value('BELFER ROBERT', 'total_payments', 3285);
df.set_value('BELFER ROBERT', 'expenses', 3285);
df.set_value('BELFER ROBERT', 'deferred_income', -102500);
df.set_value('BELFER ROBERT', 'deferral_payments', 0);
df.set_value('BELFER ROBERT', 'director_fees', 102500);
df.set_value('BELFER ROBERT', 'exercised_stock_options', 0);
df.set_value('BELFER ROBERT', 'restricted_stock', 44093);
df.set_value('BELFER ROBERT', 'restricted_stock_deferred', -44093);
df.set_value('BELFER ROBERT', 'total_stock_value', 0);

In [6]:
# Adding new email features...
for i in df.index:
    if df.loc[i]['from_poi_to_this_person'] == 0 or df.loc[i]['to_messages'] == 0:
        df.set_value(i, 'received_from_poi_ratio', 0)
    else:
        df.set_value(i, 'received_from_poi_ratio',
                     float(df.loc[i]['from_poi_to_this_person']) / float(df.loc[i]['to_messages']))

for i in df.index:
    if df.loc[i]['from_this_person_to_poi'] == 0 or df.loc[i]['from_messages'] == 0:
        df.set_value(i, 'sent_to_poi_ratio', 0)
    else:
        df.set_value(i, 'sent_to_poi_ratio',
                     float(df.loc[i]['from_this_person_to_poi']) / float(df.loc[i]['from_messages']))

for i in df.index:
    if df.loc[i]['shared_receipt_with_poi'] == 0 or df.loc[i]['to_messages'] == 0:
        df.set_value(i, 'shared_receipt_with_poi_ratio', 0)
    else:
        df.set_value(i, 'shared_receipt_with_poi_ratio',
                     float(df.loc[i]['shared_receipt_with_poi']) / float(df.loc[i]['to_messages']))        

In [7]:
ee = ['poi','to_messages', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi',
'shared_receipt_with_poi', 'shared_receipt_with_poi_ratio', 'sent_to_poi_ratio','received_from_poi_ratio']

ff = ['poi', 'salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments', 
'loan_advances', 'other', 'expenses', 'director_fees', 'total_payments', 'exercised_stock_options', 
'restricted_stock', 'restricted_stock_deferred', 'total_stock_value']

# What we want to transform from 'ff'
filtered_ff = ['salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments', 
'loan_advances', 'other', 'expenses', 'director_fees', 'total_payments', 'exercised_stock_options', 
'restricted_stock', 'restricted_stock_deferred', 'total_stock_value']

log_df = df[ff]

In [8]:
for i in log_df.index:
    for f in filtered_ff:
        if log_df.loc[i][f] != 0:
            log_df.set_value(i,f, np.log(abs(log_df.loc[i][f])))
        else:
            log_df.set_value(i,f,0)