# Udacity Data Analyst NanoDegree Project 5
## Bart Leatham  
## May 5, 2017

# Identify Fraud from Enron Email

In October of 2001, Enron, one of the largest companies in the world at the time as measured by market cap, quickly fell into bankrupcy amidst the largest financial scandal in US history.  Due to the unscrupulous behavior and business dealings of many top-level executives at the company, Enron collapsed, erasing billions of dollars of stock-holder value and employee retirement plans, all while many executives cashed in on hundreds of millions of dollars.  More reading on the Enron Scandal can be found [here](https://en.wikipedia.org/wiki/Enron_scandal).


## Purpose and Goals of this project.  
In this project, I will apply Sklearn machine learning algorithms using python on publicly available Enron financial and email datasets to create a classifier that predicts Persons of Interest (POI) in the Enron scandal.  A variety of machine learning classifiers will be experimented with, in the end I will define the final classifier and the parameters that yield the best results.  The files of interest for this project are:  
**enron61702insiderpay.pdf** - insider pay information for various Enron employees  
**poi_names.txt** - Persons of Interest in Enron scandal, these are people who were indicted, settled without admitting quilt or testified in exchange for immunity.  
**final_project_dataset.pkl** - data set with all features and labels included.  
**tester.py** - python script for performing cross validation of classifier.  
**poi.py** - main python script were features are defined, outliers addressed and classifiers experimented with and finally settled on.  
**my_feature_list.pkl** - .pkl file of final features I used.  
**my_dataset.pkl** - .pkl file of cleaned up dataset I used.  
**my_classifier.pkl** - .pkl file of my optimized classifier.  
**resources.txt** - list of webpages I used for assistance along the way.  


### Data Overview  
The dataset used in this project can be found here: https://github.com/udacity/ud120-projects/tree/master/final_project/emails_by_address  
The dataset provided includes many features, and POI identifiers for a number of Enron employees.  Since we have a category feature of POI for each person, we can use machine learning algorithms to analyze the feature data to develop a classifier to identify POIs using precision and recall as the measures for success.  

```
Number of People: 146  
Number of Features: 21  
Number of POI is: 18  
```

For a variety of possible reasons, not all feature entries for all persons had values entered.  
**Missing Values per Feature:**
```  
loan_advances                142
director_fees                129
restricted_stock_deferred    128
deferral_payments            107
deferred_income               97
long_term_incentive           80
bonus                         64
from_this_person_to_poi       60
from_poi_to_this_person       60
from_messages                 60
shared_receipt_with_poi       60
to_messages                   60
other                         53
expenses                      51
salary                        51
exercised_stock_options       44
restricted_stock              36
total_payments                21
total_stock_value             20
poi                            0

```

### Outlier Identification
The first step to identify outliers was to make a simple plot of salary vs. bonus, there was an obvious significant outlier.  
![Outlier](outlier.png)  

On referencing the enron61702insiderpay.pdf and doing a database query I found that TOTAL (total sum each feature) was included, so I removed it. Next I queried the data for any people with a Salary or Bonus of 0.  This yielded 3 persons, again referring to the enron61702insiderpay.pdf I found 1 person with NaN for all entries, LOCKHART EUGENE E, so he was removed from the dataset.  While investigating the enron61702insiderpay.pdf, I also noticed 'THE TRAVEL AGENCY IN THE PARK', which does not sound like a persons name, so it was removed from the dataset.  After outlier removal, the dataset contained 144 people.
![Outliers_removed](outliers_removed.png)  


## Feature Identification and Selection
### Features Available  
```
LABEL:
poi

FEATURE - KBest_Score 
exercised_stock_options - 24.815
total_stock_value - 24.183
bonus - 20.792
salary - 18.290
email_to_poi_ratio** - 16.410
deferred_income - 11.458
long_term_incentive - 9.922
restricted_stock - 9.212
total_payments - 8.773
shared_receipt_with_poi - 8.589
loan_advances - 7.184
expenses - 6.094
from_poi_to_this_person - 5.243
other - 4.187
email_from_poi_ratio** - 3.128
from_this_person_to_poi - 2.383
director_fees - 2.126 
to_messages - 1.646  
deferral_payments - 0.225
from_messages - 0.170
restricted_stock_deferred - 0.065

**these two features were added by me, explanation is below.
```
### Feature Creation, Scaling and Selection.
After reading through the features available, and understanding thier meaning, I decided that a more relevant metric regarding emails to/from POIs, was to create the following 2 features:  
email_to_poi_ratio = from_this_person_to_poi/from_messages  
email_from_poi_ratio = from_poi_to_this_person/to_messages  
It makes more sense to look at the ratio of email to/from a poi in relation to all to/from messages, than it does to look at the raw number of messages to/from a poi.  If a person sends/recieves very few emails, but they are all to/from a POI, then that is definitely a person of interest.

With the full feature list in place, I performed feature scaling on all of the features using the sklearn.preprocessing.MinMaxScaler. Since some features are in $USD, and others are counts or ratios, it makes sense to scale them so that they contribute proportionaly to the classifiers calculations. 

Next I used SelectKbest to get the weights of each feature (list of features above is sorted by weight). I then determined the optimal k value (number of features) from SelectKbest to achieve the best classifier performance.  To determine the optimal k value, I used a pipeline, with GridSearch cross validation to assess precision and recall scores for values of k from 5-18.  A k value of 10 was shown to offer the best performance with balance between precision and recall.

### Final Features Used:
```
exercised_stock_options
total_stock_value 
bonus
salary
email_to_poi_ratio
deferred_income 
long_term_incentive
restricted_stock 
total_payments
loan_advances 
```

## Classifier Algorithms
In my classifier testing experimentation I chose to use GaussianNB, SVC, DecisionTree, Kneighbors and AdaBoost, analyzing results and doing some parameter tuning to gauge the potential of each classifier. Note that for the DecisionTree classifier, it is not necessary to use scaled data, though in this case scaled data was used.

### Classifier Results  
**GaussianNB**  
Precision: 0.34921  
Recall: 0.286  
F1: 0.31446   
Time: 2.08s  

**SVC**    
Precision: 0.30925  
Recall: 0.6635  
F1: 0.42187   
Time: 2.629s  

**DecisionTree**    
Precision: 0.30577  
Recall: 0.2990  
F1: 0.4518   
Time: 1.066s  

**KNeighbors**    
Precision: 0.25758  
Recall: 0.0765  
F1: 0.11796   
Time: 3.274s  

**AdaBoost**    
Precision: 0.4495  
Recall: 0.2985  
F1: 0.35877   
Time: 59.788s  


### Classifier Tuning
Once each classifier was tested according to the tester.py script, I dove deeper into three of the classifiers before I decided on a final solution. For SVC, DecisionTree and AdaBoost classifiers, I used GridSearch cross validation to perform parameter tuning to determine the optimal settings of various parameters for each classifier. Parameter tuning is an important step in refining a machine learning classifier.  Classifiers come with default settings for a variety of parameters that affect the performance and outcome of the classifier. The main goal of tuning the available parameters is to customize the classifier for the training data being worked with to achieve optimum results. Using GridSearch for parameter tuning is an indispensable tool in tuning parameters for a machine learning classifier. Some classifiers have many parameters that can be tuned, with a large affect on the results of the classifier. Using the GridSearch allows to perform cross validation on the classifier with many different combinations of parameters in an automated manner, with the best combination of parameters reported. It is important to note issues that can arise in algorithm tuning. In tuning it is possible to end up overfitting, or underfitting the classifier to the training data. Overfitting would result in a classifier that scores well on the training data, but does not predict the test data well in validation. Underfitting would not score well on training or test data. It is important to spend time and effort with tuning and validating classifiers to achieve optimum scores and classifier robustness.    

I found that the DecisionTree classifier yielded the best results from the tuning effort, and decided on using it as my final classifier. 

### Final Classifier Settings and GridSearch Parameters:  
clf = DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=2,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; max_features=None, max_leaf_nodes=None,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; min_impurity_split=1e-07, min_samples_leaf=10,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; min_samples_split=2, min_weight_fraction_leaf=0.0,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; presort=False, random_state=None, splitter='best')  

###DecisionTree GridSearch  
clf = DecisionTreeClassifier()  
param_grid = {'criterion': ['gini', 'entropy'],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'splitter': ['best', 'random'],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'max_depth': [None, 2, 5, 10],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'min_samples_split': [2, 10, 20],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'min_samples_leaf': [1, 5, 10],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'max_leaf_nodes': [None, 5, 10, 20],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'random_state': [None, 21, 42, 100]}  
clf = GridSearchCV(clf, param_grid, verbose=5, n_jobs = 2)  

## Validation
Validation was performed using the tester.py script which uses a StratifiedShuffleSplit cross-validator with 1000 iterations, the training data consisted of 90`%` of the data, the remaining 10`%` of the data was used for testing. It is important to validate the classifier on a set of data other than the data used for training.  If training and testing are performed on the same dataset, there is a risk for overfitting the algorithm, it will report high accuracy but may be unable to correctly identify labels in a new set of data.  Using a randomized sample of 10`%` of the data, over 1000 iterations is a thorough way to ensure training and testing are performed robustly. Each of the classifiers were validated for accuracy, precision, recall. Due to the limited number of POIs in the dataset, accuracy was not considered in choosing a classifier. With 18 POIs in a dataset of 144 people, a classifier could have an accuracy result of 88`%` without correctly predicting any POIs. Precision is the ratio of true POI divided by the total actual POIs in the dataset. A higher precision value indicates less false positives identified. Recall is the ratio of true POI divided by the total number of POI identified by the classifier. A higher recall value indicates less false negatives, that is, more people will be identified as POIs. For this case of identifying POIs in the Enron financial scandal, we would rather identify more POIs at the risk of falsely identifying innocent people, than to miss out on identifying actual POIs, so recall is the most important measure of classifier validity.  
The final results for the DecisionTree algorithm after tuning were:  

**DecisionTree**    
Precision: 0.46543  
Recall: 0.46650  
F1: 0.4518   
Time: 1.066s    

## References  
research on gridsearch  
https://www.quora.com/How-do-I-properly-use-SelectKBest-GridSearchCV-and-cross-validation-in-the-sklearn-package-together  

sklearn, numerous searches for help with classifier algorithms  
http://scikit-learn.org  

searches for help with python coding  
https://stackoverflow.com/  

many searches for help with project and coursework explanations  
https://discussions.udacity.com/c/nd002-intro-to-machine-learning  

markdown cheatsheet  
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet  





