# Enron - Identifying POI in the Fraud Case
### Dustin Kopp

## Question 1 - Summary of project goal
__Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question.__

In this project I attempt to use machine learning to identify POIs (Persons of Interest) in the Enron fraud case. The data we have available on this case includes financial data (Salaries and bonuses) as well as email data (How many messages sent and to whom) on 145 people related to the case. There ended up being only 18 persons of interest in the data. Since there were a few possibly related data points in the data, it makes sense to attempt to classify the data using machine learning. This classifier will allow us to process a new person's data to determine if they fit the model of a POI.

__Were there any outliers in the data when you got it, and how did you handle those?__

There were a couple of outliers in the data. The first was the TOTAL line from the spreadsheet. Since this is just a tally of the values, I was able to eliminate it. I also eliminated Eugene Lockhart from the list because he had no data (Either 0 or NaN) in all of the features. Finally, I eliminated The Travel Agency in the Park. I am looking for persons of interest, and the fact they received about 320k in total payments with no additional data points did not help me deduce which were POIs.

There were other data points that could be considered outliers, but I kept them in for a complete view of the data. Those were Kenneth Lay and Jeffrey Skilling. I primarily kept them in because they were POIs and very high up in the organization.


## Question 2 - Feature selection and creation
__What features did you end up using in your POI identifier, and what selection process did you use to pick them?__
Most of the time spent on this project was spent selecting features. I knew that I wanted to eliminate email address at the outset, as this only served as an additional identifier for the person. Then I decided to run some plots on data points that I thought, through intuition, would yeild interesting results. This was done in a jupyter notebook. This also allowed me to determine some outliers. 

I plotted several combinations of features into scatter plots to see any trends by eye. There were some, but I think we can use some tools to help. By eye, I would have opted to use poi, salary, total payments, bonus, deferred income, expenses, and exercised stock options. I might have also included some ratio of messages sent to POI vs total messages sent. 

I created a feature_selection.py file to hold some methods for determining how those tools responded to the input. 

First, I ran a select k best feature selection attempting to get 6 features. This yeilded salary, total payments, bonus, total stock value,  exercised stock options, and shared receipt with poi

Next, I ran a tree classifier using ExtraTreesClassifier and SelectFromModel from sklearn. This yeilded salary, total payments, bonus, deferred income, total stock value, expenses, exercised stock options, other, long term incentive, shared receipt with poi, and restricted stock. This is 11 total features when I only really wanted 6 so I looked at the importances from the classifier and the best 6 were salary, bonus, deferred income, expenses, exercised stock options, and restricted stock.

In the end, I went with my intuition on this. The model reported the best accuracy, recall, and precision using that. 
 
__Did you have to do any scaling? Why or why not?__

My first classification attempt was a KMeans classifier. That required I perform scaling. I did not like the results I received from that classifier though so I switched to an ensemble classifier, Random Forest. 

__As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) __

I created a ratio of the email data, specifically all poi messages / all messages. I didn't end up using this feature. I thought if someone only had 5 messages with a known POI it may look like nothing, but if those 5 were 50% or more of their total messages then it could be a clue. 

I also created a value for (restricted_stock + restricted_stock_deferred)/total_stock_value. I thought if someone had a lower ratio of restricted stock lurking in their total stock value, it may make them more or less likely to be a POI. I also did not end up using this value.

__In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.__

For the decision tree feature selection my feature importances were: 

* 'salary'                    0.07200787 
* 'to_messages'               0.03959709
* 'deferral_payments'         0.02295979
* 'total_payments'            0.05992697
* 'loan_advances'             0.01046504
* 'bonus'                     0.1202399  
* 'restricted_stock_deferred' 0.00607455
* 'deferred_income'           0.0784688  
* 'total_stock_value'         0.06179893
* 'expenses'                  0.07296664 
* 'from_poi_to_this_person'   0.04306053
* 'exercised_stock_options'   0.06914693 
* 'from_messages'             0.03462905
* 'other'                     0.06480855 
* 'from_this_person_to_poi    0.03546818
* 'long_term_incentive'       0.06603779 
* 'shared_receipt_with_poi    0.06460563
* 'restricted_stock'          0.07503367 
* 'director_fees'             0.00270408

I grabbed the 6 features with the highest importances. It seems that many more and I would be at risk of overfitting. 

For the SelectKBest the feature scores were:

* 'salary' 1.58587309e+01
* 'to_messages' 2.61618300e+00 
* 'deferral_payments' 9.98239959e-03  
* 'total_payments' 8.95913665e+00
* 'loan_advances' 7.03793280e+00 
* 'bonus' 3.07287746e+01
* 'restricted_stock_deferred' 7.27124110e-01 
* 'deferred_income' 8.79220385e+00
* 'total_stock_value' 1.06338520e+01 
* 'expenses' 4.18072148e+00
* 'from_poi_to_this_person' 4.95866668e+00
* 'exercised_stock_options' 9.68004143e+00
* 'from_messages' 4.35374099e-01 
* 'other' 3.20445914e+00
* 'from_this_person_to_poi' 1.11208239e-01 
* 'long_term_incentive' 7.55511978e+00    
* 'shared_receipt_with_poi' 1.07225708e+01
* 'restricted_stock' 8.05830631e+00
* 'director_fees' 1.64109793e+00

Similarly, I grabbed the 6 best features from the SelectKBest method. I feel like this is a good point between the variance and mean.


## Question 3 - Algorithm selection
__What algorithm did you end up using?__

I ended up selecting the Random Forest algorithm. 

__What other one(s) did you try?__

I tried a KMeans classifier and an AdaBoost classifier as well. 

__How did model performance differ between algorithms?__
```
               Intuition Features    KBest Features    Tree Features
KMeans: 
  Accuracy:    .9024                 .8837             .8095
  Precision:   .5                    .0                .3333
  Recall:      .25                   .0                .6
 
Random Forest: 
  Accuracy:    .9268                 .8837             .9048
  Precision:   .6667                 .0                .6667
  Recall:      .5                    .0                .4
 
AdaBoost:  
  Accuracy:    .8537                 .7674             .8571
  Precision:   .25                   .0                .3333
  Recall:      .25                   .0                .2
```

This shows that the best scores I received were with my intuition features using the Random Forest classifier. 

## Question 4 - Tuning parameters
__What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?__

Each algorithm we use in machine learning takes some parameters to customize how it handles a specific dataset. Changing these parameters could drastically change how the classifier chooses to classify a specific data point. 

__How did you tune the parameters of your particular algorithm?__

I ran a GridSearchCV on this classifier and passed in several value options for n_estimators, max_features, max_depth, criterion, and min_samples_split.

__What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).__

I ended up tuning n_estimators, criterion, max_depth, and min_sample_split



## Question 5 - Validation
__What is validation, and what’s a classic mistake you can make if you do it wrong?__

Validation is a process where you check if your training model correctly predicts a testing set that was not used in the training process. This is why we split our data set into a training and testing set and only train on the training features and labels. If we trained on the entire dataset we would have no data in which to test our model. 

__How did you validate your analysis?__

I used the stratified KFold in the tester.py script with 3 folds on my model. Given more data points I likely would have increased the number of folds to 5 or 10 to get a better spread of test vs train data. 




## Question 6 - Evaluation
Give at least 2 evaluation metrics and your average performance for each of them.  

Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

