# Identify Fraud from Enron Email

In [26]:

import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.grid_search import GridSearchCV

### Import helper functions
from helper_functions import plotting_salary_expenses, dataset_info,\
    remove_outlier, ratio

### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi','salary',
                'bonus',
                'long_term_incentive',
                'deferred_income',
                'deferral_payments',
                'loan_advances',
                'other',
                'expenses',
                'director_fees',
                'total_payments',
                'exercised_stock_options',
                'restricted_stock',
                'restricted_stock_deferred',
                'total_stock_value',
                'to_messages',
                'from_messages',
                'from_this_person_to_poi',
                'from_poi_to_this_person']

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
#my_dataset = data_dict



In [28]:
#plotting_salary_expenses(data_dict, 'salary', 'expenses')
#plotting_salary_expenses(data_dict, 'salary', 'from_poi_to_this_person')
#plotting_salary_expenses(data_dict, 'from_poi_to_this_person', 'from_this_person_to_poi')
#plotting_salary_expenses(data_dict, 'salary', 'from_this_person_to_poi')
#plotting_salary_expenses(data_dict, 'shared_receipt_with_poi', 'to_messages')

**I. Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”] **

The goal of this project was to figure out the persons of interest using machine learning. Machine learning was usefull here because we can put many features such as salary, emails to poi, bonuses and predict if the person is POI or not based on these features. I got hold of excerpt of the data from enron corpus, and tried to create a model to predict persons of interest. Exploring the data set I got these statistics:

* Number of total datapoints:  146
* Number of features for each datapoint:  21
* Number of persons of interest in this dataset:  18
* Number of other people in this dataset:  128
* Total feature values missing:  1352
* Total feature values:  3190
* The percentage compared to all values:  42.3824451411

The dataset is very limited and contains about 42% of missing values. I couldn't remove much of outliers in statistical way by calculating the quartile. I plotted the features to find most extreme outliers and I found 1 that needed removal. The key "TOTAL" had the totals of all salaries and it skewed the data. I removed it by poping "Total" from the dict.




**II. What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “properly scale features”, “intelligently select feature”]**

I ended up using "SelectKBest" for feature selection. I found these scores for my features: 

- salary score is:  18.575703268
- bonus score is:  21.0600017075
- long_term_incentive score is:  10.0724545294
- deferred_income score is:  11.5955476597
- loan_advances score is:  0.21705893034
- expenses score is:  7.24273039654
- total_payments score is:  4.2049708583
- exercised_stock_options score is:  6.23420114051
- restricted_stock score is:  2.10765594328
- total_stock_value score is:  8.86672153711

the scale is different for all of them, for example emails to poi could have about 300 emails, while salary can go up to 100 000 and so on. 

I created a new feature called "ratio of poi_to and poi_from". I wanted to see what is the ratio of sent and received emails from poi and maybe find some patterns in this data. 

**III. What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]**

My features were labeled so I used these algorithms to test the accuracies: Naive bayes, decision trees and K nearest neigbor. I ended up using naive bayes (gaussianNB) algorithm because it had the best performance when I tested using tester.py. From my testing I got these results: 
- GaussianNb: Accuracy: 0.76800	Precision: 0.19544	Recall: 0.16300	F1: 0.17775	F2: 0.16860
- DecisionTree: Accuracy: 0.75677	Precision: 0.24428	Recall: 0.27750	F1: 0.25983	F2: 0.27015
- KNeigbors: Accuracy: 0.81231	Precision: 0.00673	Recall: 0.00150	F1: 0.00245	F2: 0.00178

I was actualy surprised that Kneigbors got really low scores, I was expecting it to be higher.

**IV. What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric item: “tune the algorithm”]**

Tuning the parameters in my case means that I used different parameters in the classifier to get the best result and least overplotting. I tuned my parameters by using automatic parameter tuning with GridSearchCV and piping. GridSearchCV found the most optimal parameters to get the best result in my algorithm. 

If I wouldn't have used the tuning and used the default values, I would have got lower precission and recall scores. I had to use tuning to get the best result possible.

**V. What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric item: “validation strategy”]**


Validation is a strategy to separate your data to train and test sets so the results would not be overfitted and the most accurate and not corrupted by the training. The classic mistake is to test on your training data which gives you bad results.

I validated my data using StratifiedShuffleSplit and train_test_split

**VI. Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]**

My selected evaluation metrics are precission and recall. 

- Precission: This is the ratio of predicted labels (POI) that were actualy the persons of interest.

- Recall: This is the ratio of how many items were labeled as POI that were actual POI's. This means the higher this score is, the better the prediction.

In my case using tester.py, I got precission of 0.39855 and recall of 0.33000

In [17]:

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')