# P5 Identify Fraud from Enron Email - Ahmad Takatkah

## Question One:
- Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. 
- As part of your answer, give some background on the dataset and how it can be used to answer the project question.
- Were there any outliers in the data when you got it, and how did you handle those?

**Overview:**


The Enron scandal, publicized in October 2001, eventually led to the largest bankruptcy reorganization in American history at that time, the Enron Corporation bankruptcy. 

A staff of executives at Enron were able to hide billions of dollars in debt from failed deals and projects to keep the stock price up, by the use of accounting loopholes, special purpose entities, and poor financial reporting. 

**Project Goal:**


Many executives were indicted for a variety of charges and some were later sentenced to prison. This project aims to identify Enron staff who may have been involved in these fraudulent actions. To do this, the project uses a public dataset of Enron employees' financials and emails to identify a person of interest (POI). 

A POI is an an individual who was indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

**Using Machine Learning:**


Supervised machine learning algorithms can take a smaller dataset of already identified POIs and process a bigger dataset to find trends and classify employees based on the provided dataset, the training data. This save time and effort and speeds up the process of investigation. 

The risk however would be in missing out on some false negatives (POIs that the algorithms does not classify as POIs), or have some false positives (none POIs that the algorithm mistakenly classify as POIs). 

**Dataset Initial Exploration:**


The dataset includes salaries, bonuses, and other financial incentives given to Enron employees, and the history of emails sent and received by Enron employees. 


I performed a quick exploratory analysis (provided in the submitted code file) to learn more about the provided dataset, and below are the main notes with the percentage of NaN values for each feature:


- Financial Features:

|#| Feature | Missing Values Percentage |
|---|---|---|
| 1 | salary feature | 35.0% |
| 2 | bonus feature | 44.0% |
| 3 | deferral_payments feature | 73.0% |
| 4 | total_payments feature | 14.0% |
| 5 | exercised_stock_options feature | 30.0% |
| 6 | restricted_stock feature | 25.0% |
| 7 | restricted_stock_deferred feature | 88.0% |
| 8 | total_stock_value feature | 14.0% |
| 9 | expenses feature | 35.0% |
| 10 | loan_advances feature | 97.0% |
| 11 | other feature | 36.0% |
| 12 | deferred_income feature | 66.0% |
| 13 | long_term_incentive feature | 55.0% |
| 14 | director_fees feature | 88.0% |

- Emails Features:

|#| Feature | Missing Values Percentage |
|---|---|---|
| 1 | to_messages feature | 41.0% |
| 2 | from_poi_to_this_person feature | 41.0% |
| 3 | from_messages feature | 41.0% |
| 4 | from_this_person_to_poi feature | 41.0% |
| 5 | shared_receipt_with_poi feature | 41.0% |
| 6 | email_address | Ignored | 


- Labels: (Classifications):

|#| Feature | Missing Values Percentage |
|---|---|---|
| 1 | PIO | 0.0% |

- General Notes:
    - The dataset provides details on 145 employees. 
    - There was one clear **Outlier** that was removed which is the total value for each feature. to find the outliers I sorted every features in a separate dictionary (as shown in the code file)
    - All labels are provided (there is no missing labels) 
    - The number of POIs: 18, and the number of non-POIs: 128, 
    - 44.0% of all feature values in the dataset is missing!
    - All provided features have missing values with different percentages (detailed in the tables above). From this we can't relay on some specific features that have most of their values missing such as: 
        - loan_advances feature (97.0% missing values)
        - restricted_stock_deferred (88.0% missing values)
        - director_fees feature (88.0% missing values)
        - deferral_payments (73.0% missing values)
        - deferred_income feature (66.0% missing values)
        - long_term_incentive feature (55.0% missing values)
    - Missing feature values are imputed to zero by the provided `featureFormat` function by Udacity. 


----

## Question Two:
- What features did you end up using in your POI identifier, and what selection process did you use to pick them? 
- Did you have to do any scaling? Why or why not? 
- As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) 
- In your feature selection step:
    - if you used an algorithm like a decision tree, please also give the feature importances of the features that you use
    - if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  

**Feature Selection:**


I started with a big set of features (all features except the `email address`), and then, I used `SelectKBest` algorithm to select the best features. 

**Feature Engineering:**


I added two new financial features: 
- salary_to_avg_salary: 
    This feature aims to show how close or far this employee's salary is from the average salary in the company. This might help in identifying a trend for overpaid executives compared to the majority of other employees and it might reduce the effect of outliers in salary that I decided to keep. 
- bonus_to_avg_bonus: 
    This feature aims to show how close or far this employee's bonus is from the average bonus in the company. This might also help in identifying a trend for overpaid executives compared to the majority of other employees and it might reduce the effect of outliers in bonus that I decided to keep.
    
**Feature Scaling:**


Although I experimented with algorithms that needed feature scaling such as SVM, but for the final classifier, which was `GaussianNB`, I ended up removing scaling as it reduced the accuracy, precision and recall scores of my final classifier. 


**Final Features Selected:**


After several experiments, in my final classifier, and I ended up using only 4 of the original features and 2 of the new features I created in a :
- Original Features used in final classifier and their Score:

| Feature | Score |
|---|---|
| salary | 25.0975415287 |
| bonus | 24.4676540475 |
| exercised_stock_options | 21.0600017075 |
| total_stock_value | 21.0600017075 |
| salary_to_avg_salary | 18.575703268 |
| bonus_to_avg_bonus | 18.575703268 |


- Engineered Features used in final classifier and their Score:

| Feature | Score |
|---|---|
| salary_to_avg_salary | 18.575703268 |
| bonus_to_avg_bonus | 18.575703268 |


----

## Question Three:

- What algorithm did you end up using? 
- What other one(s) did you try? 
- How did model performance differ between algorithms?  

I used a pipeline and a grid search to experiment with 3 different algorithms with SelectKBest for feature selection. I also used the provided tester.py to measure the performance of my experiments. 

In the below table, I listed the used algorithms and their best performance: 

| Algorithm | Accuracy | Precision | Recall | F1 | F2 | Total predictions | True positives | False positives | False negatives | True negatives
|---|---|---|---|---|---|---|---|---|---|---|
|GaussianNB()|0.85080|0.42401|0.33200|0.37241|0.34706|15000|664|902|1336|12098|
|DecisionTreeClassifier()| 0.85200|0.37585|0.16650|0.23077|0.18737|15000|333|553|1667|12447|
|SVC()|0.87513|0.65526|0.13400|0.22250|0.15935|15000|268|141|1732|12859|


I ended up choosing GaussianNB() with SelectKBest because it gave me the highest values for both precision and recall above 0.3 as required. 



-----

## Question Four:

- What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  
- How did you tune the parameters of your particular algorithm? 
- What parameters did you tune? 
     - (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier). 



**Parameter Tuning:**


ML algorithms come with specific parameters whose values can be changed (tuned) to achieve the best performance possible. Not tuning the parameters well, can result in poor performance or in getting unexpected results because the algorithm couldn't train well on the provided dataset. 


**Used Algorithms and Their Parameters:**


I used a pipeline and a grid search to experiment with 3 different algorithms:
- SelectKBest() (Which I ended up using along with GaussianNB())
    - For feature selection, I experimented with the following parameter values:
        - 'selectkbest__k': range(2,22) (based on the number of features in the features_list and excluding the first one which is the Label: POI)


- GaussianNB() (Which I ended up using along with SelectKBest())
    - For naive bayes there were no parameters to tune
 
    
- DecisionTreeClassifier()
    - For decision trees, I experimented with the following parameter values:
        - 'tree__criterion' : ['gini', 'entropy'],
        - 'tree__max_depth' : [None, 1, 2, 3, 4],
        - 'tree__min_samples_split' : [2, 3, 4, 25],
        
        
- SVC()
    - For support vector machines,I experimented with the following parameter values:
        - 'svm__kernel' : ['rbf'],
        - 'svm__C' : [1, 10, 100, 1000, 10000],

-----

## Question Five:

- What is validation, and what’s a classic mistake you can make if you do it wrong? 
- How did you validate your analysis?  


Validation is basically testing the process of training a ML algorithm. Mistakes can vary, but the most common one would be using the same data set for both training and testing the algorithm. to avoid this, the dataset is usually split into two separate sets: a training set and a testing set. 

I used the train_test_split method to split the provided dataset (30% test, - 70% train), and then the GridSearchCV method to create multiple combinations of train_test datasets splits. 


-----

## Question Six:

- Give at least 2 evaluation metrics and your average performance for each of them.  
- Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. 


**Evaluation Metrics:**


Among other evaluation metrics, I can list precision and recall as two main measures of relevance for the performance of a machine learning algorithm.


For my algorithm (GaussianNB()), here are the values for those two measures:
   - Precision: 0.42401
   - Recall: 0.33200


**Interpretation:**


precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances. Meaning, how accurate is the classifier in classifying a person of interest when the person is truly a person of interest. 

while recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over total relevant instances. Meaning, the rate at which the classifier can classify a person of interest among all true persons of interest. 

-----

### References:




- Sorting dictionaries: https://stackoverflow.com/questions/613183/sort-a-python-dictionary-by-value
- Access an arbitrary element in a dictionary in Python: https://stackoverflow.com/questions/3097866/access-an-arbitrary-element-in-a-dictionary-in-python
- Precision and recall: https://en.wikipedia.org/wiki/Precision_and_recall
- SelectKBest Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
- GaussianNB Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html 
- DecisionTreeClassifier Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
- SVC Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

