# Enron Submission Free-Response Questions

**Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?**

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.
These data have been combined with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity. These data have created a dataset of 21 features for 146 employees.

The scope of the project is the creation of an algorithm with the ability to identify Enron Employees who may have committed fraud. To achieve this goal, Exploratory Data Analysis and Machine Learning were deployed to clear the dataset from outliers, identify new parameters and classify the employees as potential Persons of Interest.  

During the process some outliers were revealed probably due to data extraction from the [Payments Schedule](dataset/enron61702insiderpay.pdf). In one occasion there was a datapoint named 'TOTAL' matching the totals row from the Schedule and also, two datapoints with transposed values across features. The first datapoint removed from the dataset and the other two corrected.

**What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.**

There are some cases where the value of a variable might be less important than its proportion to a related aggregated value. As an example from the current dataset, a bonus of 100,000 is less informative than a bonus 3 times the salary, or "500 sent email to POIs" is far less informative than "half of the sent emails have been sent to POIs".
For this reason and since all the features were related to an aggregated value, I created the proportions of all the features to their respective aggregated value. These new features added to the dataset and the 'enchanced' dataset evaluated with the ```SelectPercentile(percentile=100)```.  
![features_importance](Figures/features_importance.png)  
The result showed that the proportions of "*Long Term Incentive*", "*Restricted Stock Deferred*" and "*From This Person to POI*" were more significant than the related original feature.  They added to the dataset and in the same time removed the original features to avoid any bias towards these features.
![features_importance](Figures/features_importance2.png)
The used classifier is not based on recursive partitioning, so scaling was required. Since the dataset was quite sparse, ```MaxAbsScaler()``` was selected to preserve the sparseness structure in the data. The final features and their importance after the above procedure were:  


Afterword, I evaluated several classifiers both with Univariate Feature Selection and Primary Component Analysis and I ended up with PCA with 2 principal components.

**What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?**

The most appropriate algorithm for the specific case was Nearest Centroid. Bellow you may find all the evaluated algorithms and their performance.

|           Category           |        Algorithm       |   Accuracy  |  Precision  |    Recall   |      F1     |      F2     |
|:----------------------------:|:----------------------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|
|    Support Vector Machine    |        LinearSVC       |   0.76887   |   0.25328   |   0.37650   |   0.30284   |   0.34311   |
|    Support Vector Machine    |           SVC          | **0.86933** | **0.77778** |   0.02800   |   0.05405   |   0.03469   |
| Nearest Neighbors            | KNeighborsClassifier   | 0.85747     | 0.45751     | 0.37150     | 0.41004     | 0.38601     |
| **Nearest Neighbors**        | **NearestCentroid**    | 0.73933     | 0.31052     | **0.78250** | **0.4446**0 | **0.60008** |
| Ensemble Methods (Averaging) | RandomForestClassifier | 0.80033     | 0.26208     | 0.27400     | 0.26791     | 0.27153     |
| Ensemble Methods (Boosting)  | AdaBoostClassifier     | 0.84847     | 0.40087     | 0.27600     | 0.32692     | 0.29434     |


**What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm?.**

hyperparameter optimization or model selection is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set. Often cross-validation is used to estimate this generalization performance ([wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_optimization#cite_note-bergstra-1)). If this process does not performed thoroughly you may end up with an algorithm with degraded performance or if you don't follow the right methodology (dataset splitting or cross validation) you may end up with an overfitted model that do not generalize right, unable to make good predictions with unknown data.  

For parameter optimization I used Exhaustive Grid Search with the following parameters:

|      Process      |    Algorithm    |     Parameter    |      Evaluated Values      |  Selected Value  |
|:-----------------:|:---------------:|:----------------:|:--------------------------:|:----------------:|
|      Scaling      |  MaxAbsScaller  |       copy       |      *default values*      | *default values* |
| Feature Selection |       PCA       |   n_components   |        [2, 3, 4, 5]        |         2        |
|   Classification  | NearestCentroid |      metric      | ["euclidean", "manhattan"] |    "manhattan"   |
|                   |                 | shrink_threshold |     [None, 0.1, 1, 10]     |       None       |
***
(*Note: Additional Scaling and Feature Selection methods were evaluated, but they didn't performed as well as the above.*)

**What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?**

Validation is the process of applying the model to a part of the dataset, that has not been used during the model tuning, to evaluate its ability to generalize. In the event of lack of the necessary number of datapoints to split the dataset, Cross Validation, where several randomized splits are used both for model creation and validation can be applied. A classic mistake is to use the same data (without Cross Validation) for both creation and evaluation of the model. This leads to high biased models with very poor performance on new datapoints.  

For my model I used 10 folds Stratified Shuffle Split for the training of the model and 1000 folds Stratified Shuffle Split for the models selection and validation.

**Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.**

|           Category           |        Algorithm       |   Accuracy  |  Precision  |    Recall   |      F1     |      F2     |
|:----------------------------:|:----------------------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|
|    Support Vector Machine    |        LinearSVC       |   0.76887   |   0.25328   |   0.37650   |   0.30284   |   0.34311   |
|    Support Vector Machine    |           SVC          | **0.86933** | **0.77778** |   0.02800   |   0.05405   |   0.03469   |
| Nearest Neighbors            | KNeighborsClassifier   | 0.85747     | 0.45751     | 0.37150     | 0.41004     | 0.38601     |
| **Nearest Neighbors**        | **NearestCentroid**    | 0.73933     | 0.31052     | **0.78250** | **0.4446**0 | **0.60008** |
| Ensemble Methods (Averaging) | RandomForestClassifier | 0.80033     | 0.26208     | 0.27400     | 0.26791     | 0.27153     |
| Ensemble Methods (Boosting)  | AdaBoostClassifier     | 0.84847     | 0.40087     | 0.27600     | 0.32692     | 0.29434     |

As can be seen in the table, Support Vector Classifier performed better in Accuracy and Precision and Nearest Centroid in Recall and the F scores. I ended up using Nearest Centroid because I wanted a more balanced behavior, otherwise a high score may be misleading if it is combined with poor score in the other categories. This can be demonstrated graphically.  

|SVC                    |Nearest Centroid                                 |
|:---------------------:|:-----------------------------------------------:|
|![SVC](Figures/svc.png)|![Nearest Centroid](Figures/nearest_centroid.png)|

It is clear that the extremely high (comparing to the rest) Precision of SVC is because it evaluates very "conservative" the datapoints. It makes two right picks but it can only spot 2 out of 18 POIs.  
On the other hand, Nearest Centroid has some false positives but in general can better distinct POIs from non-POIs.