# POI Identifier in Enron Dataset
## Chris Tyndall
### 10/16/16

### Introduction

Enron was a publicly traded American energy company that saw rapid declines in stock price between 2000 and 2001 from \$90 to less than \$1 per share, with many employees selling shares in advance of the worst of these declines.  The company filed for bankruptcy shortly after, and several former employees were convicted on fraud and insider trading charges.  Investigations revealed accounting practices that misled the public on the health of the company, and the unexpected bankruptcy was at the time the largest Chapter 11 filing ever.  The subsequent charges, investigations, and finding reveal many employees knew that fraud was occuring.  In addition, the investigations have led to the public release of thousands of emails and financial records on many employees.  These documents make for an interesting dataset to analyze and attempt to classify fraudulent employees.

The goal of this project is to analyze this publically available data using machine learning methods to identify persons of interest (POIs) and in the process identify useful features, compare classifiers, and tune them to provide the most useful POI classification.  Enron was not the only company to lose significant value unexpectedly and a miriad of similar events, particularly related to the housing market, led to the Great Recession.  Company fraud and identifying unusual business practice has gained more attention than ever before particularly with the emergence of highly available machine learning techniques.  By understanding the features and patterns that reflect unusual employee activity, appropriate machine learning techniques can be applied to detect fraud within companies before the public damage can occur.

---
___

### Feature Selection

#### Initial dataset

The ready-made dataset included 19 features: 14 financial and 5 from the e-mail corpus.  There was also an additional feature for e-mail address that was removed as it has no numerical value.  Of the 14 financial features, 10 were related to monetary compensation (e.g. salary, bonus) and the other 4 were related to stock benefits.  The 5 e-mail features were derived from the e-mail corpus.

The dataset is very limited, and there are several missing values. In fact, 62 %  $(\frac{1708}{2774})$ of the dataset entries (ignoring 'poi' and 'email_address') are "NaN".

The following people only have 2 features with values:
* WODRASKA JOHN
* WHALEY DAVID A
* WROBEL BRUCE
* SCRIMSHAW MATTHEW
* GRAMM WENDY L

And the following has no values listed:
* LOCKHART EUGENE E

Eugene is removed from the dataset, as he provides no information.  The other 5 people are not POIs and removing them will reduce the limited dataset even further.  They will be left for now, and removed if their removal drastically affects performance.

---

#### Outliers

The dataset used in this analysis included features extracted from a financial document regarding compensation for employees and the e-mail corpus containing thousands of e-mails from and to employees.  As a result of generating the financial data from the financial document, there were two unnecessary entries:
  * TOTAL
  * THE TRAVEL AGENCY IN THE PARK
  
TOTAL was quickly revealed as an outlier when plotting any of the financial features.  It was always the extreme value.  This line item at the bottom of the financial document is the sum of all the payments and should be removed as it does not reflect an invidual.

THE TRAVEL AGENCY IN THE PARK did not have an outlier value expect that there was there were only 2 values from both financial and e-mail features which were the 'Other' and 'Total Payments' features of the financial data.  The name suggest this is not an individual, and the footnote on the financial document reveal this was a travel related account.  Because I am trying to identify persons of interest, this non-person has also been removed from the dataset.

Another potential outlier was the 'Loan Advance' and 'Total Payments' attributed to Kenneth Lay as it was an unusually large number.  However, only 3 persons in the dataset include 'Loan Advance' and the footnote explains that loan payments were made with stock.  This number is likely accurate considering Kenneth Lay was the CEO, so the outlier was not removed.  The 'Loan Advance' feature itself was not actually used because only 3 persons included this number so the outlier was irrelevant.  This unusually high number did affect the 'Total Payments' value for Lay, and pushed it far above the others at > \$ 100 million dollars.  Again, this value does reflect the actual (despite absurdly high) value for Lay and was therefore not removed from the dataset.

---

#### Feature Removal

I took the initial dataset and looked through each feature to get an understanding of the their distribution using a nice Tableau dashboard created by Diego<sup>1</sup>.

The two features, **restricted_stock_deferred** and **director_fees**, had no values for any POIs and therefore are not meaningful in distinguishing POIs from non-POIs.  Leaving them could potential cause them to be weighted too heavily since any value other than 'NaN' would indicate a non-POI.

The feature **loan_advance** had only 3 total values and was not used.
 
There are also two additional features that I chose to remove from the dataset: **total_payments**, **total_stock_value**.
These two features represent the sum of several other features.  Therefore, there is no new information contained within these features, and they are entirely dependent on the other features.  They could arguably replace other individual features since they reflect an underlying total financial compensation.  However, I will remove them from the analysis because princinal component analysis (PCA) should be more effective in extracting this underlying compensation metric while the classifier will appropriately scale the weights for each indiviual component.

I also removed the 2 features **from_poi_to_this_person** and **from_this_person_to_poi** and replaced them with a ratio to total sent/received e-mails as described below.

---

#### Feature Addition

Many of the distributions within the financial features have large ranges and the appearance of outliers.  However, the data is real, and comparing some features against each other in scatterplots reveals some non-linear relationships (e.g. salary vs bonus).  This means that some features (such as bonus) have non-linear distributions and may correlate to other features in non-linear manners.  I have therefore transformed some of the features using both sqrt and log transformations:
 * sqrt(bonus)
 * log(bonus)
 * sqrt(exercised_stock_options)
 * log(exercised_stock_options)
 
I also attempted to create some new word features by selecting single common words among POIs that showed up with a higher frequency within e-mails sent from POIs.  I made a word count dictionary from emails that were sent from POIs and created a frequency count by dividing by the total number of e-mail for each POI.  I then looked at the top 200 words from each (ignoring certain stopwords), and found the common words that showed up in the list for multiple POIs.  From this list, I chose words that seemed somewhat meaningful and that a person potentially engaged in fraud may have to use more often.   I selected the words with roots 'enron', 'team', 'want', 'let', 'veri', 'issu', 'provid', and 'depreci' and then extracted the word frequencies for all people in the dataset.  The feature names are added to the dataset using the name of the word root that it counts.

It seems that these words are likely common among all employees and simply typically of the business itself. By plotting these features and coloring POI/non-POI, I couldn't see any particular divisions among the word frequencies.  Obviously, the company name 'enron' should appear quite frequently among all e-mails, and it seems that the other words are simply commonly used words across the company.  Before deciding to not use these features, I tested my classifier with and without them included and got better results without them.  This is not too surprising given the common usage of these words among all e-mails. A classifier that attempts to make use of word counts would likely have to use a much larger feature space and include possibly hundreds of different words.

Lastly, I added a 2 features that took the ratio of emails from/to POI with respect to the total emails from/to the person.  These features capture the idea that a POI may send and/or receive e-mails from other POIs more frequently better than simply using the total number of messages because some people sent more e-mails.

The entire feature list, including those removed above, are 33 in total:

['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock','shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances','from_messages', 'other', 'from_this_person_to_poi', 'director_fees', 'deferred_income', 'long_term_incentive' ,'from_poi_to_this_person', **'bonus_log'**, **'bonus_sqrt'**, **'exercised_stock_options_log'**, **'exercised_stock_options_sqrt'**,**'from_poi_ratio'**, **'to_poi_ratio'**, **'enron'**, **'team'**, **'want'**,**'let'**,**'veri'**, **'issu'**, **'provid'**,**'depreci'**]

Engineered features are bolded.  Note that the last 8 features reflect the word roots that frequencies were calculated for.  These features were not used in the final classifier.

---

#### Feature Scaling

I used a MinMaxScaler and then PCA to reduce the feature set.  My hope is to extract the correlations introduced with the added sqrt and log financial features from above, and further reduce the general underlying features that likely correlate separately to financial information (both monetary and stock compensation) and e-mail features (general communication among POIs).  By plotting the PCA explained variance parameter, I see that most of the variance is explained with the first 7 components.  There is a little more gain up to 11 components, and very little with the rest (up to all 18 components).  For my parameter tuning, I elected to use PCA and reduce the features to 11 components.

---
___

### Algorithm Selection

#### Initial Comparison
I chose 4 algorithms to compare:
   * NaiveBayes
   * DecisionTree
   * KNeighbors
   * LogisticRegression
   
Using the default parameter for these classifiers, all accuracies are near 80-90%.  However, this metric is not too useful because of the unequal distribution of labels in the dataset.  There are only 18 POIs out of 144 people, which means a classifier that simply guesses non-POI will have an accuracy of 87.5%.  Therefore, it is much more useful to consider the precision and recall values when comparing classifiers.  The F1 score provides a nice weighted average of these two scores.

 Classifier | Accuracy | Precision | Recall | F1 Score
--- | --- | --- | ---
NaiveBayes |  0.820 | 0.291	| 0.246	| 0.267
DecisionTree | 0.820 |0.312  | 0.316	| 0.314
KNeighbors |  0.883	| 0.770	| 0.174 | 0.284	
LogisticRegression | 0.773 | 0.234 | 0.309 | 0.266

Most of the results are rather poor.  Decision tree does give > 0.3 precision and recall, but I will tune my parameters to achieve better results by tuning the F1 score to the best possible for a range of parameters for DecisionTree, KNeighbors, and LogisticRegression.  NaiveBayes has no parameters to tune and was included in the table as a point of reference.

---

#### Tuning
For each classifier, there are typically several parameters that can be chosen with different values.  For example, a logistic classifier can have different values for 'C' (inverse of regularization strength), 'fit_intercept' (True/False), 'tol' (tolerance value used to stop fitting), and several others <sup>2</sup>.  These all affect the root algorithm used to train the classifier.  An appropriately chosen set of parameters can greatly affect the performance of a classifier, but it would take too long to manually select different combinations of each parameter and check the results because the number of combinations grows very quickly.  Therefore, I performed a GridSearchCV for several parameters on each of the 3 classifiers chosen above.  This iteratively searches through grid of all combinations, and compares the classifier to a selected scoring criteria, which I chose to be the F1-score.

In my GridSearchCV, I used used a Pipeline that included several parameters for SelectKBest and PCA as feature selection and reduction steps in addition to tuning classifier parameters.  I also ran a MinMaxScaler on the features before piping into SelectKBest and PCA, to ensure that I did not give more weight in variance to the larger values (i.e. the financial features).  

My array of values for SelectKBest was k = [15, 17, 19, 'all'], where k is the number of features to keep.
My array of values for PCA was n_components = [3, 5, 7, 9, 11, 13] where n_components is the number of principal components to reduce the set to.

Due to the limited dataset, and particularly limited POI labels, I used a 10-fold cross-validation for training/testing.  If I simply set aside a test set and trained on a fixed amount of data (say 80%), my results are likely to overfit to the training data and not perform well on test data. Therefore, I made use of the cross-validation methods that GridSearchCV utilizes, using 10 fold rather than the default 3 for more rigorous cross-validation.

Because of the number of 'NaN' in the dataset, I did not use more than 10 fold.  After removal of restricted_stock_deferred, director_fees, and loan_advance, the feature with the smallest number of non-NaN values was deferral_payments with 39.  With 10 folds, this means that the test set would contain likely only 3 or 4 values for this feature and possibly none.  With more folds, the likelyhood of having no data for this feature would only increase, so I left the number of folds at 10.

After tuning the parameter for each, I found the following results for each classifer:

Classifier | F1 Score
--- | ---
DecisionTree | 0.468 (0.233)\*
KNeighbors | 0.375
LogisticRegression | 0.466

*DecisionTree scored well with the 10-fold splitting performed in the GridSearchCV, but only achieved F1-score of 0.233 with the more rigorous StratifiedShuffleSplit of 1000 folds during testing.

---

#### Best Classifier

LogisticRegression has the best performance after tuning when utilizing the supplied tester.  The optimized pipeline involved keeping removing only excerised_stock_options_log features while keeping 17/18 features, but using PCA to reduce the rest down to 7.  Therefore, my final classifier used for evaluation is logistic regression classifier used with a PCA to reduce to 7 parameters:

LogisticRegression(C=10, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=10, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
     
It should be noted that I removed several features as stated above (see "Feature Removal") based on some intuition and general review of scatter plots before using the pipeline that involved a SelectKBest feature selector.  The features used for this classifier were: 

['salary', 'to_messages', 'deferral_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'expenses', 'from_messages', 'other', 'deferred_income', 'long_term_incentive', 'bonus_log', 'bonus_sqrt', 'exercised_stock_options_sqrt', 'from_poi_ratio', 'to_poi_ratio']

with ['exercised_stock_options_log'] excluded as it only received an F-score of 0.0562 (compared to an average score of 9.78 and a maximum score of 22.9 for bonus_sqrt)

---

#### Comparison to all features

My intuition and reasoning led me to remove several features and start with only 18 features (as listed above for the best classifier) before even attempting any sort of tuning.  After determing that a LogisticRegression gives decent results, I reran the entire pipeline into the grid search with all 33 features available.  I added more options for SelectKBest features (k = [23, 27, 29, 31]) since there are more features available.  Surprisingly, the best result involved keeping 31 features including all word frequencies, but reducing down to 5 components using PCA.  The 2 removed features were ['from_messages', 'exercised_stock_options_log'].  The best classifer was:

LogisticRegression(C=10, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=0.1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

This classifier, when used in a pipeline involving all 33 features resulted in an F1 score of 0.453.  This is slightly worse than the original classifier that involved fewer features to begin the pipeline.  My initial screening of features to immediately reduce to 18 features was useful in providing slightly better results for this particular classifier.

---
___

### Validation
Training a classifier involves using a dataset with known features and labels. After training, it is important to validate the classifier by attempting to predict labels for a separate dataset also with features and labels, and comparing the predicted labels against the known labels of this holdout set.  One of the simplest ways get two sets of data that include both features and labels is to divide an available dataset into a portion for training and a portion for testing.  The testing portion is set aside and training is performed on the rest. 

One problem that may occur as a classifier attempts to correctly identify the training labels, is that it may end up overfitting and creating rules that are too complex to work for a more general dataset.  This will lead to good performance on the training set, but poor performance on a test set.  One approach to avoid this mistake, is to use cross-validation.  This involves dividing the entire dataset into multiple equal sized groups, selecting one group as the test group, and then training/test just as before.  Then, the next group is selected as the test, and the classifier is trained again the the remaining data.  This can be iteratively repeated with each group taking a turn as the test group.  Then the results are averaged to give a more general validation of the classifier performance.  Essentially all data gets trained and tested on, but in different combinations.

My GridSearchCV optimized parameters based on the 'F1' score using cross-validation.  Due to the uneven distribution of labels (POI/non-POIT) of the dataset, a stratified cross-valitation method would have been better to use as it ensures equal distribution of the labels for each group.  During my training there were instances where no POI's were selected for the test set, and F1 values could not be computed.  This is a possible explaination for the discrepency in my DecisionTree classifier as it appeared to do well in the cross-validation of my GridSearch, but did poorly on the tester code valdiation that used a stratisfied shuffle.

---
#### Tester Validation

After finding the best classifier for this score which involved a pipeline including PCA on a limited feature set and a Logistic Regression, I ran the tester code supplied with the final project and evaluated accuracy, precision and recall values.

Accuracy: 0.77300

Precision: 0.34110	

Recall: 0.75400	

F1: 0.46971

The accuracy is actually worse than simply stating that all features are non-POIs, which again would yield 87.5% accuracy.  Of course, this would be a useless classifier in an attempt to identify fraudulent persons since it says that no is a POI!  Therefore, accuracy is less important than the precision and recall.  I used F1 to optimize my classifier because it weighs both precision and recall.

Precision is defined as: $\frac{truePositives}{truePositives+falsePositives}$

Recall is defined: $\frac{truePositives}{truePositives+falseNegatives}$

F1 is defined : $2*\frac{precision*recall}{precision + recall}$

The precision of my classifier is a bit low, though still above 0.3 while the recall performs better.  Given that I am attempting to identify POIs positively, the recall value may be considered more important than the precision.  A low precision means that there are several false positives and some innocent people are being flagged.  These false positives may be tolerated more than false negatives (missing people who actually were POIs). Considering the definition of POI was 'person of interest' and not as absolutely guilty person, higher importance on recall seems reasonable as it could warrant further potentially exonerating investigation in the case the person was actually innocent.  And who knows, perhaps some guilty Enron people got off the hook without any charges or investigations.

---
[1] https://public.tableau.com/profile/diego2420#!/vizhome/Udacity/UdacityDashboard

[2] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html