## Project: Identify Fraud from Enron Email
### Enron Submission Free-Response Answers

Student: Richard Smith

---

#### Question 1
> Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?

### Project Goal

Regarding the goal of the project, this excerpt from the Project Overview is the most direct answer:
> In this project, you will play detective, and put your machine learning skills to use by building an algorithm to identify Enron Employees who may have committed fraud based on the public Enron financial and email dataset.

In other words, the goal of the project is to train a machine learning algorithm to correctly predict the value of 'poi' for an individual, given financial data and email metadata derived from the public database of Enron emails.

### Dataset Background

Regarding the public Enron financial and email dataset, this excerpt from the Project Overview is an appropriate description:
> In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, you will play detective, and put your new skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. To assist you in your detective work, we've combined this data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

The specific data students are given is "final_project_dataset.pkl", which contains rows of financial data and email metadata for various people (and one business) involved with the Enron scandal, either by way of their working for Enron, or else being associated with those working for Enron.

In [101]:
import pickle
from pprint import pprint as pp
# loading the pickle file to a Python dict to display length and features
with open("final_project_dataset.pkl", "rb") as data_file:
    data_dict = pickle.load(data_file)
print("Number of rows present in the data:", len(data_dict))
features = []
poi_count = 0
for k in data_dict.keys():
    for d in data_dict[k]:
        if d not in features:
            features.append(d)
    if data_dict[k]['poi']:
        poi_count += 1
print("Data features for any given row:")
pp(features)
print("Number of features:", len(features))
print("Number of persons of interest in the set: ", poi_count)
print("Non-PoIs: ", len(data_dict) - poi_count)

Number of rows present in the data: 146
Data features for any given row:
['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'loan_advances',
 'bonus',
 'email_address',
 'restricted_stock_deferred',
 'deferred_income',
 'total_stock_value',
 'expenses',
 'from_poi_to_this_person',
 'exercised_stock_options',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'poi',
 'long_term_incentive',
 'shared_receipt_with_poi',
 'restricted_stock',
 'director_fees']
Number of features: 21
Number of persons of interest in the set:  18
Non-PoIs:  128


Of those features, 'poi' contains Boolean True or False values representing whether a given individual was considered a "person of interest" (PoI) during the investigation into Enron's financial activities. By those values, of the 146 entries in the set 18 are PoIs, leaving 128 non-PoIs.

Financial and email feature values are typically integers--excepting 'email_address'--representing either dollar amounts or counts for each. Given what little correlation to the other data is possible for email addresses, I decided to remove them as a feature, reducing the total count of features to 20.

Among the remaining initial features, there were a lot of "missing" values for entries in the data, represented by 'NaN':

In [102]:
print("Removing email addresses... ", end='')
for k in data_dict.keys():
  if 'email_address' in data_dict[k].keys():
    data_dict[k].pop('email_address', 0)
print("done.")
print("\nRemaining features in the dataset:", len(features)-1)
missing_values = {'salary' : 0,
                  'to_messages' : 0,
                  'deferral_payments' : 0,
                  'total_payments' : 0,
                  'loan_advances' : 0,
                  'bonus' : 0,
                  'restricted_stock_deferred' : 0,
                  'deferred_income' : 0,
                  'total_stock_value' : 0,
                  'expenses' : 0,
                  'from_poi_to_this_person' : 0,
                  'exercised_stock_options' : 0,
                  'from_messages' : 0,
                  'other' : 0,
                  'from_this_person_to_poi' : 0,
                  'poi' : 0,
                  'long_term_incentive' : 0,
                  'shared_receipt_with_poi' : 0,
                  'restricted_stock' : 0,
                  'director_fees' : 0}
for k in data_dict.keys():
    for d in data_dict[k]:
        if data_dict[k][d] == 'NaN':
            missing_values[d] += 1
print("\n'NaN' (missing) values in the dataset, by feature:")
pp(missing_values)

Removing email addresses... done.

Remaining features in the dataset: 20

'NaN' (missing) values in the dataset, by feature:
{'bonus': 64,
 'deferral_payments': 107,
 'deferred_income': 97,
 'director_fees': 129,
 'exercised_stock_options': 44,
 'expenses': 51,
 'from_messages': 60,
 'from_poi_to_this_person': 60,
 'from_this_person_to_poi': 60,
 'loan_advances': 142,
 'long_term_incentive': 80,
 'other': 53,
 'poi': 0,
 'restricted_stock': 36,
 'restricted_stock_deferred': 128,
 'salary': 51,
 'shared_receipt_with_poi': 60,
 'to_messages': 60,
 'total_payments': 21,
 'total_stock_value': 20}


Only 'poi' appears to have a value for every row. Missing values for financial data features vary, with most rows having varying, multiple 'NaN's across them, but since these can be interpreted as meaning '0' (given the reference provided in enron61702insiderpay.pdf, included with the project files), financial data for each entry can be interpreted as complete.

For email metadata features, though, there's a common '60' missing values across all but 'email_addresses'. This is best understood by observing that across all rows, either an entry has integer values for all email features, or it has only 'NaN' values for each, meaning emails to or from those individuals were not present in the underlying data, so email metadata for each entry can be interpreted as complete.

When looking through `tester.py`'s code in order to understand its internal treatment of the dataset, I noticed the default treatment for 'NaN' values, based on the `feature_format.featureFormat()` call in `tester.py`'s `test_classifier` method:
```
def test_classifier(clf, dataset, feature_list, folds = 1000):
    data = featureFormat(dataset, feature_list, sort_keys = True)
```
Given that call, `feature_format.featureFormat()` has a default treatment for 'NaN' values which is being used:
```
def featureFormat( dictionary, features, remove_NaN=True...
  ...<lots of code ommitted>...
  if value=="NaN" and remove_NaN:
    value = 0
```
This means that 'NaN' values will be replaced with integer zeroes upon testing. As an alternative, missing values *could* be imputed, replaced with values based on other features in those rows, statistics from the distribution of that feature across all rows, or some combination of both. [This](https://towardsdatascience.com/why-using-a-mean-for-missing-data-is-a-bad-idea-alternative-imputation-algorithms-837c731c1008) article includes descriptions of methods, limitations, and drawbacks for imputation, and also cites [this](http://www.stat.columbia.edu/~gelman/arm/missing.pdf) paper which goes into greater detail. Besides the points raised in those references, given what I've come to understand about the features in this dataset, imputation may simply be inappropriate: if missing data in this set is "best" represented as zeroes, imputation would effectively result in "false" data, and therefore potentially inaccurate results from classification predictions based on that "false" data.

Regardless, since rows with many 'NaN' values will result in zeroes on the distributions for related features, I decided to examine rows with more than 15 features containing 'NaN' values, since those would be rows which not only lacked email metadata (five features, 'to_messages', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi', and 'shared_receipt_with_poi') but would also have few nonmissing values for financial features, meaning relatively little contribution to predictive classification:

In [103]:
# populating a dictionary entries for each entry in data_dict
#   with each entry assigned the number of 'NaN'-valued features for the same
empty_features = {}
for k in data_dict.keys():
  empty_features[k] = []
  for d in data_dict[k]:
    if data_dict[k][d] == 'NaN':
      empty_features[k].append(1)
  empty_features[k] = sum(empty_features[k])
count = 0
print("Entries with less than 5 nonmissing values:")
for k in sorted(empty_features):
  if empty_features[k] > 15:
    print(k)
    for d in data_dict[k]:
      if data_dict[k][d] != 'NaN':
        print("  %s : %s" % (d, data_dict[k][d]))
    count += 1
print("Number of entries with less than 5 nonmissing values:")
print(" ",count)

Entries with less than 5 nonmissing values:
CHRISTODOULOU DIOMEDES
  total_stock_value : 6077885
  exercised_stock_options : 5127155
  poi : False
  restricted_stock : 950730
CLINE KENNETH W
  restricted_stock_deferred : -472568
  total_stock_value : 189518
  poi : False
  restricted_stock : 662086
GILLIS JOHN
  total_stock_value : 85641
  exercised_stock_options : 9803
  poi : False
  restricted_stock : 75838
GRAMM WENDY L
  total_payments : 119292
  poi : False
  director_fees : 119292
LOCKHART EUGENE E
  poi : False
SAVAGE FRANK
  total_payments : 3750
  deferred_income : -121284
  poi : False
  director_fees : 125034
SCRIMSHAW MATTHEW
  total_stock_value : 759557
  exercised_stock_options : 759557
  poi : False
THE TRAVEL AGENCY IN THE PARK
  total_payments : 362096
  other : 362096
  poi : False
WAKEHAM JOHN
  total_payments : 213071
  expenses : 103773
  poi : False
  director_fees : 109298
WHALEY DAVID A
  total_stock_value : 98718
  exercised_stock_options : 98718
  poi : False

What first struck me was that there were relatively few such rows, and that none of these rows were for persons of interest. Next, I noticed that 'LOCKHART EUGENE E' had only one nonmissing value, 'poi', meaning its values for all other features would be set to zero upon testing. Also, 'THE TRAVEL AGENCY IN THE PARK', appears to represent a business, rather than an individual. A description for this entry is footnoted in enron61702insiderpay.pdf:
>Payments were made by Enron employees on account of business-related travel to The Travel Agency in the Park (later Alliance Worldwide), which was co-
owned by the sister of Enron's former Chairman. Payments made by the Debtor to reimburse employees for these expenses have not been included.

The other rows shown above may have relatively little data that would be non-zero upon testing, but given that they are all non-PoIs, the data they *do* hold may well contribute to classification.

### Data Cleaning

As previously mentioned, 'email_address' holds no correlatable values, so I've removed that feature entirely, popping it from every entry in the dataset.

Besides that, inspection of the dataset resulted in these rows standing out:  
  - 'TOTAL'
  - 'THE TRAVEL AGENCY IN THE PARK'
  - 'LOCKHART EUGENE E'

Having already shown the contents of 'THE TRAVEL AGENCY IN THE PARK' and 'LOCKHART EUGENE E', here's 'TOTAL':

In [104]:
print("'TOTAL':")
pp(data_dict['TOTAL'])

'TOTAL':
{'bonus': 97343619,
 'deferral_payments': 32083396,
 'deferred_income': -27992891,
 'director_fees': 1398517,
 'exercised_stock_options': 311764000,
 'expenses': 5235198,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 83925000,
 'long_term_incentive': 48521928,
 'other': 42667589,
 'poi': False,
 'restricted_stock': 130322299,
 'restricted_stock_deferred': -7576788,
 'salary': 26704229,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 309886585,
 'total_stock_value': 434509511}


This is an aggregate row, and an incomplete one in that it applies only to financial data--it's clearly a carryover from the financial data represented in enron61702insiderpay.pdf. Being an aggregate, it could only be considered a non-person, and its upper outlier status for all non-missing values would only introduce inaccuracy to a classification model, and so should be removed.

'THE TRAVEL AGENCY IN THE PARK' is cited (see above) as "owned by the sister of Enron's former Chairman", but her name is not immediately apparent, and after spending some time in search of her name I was not able to confirm whether or not she was included in this dataset. Regardless, like 'TOTAL', this entry does not represent an individual and should be removed.

The entry for 'LOCKHART EUGENE E' is composed of only 'NaN' values and `'poi' : False`, so its data for all features used in predictive classification will be filled with zeroes. Since my intuition is that an individual with no financial or email interaction with Enron is *appropriately* labeled a non-PoI, and since there are relatively few entries in the dataset as a whole, I've opted to leave this entry in place.

Given the above rationale, I removed 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK':

In [105]:
# removing aggregate row
data_dict.pop('TOTAL', 0)
# removing non-person row
data_dict.pop('THE TRAVEL AGENCY IN THE PARK', 0)
print("Rows remaining in the dataset:",len(data_dict))

Rows remaining in the dataset: 144


In thinking about checking the data for any other issues that required cleaning, I figured out pretty quickly that it would be difficult to check the dataset's email metadata, as it's been compiled via "to" and "from" fields across all ~500,000 emails. I *can* check financial data in the set with basic calculations, though, which I did by way of adding together financial payment features and checking them against 'total_payments':

In [106]:
print("Checking data for issues related to 'total_payments'.........", end='')
payment_financial_features = ['salary',
                              'bonus',
                              'long_term_incentive',
                              'expenses',
                              'director_fees',
                              'other',
                              'loan_advances',
                              'deferred_income',
                              'deferral_payments']
problem_entries = {}
# Iterate over each row, check sum of above features against total_payments,
#   adding rows with mismatch to problem_entries
for k in data_dict.keys():
  total_payments_check = 0
  for d in data_dict[k]:
    if d in payment_financial_features and data_dict[k][d] != 'NaN':
      total_payments_check += data_dict[k][d]
  if data_dict[k]['total_payments'] != 'NaN' and \
                        total_payments_check != data_dict[k]['total_payments']:
    problem_entries[k] = data_dict[k]
from pprint import pprint as pp
if len(problem_entries):
  print("found!")
  print("  Rows with issues related to 'total_payments' found:")
  pp(problem_entries)
else:
  print("none.")

Checking data for issues related to 'total_payments'.........found!
  Rows with issues related to 'total_payments' found:
{'BELFER ROBERT': {'bonus': 'NaN',
                   'deferral_payments': -102500,
                   'deferred_income': 'NaN',
                   'director_fees': 3285,
                   'exercised_stock_options': 3285,
                   'expenses': 'NaN',
                   'from_messages': 'NaN',
                   'from_poi_to_this_person': 'NaN',
                   'from_this_person_to_poi': 'NaN',
                   'loan_advances': 'NaN',
                   'long_term_incentive': 'NaN',
                   'other': 'NaN',
                   'poi': False,
                   'restricted_stock': 'NaN',
                   'restricted_stock_deferred': 44093,
                   'salary': 'NaN',
                   'shared_receipt_with_poi': 'NaN',
                   'to_messages': 'NaN',
                   'total_payments': 102500,
                   'total_stock_

I checked these rows' data against their entries in enron61702insiderpay.pdf, and it was readily apparent that their financial data values were "shifted", such that Belfer's were shifted to the right, and Bhatnagar's to the left. Given the organization of these entries' features in the dictionary, the simplest way to correct these problems was to create new entries for each with the correct financial feature values from the pdf. I left the email metadata features' values "as-found", since verifying those would require an inordinate amount of effort, and there wasn't any apparent reason to distrust them.

In [107]:
# For 'BELFER ROBERT', lines marked with '#' were affected by an apparent shift
#   in values, corrected here with values from reference.
# Email data left as-found.
belfer_corrected = {'bonus': 'NaN',
                    'deferral_payments': 0,                  #
                    'deferred_income': -102500,              #
                    'director_fees': 102500,                 #
                    'exercised_stock_options': 0,            #
                    'expenses': 3285,                        #
                    'from_messages': 'NaN',
                    'from_poi_to_this_person': 'NaN',
                    'from_this_person_to_poi': 'NaN',
                    'loan_advances': 'NaN',
                    'long_term_incentive': 'NaN',
                    'other': 'NaN',
                    'poi': False,
                    'restricted_stock': 44093,                #
                    'restricted_stock_deferred': -44093,      #
                    'salary': 'NaN',
                    'shared_receipt_with_poi': 'NaN',
                    'to_messages': 'NaN',
                    'total_payments': 3285,                   #
                    'total_stock_value': 0}                   #

# Likewise, for 'BHATNAGAR SANJAY', lines marked with '#' were affected by an
#   apparent shift in data, corrected here with values from reference.
# Email data left as-found.
bhatnagar_corrected = {'bonus': 'NaN',
                       'deferral_payments': 'NaN',
                       'deferred_income': 'NaN',
                       'director_fees': 0,                    #
                       'exercised_stock_options': 15456290,   #
                       'expenses': 137864,                    #
                       'from_messages': 29,
                       'from_poi_to_this_person': 0,
                       'from_this_person_to_poi': 1,
                       'loan_advances': 'NaN',
                       'long_term_incentive': 'NaN',
                       'other': 0,                            #
                       'poi': False,
                       'restricted_stock': 2604490,           #
                       'restricted_stock_deferred': -2604490, #
                       'salary': 'NaN',
                       'shared_receipt_with_poi': 463,
                       'to_messages': 523,
                       'total_payments': 137864,              #
                       'total_stock_value': 15456290}         #

data_dict['BELFER ROBERT'] = belfer_corrected
data_dict['BHATNAGAR SANJAY'] = bhatnagar_corrected

Having overwritten the faulty entries, I checked my payment financial data totals again to verify the fix:

In [108]:
# Repeating check to verify changes
print("Re-checking data for issues related to 'total_payments'.....", end='')
problem_entries = {}
for k in data_dict.keys():
  total_payments_check = 0
  for d in data_dict[k]:
    if d in payment_financial_features and data_dict[k][d] != 'NaN':
      total_payments_check += data_dict[k][d]
  if data_dict[k]['total_payments'] != 'NaN' and \
    total_payments_check != data_dict[k]['total_payments']:
    problem_entries[k] = data_dict[k]

if len(problem_entries):
  print("found!")
  print("  Rows with issues related to 'total_payments' found:")
  pp(problem_entries)
else:
  print("none.")

Re-checking data for issues related to 'total_payments'.....none.


### Outliers and Data Cleaning Decisions

It was a little tricky thinking about what might be considered outliers among the data in this set. Given the nature of what made any given individual a PoI or not, extremely high or low values among the entries could be considered appropriate to their true or false 'poi' status. Any "conflicting" combinations of values among an entry's features could potentially be indicative of feature importances, too. Compounding both those prospects, the dataset is relatively sparse in both its number of rows and nonzero values for features. After considering the possible impact of further reduction of the dataset, it seemed prudent to trim and clean as little as possible: I opted to only remove feature 'email_address', and rows 'TOTAL' and 'THE TRAVEL AGENCY IN THE PARK'.

#### From Question 2
>... As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) ...

### Feature Creation

Since my exploration of the data revealed that 'NaN' values for financial features varied widely from row to row, I realized that I could not rely on any specific set of financial features being present for creation of a calculated feature. The only reliably-present set of features was the five email metadata features--as previously mentioned, rows either have integer values for all of them, or else 'NaN' for all of them--So, the features I introduced to the dataset were all fractions based on email metadata:
 - ratio of emails sent to PoIs to emails sent generally:  
   `to_poi_from_messages_ratio = from_this_person_to_poi / from_message`
 - ratio of emails received from PoIs to emails received generally:  
   `from_poi_to_messages_ratio = from_poi_to_this_person / to_messages`
 - ratio of emails having shared recipt with PoI to emails received generally:  
   `shared_receipt_to_messages_ratio = shared_receipt_with_poi / to_messages`

In [109]:
print("Creating features... ", end='')
for k in data_dict.keys():
  from_messages = True if \
    (data_dict[k]['from_messages'] != 'NaN') else False
  to_messages = True if \
    (data_dict[k]['to_messages'] != 'NaN') else False
  to_poi = True if \
    (data_dict[k]['from_this_person_to_poi'] != 'NaN') else  False
  from_poi = True if \
    (data_dict[k]['from_poi_to_this_person'] != 'NaN') else False
  shared_receipt = True if \
    (data_dict[k]['shared_receipt_with_poi'] != 'NaN') else False

  # ratio of emails sent to PoIs to emails sent generally:
  # to_poi_from_messages_ratio = from_this_person_to_poi / from_messages
  if to_poi and from_messages:
    data_dict[k]['to_poi_from_messages_ratio'] = \
       data_dict[k]['from_this_person_to_poi'] / data_dict[k]['from_messages']
  else:
    data_dict[k]['to_poi_from_messages_ratio'] = 'NaN'

  # ratio of emails received from PoIs to emails received generally:
  # from_poi_to_messages_ratio = from_poi_to_this_person / to_messages
  if from_poi and to_messages:
    data_dict[k]['from_poi_to_messages_ratio'] = \
          data_dict[k]['from_poi_to_this_person'] / data_dict[k]['to_messages']
  else:
    data_dict[k]['from_poi_to_messages_ratio'] = 'NaN'
  
  # ratio of emails having shared recipt with PoIs to emails received generally:
  # shared_receipt_to_messages_ratio = shared_receipt_with_poi / to_messages
  if shared_receipt and to_messages:
    data_dict[k]['shared_receipt_to_messages_ratio'] = \
       data_dict[k]['shared_receipt_with_poi'] / data_dict[k]['to_messages']
  else:
    data_dict[k]['shared_receipt_to_messages_ratio'] = 'NaN'
print("done.\n")

Creating features... done.



I checked these created features via this code, observing the values created for each entry (results omitted due to length):
```
for k in data_dict.keys():
  print(k)
  print(" to", data_dict[k]['to_messages'])
  print(" from", data_dict[k]['from_messages'])
  print(" to_poi", data_dict[k]['from_this_person_to_poi'])
  print(" to poi/from",data_dict[k]['to_poi_from_messages_ratio'])
  print(" from_poi", data_dict[k]['from_poi_to_this_person'])
  print(" from poi/to",data_dict[k]['from_poi_to_messages_ratio'])
  print(" shared",data_dict[k]['shared_receipt_with_poi'])
  print(" shared/to ",data_dict[k]['shared_receipt_to_messages_ratio'])
```
Excepting a value of 1.0011 for 'GLISAN JR BEN F', caused by its 'shared_receipt_with_poi' value exceeding its 'to_messages' value by 1, all values for these created features were either decimal values between 0 and 1, or 'NaN', as intended.
With the removal of 'email_address' and inclusion of 'to_poi_from_messages_ratio', 'from_poi_to_messages_ratio', and 'shared_receipt_to_messages_ratio', non-target features in the dataset number 22, with 'poi' remaining as a target feature.

#### From Question 3
>...What [algorithms] did you try? How did model performance differ between algorithms?

### Algorithms Tried

I evaluated performance differences between DecisionTreeClassifier, KNeighborsClassifier (K Nearest Neighbors), and GaussianNB (Gaussian Naive Bayes). Since I'd reviewed `tester.py` and seen that its `test_classification()` method applied K-fold cross-validation prior to iterative fitting and prediction, I decided to mimic that function's approach "whole-cloth" in order to base my algorithm choice on the same metrics used in grading: accuracy, precision, recall, F1, and F2, each as ratios calculated from sums for comparisons of predictions and labels across 1000 testing-training splits of the dataset, equivalent to a confusion matrix:

In [110]:
from sklearn.neighbors         import KNeighborsClassifier
from sklearn.tree              import DecisionTreeClassifier
from sklearn.naive_bayes       import GaussianNB
from sklearn.ensemble          import AdaBoostClassifier
from sklearn.model_selection   import StratifiedShuffleSplit

# 4-A Function definition for classifier testing, validation, evaluation
def classifier_test(clf, dataset, feature_list, folds = 1000):
  '''
  Based on code used in tester.py, with equivalent functionality, this function
evaluates classifier performance through cross-validation via StratifiedShuffleSplit(),
default 1000 splits for training and testing sets.
  Written primarily for personal comprehension of the testing method used in grading
results, and to apply the same metrics used in grading to validation and evaluation of
classifiers.

parameters:

clf:
  sklearn classifier, must support *.fit, *.predict.
  
dataset:
  object compatible with Python dict, must have key entries containing features and
values compatible with feature_list.

feature_list:
  Python list, must contain strings matching features present in dict passed to
'dataset'.
  
folds:
  integer, default 1000, controls splits applied for cross validation via
StratifiedShuffleSplit().

output:
  Displays predictions made and performance results:
    Accuracy, Precision, Recall, F1, F2.
  '''
  data = featureFormat(dataset, feature_list, sort_keys = True)
  labels, features = targetFeatureSplit(data)
  cv = StratifiedShuffleSplit(n_splits=folds, random_state = 42)
  true_neg  = 0
  false_neg = 0
  true_pos  = 0
  false_pos = 0
  for train_idx, test_idx in cv.split(features, labels):
    features_train = []
    labels_train   = []
    features_test  = []
    labels_test    = []
    for ii in train_idx:
      features_train.append(features[ii])
      labels_train.append(labels[ii])
    for jj in test_idx:
      features_test.append(features[jj])
      labels_test.append(labels[jj])

    # fit the classifier using training set, and test on test set
    clf.fit(features_train, labels_train)
    predictions = clf.predict(features_test)
    for prediction, truth in zip(predictions, labels_test):
      if prediction == 0 and truth == 0:
        true_neg += 1
      elif prediction == 0 and truth == 1:
        false_neg += 1
      elif prediction == 1 and truth == 0:
        false_pos += 1
      elif prediction == 1 and truth == 1:
        true_pos += 1
      else:
        print("Warning: Found a predicted label not == 0 or 1.")
        print("All predictions should take value 0 or 1.")
        print("Evaluating performance for processed predictions:")
        break
  try:
    total_pred = true_neg + false_neg + false_pos + true_pos
    accuracy = 1.0 * (true_pos + true_neg) / total_pred
    precision = 1.0 * true_pos / (true_pos + false_pos)
    recall = 1.0 * true_pos / (true_pos + false_neg)
    f1 = 2.0 * true_pos / (2 * true_pos + false_pos + false_neg)
    f2 = (1 + 2.0 * 2.0) * precision * recall / (4 * precision + recall)
    print(clf)
    print("  Predictions: %d" % total_pred)
    print("  Accuracy: %.5f\n  Precision: %.5f  Recall: %.5f" % \
          (accuracy, precision, recall))
    print("  F1: %.5f  F2: %.5f" % (f1, f2), "\n")
  except:
    print("Performance calculations failed.")
    print("Precision or recall may be undefined (no true positives).")

# 4-B  Iteration over a list of classifiers
# (see references.txt for code example source)
classifiers = [KNeighborsClassifier(),
               DecisionTreeClassifier(),
               GaussianNB()]

print("Trying several classifiers with default settings for comparison...\n")
for classifier in classifiers:
  classifier_test(classifier, data_dict, features_list)

Trying several classifiers with default settings for comparison...

KNeighborsClassifier()
  Predictions: 15000
  Accuracy: 0.87513
  Precision: 0.61609  Recall: 0.16850
  F1: 0.26463  F2: 0.19715 

DecisionTreeClassifier()
  Predictions: 15000
  Accuracy: 0.80460
  Precision: 0.26383  Recall: 0.26000
  F1: 0.26190  F2: 0.26076 

GaussianNB()
  Predictions: 15000
  Accuracy: 0.76353
  Precision: 0.24564  Recall: 0.37350
  F1: 0.29637  F2: 0.33828 



With default settings and  features used, the above results were yielded for KNeighborsClassifier, DecisionTreeClassifier, and GaussianNB. KNeighBorsClassifier ranked highest in accuracy and precision, but lowest in recall. DecisionTreeClassifier slightly outperformed GaussianNB in accuracy and precision, but was lesser in recall. GaussianNB had the greatest recall, but least accuracy or precision.

Given those results, I decided to perform more extensive testing and parameter tuning for KNeighborsClassifier and DecisionTreeClassifier, and ultimately discarded KNeighborsClassifier when tuning for feature selection and parameter settings could not result in a sufficient recall score. Relative inflexibility for GaussianNB would not permit much tuning for its parameters, so I ruled it out early on, despite its relatively high recall score.

#### From Question 2
> What features did you end up using in your POI identifier, and what selection process did you use to pick them?...

### Feature Importance Inspection

Having tested a few classifiers, I examined feature importances by a number of different methods (DecisionTreeClassifier's built-in 'Gini' impurity metric, f_classif's 'ANOVA' f-values and p-values, various settings for both via GridSearchCV) before finding meaningful increases in my algorithm's performance by using 'mutual information' values derived from `sklearn.feature_selection.mutual_info_classif()`:

In [112]:
# for feature, label extraction
features_list = ['poi',
                 'salary',
                 'bonus',
                 'long_term_incentive',
                 'expenses',
                 'director_fees',
                 'other',
                 'loan_advances',
                 'deferred_income',
                 'deferral_payments',
                 'total_payments',
                 'restricted_stock_deferred',
                 'exercised_stock_options',
                 'restricted_stock',
                 'total_stock_value',
                 'from_messages',
                 'to_messages',
                 'from_poi_to_this_person',
                 'from_this_person_to_poi',
                 'shared_receipt_with_poi',
                 'from_poi_to_messages_ratio',
                 'to_poi_from_messages_ratio',
                 'shared_receipt_to_messages_ratio']

# Extracting features and labels from dataset for local testing
print("Extracting features and labels... ", end='')
data = featureFormat(data_dict, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
print("done.\n")

from sklearn.feature_selection import mutual_info_classif
print("\nFeature importance by mutual_info_classif:")
print(" ('mutual info' with regard to 'poi' target)")
# sorting feature names by magnitude of mutual information with 'poi'
# (see references.txt for code example used with zip() sorting of two lists)
mutual_info = sorted(zip(list(mutual_info_classif(features, labels)),
                         features_list[1:]), reverse = True)
for i in range(len(mutual_info)):
  print(" ", i+1, "- '%s'" % mutual_info[i][1],
        "\n        %.5f"   % mutual_info[i][0])

Extracting features and labels... done.


Feature importance by mutual_info_classif:
 ('mutual info' with regard to 'poi' target)
  1 - 'expenses' 
        0.08268
  2 - 'bonus' 
        0.07840
  3 - 'other' 
        0.06814
  4 - 'shared_receipt_with_poi' 
        0.06151
  5 - 'to_poi_from_messages_ratio' 
        0.05606
  6 - 'total_stock_value' 
        0.03748
  7 - 'restricted_stock' 
        0.03558
  8 - 'from_poi_to_messages_ratio' 
        0.02978
  9 - 'shared_receipt_to_messages_ratio' 
        0.02889
  10 - 'from_this_person_to_poi' 
        0.02545
  11 - 'director_fees' 
        0.02352
  12 - 'salary' 
        0.02239
  13 - 'total_payments' 
        0.01453
  14 - 'from_poi_to_this_person' 
        0.01436
  15 - 'from_messages' 
        0.01259
  16 - 'exercised_stock_options' 
        0.01177
  17 - 'to_messages' 
        0.00828
  18 - 'long_term_incentive' 
        0.00677
  19 - 'restricted_stock_deferred' 
        0.00313
  20 - 'loan_advances' 
        0.0016

Though the exact meaning of these scores is slightly obfuscated by the density of the subject matter involved, this quote from the [wiki entry for mutual information](https://en.wikipedia.org/wiki/Mutual_information) is relatively concise:

>In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

>Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair ( X , Y ) is to the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI).

Given multiple executions of the above script, slightly varying results will be returned due to function behavior described in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html?highlight=mutual_info_classif#sklearn.feature_selection.mutual_info_classif)for `mutual_info_classif`:

>The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances.

Because of that slight variation, I observed the results of multiple executions of a GridSearchCV (described later) involving this metric for feature selection, and was able to recognize the consistently high ranking of a particular subset of features.

### Features Used in Final Algorithm

My final algorithm wound up making use of only these features:

 - 'expenses',
 - 'other',
 - 'bonus',
 - 'to_poi_from_messages_ratio',
 - 'shared_receipt_with_poi'

Making use of only these features in my algorithm testing (by the same method and function used above, described below) resulted in a roughly 0.2 increase in precision and recall, varying by 0.01~0.02 with the very slightly randomized performance inherent to Decision Tree classification.

#### From Question 2

>Did you have to do any scaling? Why or why not?

I did not require scaling for my features, due to DecisionTreeClassifier not requiring such for successful performance.


#### From Question 4
> What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? What parameters did you tune?...

The basic meaning of 'tuning' parameters for a classification algorithm is iteratively changing settings for internal behavior of classifiers with the objective of increasing performance by way of minimizing process runtime and/or maximizing resulting evaluation metrics like accuracy, precision, recall, and/or composites like f1, f2. Not doing this 'well' might mean drastically increased time for optimization of results, and/or poorly optimized results which do not properly make use of the potential of a classifier. Parameter tuning is crucially important to maximizing that potential.

I tuned my algorithm iteratively. First, by use of GridSearchCV with SelectKBest and DecisionTreeClassifier, optimizing parameters for maximized 'F1', which tended to result in maximized precision and accuracy.

I attempted various permutations of feature selection metrics and splitting criterions, eventually finding that the best performance could be achieved by using mutual information as the feature selection metric (`SelectKBest(score_func=mutual_info_classif)`) and information gain as the splitting criterion (`DecisionTreeClassifier(criterion='entropy')`). Given the complexity involved in executing GridSearchCV, I minimized runtime by keeping those parameters fixed and then optimizing for 'k' in SelectKBest (number of features selected) and 'min_samples_split' in DecisionTreeClassifier (minimum samples required for internal node splitting):

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline        import Pipeline

# Using mutual information as feature selection metric
selector = SelectKBest(mutual_info_classif)
# Using information gain as splitting criterion
classifier = DecisionTreeClassifier(criterion = 'entropy')

tune_pipe = Pipeline(steps=[('skb', selector),
                            ('clf', classifier)])

# Optimizing number of features and minimum number of samples for splitting
grid_params = {'skb__k' : (3, 4, 5, 6, 7, 8, 9),
                'clf__min_samples_split' : (3, 4, 5, 6, 7, 8, 9)}

print("Trying GridSearchCV with")
pp(tune_pipe)
print("over parameters:")
pp(grid_params)

# Optimizing for maximized F1 in order to maximize precison and recall
grid = GridSearchCV(tune_pipe, grid_params, scoring = 'f1', cv = 10,
                    n_jobs = -1)
grid.fit(features, labels)

print("\nResulting 'best' parameters for maximizing 'f1':")
pp(grid.best_params_)

# sorting features by paired information gain scores
grid_ftrs = sorted(zip(list(grid.best_estimator_.named_steps['skb'].scores_),
                             features_list[1:]), reverse = True)
# creating list to pass to k-fold testing method
best_features = ['poi']
print("\nFeatures used:")
for i in range(grid.best_params_['skb__k']):
  best_features.append(grid_ftrs[i][1])
  print(" ", i+1, "- '%s'" % grid_ftrs[i][1],
        "\n        %.5f"   % grid_ftrs[i][0])
print('')

# 5-B - Testing tuned parameters with 1000-fold cross validation
classifier_test(grid.best_estimator_.named_steps['clf'],data_dict,
                best_features)

As you can see above, I passed the "best-parametered" classifier and "best-performing" features from GridSearchCV to my previously-used 1000-fold cross validation testing function in order to observe performance results.

With multiple executions of this code, resulting precision and recall were always good enough for project requirements, but tended to vary due to GridSearchCV resulting in slightly varying 'k', 'min_samples_split', and the exact features selected. This is likely normal, given the variation possible for mutual information scores (described above in the 'feature selection' section), and the variation possible for DecisionTreeClassifier by its node splitting method.

By dint of observation, performance appeared to be greatest when 'min_samples_split' was 5, 'k' was 5, and the features used were as described above. Because of that, I changed my focus to manually testing those values for their repeated performance results from the testing function:

In [100]:
# features (apart from 'poi') tending to be top-ranked in mutual information
manual_features = ['poi',
                   'expenses',
                   'bonus',
                   'other',
                   'to_poi_from_messages_ratio',
                   'shared_receipt_with_poi']

# parameter settings tending to result in highest precision and recall
clf = DecisionTreeClassifier(criterion = 'entropy',
                             min_samples_split = 5)

print("Trying DecisionTreeClassifier with parameter settings and feature")
print("  selection based on 'best' of varying results from optimization...")
print("  (features *reliably* top-ranked by 'mutual information' with 'poi')")
print("Features used:")
pp(manual_features[1:])
classifier_test(clf, data_dict, manual_features)

# Dumping classifier, dataset, features list, and running tester.py for final test
import tester
print("Testing final classifier via tester.py...")
tester.dump_classifier_and_data(clf, data_dict, manual_features)
tester.main()

Trying DecisionTreeClassifier with parameter settings and feature
  selection based on 'best' of varying results from optimization...
  (features *reliably* top-ranked by 'mutual information' with 'poi')
Features used:
['expenses',
 'bonus',
 'other',
 'to_poi_from_messages_ratio',
 'shared_receipt_with_poi']
DecisionTreeClassifier(criterion='entropy', min_samples_split=5)
  Predictions: 13000
  Accuracy: 0.84300
  Precision: 0.48905  Recall: 0.45800
  F1: 0.47302  F2: 0.46389 

Testing final classifier via tester.py...
DecisionTreeClassifier(criterion='entropy', min_samples_split=5)
Accuracy: 0.84492  Precision: 0.49570  Recall: 0.46150
    F1: 0.47799  F2: 0.46796
Total predictions: 13000
  True positives:  923  False positives:  939
  False negatives: 1077  True negatives: 10061



#### From Question 3
> What algorithm did you end up using?

As shown above, my final algorithm was a DecisionTreeClassifier set to use entropy (information gain) for its splitting criterion and a minimum of 5 samples for splitting of internal nodes, applied to a subset of features based on average top-ranks in mutual information with 'poi': 'expenses', 'bonus', 'other', 'to_poi_from_messages_ratio', and 'shared_receipt_with_poi'.

#### Question 5
> What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?

Generally, validation is the verification of results from a predictive method by maintaining independence between its input in training and input in testing. If data from training and testing sets are mixed, a form of "overfitting" may occur, in which accuracy and other metrics may be higher than those possible for any independent data due to recognition of feature values from those used in training the classifier.

I validated my analysis by using k-fold cross validation within the function shown above, `classifier_test`, in a manner equivalent to that applied by `tester.py`: the dataset was split by `StratifiedShuffleSplit` with a default of `1000` for its `n_splits` parameter, and performance metrics were calculated according to totals for all predictions, true positives, false positives, true negatives, and false negatives.

#### Question 6
> Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

With multiple executions of the testing shown above, accuracy tended to hover around 0.84, precision around 0.49, and recall around 0.46. This means that my algorithm was able to correctly predict a row's person-of-interest status around 84% of the time, correct positive predictions were only slightly less in number than incorrect positive predictions, and correct positive predictions were about 46% of all potentially positive predictions.