# Enron Scandal: Indentifying Person of Interest

**Identification of Enron employees who may have committed fraud**

**Supervised Learning. Classification**

Data: [Enron financial dataset from Udacity](https://github.com/udacity/ud120-projects/tree/master/final_project)

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import helper
import keras

helper.info_gpu()
# sns.set_palette("Reds")
helper.reproducible(seed=0)  # setup reproducible results from run to run using Keras

%matplotlib inline
%load_ext autoreload
%autoreload

## 1. Data Processing and Exploratory Data Analysis

###  Load the Data

In [None]:
data_path = "data/enron_financial_data.pkl"
target = ["poi"]

df = pd.read_pickle(data_path)
df = pd.DataFrame.from_dict(df, orient="index")

### Explore the Data

In [None]:
helper.info_data(df, target)

** Imbalanced target: the evaluation metric used in this problem is the Area Under the ROC Curve ** <br>
**poi** =  person of interest (boolean) <br>

In [None]:
df.head(3)

### Transform the data

In [None]:
# delete 'TOTAL' row (at the bottom)
if "TOTAL" in df.index:
    df.drop("TOTAL", axis="index", inplace=True)

# convert dataframe values (objects) to numerical. There are no categorical features
df = df.apply(pd.to_numeric, errors="coerce")

#### Missing features

In [None]:
helper.missing(df)

High-missing features, like 'loan_advances', are needed to obtain better models

#### Remove irrelevant features

In [None]:
df.drop("email_address", axis="columns", inplace=True)

#### Classify variables

In [None]:
num = list(df.select_dtypes(include=[np.number]))

df = helper.classify_data(df, target, numerical=num)

helper.get_types(df)

#### Fill missing values

In [None]:
# Reeplace NaN values with the median
df.fillna(df.median(), inplace=True)
# helper.fill_simple(df, target, inplace=True) # same result

### Visualize the data

In [None]:
df.describe(percentiles=[0.5]).astype(int)

#### Numerical features

In [None]:
helper.show_numerical(df, kde=True, ncols=5)

#### Target vs Numerical features

In [None]:
helper.show_target_vs_numerical(df, target, jitter=0.05, point_size=50, ncols=5)

#### Total stock value vs some features 

In [None]:
# df.plot.scatter(x='salary', y='total_stock_value')
# df.plot.scatter(x='long_term_incentive', y='total_stock_value')

# sns.lmplot(x="salary", y="total_stock_value", hue='poi', data=df)
# sns.lmplot(x="long_term_incentive", y="total_stock_value", hue='poi', data=df)

g = sns.PairGrid(
    df,
    y_vars=["total_stock_value"],
    x_vars=["salary", "long_term_incentive", "from_this_person_to_poi"],
    hue="poi",
    size=4,
)
g.map(sns.regplot).add_legend()
plt.ylim(ymin=0, ymax=0.5e8)

# sns.pairplot(df, hue='poi', vars=['long_term_incentive', 'total_stock_value', 'from_poi_to_this_person'], kind='reg', size=3)

The person of interest seems to have a higher stock vs salary and long-term incentive, especially when his stock value is high. There is no dependency between POI and the amount of emails from or to another person of interest.

#### Correlation between numerical features and target

In [None]:
helper.correlation(df, target)

## 2. Neural Network model

### Select the features

In [None]:
droplist = []  # features to drop from the model

# For the model 'data' instead of 'df'
data = df.copy()
data.drop(droplist, axis="columns", inplace=True)
data.head(3)

### Scale numerical features
Shift and scale numerical variables to a standard normal distribution. The scaling factors are saved to be used for predictions.

In [None]:
data, scale_param = helper.scale(data)

There are no categorical variables

### Split the data into training and test sets
Data leakage: Test set hidden when training the model, but seen when preprocessing the dataset

No validation set (small dataset)

In [None]:
test_size = 0.4
random_state = 9

x_train, y_train, x_test, y_test = helper.simple_split(data, target, True, test_size, random_state)

### Encode the output

In [None]:
y_train, y_test = helper.one_hot_output(y_train, y_test)

In [None]:
print("train size \t X:{} \t Y:{}".format(x_train.shape, y_train.shape))
print("test size  \t X:{} \t Y:{} ".format(x_test.shape, y_test.shape))

### Build a dummy classifier

In [None]:
helper.dummy_clf(x_train, y_train, x_test, y_test)

### Build the Neural Network for Binary Classification

In [None]:
# class weight for imbalance target

cw = helper.get_class_weight(y_train[:, 1])

In [None]:
model_path = os.path.join("models", "enron_scandal.h5")

model = None
model = helper.build_nn_clf(x_train.shape[1], y_train.shape[1], dropout=0.3, summary=True)

helper.train_nn(model, x_train, y_train, class_weight=cw, path=model_path)

from sklearn.metrics import roc_auc_score

y_pred_train = model.predict(x_train, verbose=0)
print("\nROC_AUC train:\t{:.2f} \n".format(roc_auc_score(y_train, y_pred_train)))


### Evaluate the model

In [None]:
# Dataset too small for train, validation, and test sets. More data is needed for a proper
y_pred = model.predict(x_test, verbose=0)

helper.binary_classification_scores(y_test[:, 1], y_pred[:, 1], return_dataframe=True, index="DNN")

### Compare with non-neural network models

In [None]:
helper.ml_classification(x_train, y_train[:, 1], x_test, y_test[:, 1])