# Using an Autoencoder to Identify Credit Card Fraud
*Author: A. Trahan* <br>
*Date:   August 2019*

### Introduction

An autoencoder is a neural netowrk shaped like an hour glass (narrow in the center) that is trained on its own inputs. The goal is to build a simplified latent representation of the inputs (the narrow part) that can still be reliably expanded back to the original dataset. The model is trained on "good data", then when anomalous data is submitted to the model it can't be distilled to latent representation the same way, making it easier to separate.

This notebook uses classic and neural network methods to attempt to discover credit card fraud. The dataset contains 404 features generated from anaonymized data from a German bank using PCA, so feature engineering options are limited.

The process consists of:

* Resampled and split the unbalanced dataset
* Train an SVM as a base model, aiming for ROC_AUC and Recall
* Train an autoencoder
    * Visual inspection of autoencoder
      * Use t-SNE to plot initial data in $R^2$
      * Use t-SNE to plot latent representation in $R^2$
    * Train LogReg on latent representation, aiming for ROC_AUC and Recall

### Module and Data Imports

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.layers import Dense, Input
from keras.models import Model, Sequential
from keras import regularizers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score, recall_score

In [None]:
# I/O Directories and Files
din = './input'
dout = './output'
fin = 'creditcard.csv'
save_figs = True

# Read data
df = pd.read_csv(f'{din}/{fin}')

### Resampling and T/T Split

As with many fraud datasets, this one is heavily unbalanced. There are many more non-fraud cases than fraud cases. This can lead a model to be overzealous in marking cases non-fraud, since that correct so often. Along with metric selection, undersampling the majority class can help offset this issue. The resampled dataset is then split into training and test sets.

In [None]:
# Undersample the majority class
nfraud = df.loc[df['Class']==0].sample(1000) # Select only 1000 non-fraud cases (at random)
fraud = df.loc[df['Class']==1]               # Select all the fraud cases
df_resamp = (nfraud.append(fraud)            # Append the two datasets
                   .sample(frac=1)           # Resample to randomize order
                   .reset_index(drop=True))  # Reset the index because it's irrellevant

# Train/test split
X = df_resamp.drop(['Class', 'Time'], axis=1)
y = df_resamp['Class']
X_train_usamp, X_test_usamp, y_train_usamp, y_test_usamp = train_test_split(X.values, y.values, test_size=0.3, random_state=0)

## Train the SVM Model (SVC)

SVM models are a common first step in ML projects (and sometimes the only one needed), so this will be used as a basis against which to compare the quality of predicitons from the autoencoder.

In [None]:
# Scale data to (0,1)
scaler = MinMaxScaler().fit(X_train_usamp)
X_train_usamp_sc = scaler.transform(X_train_usamp)
X_test_usamp_sc = scaler.transform(X_test_usamp)

# Train the model and generate predictions
model = SVC(probability=True).fit(X_train_usamp_sc, y_train_usamp)
y_tr_pred_usamp = model.predict(X_train_usamp_sc)
y_te_pred_usamp = model.predict(X_test_usamp_sc)
y_tr_prob_usamp = model.predict_proba(X_train_usamp_sc)
y_te_prob_usamp = model.predict_proba(X_test_usamp_sc)

In [None]:
# Print metrics
print('Training ROC_AUC: {:.04f}'.format(roc_auc_score(y_train_usamp, y_tr_prob_usamp[:,1])))
print('Test ROC_AUC: {:.04f}'.format(roc_auc_score(y_test_usamp, y_te_prob_usamp[:,1])))
print('')
print('Training Recall: {:.04f}'.format(recall_score(y_train_usamp, y_tr_pred_usamp)))
print('Test Recall: {:.04f}'.format(recall_score(y_test_usamp, y_te_pred_usamp)))

## Build the Autoencoder

### Visual Justification

Projecting the training data into two dimensions with t-SNE reveals that it is clearly not linearly separable.

In [None]:
# Perform t-SNE and plot training data
tsne = TSNE(n_components=2, random_state=0)
X_t = tsne.fit_transform(X_train_usamp)

fh = plt.figure(figsize=(10,10))
plt.scatter(X_t[y_train_usamp==0,0], X_t[y_train_usamp==0,1], marker='o', color='g', label='Non Fraud')
plt.scatter(X_t[y_train_usamp==1,0], X_t[y_train_usamp==1,1], marker='o', color='r', label='Fraud')
plt.title('Fraud/Non-Fraud t-SNE Plot')
plt.legend()

if save_figs: fh.savefig(f'{dout}/tsne_raw_data.png')

### Build and Compile the Model

The autoencoder is a string of shrinking then growing NN layers

In [None]:
# Build the autoencoder
X_train = X_train_usamp
inp_lyr = Input(shape=(X_train.shape[1],))
enc = Dense(100, activation='tanh', activity_regularizer=regularizers.l1(10e-5))(inp_lyr)
enc = Dense(50, activation='relu')(enc)
dec = Dense(50, activation='relu')(enc)
dec = Dense(100, activation='relu')(dec)
out_lyr = Dense(X_train.shape[1], activation='relu')(dec)

In [None]:
# Construct model and compile
autoencoder = Model(inp_lyr, out_lyr)
autoencoder.compile(optimizer="adadelta", loss="mse")

### Fit the Model

The model is fit to only the non-fraud cases, then any fraud cases will have anomalous latent representations.

In [None]:
# Build the model training set, min-max scaled non-fraud records
X_scale = MinMaxScaler().fit_transform( df.drop(['Class', 'Time'], axis=1).values )
x_sc_norm, x_sc_fraud = X_scale[df['Class'].values==0], X_scale[df['Class'].values==1]

# Random sample for fitting
n_fit_samp = 2000
fit_samp = x_sc_norm[np.random.choice(x_sc_norm.shape[0], n_fit_samp, replace=False),:]

In [None]:
# Train the model
autoencoder.fit(x=fit_samp, y=fit_samp,
               batch_size=256, epochs=15,
               shuffle=True, validation_split=0.2)

### Generate Latent Represenatations

The trained model can be used to create latent representations (parameters from the narrow point in the netowrk). Optimally these are the smallest number of parameters that can still completely define the system, but that's a question for a mathematician. Here we settle for "small enough, while providing sufficient definition of the system."

In [None]:
# Obtain latent representations
hidden_rep = Sequential(autoencoder.layers[:3])
norm_hid_rep = hidden_rep.predict(x_sc_norm[np.random.choice(x_sc_norm.shape[0], n_fit_samp*2, replace=False),:])
fraud_hid_rep = hidden_rep.predict(x_sc_fraud)

X_latent = np.append(norm_hid_rep, fraud_hid_rep, axis=0)
y_latent = np.append(np.zeros(norm_hid_rep.shape[0]), np.ones(fraud_hid_rep.shape[0]))

### Visual Comparison

Projecting the latent representations into two dimensions with t-SNE reveals that it is intuitively more linearly separable than the initial dataset (compare to above).

In [None]:
# Visualize latent representations (with t-SNE)
tsne = TSNE(n_components=2, random_state=0)
X_t = tsne.fit_transform(X_latent)

fh = plt.figure(figsize=(10,10))
plt.scatter(X_t[y_latent==0,0], X_t[y_latent==0,1], marker='o', color='g', alpha=0.7, label='Non Fraud')
plt.scatter(X_t[y_latent==1,0], X_t[y_latent==1,1], marker='o', color='r', alpha=0.7, label='Fraud')
plt.title('Fraud/Non-Fraud t-SNE Plot')
plt.legend()

if save_figs: fh.savefig(f'{dout}/tsne_latent_rep.png')

### Fit a Logistic Regression to the Latent Representation

Now that the data appears to be linearly separable, a logistic regression should provide reasonable prediction quality at low computational cost. We can create a train-test split on the latent representations and check ROC_AUC and Recall to compare with the SVM from above.

In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_t, y_latent, test_size=0.3, random_state=0)

# Create and train model
model = LogisticRegression(penalty='l1').fit(X_train, y_train)
y_tr_pred = model.predict(X_train)
y_te_pred = model.predict(X_test)
y_tr_prob = model.predict_proba(X_train)
y_te_prob = model.predict_proba(X_test)

In [None]:
# Print metrics
print('Training ROC_AUC: {:.04f}'.format(roc_auc_score(y_train, y_tr_prob[:,1])))
print('Test ROC_AUC: {:.04f}'.format(roc_auc_score(y_test, y_te_prob[:,1])))
print('')
print('Training Recall: {:.04f}'.format(recall_score(y_train, y_tr_pred)))
print('Test Recall: {:.04f}'.format(recall_score(y_test, y_te_pred)))

### Plot the Results

This is the same t-SNE plot as above, but points are color-coded based on how they were predicted (TN, FP, FN, TP).

In [None]:
# Plot correct/incorrect in tSNE space

set_name = 'Undersampling LogReg Test Set'
X = X_test
y_true = y_test
y_pred = y_te_pred

def gen_pred_res(y_true, y_pred):
    return y_true*10 + y_pred

pred_res = gen_pred_res(y_true, y_pred)
cases = { 0:{'label':'TN','color':'b', 'alpha':0.5}, 
          1:{'label':'FP','color':'r', 'alpha':0.8},
         10:{'label':'FN','color':'k', 'alpha':0.8},
         11:{'label':'TP','color':'g', 'alpha':0.5}}

fh = plt.figure(figsize=(10,10))

for case, params in cases.items():
    plt.scatter(X[pred_res==case,0], X[pred_res==case,1], marker='o', **params)

plt.title(f'{set_name} Prediction Errors in t-SNE Space')
plt.legend()

if save_figs: fh.savefig(f'{dout}/tsne_latent_rep_confusion.png')

## References
*Data and inspiration from Kaggle kernel: https://www.kaggle.com/shivamb/semi-supervised-classification-using-autoencoders*