# Credit Card Fraud Detection

The following analysis has been done on Kaggle's [Credit Card Fraud Detection dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud). A particularly helpful resource is TensorFlow's [tutorial](https://www.tensorflow.org/tutorials/structured_data/imbalanced_data) on classification with imbalanced data.

All theoretical concepts are covered in my notes on ML/deep learning, which can be found [here](https://github.com/brownc1995/machine-learning-notes/blob/master/an_introduction_to_machine_learning.pdf).

## Context
It is important that credit card companies are able to recognize 
fraudulent credit card transactions so that customers are not 
charged for items that they did not purchase.

## Content
The [dataset](sample/data) contains transactions made by credit cards in 
September 2013 by European cardholders. This dataset presents 
transactions that occurred in two days, where we have 492 frauds 
out of 284,807 transactions. The dataset is highly unbalanced, 
the positive class (frauds) account for 0.172% of all transactions.

The data contains only numerical input variables which are the result
of a PCA transformation. Unfortunately, due to confidentiality 
issues, we cannot provide the original features and more 
background information about the data. Features `V1`, `V2`,..., 
`V28` are the principal components obtained with PCA, the only 
features which have not been transformed with PCA are `Time` 
and `Amount`. Feature `Time` contains the seconds elapsed between 
each transaction and the first transaction in the dataset. The 
feature `Amount` is the transaction amount. This feature can be 
used for example-dependant cost-senstive learning. Feature 'Class'
is the response variable and it takes value `1` in case of __fraud__ 
and `0` otherwise.

In [None]:
%load_ext autoreload
%autoreload 2
%load_ext tensorboard

In [None]:
import logging
import os
import warnings
from datetime import datetime

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from ccfd import *
from ccfd.data import *
from ccfd.model import *
from ccfd.plot import *

In [None]:
warnings.filterwarnings("ignore")

In [None]:
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [None]:
EPOCHS = 1

## Import data and clean

First let's import the dataset and do some initial exploration. We remove the `time` column and split the data into features and target data.

In [None]:
ccfd_data = get_data()

We have highly imbalanced data! __Accuray__ will therefore not be a good measure of success for our models. Instead, we shall look mostly at [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall).

So, what does our data look like?

In [None]:
ccfd_data.head()

In [None]:
ccfd_data.describe()

It looks like the `Amount` column covers a __very__ large range. As is common-practice with monetary features in financial data, let's transform this column using a log-transformation. Add a very small value pre-transform to avoid any undefined values where `Amount` equals 0. This column has already been added to the `.pkl` file.

In [None]:
ccfd_data['log_amount'] = np.log(ccfd_data.amount + 0.0001)

Let's split the data into training, validation and testing data.

In [None]:
train_data, train_target, val_data, val_target, test_data, test_target = train_val_test_split(ccfd_data)

We should also normalise each of the features as they are each in vastly different ranges with different mean/variance. We normalise the input features using the sklearn `StandardScaler`.

Note: We fit the `StandardScalar` using only `train_data` to be sure the model is not peeking at the validation or test sets. The validation and test sets are normalised using the same transformation as the training set.

In [None]:
train_data, val_data, test_data = scale_data(train_data, val_data, test_data)
train_len, val_len, test_len = len(train_data), len(val_data), len(test_data)

We're pretty much ready for analysis. Our data is in the following state:

In [None]:
log_shapes(train_data, train_target, val_data, val_target, test_data, test_target)

Finally, we convert our dataframes into `tf.data.Dataset` types.

In [None]:
train_dataset, val_dataset, test_dataset = make_all_datasets(
    train_data,
    train_target,
    val_data,
    val_target,
    test_data,
    test_target
)

## Plots

Below are two particularly interesting plots that can be found in the TensorFlow [tutorial](https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#look_at_the_data_distribution).

We compare the distributions of the positive (fraudulent) and negative examples over a few features.

In [None]:
plot_pos_neg(train_data, train_target)

 Good questions to ask yourself at this point are:
- Do these distributions make sense?
    - __Yes__ - you've normalized the input and these are mostly concentrated in the $\pm$2 range.
- Can you see the difference between the ditributions?
    - __Yes__ - the positive examples contain a much higher rate of extreme values.

## Deep Neural Network

We define a function that creates a simple neural network with a series of densely connected hidden layers followed by dropout layers to reduce overfitting, and an output sigmoid layer that returns the probability of a transaction being fraudulent. We use a dropout rate of 0.5 as recommended in [Hinton (2012)](https://arxiv.org/pdf/1207.0580.pdf). We also use [batch nomalisation](https://en.wikipedia.org/wiki/Batch_normalization) between each dense layer to improve efficiency of learning.

We use the [Adam](https://arxiv.org/pdf/1412.6980.pdf) optimiser, and binary cross-entropy loss.

### Model

First, we build the model and show a summary of its architecture.

We know the dataset is imbalanced and so we set the output layer's bias to reflect that (See [A Recipe for Training Neural Networks](http://karpathy.github.io/2019/04/25/recipe/): "init well"). This can help with initial convergence. See [here](https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#optional_set_the_correct_initial_bias) for details on the calculation involved. 

It turns out the initial loss is about __50 times less__ than if we used a naive initilization. This way the model doesn't need to spend the first few epochs just learning that positive examples are unlikely. This also makes it easier to read plots of the loss during training.

In [None]:
initial_bias = calc_initial_bias(ccfd_data)
input_shape = train_data.shape[-1]

In [None]:
model = build_model(
    input_shape=input_shape, 
    output_bias=initial_bias
)

In [None]:
model.summary()

We are now ready to fit our model. The metrics used are listed below.

- __false negatives__ (FN) and __false positives__ (FP) are samples that were incorrectly classified;
- __true negatives__ (TN) and __true positives__ (TP) are samples that were correctly classified;
- __accuracy__ is the percentage of examples correctly classified;
- __precision__ is the percentage of __predicted__ positives that were correctly classified, 
$$\frac{TP}{TP+FP};$$ 
- __recall__ is the percentage of __actual__ positives that were correctly classified,
$$\frac{TP}{TP+FN};$$
- __AUC__ refers to the __Area Under the Curve__ of a __Receiver Operating Characteristic__ [ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) curve. This metric is equal to the probability that a classifier will rank a random positive sample higher than than a random negative sample.

Note: Accuracy is not a helpful metric for this task. __You can attain 99.827% accuracy on this task by just predicting False all the time__.

In [None]:
history = fit_model(
    model,
    train_dataset,
    val_dataset,
    epochs=EPOCHS,
)

In [None]:
baseline_results = model.evaluate(
    test_dataset,
    verbose=0
)

In [None]:
train_predictions_baseline = model.predict(train_dataset, steps=train_len/BATCH_SIZE)
test_predictions_baseline = model.predict(test_dataset)

In [None]:
log_model_performance(model, baseline_results, test_target, test_predictions_baseline)

It looks like the precision is relatively high, but the recall and the area under the ROC curve (AUC) aren't as high as you might like. Classifiers often face challenges when trying to maximize both precision and recall, which is especially true when working with imbalanced datasets. It is important to consider the costs of different types of errors in the context of the problem you care about. In this example, a false negative (a fraudulent transaction is missed) may have a financial cost, while a false positive (a transaction is incorrectly flagged as fraudulent) may decrease user happiness.

### Class weights

The goal is to identify fradulent transactions, but you don't have very many of those positive samples to work with, so you would want to have the classifier heavily weight the few examples that are available. You can do this by passing Keras weights for each class through a parameter. These will cause the model to "pay more attention" to examples from an under-represented class.

In [None]:
class_weight = set_class_weights(ccfd_data)

Now try re-training and evaluating the model with class weights to see how that affects the predictions.

Note: Using `class_weights` changes the range of the loss. Because of the weighting, the total losses are not comparable between the below model and the one seen above.

In [None]:
model_cw = build_model(
    input_shape=input_shape, 
    output_bias=initial_bias
)

In [None]:
history_cw = fit_model(
    model_cw,
    train_dataset,
    val_dataset,
    epochs=EPOCHS,
    class_weight=class_weight
)

In [None]:
cw_results = model_cw.evaluate(
    test_dataset,
    verbose=0
)

In [None]:
train_predictions_cw = model_cw.predict(train_dataset, steps=train_len/BATCH_SIZE)
test_predictions_cw = model_cw.predict(test_dataset)

In [None]:
log_model_performance(model_cw, cw_results, test_target, test_predictions_cw)

### Oversampling
A related approach would be to resample the dataset by oversampling the minority class. The easiest way to produce balanced examples is to start with a positive and a negative dataset, and merge them.

In [None]:
pos_data, pos_target, neg_data, neg_target = pos_neg_data(train_data, train_target)
pos_dataset, neg_dataset = make_datasets_pos_neg(pos_data, pos_target, neg_data, neg_target)
resampled_dataset = resample_dataset(pos_dataset, neg_dataset)

To use this dataset, we'll need the number of steps per epoch.

The definition of "epoch" in this case is less clear. Say it's the number of batches required to see each negative example once:

In [None]:
resampled_steps_per_epoch = resample_steps_per_epoch(ccfd_data)

Now try training the model with the resampled data set instead of using class weights to see how these methods compare.

Note: Because the data was balanced by replicating the positiver examples, the total dataset size is larger, and each epoch runs for more training steps.

In [None]:
model_rs = build_model(
    input_shape=input_shape, 
    output_bias=initial_bias
)

In [None]:
history_rs = fit_model(
    model_rs,
    resampled_dataset,
    val_dataset,
    steps_per_epoch=resampled_steps_per_epoch,
    epochs=EPOCHS,
    resampled=True
)

If the training process were considering the whole dataset on each gradient update, this oversampling would be basically identical to the class weighting.

But when training the model batch-wise, as you did here, the oversampled data provides a smoother gradient signal: Instead of each positive example being shown in one batch with a large weight, they're shown in many different batches each time with a small weight.

This smoother gradient signal makes it easier to train the model.

In [None]:
rs_results = model_rs.evaluate(
    test_dataset,
    verbose=0
)

In [None]:
train_predictions_rs = model_rs.predict(train_dataset, steps=train_len/BATCH_SIZE)
test_predictions_rs = model_rs.predict(test_dataset)

In [None]:
log_model_performance(model_rs, rs_results, test_target, test_predictions_rs)

## ROC Curves

We now plot some ROC curves to illustrate the success of each of the above three networks.

In [None]:
plot_roc_all(
    train_target,
    test_target,
    train_predictions_baseline,
    test_predictions_baseline,
    train_predictions_cw,
    test_predictions_cw,
    train_predictions_rs,
    test_predictions_rs
)