# Credit Card Fraud Analysis

Script to explore prediction of fraud with the following dataset:
Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi.
Calibrating Probability with Undersampling for Unbalanced Classification.
In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

The dataset can be downloaded from:
https://www.kaggle.com/dalpozz/creditcardfraud

The approach will be to use an autoencoder to understand learn about the dataset
and to identify the transactions with the biggest error from the autoencoder reconstruction.

The autoencoder will then be used as pre-training for a neural network classifier

In [44]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [20]:
%qtconsole

In [45]:
# Import libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.cuda
import sklearn as skl
import matplotlib.pyplot as plt
import os

In [66]:
# define analysis hyper-parameters
batch_size=64
lr= 0.002
no_epochs=20
hidden_nodes=[14,7]


In [67]:
# Check for GPU and CUDA libraries
HAS_CUDA=torch.cuda.is_available()

In [68]:
# Attempt to get consistant results during model development by fixing random seeds
np.random.seed(39)
torch.manual_seed(10)
if HAS_CUDA:
    torch.cuda.manual_seed(10)

In [69]:
# Import data
dataset=pd.read_csv('creditcard.csv')

In [70]:

#Process time column to catagorize into time of day quarters, and normalize the data
from utils import add_quarter_and_normalize
dataset = add_quarter_and_normalize(dataset)
data = dataset.drop(['Time','qtr_num'], axis=1)

print(' Data imported and normalized')

 Data imported and normalized


In [71]:
#Split the data into training and test sets

from sklearn.model_selection import train_test_split

# Note this shuffles the data by default before splitting
data_tr, data_val=train_test_split(data,test_size=0.2,random_state=0)

### Create version of data for training the Autoencoder.  For this purpose only use non-fraudulent cases and drop the "Class" field since it is not needed


In [72]:
# Create new dataset and drop the class field
AE_input_tr = data_tr[data_tr.Class==0]
AE_input_tr = AE_input_tr.drop(['Class', 'Amt_To_Keep'], axis=1)
AE_input_val = data_val.drop(['Class', 'Amt_To_Keep'], axis=1)

# Convert arrays to PyTorch Tensors
AE_input_tr=torch.Tensor.float(torch.from_numpy(AE_input_tr.values))
AE_input_val=torch.Tensor.float(torch.from_numpy(AE_input_val.values))


In [73]:
# Create instance of Autoencoder
from AutoEncoders import AutoEncoder_Multi_Layer
num_features=AE_input_tr.shape[1]
AE_model=AutoEncoder_Multi_Layer(num_features, hidden_sizes=hidden_nodes)

# If the computer has a GPU, move the model to the GPU
if HAS_CUDA:
    AE_model.cuda()

# Define optimiser algorithm to use (using Adam in this case)
import torch.optim as optim
AE_opt=optim.Adam(AE_model.parameters(), lr=lr)

# Define how to adjust learning rate
from torch.optim.lr_scheduler import ReduceLROnPlateau
AE_sched = optim.lr_scheduler.ReduceLROnPlateau(AE_opt, mode='min', verbose=True, 
                                                patience=5, factor=0.2)

# Define cost function to use
AE_loss=nn.MSELoss()


In [74]:
# Create model manager instance to control training and validation
from nnModels import NN_Manage
AE_manager=NN_Manage(model = AE_model,
                     loss_function = AE_loss, 
                     optimizer = AE_opt, 
                     scheduler=AE_sched,
                     HAS_CUDA=HAS_CUDA, 
                     my_logger=True)

In [77]:
# Train Autoencoder
AE_test_loss=0
for epoch in range(no_epochs):
    # Train over training data
    train_loss=AE_manager.train(epoch, shuffle = True, inputs = AE_input_tr,
                                targets = AE_input_tr, batch_size = batch_size)
    
    # Assess performance against validation data
    AE_test_loss=AE_manager.test(epoch,inputs = AE_input_val, targets = AE_input_val,
                                 batch_size=batch_size)
    
    # Adjust learning rate if required
    AE_manager.scheduler.step(AE_test_loss)

Train Epoch: 0  Batch: 0 Loss: 0.229753
Train Epoch: 0  Batch: 100 Loss: 0.519741
Train Epoch: 0  Batch: 200 Loss: 0.270257
Train Epoch: 0  Batch: 300 Loss: 0.254409
Train Epoch: 0  Batch: 400 Loss: 0.201296
Train Epoch: 0  Batch: 500 Loss: 0.305808
Train Epoch: 0  Batch: 600 Loss: 0.224236
Train Epoch: 0  Batch: 700 Loss: 0.237990
Train Epoch: 0  Batch: 800 Loss: 0.275711
Train Epoch: 0  Batch: 900 Loss: 0.222909
Train Epoch: 0  Batch: 1000 Loss: 0.260364
Train Epoch: 0  Batch: 1100 Loss: 0.247218
Train Epoch: 0  Batch: 1200 Loss: 0.223768
Train Epoch: 0  Batch: 1300 Loss: 0.203623
Train Epoch: 0  Batch: 1400 Loss: 0.217012
Train Epoch: 0  Batch: 1500 Loss: 0.215013
Train Epoch: 0  Batch: 1600 Loss: 0.229555
Train Epoch: 0  Batch: 1700 Loss: 0.549973
Train Epoch: 0  Batch: 1800 Loss: 0.296072
Train Epoch: 0  Batch: 1900 Loss: 0.238299
Train Epoch: 0  Batch: 2000 Loss: 0.222937
Train Epoch: 0  Batch: 2100 Loss: 0.277053
Train Epoch: 0  Batch: 2200 Loss: 0.243090
Train Epoch: 0  Batch: 

In [78]:
# Check model losses for training and validation
print()
print('Train Loss:      {:8.5f}'.format(train_loss))
print('Validation Loss: {:8.5f}'.format(AE_test_loss))



Train Loss:       0.25875
Validation Loss:  0.28962


In [99]:
# Get case name to save file under
import pickle
import feather
if input('Save case (y or ret): ') =='y':
    while True:
        case_name=input('Case name to use to save model :')
        directory=os.path.dirname(case_name)
        if not os.path.exists(case_name):
            os.mkdir("./"+case_name)
            break
        else:
            print('Directory exists already!')
    # Save files - model state dictionary and other key parameters
    # Save model with weights
    if HAS_CUDA:
        AE_model.cpu()
    torch.save(AE_model.state_dict(), case_name+"/demo_state_dict_cpu.pt")
    # Save critical model parameters
    with open(case_name+'/'+'analysis_params.pkl', 'wb') as f:
        pickle.dump([batch_size, lr, no_epochs, hidden_nodes],f)
        f.close

    # Save training and validation datasets to ensure consistency with classifier
    feather.write_dataframe(data_tr,f'{case_name}/data_tr.feather')
    feather.write_dataframe(data_val,f'{case_name}/data_val.feather')

    #data_val.to_feather(f'{case_name}/data_val')

    print("Autoencoder model saved")
        

Save case (y or ret): y
Case name to use to save model :base_case
Directory exists already!
Case name to use to save model :base_case
Autoencoder model saved
