# Credit Card Fraud Detection

[This](https://www.kaggle.com/mlg-ulb/creditcardfraud) dataset contains a PCA transformed
record of credit card transactions where 0.172% are fraudulent. The overall goal is to detect the fraudulent transactions. 

Because the fraudulent transaction make up a tiny portion of the data, the data set is highly imbalanced.  One must be careful using the usual classification metrics on such data sets. The purpose of this notebook is to explore custom training metrics in LightGBM and Keras that are well suited to imbalanced data.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set_palette("Set2")
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import lightgbm as lgb

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.utils import np_utils

np.random.seed(149)

data = pd.read_csv('creditcard.csv')

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Sometimes in classification problems I like to view the variance inflation factor (VIF) to check for multicollinearity. Though high VIF is [not always a problem](https://stats.stackexchange.com/questions/168622/why-is-multicollinearity-not-checked-in-modern-statistics-machine-learning) in machine learning, it can be useful to know in case our model needs to be tweaked.

In [2]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [3]:
VIF = pd.DataFrame()
VIF["VIF"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
VIF["features"] = data.columns

In [4]:
VIF

Unnamed: 0,VIF,features
0,2.339858,Time
1,1.638237,V1
2,3.900804,V2
3,1.321018,V3
4,1.172479,V4
5,2.764441,V5
6,1.528629,V6
7,2.603517,V7
8,1.098591,V8
9,1.037715,V9


Interestingly, `Amount` has high VIF, suggesting multicollinearity. At the moment, we will ignore this.

We will try two different approaches to detecting fraud. In the first method, we will use LightGBM's gradient boosting classifier on the _full_ data set, carefully managing the imbalance with scaling and a custom metric.  Secondly, we will use a Multilayer Perceptron in Keras on a _subset_ of the data.  In both cases, we will evaluate the performance using [Matthews correlation coefficient (MCC)](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient). 

## Custom evaluation metric in LightGBM

LightGBM does not include MCC as a metric, but fortunately LightGBM makes it very easy to implement it yourself.

In [5]:
from sklearn.metrics import matthews_corrcoef as mcc
def mcc_error(preds, train_data):
    labels = train_data.get_label()
    return 'mcc_error', mcc(preds.round(), labels), True

In the last line of `mcc_error`, it is very important to use `pred.round()` instead of `preds`.

Divide the data into training, validation, and test sets.

In [6]:
X = data.drop(columns=['Class','Time'])
y = data['Class']
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val= train_test_split(X_train, y_train, test_size=0.2, random_state=1)

In [7]:
lgb_data = lgb.Dataset(X_train, label=y_train)
lgb_val = lgb.Dataset(X_val, label=y_val)

The lgb hyperparameters below were obtained with [Bayesian Optimization](https://github.com/fmfn/BayesianOptimization).

In [8]:
lg = lgb.LGBMClassifier(task='train',
    boosting_type='gbdt',
    objective='binary',
    feval = 'mcc_error',
    num_leaves= 20,
    learning_rate= 0.05,
    feature_fraction= 0.8899,
    bagging_fraction= 0.8688,
    bagging_freq= 20,
    scale_pos_weight = 0.00173,
    verbose=0,
    min_data_in_leaf=3,
    n_estimators=1000)

In [9]:
lg.fit(X_train, y_train)
pred_val = lg.predict(X_val)
print(classification_report(y_val, pred_val))
print(mcc(y_val, pred_val))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     45492
          1       1.00      0.78      0.88        77

avg / total       1.00      1.00      1.00     45569

0.8825699402104176


  if diff:


This is a solid score. Let's see how it performs on the test data.

In [10]:
pred_test = lg.predict(X_test)
score = mcc(y_test, pred_test)
print(score)

0.8463345814534691


  if diff:


## Keras MLP

In our second approach, we construct an MLP (i.e. all dense layers) in Keras on a subset of the data.  Since we are not using `Time` as one of the features, the order of the transactions is irrelevant.  Because of this, we'll first sort the data set by `Class`, then pick out a set of somewhat balanced transactions. The idea is borrowed from [this](https://www.kaggle.com/randyrose2017/using-scikit-learn-and-keras-for-fraud-detection/notebook) notebook. Yet unlike that notebook, we will remove `Amount` --- the feature with very high VIF --- for a performance boost.

In [11]:
data_sorted = data.sort_values(by='Class', ascending=False, inplace=False)
data_sorted = data_sorted.drop(columns=['Time','Amount'])

In [12]:
df_sample = data_sorted.iloc[:10000,:]
df_sample.Class.value_counts()

0    9508
1     492
Name: Class, dtype: int64

Now we divide into training, validation, and test sets, being careful to shuffle the data first.

In [13]:
from sklearn.utils import shuffle

shuffle_df = shuffle(df_sample, random_state=42)

df_train = shuffle_df[0:6400]
df_val = shuffle_df[6400:8000]
df_test = shuffle_df[8000:]

In [14]:
train_feature = np.array(df_train.values[:,0:28])
train_label = np.array(df_train.values[:,-1])
val_feature = np.array(df_val.values[:,0:28])
val_label = np.array(df_val.values[:,-1])
test_feature = np.array(df_test.values[:,0:28])
test_label = np.array(df_test.values[:,-1])

Converting the labels to categorical and scaling the features are both standard techniques in deep learning.

In [16]:
Y_train = np_utils.to_categorical(train_label, 2)
Y_test = np_utils.to_categorical(test_label, 2)
Y_val = np_utils.to_categorical(val_label, 2)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_feature)
X_tr = scaler.transform(train_feature)
X_va = scaler.transform(val_feature)
X_te = scaler.transform(test_feature)

Keras no longer seems to have the Matthews correlation coefficient as an evaluation metric, but you can still find the code on github to make your own.  Here it is.

In [17]:
import keras.backend as K
def matthews_correlation(y_true, y_pred):
    ''' Matthews correlation coefficient
    '''
    y_pred_pos = K.round(K.clip(y_pred, 0, 1))
    y_pred_neg = 1 - y_pred_pos

    y_pos = K.round(K.clip(y_true, 0, 1))
    y_neg = 1 - y_pos

    tp = K.sum(y_pos * y_pred_pos)
    tn = K.sum(y_neg * y_pred_neg)
    
    fp = K.sum(y_neg * y_pred_pos)
    fn = K.sum(y_pos * y_pred_neg)

    
    numerator = (tp * tn - fp * fn)
    denominator = K.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))

    return numerator / (denominator + K.epsilon())

Here is our MLP implemented in Keras.  We use the `TruncatedNormal` initializer in our layers to improve performance.

In [18]:
model = Sequential() 
model.add(Dense(units=256, 
                input_dim=28, 
                kernel_initializer='TruncatedNormal', 
                activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=256,  
                kernel_initializer='TruncatedNormal', 
                activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=256,  
                kernel_initializer='TruncatedNormal', 
                activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=2,
                kernel_initializer='TruncatedNormal',
                activation='sigmoid'))

print(model.summary()) 

model.compile(loss='binary_crossentropy',  
              optimizer='adam', metrics=[matthews_correlation])

model.fit(x=X_tr, y=Y_train,  
                          validation_data=(X_va, Y_val), epochs=100, 
                          batch_size=300, verbose=1) 


score = model.evaluate(X_te, Y_test, verbose=0)
print(score)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 256)               7424      
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 514       
Total para

Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100


Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
[0.043402254733257, 0.983]


In [19]:
pred_te = model.predict_classes(X_te)

In [20]:
print(classification_report(test_label, pred_te))

             precision    recall  f1-score   support

        0.0       0.99      1.00      1.00      1910
        1.0       0.96      0.84      0.90        90

avg / total       0.99      0.99      0.99      2000



In [21]:
score = mcc(test_label, pred_te)
print(score)

0.8970636232683191
