## Supervised Learning: Challenge

During this challenge, we will be working on the credit card fraud prediction. Available to download [here](https://drive.google.com/file/d/102F1yO4uhUZ-TONJheSiXYWUgBDCoIjA/view?usp=sharing). The data is originally from [Kaggle Competition](https://www.kaggle.com/mlg-ulb/creditcardfraud).

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

**Challenge:** Identify fraudulent credit card transactions.

Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise

> #### Warning
> There is a huge class imbalance ratio so we need to be careful when evaluating. It might be better to use method `.predict_proba()` with custom cut-off to search for fraudelent transactions.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics         import accuracy_score, balanced_accuracy_score
from sklearn.preprocessing   import StandardScaler

from sklearn.svm           import SVC
from sklearn.ensemble      import RandomForestClassifier
from sklearn.ensemble      import GradientBoostingClassifier
from lightgbm              import LGBMClassifier

from sklearn.metrics        import recall_score
from sklearn.neural_network import MLPClassifier

from time import time

In [2]:
df = pd.read_csv('./data/creditcard.csv')
df['norm_amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1,1))
df = df.drop(['Time', 'Amount'], axis=1)

In [3]:
y = df['Class']
x = df.drop('Class', axis=1)

# RESAMPLING for unbalanced systems

In [4]:
# number of fraud cases
no_of_frauds = len(df[df['Class'] == 1]) 

# extract indeces for fraud and non-fraud cases
frauds_index = np.array(df[df['Class'] == 1].index)
not_frauds_index = np.array(df[df['Class'] == 0].index)

# undersampling to eliminate majority type cases
# pick random cases of non-fraud equal to fraud cases
random_not_frauds_index = np.random.choice(not_frauds_index, no_of_frauds, replace=False)

# add together all randomly chosen fraud and non-fraud cases
undersample_index = np.concatenate([frauds_index, random_not_frauds_index])

# get values of undersampled indeces
undersample_data = df.iloc[undersample_index, :]

# split x and y again
x_undersample = undersample_data.iloc[:, undersample_data.columns != 'Class']
y_undersample = undersample_data.iloc[:, undersample_data.columns == 'Class']

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, test_size=0.3)
x_train_under, x_test_under, y_train_under, y_test_under = train_test_split(x_undersample, y_undersample, random_state=0, test_size=0.3)

In [6]:
tree_classifiers = {
  "SVC": SVC(),
  "Random Forest": RandomForestClassifier(verbose=False),
  "Skl GBM": GradientBoostingClassifier(verbose=False),
  "LightGBM": LGBMClassifier(),
}

In [12]:
results = pd.DataFrame({'Model': [], 'Accuracy': [], 'Bal Acc.': [], 'Time': []})

for model_name, model in tree_classifiers.items():

        start_time = time()

        model.fit(x_train_under, np.ravel(y_train_under))
        pred = model.predict(x_test)

        total_time = time() - start_time

        print(f'Finished {model_name}')

        results = results.append({"Model":    model_name,
                                  "Accuracy": round(accuracy_score(y_test, pred)*100, 3),
                                  "Bal Acc.": round(balanced_accuracy_score(y_test, pred)*100, 3),
                                  "Time":     round(total_time, 2)}, 
                                  ignore_index=True)

results_ord = pd.DataFrame(results)
results_ord = results_ord.sort_values(by=['Accuracy'], ascending=False, ignore_index=True)
results_ord

Finished SVC
Finished Random Forest
Finished Skl GBM
Finished LightGBM


Unnamed: 0,Model,Accuracy,Bal Acc.,Time
0,SVC,98.462,93.457,1.92
1,Random Forest,97.856,97.228,1.16
2,LightGBM,96.6,96.599,0.37
3,Skl GBM,96.283,96.44,0.89


In [8]:
mlpc = MLPClassifier(hidden_layer_sizes=(200, ), max_iter=10000)
mlpc.fit(x_train_under, np.ravel(y_train_under))
mlpc_pred = mlpc.predict(x_test)
recall_acc = round(recall_score(y_test, mlpc_pred)*100, 3)
recall_acc

95.918