In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("card_transdata.csv")
df = df.rename(columns={'distance_from_home': 'home', 
                        'distance_from_last_transaction': 'trans', 
                        'ratio_to_median_purchase_price': "ratio",
                        'repeat_retailer': "repeat",
                        'used_chip': 'chip',
                        'used_pin_number': 'pin',
                        'online_order': 'online'})

df.head(2)

Unnamed: 0,home,trans,ratio,repeat,chip,pin,online,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0


So Splitting the data will surely create an imbalance of 9/1, which will cause issues when training the data.

I'm going to test oversampling and undersampling on the training data and see how different linear models behave in terms of Recall instead of just accuracy. I will also use an ensemble of RandomForest trees and test these cost-sensitive models with different parameters. 

In [None]:
# I will use sklean as it provides easy and basic model interpretability.

from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.svm import SVC as svc

In [3]:
# Seperating the predictors from the responses
X = df.drop(["fraud"], axis=1)
y = df["fraud"]

X_train, X_test, y_train, y_test = tts(X, y, random_state=42)

In [4]:
# I will first scale the non-binary features to get better readability for the machine

non_binary = ["home", "trans", "ratio"]
binary = ["repeat", "chip", "pin", "online"]

scaler = StandardScaler()
scaler.fit(X_train[non_binary])

# fitting and transforming seperate to prevent data leakage
X_train[non_binary] = scaler.transform(X_train[non_binary])
X_test[non_binary] = scaler.transform(X_test[non_binary])


In [7]:
# So first I will apply cross validation to get an idea of 
# what might be the best model to use in this case
# LogisticRegression, random forests, and SVM

log = LogisticRegression()

In [6]:
# I will use the default n_estimators which is 100 trees
# Although this takes a bit of time on a regular computer
# Since it fits 1m observations. However, it is needed
# for the purpose of this project.

forest = RandomForest()

In [8]:
# I will use the default RBF kernel for SVM as well
# I will reduce the amount of fitting data since SVM's
# take more complexity than other models
# even though this will be costy for the model, but for the sake
# of comparison this will do

svm = svc()

In [9]:
log_scores = cross_val_score(log, X_train[:100000], y_train[:100000], scoring="f1")
log_scores

array([0.69249395, 0.69969252, 0.70625   , 0.70652545, 0.71840659])

In [12]:
forest_scores = cross_val_score(forest, X_train[:100000], y_train[:100000], scoring="f1")
forest_scores

array([0.9994315 , 0.99886234, 0.99971599, 0.99914845, 1.        ])

In [11]:
svm_scores = cross_val_score(svm, X_train[:100000], y_train[:100000], scoring="f1")
svm_scores

array([0.9583815 , 0.95888014, 0.96309212, 0.97677086, 0.97088498])

Random forest trees seem to perform the best on different training and testing splits. So for the given timeline we will not go further on model selection and I will settle on random forests.

However It will take a lot of time to fit 1m observations on a randomforest model for a regular computer, but for the purpose of this project I will have to perform this computation.

And i will have to use methods to set good parameters to maximize model "security".