# Preprocessing Training and Modeling

In [1]:
import numpy as np
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import datetime

## Preprocessing and Understanding

In [80]:
df = pd.read_csv('fraudTrain.csv')

In [81]:
# Deleted first column since it was the same as the index
df = df.iloc[: , 1:]
# We are going to rename some columns for ease
df = df.rename(columns={'trans_date_trans_time':'date_time'})
# Drop any duplicates
df.drop_duplicates(inplace=True)

In [82]:
df.shape

(1296675, 22)

In [83]:
# Let's downsize the data for ease of speed
df = df.sample(frac=0.01)
df.shape

(12967, 22)

In [84]:
df.head()

Unnamed: 0,date_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
563903,2019-08-28 02:49:02,180011453250192,fraud_Jast-McDermott,shopping_pos,146.54,Craig,Dunn,M,721 Jacqueline Brooks,New Boston,...,41.2153,-90.9879,1504,Manufacturing engineer,1993-10-05,b3675f50a883f2c60e701dfc326bc244,1346122142,40.8768,-91.929266,0
805864,2019-12-05 22:03:29,4181833256558613886,"fraud_Witting, Beer and Ernser",home,87.06,Jessica,Potter,F,7600 Stephen Course Suite 031,Red River,...,36.6659,-105.4694,606,"Surveyor, land/geomatics",1988-09-06,c9cf73ed101b58ff342015dfa0496cd4,1354745009,36.455923,-104.520076,0
68445,2019-02-10 12:58:17,4935858973307492,fraud_Wilkinson PLC,kids_pets,44.04,Lance,Wagner,M,6003 Brady Shoal Apt. 449,Irwinton,...,32.8088,-83.174,1841,Film/video editor,1975-06-01,b12086115b4076e7d0c2e7c23d966458,1328878697,31.897108,-82.556987,0
83788,2019-02-18 22:49:50,3545109339866548,fraud_Dare-Gibson,health_fitness,35.57,Keith,Sanders,M,8030 Beck Motorway,Moorhead,...,33.4783,-90.5142,2870,Chartered public finance accountant,1999-03-05,1e2ccb8023756dadaab2a33e7dafa6d1,1329605390,33.296778,-89.531965,0
39058,2019-01-23 16:48:10,4344955088481397,fraud_Eichmann-Kilback,home,68.65,Christopher,Rodgers,M,30587 Fox Shores Apt. 627,Fenelton,...,40.8555,-79.7372,2054,Communications engineer,1976-09-29,9cadac697a133719106faf584bc4b10f,1327337290,41.313525,-80.603195,0


In [85]:
df.columns

Index(['date_time', 'cc_num', 'merchant', 'category', 'amt', 'first', 'last',
       'gender', 'street', 'city', 'state', 'zip', 'lat', 'long', 'city_pop',
       'job', 'dob', 'trans_num', 'unix_time', 'merch_lat', 'merch_long',
       'is_fraud'],
      dtype='object')

In [86]:
# No null data
df.isnull().sum()

date_time     0
cc_num        0
merchant      0
category      0
amt           0
first         0
last          0
gender        0
street        0
city          0
state         0
zip           0
lat           0
long          0
city_pop      0
job           0
dob           0
trans_num     0
unix_time     0
merch_lat     0
merch_long    0
is_fraud      0
dtype: int64

In [87]:
print(df['amt'].max())
print(df['amt'].min())

12025.3
1.0


Since there is a huge difference in the max and min values for the transaction amount, I will then scale this column to make the data more practical.

In [10]:
# We want to standardize the 'amt' column
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
amount = df['amt'].values
df['amt'] = sc.fit_transform(amount.reshape(-1, 1))
df['amt']

500767    -0.376352
784441     0.133939
519637     0.168301
1150334    1.380829
348021    -0.273268
             ...   
8395      -0.093539
1249845   -0.181562
848046    -0.055775
260209     0.451003
1200601   -0.128068
Name: amt, Length: 12967, dtype: float64

Everything in the 'amt' or Amount column looks like it has been scaled and so we will move onto the train and test split. While doing EDA, we found out that most of the columns are not correlated with wether or not the transaction is fraudulent. Therefore, we will create a new dataframe that focuses on the features we need. Additionally, it will save running time.

In [88]:
dfa = df[['amt', 'is_fraud']]
dft = df[['date_time', 'is_fraud']]

## Train Test Split - Amount

In [12]:
X = dfa.drop('is_fraud', axis = 1).values
y = dfa['is_fraud'].values

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

### Decision Tree Classifier

In [14]:
from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier(max_depth = 4, criterion = 'entropy')
DT.fit(X_train, y_train)
dt_yhat = DT.predict(X_test)

In [15]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

print('Accuracy score of the Decision Tree model is {}'.format(accuracy_score(y_test, dt_yhat)))
print('F1 score of the Decision Tree model is {}'.format(f1_score(y_test, dt_yhat)))
confusion_matrix(y_test, dt_yhat, labels = [0, 1])

Accuracy score of the Decision Tree model is 0.9938309685379395
F1 score of the Decision Tree model is 0.23076923076923078


array([[3219,    0],
       [  20,    3]], dtype=int64)

Here, the first row represents positive and the second row represents negative. So, we have 3219 as true positive and 0 are false positive. We have 3219 that are successfully classified as a nonfraudulent transaction and 0 were falsely classified as nonfraudulent, but they were fraudulent. For the accuracy and F1 score, generally we want the score closer to 1. The accuracy score of the Decision Tree model is pretty good but the F1 score is alright.

### K-Nearest Neighbors

In [21]:
from sklearn.neighbors import KNeighborsClassifier

n = 7
KNN = KNeighborsClassifier(n_neighbors = n)
KNN.fit(X_train, y_train)
knn_yhat = KNN.predict(X_test)

In [22]:
print('Accuracy score of the K-Nearest Neighbors model is {}'.format(accuracy_score(y_test, knn_yhat)))
print('F1 score of the K-Nearest Neighbors model is {}'.format(f1_score(y_test, knn_yhat)))
confusion_matrix(y_test, knn_yhat, labels = [0, 1])

Accuracy score of the K-Nearest Neighbors model is 0.9935225169648365
F1 score of the K-Nearest Neighbors model is 0.16


array([[3219,    0],
       [  21,    2]], dtype=int64)

### Logistic Regression

In [24]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_yhat = lr.predict(X_test)

In [25]:
print('Accuracy score of the Logistic Regression model is {}'.format(accuracy_score(y_test, lr_yhat)))
print('F1 score of the Logistic Regression model is {}'.format(f1_score(y_test, lr_yhat)))
confusion_matrix(y_test, lr_yhat, labels = [0, 1])

Accuracy score of the Logistic Regression model is 0.9919802590993214
F1 score of the Logistic Regression model is 0.0


array([[3216,    3],
       [  23,    0]], dtype=int64)

### Random Forest

In [26]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth = 4)
rf.fit(X_train, y_train)
rf_yhat = rf.predict(X_test)

In [27]:
print('Accuracy score of the Random Forest model is {}'.format(accuracy_score(y_test, rf_yhat)))
print('F1 score of the Random Forest model is {}'.format(f1_score(y_test, rf_yhat)))
confusion_matrix(y_test, rf_yhat, labels = [0, 1])

Accuracy score of the Random Forest model is 0.9935225169648365
F1 score of the Random Forest model is 0.22222222222222218


array([[3218,    1],
       [  20,    3]], dtype=int64)

### XGBoost

In [28]:
from xgboost import XGBClassifier

xgb = XGBClassifier(max_depth = 4)
xgb.fit(X_train, y_train)
xgb_yhat = xgb.predict(X_test)

In [29]:
print('Accuracy score of the XGBoost model is {}'.format(accuracy_score(y_test, xgb_yhat)))
print('F1 score of the XGBoost model is {}'.format(f1_score(y_test, xgb_yhat)))
confusion_matrix(y_test, xgb_yhat, labels = [0, 1])

Accuracy score of the XGBoost model is 0.9935225169648365
F1 score of the XGBoost model is 0.3225806451612903


array([[3216,    3],
       [  18,    5]], dtype=int64)

### Hybrid XGBoost Model - XGBoost and Random Forest Classifier

In [30]:
from xgboost import XGBRFClassifier

xgbrf = XGBRFClassifier()
xgbrf.fit(X_train, y_train)
xgbrf_yhat = xgb.predict(X_test)

In [31]:
print('Accuracy score of the XGBRF model is {}'.format(accuracy_score(y_test, xgbrf_yhat)))
print('F1 score of the XGBRF model is {}'.format(f1_score(y_test, xgbrf_yhat)))
confusion_matrix(y_test, xgbrf_yhat, labels = [0, 1])

Accuracy score of the XGBRF model is 0.9935225169648365
F1 score of the XGBRF model is 0.3225806451612903


array([[3216,    3],
       [  18,    5]], dtype=int64)

## Overall Scores

In [34]:
accuracy = [accuracy_score(y_test, dt_yhat), accuracy_score(y_test, knn_yhat), accuracy_score(y_test, lr_yhat), 
            accuracy_score(y_test, xgb_yhat), accuracy_score(y_test, xgbrf_yhat)]
f1 = [f1_score(y_test, dt_yhat), f1_score(y_test, knn_yhat), f1_score(y_test, lr_yhat), 
            f1_score(y_test, xgb_yhat), f1_score(y_test, xgbrf_yhat)]
names = ['DT', 'KNN', 'LR', 'XGB', 'XBGRF']

In [64]:
def scores(accuracy, f1, names):
    for i in range(0, 5):
        print('For', names[i])
        print('It has an accuracy score of', accuracy[i])
        print('And an F1 score of', f1[i])

In [65]:
scores(accuracy, f1, names)

For DT
It has an accuracy score of 0.9938309685379395
And an F1 score of 0.23076923076923078
For KNN
It has an accuracy score of 0.9935225169648365
And an F1 score of 0.16
For LR
It has an accuracy score of 0.9919802590993214
And an F1 score of 0.0
For XGB
It has an accuracy score of 0.9935225169648365
And an F1 score of 0.3225806451612903
For XBGRF
It has an accuracy score of 0.9935225169648365
And an F1 score of 0.3225806451612903


In [66]:
max(accuracy)

0.9938309685379395

In [67]:
max(f1)

0.3225806451612903

Decision Tree Classifier has the highest accuracy score but XGBoost and XGBoost/Random Forest has the highest F1 score. So how do we choose between the two?

## Conclusion

Remember that the F1 score is balancing precision and recall on the positive class while accuracy looks at correctly classified observations both positive and negative. That makes a big difference especially for the imbalanced problems where by default our model will be good at predicting true negatives and hence accuracy will be high. However, if you care equally about true negatives and true positives then accuracy is the metric you should choose. 

Since our dataset is naturally imbalanced, it will make our accuracy score really high by default. That is why we will care more about the F1 score in this case.

Therefore, the best model is the XGBoost and XGBRF which has the same accuracy and F1 score. We have an accuracy of 99.35% and a F1 score of 32.26%