<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Capstone Project: Credit Card Fraud Detection Model
Fight Against Financial Crime with Machine Learning

## Introduction

With the evolving of the internet and popularity in using credit card for payments, numbers of credit card fraud has been increased as compared to old days. Although there is very low crime rate in Singapore, there is still credit card fraud cases happens and it affect the credit card users as well as the bank who provide the credit card facilities.

Nowadays, the news about credit card fraud transactions may getting more frequently that both regulators and public getting more cautions and seriously look into it. According to [Straits Times](https://www.straitstimes.com/singapore/460-jump-in-unauthorised-online-banking-and-card-transactions-in-2020), there were 1,848 police reports of such transactions involving criminals phishing for banking and card details from victims - up 462 per cent from 2019's 329 cases. 

Banks have taken up precautionary actions by implementing stringent SMS one-time passwords (OTP) requirements as well as sending transaction alerts to the user in order to prevent credit card fraud to happens. However, with the evolving in technologies fraudsters will keep come up with new ways to commit credit card fraud, hence we have to keep on modify and improve our detection model in order to detect credit card fraud as accurate as possible along the way.

<img src="../images/news1.png" width="800" height="800" />
<img src="../images/news2.png" width="800" height="800" />
<img src="../images/news3.png" width="800" height="800" />

## Problem Statements

One of the main business of the bank is to provide Credit Card facility for the clients. However, Credit Card Fraud Cases does not seems to decrease along with the evolving of technologies these years. It has caused significant losses and negative impact to the clients as well as the bank. For clients, not every credit card fraud transactions have been realized on time or there are clients who do not aware of the fraud transactions occurred as the amount of the fraud transactions are too little to be realized. Hence these group of clients may not raise the discrepancies on time for the bank the to investigate and reimburse the losses. Most of the time if the bank's investigation outcome is a fraud transaction, the bank has to reimburse the clients and it causes the bank suffer losses. On top of that, clients may also lose confident on the bank's products and it will further causes the reputational loses to the bank.

With above, Risk and Compliance department has approached me who is a data scientist of the bank to build an effective __Fraud Detection Model__ which able to classify the fraud transactions accurately. I have been provided with 1.3m of training datasets and 550k testing datasets with transactions from duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants. I will also need to __identify the top predictors and features__ that used to detect the credit card fraud transactions.

By having the fraud detection model, the team will be able to identify the fraud transaction right immediately when it being detected. It will reduce the bank's losses incurred in credit card fraud transaction, gain the reputation on non-compromise security risk and clients will not be the victims of the fraud transactions.

I decided to build the fraud detection classification model by exploring the classification algorithms, for example `LogisticRegression`, `GaussianNaiveBayes`,`RandomForestClassifier`, `LGBMClassifier`, `XGBClassifier` and `CatBoostClassifier`. I will also conduct the __*Recency, Frequency, Monetary Value (RFM) Analysis for customer segmentation*__ and __*build a Tableau Dashboard to visualize the predicted outcome from test dataset*__. 

The datasets provided is a super imbalanced in class which only 0.58% of data are labelled as fraud transactions, which means that high accuracy will not be the metric suitable in evaluating this datasets. I will need to have high True Positive outcome with minimal False Negative (Predicted not fraud, but is true fraud) and False Positive (Predicted is fraud, but is not fraud). In this case, my priority metrics to evaluate success of my model is to have __High Recall score__ and __High F1 Score__ which balance it with __Precision Score__. Specifically with the score above 90%. My model have to be able to capture as high numbers of fraud transactions as possible, and it allow minimal of Type I error (False Positive), this is because too many of Type I error may cause unnecessary alert triggered to clients as well as creating inconveniences.

## Datasets

If you would like to download the codebook and run it on your machine, you will need to download the full datasets from Kaggle source at https://www.kaggle.com/kartik2112/fraud-detection and __save both csv files into `datasets` folder.__
- fraudTest.csv (150.35 MB)
- fraudTrain.csv (351.24 MB)

# Notebook 01: Data Wrangling

## Loading the Datasets

In [1]:
import pandas as pd
import numpy as np
import datetime
from geopy import distance

In [2]:
# load datasets
train = pd.read_csv("../datasets/fraudTrain.csv")
test = pd.read_csv("../datasets/fraudtest.csv")

In [3]:
# check train datasets shape
print(train.shape)
train.head(3)

(1296675, 23)


Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0


In [4]:
# check test datasets shape
print(test.shape)
test.head(3)

(555719, 23)


Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0


In [5]:
# check the statistic data of train dataset
train.describe()

Unnamed: 0.1,Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,is_fraud
count,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0
mean,648337.0,4.17192e+17,70.35104,48800.67,38.53762,-90.22634,88824.44,1349244000.0,38.53734,-90.22646,0.005788652
std,374318.0,1.308806e+18,160.316,26893.22,5.075808,13.75908,301956.4,12841280.0,5.109788,13.77109,0.07586269
min,0.0,60416210000.0,1.0,1257.0,20.0271,-165.6723,23.0,1325376000.0,19.02779,-166.6712,0.0
25%,324168.5,180042900000000.0,9.65,26237.0,34.6205,-96.798,743.0,1338751000.0,34.73357,-96.89728,0.0
50%,648337.0,3521417000000000.0,47.52,48174.0,39.3543,-87.4769,2456.0,1349250000.0,39.36568,-87.43839,0.0
75%,972505.5,4642255000000000.0,83.14,72042.0,41.9404,-80.158,20328.0,1359385000.0,41.95716,-80.2368,0.0
max,1296674.0,4.992346e+18,28948.9,99783.0,66.6933,-67.9503,2906700.0,1371817000.0,67.51027,-66.9509,1.0


In [6]:
# check the object statistics
train.describe(include='object').T

Unnamed: 0,count,unique,top,freq
trans_date_trans_time,1296675,1274791,2019-04-22 16:02:01,4
merchant,1296675,693,fraud_Kilback LLC,4403
category,1296675,14,gas_transport,131659
first,1296675,352,Christopher,26669
last,1296675,481,Smith,28794
gender,1296675,2,F,709863
street,1296675,983,0069 Robin Brooks Apt. 695,3123
city,1296675,894,Birmingham,5617
state,1296675,51,TX,94876
job,1296675,494,Film/video editor,9779


From the dataframe statistics, it shows that there is no repeated in __transaction numbers__ and each transaction numbers are unique. However, due to cc_num is under integer data type, hence it did not shows the numbers of unique credit card numbers from the table above.

In [7]:
# check train dataset dtypes
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

In [8]:
train.isnull().sum() #check null values

Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
unix_time                0
merch_lat                0
merch_long               0
is_fraud                 0
dtype: int64

In [9]:
# check train dataset dtypes
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             555719 non-null  int64  
 1   trans_date_trans_time  555719 non-null  object 
 2   cc_num                 555719 non-null  int64  
 3   merchant               555719 non-null  object 
 4   category               555719 non-null  object 
 5   amt                    555719 non-null  float64
 6   first                  555719 non-null  object 
 7   last                   555719 non-null  object 
 8   gender                 555719 non-null  object 
 9   street                 555719 non-null  object 
 10  city                   555719 non-null  object 
 11  state                  555719 non-null  object 
 12  zip                    555719 non-null  int64  
 13  lat                    555719 non-null  float64
 14  long                   555719 non-nu

In [10]:
test.isnull().sum() #check null values

Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
unix_time                0
merch_lat                0
merch_long               0
is_fraud                 0
dtype: int64

Both train and test datasets do not have any missing values and both datasets come with our target feature `is_fraud`. Hence, we may build the model using train dataset by splitting it into train and validation datasets for model evaluation. We then can used the train model to predict our test datasets and further evaluate the outcome. 

Although the datasets seems to be cleaned, we will still have to conduct data cleaning before our EDA.

## Data Dictionary

|Feature|Dataset|Type|Description|
|---|---|---|---|
|trans_datetime|train_cleaned/test_cleaned|datetime|Credit Card trasaction date and time|
|cc_num|train_cleaned/test_cleaned|object|Credit Card Numbers|
|merchant|train_cleaned/test_cleaned|object|Credt card transaction's merchant name|
|category|train_cleaned/test_cleaned|object|Merchant's Category|
|amt|train_cleaned/test_cleaned|float|Credit Card Transaction Amount|
|gender|train_cleaned/test_cleaned|object|Gender (Male/Female)|
|street|train_cleaned/test_cleaned|object|Street of Credit Cardholder's location|
|city|train_cleaned/test_cleaned|object|City of Credit Cardholder's location|
|state|train_cleaned/test_cleaned|object|State of Credit Cardholder's location|
|zip|train_cleaned/test_cleaned|object|zip of Credit Cardholder's location|
|lat|train_cleaned/test_cleaned|float|Latitude Credit Cardholder's location|
|long|train_cleaned/test_cleaned|float|Longitude Credit Cardholder's location|
|city_pop|train_cleaned/test_cleaned|int|city population|
|job|train_cleaned/test_cleaned|object|Job Profession of Credit Cardholder|
|trans_num|train_cleaned/test_cleaned|object|Credit Card's Transaction Number|
|merch_lat|train_cleaned/test_cleaned|float|Latitude Merchant's location|
|merch_long|train_cleaned/test_cleaned|float|Longitude Merchant's location|
|is_fraud|train_cleaned/test_cleaned|int|Target Outcome(0:non-fraud, 1:fraud)|
|name|train_cleaned/test_cleaned|object|Full Name of Credit Cardholder|
|coords_ori|train_cleaned/test_cleaned|object|Combine lat & long as Credit Cardholder's coordinate|
|coords_merch|train_cleaned/test_cleaned|object|Combine merch_lat & merch_long as Merchant's coordinate|
|trans_year|train_cleaned/test_cleaned|int|Year of Credit Card Transaction is performed|
|trans_month|train_cleaned/test_cleaned|int|Month of Credit Card Transaction is performed|
|trans_week|train_cleaned/test_cleaned|int|Week of Credit Card Transaction is performed|
|trans_day|train_cleaned/test_cleaned|int|Day of Credit Card Transaction is performed|
|trans_hour|train_cleaned/test_cleaned|int|Hour of Credit Card Transaction is performed|
|trans_minute|train_cleaned/test_cleaned|int|Minute of Credit Card Transaction is performed|
|trans_dayofweek|train_cleaned/test_cleaned|int|Day of Week of Credit Card Transaction is performed|
|age|train_cleaned/test_cleaned|int|Age of Credit Cardholder as of Card Transaction is performed|
|distance|train_cleaned/test_cleaned|float|Coordinate Distance between Credit Cardholder and Merchant|

## Data Cleaning

In [12]:
# remove unuse first columns
train.drop(['Unnamed: 0','unix_time'], axis=1, inplace=True)
test.drop(['Unnamed: 0', 'unix_time'], axis=1, inplace=True)

In [13]:
# change type to datetime for 'trans_date_trans_time' and 'dob'
train['trans_date_trans_time'] = pd.to_datetime(train['trans_date_trans_time'])
train['dob'] = pd.to_datetime(train['dob'])

test['trans_date_trans_time'] = pd.to_datetime(test['trans_date_trans_time'])
test['dob'] = pd.to_datetime(test['dob'])

In [14]:
# change type of credit card number into object
train['cc_num'] = train['cc_num'].astype('object')
train['zip'] = train['zip'].astype('object')

test['cc_num'] = test['cc_num'].astype('object')
test['zip'] = test['zip'].astype('object')

In [15]:
# rename columns
train.rename(columns={'trans_date_trans_time':'trans_datetime'}, inplace=True)

test.rename(columns={'trans_date_trans_time':'trans_datetime'}, inplace=True)

In [16]:
# combine first and last name
train['name'] = train['first'] + ' ' + train['last']
train.drop(columns=['first','last'], inplace=True)

test['name'] = test['first'] + ' ' + test['last']
test.drop(columns=['first','last'], inplace=True)

In [17]:
# combine lat long
train['coords_ori'] = list(zip(train['lat'], train['long']))
train['coords_merch'] = list(zip(train['merch_lat'], train['merch_long']))

test['coords_ori'] = list(zip(test['lat'], test['long']))
test['coords_merch'] = list(zip(test['merch_lat'], test['merch_long']))

In [18]:
# Create new features from dob and trans_datetime
train['dob_year'] = train['dob'].dt.year
train['trans_year'] = train['trans_datetime'].dt.year
train['trans_month'] = train['trans_datetime'].dt.month
train['trans_week'] = train['trans_datetime'].dt.isocalendar().week.astype('int') 
train['trans_day'] = train['trans_datetime'].dt.day
train['trans_hour'] = train['trans_datetime'].dt.hour
train['trans_minute'] = train['trans_datetime'].dt.minute
train['trans_dayofweek'] = train['trans_datetime'].dt.dayofweek

test['dob_year'] = test['dob'].dt.year
test['trans_year'] = test['trans_datetime'].dt.year
test['trans_month'] = test['trans_datetime'].dt.month
test['trans_week'] = test['trans_datetime'].dt.isocalendar().week.astype('int') 
test['trans_day'] = test['trans_datetime'].dt.day
test['trans_hour'] = test['trans_datetime'].dt.hour
test['trans_minute'] = test['trans_datetime'].dt.minute
test['trans_dayofweek'] = test['trans_datetime'].dt.dayofweek

In [19]:
# create `age` column for the user at the point of transaction
train['age'] = train['trans_year'] - train['dob_year']

test['age'] = test['trans_year'] - test['dob_year']

In [20]:
# drop `dob` and `dob_year`
train.drop(['dob','dob_year'], axis=1, inplace=True)

test.drop(['dob','dob_year'], axis=1, inplace=True)

In [21]:
# create function to calculate distance between the original point and merchant point
def calculate_distance(ori, merch):
    return distance.distance(ori, merch).km

In [22]:
# create a distance column
train['distance'] = train.apply(lambda row: calculate_distance(row['coords_ori'], row['coords_merch']), axis=1)

test['distance'] = test.apply(lambda row: calculate_distance(row['coords_ori'], row['coords_merch']), axis=1)

In [23]:
# remove white space in between the objects
for col in ['merchant','category','city','state']:
    train[col].replace(' ','_', regex=True, inplace=True)
    
for col in ['merchant','category','city','state']:
    test[col].replace(' ','_', regex=True, inplace=True)


Unnamed: 0,trans_datetime,cc_num,merchant,category,amt,gender,street,city,state,zip,...,coords_merch,trans_year,trans_month,trans_week,trans_day,trans_hour,trans_minute,trans_dayofweek,age,distance
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin,_Kub_and_Mann",misc_net,4.97,F,561 Perry Cove,Moravian_Falls,NC,28654,...,"(36.011293, -82.048315)",2019,1,1,1,0,0,1,31,78.773821
1,2019-01-01 00:00:44,630423337322,"fraud_Heller,_Gutmann_and_Zieme",grocery_pos,107.23,F,43039 Riley Greens Suite 393,Orient,WA,99160,...,"(49.159047, -118.186462)",2019,1,1,1,0,0,1,41,30.216618


### Export Data

In [27]:
# save to pickle file
train.to_pickle('../datasets/train_cleaned.pkl')
test.to_pickle('../datasets/test_cleaned.pkl')