# Fraud Detection

## Background

With the widespread use of credit cards, fraud has emerged as a major issue in the credit card business in the banking world. Misuse of this credit card is known to result in considerable losses. So referring to the [Journal Article](https://ieeexplore.ieee.org/abstract/document/9000549), credit card fraud detection is an important study in the current era of mobile payments. The existence of a fraud pattern that is changing rapidly needs to be evaluated so that the approach can be more proactive to prevent fraud. In this case, the application of an efficient fraud detection algorithm using machine learning techniques is a reliable alternative to help detect fraud so as to reduce the resulting losses.

## Objectives

Overall, this project offers the objectives and contributions as follows:

   1. Build machine learning model to classify and predict the status of credit card transaction, whether fraud or not. 
   2. Deploy the model performance to new datasets in web applications. Providing the probability of the transaction to being fraud.

## Dataset

This is a simulated credit card transaction dataset containing both legitimate and fraudulent transactions during 1 Jan 2019 - 31 Dec 2020. It includes credit cards of 1000 customers who transacted with 800 merchants.

Source : https://github.com/namebrandon/Sparkov_Data_Generation

## Import Data

The data used is `fraudTrain.csv` as dataset to create machine learning model (**training**) and `fraud_test.csv` as dataset to evaluate the model (**testing**). The following code is reading the data using pandas then creating a preview for the dataset.

In [1]:
import pandas as pd

In [2]:
fraud_train = pd.read_csv("fraudTrain.csv")
fraud_test = pd.read_csv("fraud_test.csv")
fraud_train.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


Description of the Dataset:

- `trans_date_trans_time` : transaction time (yyyy/mm/dd hh:mm:ss)
- `cc_num` : credit card number
- `merchant` : name of merchant
- `category` : category of the merchant
- `amt` : transaction amount
- `first` : first name of credit card holder
- `last` : last name of credit card holder
- `gender` : F= female, M = male
- `street` : street address of the credit card holder
- `city` : city address of the credit card holder
- `state` : state address of the credit card holder
- `zip` : zip code address of the credit card holder
- `lat` : latitude location of the credit card holder
- `long` : longitude location of the credit card holder
- `city_pop` : number of city population
- `job` : job of credit card holder
- `dob` : date of birth of credit card holder
- `trans_num` : transaction number
- `unix_time` : system for describing a point in time
- `merch_lat` : latitude location of the merchant
- `merch_long` : merchant longitude
- `is_fraud` : 0 = Not Fraud, 1 = Fraud

Then, further inspection for dataset `fraud_train` and `fraud_test`

In [3]:
fraud_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

In [4]:
fraud_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4290 entries, 0 to 4289
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             4290 non-null   int64  
 1   trans_date_trans_time  4290 non-null   object 
 2   cc_num                 4290 non-null   int64  
 3   merchant               4290 non-null   object 
 4   category               4290 non-null   object 
 5   amt                    4290 non-null   float64
 6   first                  4290 non-null   object 
 7   last                   4290 non-null   object 
 8   gender                 4290 non-null   object 
 9   street                 4290 non-null   object 
 10  city                   4290 non-null   object 
 11  state                  4290 non-null   object 
 12  zip                    4290 non-null   int64  
 13  lat                    4290 non-null   float64
 14  long                   4290 non-null   float64
 15  city

In [5]:
print(fraud_test.shape),print(fraud_train.shape)

(4290, 23)
(1296675, 23)


(None, None)

In [6]:
fraud_train.isnull().sum().sum()

0

In [7]:
fraud_test.isnull().sum().sum()

0

Based on the above inspection, we got some information:
- There are some columns that do not have the appropriate data type
- Since 1 row represents one transaction, in the train dataset there are 1,296,675 transactions that will be used for model learning, and there are 4290 transactions to predict
- No missing values in both data sets

## Data Cleaning & Data Wrangling

The first step in data preparation is removing meaningless column `Unnamed: 0`

In [8]:
fraud_train.drop("Unnamed: 0",axis=1,inplace=True)
fraud_test.drop("Unnamed: 0",axis=1,inplace=True)

Then, change the value of column `is_fraud` to make it more informative.
- With a value of 0 then it is changed to **No**, which means **No Fraud**
- With a value of 1, it is changed to **Yes**, which means **Fraud**

In [9]:
fraud_train["is_fraud"]=fraud_train.is_fraud.apply(lambda x: "Yes" if x==1 else "No")
fraud_test["is_fraud"]=fraud_test.is_fraud.apply(lambda x: "Yes" if x==1 else "No")

fraud_train["is_fraud"].astype("object")
fraud_test["is_fraud"].astype("object")

0       Yes
1        No
2       Yes
3        No
4       Yes
       ... 
4285     No
4286    Yes
4287     No
4288     No
4289     No
Name: is_fraud, Length: 4290, dtype: object

The next step is adjusting the datetime data types, namely `trans_date_trans_time` and `dob`. This is done to extract the date information stored in a new columns `trans_date` and `age`. These changes are made to both the train data and the test data.

In [10]:
# create trans_date and age columns on dataset train
fraud_train['trans_date_trans_time']=pd.to_datetime(fraud_train['trans_date_trans_time'])
fraud_train['trans_date']=fraud_train['trans_date_trans_time'].dt.strftime('%Y-%m-%d')
fraud_train['trans_date']=pd.to_datetime(fraud_train['trans_date'])

fraud_train['dob']=pd.to_datetime(fraud_train['dob'])
fraud_train["age"] = fraud_train["trans_date"]- fraud_train["dob"]
fraud_train["age"]= fraud_train["age"].astype('timedelta64[Y]')

# create trans_date and age columns on dataset test
fraud_test['trans_date_trans_time']=pd.to_datetime(fraud_test['trans_date_trans_time'])
fraud_test['trans_date']=fraud_test['trans_date_trans_time'].dt.strftime('%Y-%m-%d')
fraud_test['trans_date']=pd.to_datetime(fraud_test['trans_date'])

fraud_test['dob']=pd.to_datetime(fraud_test['dob'])
fraud_test["age"] = fraud_test["trans_date"]- fraud_test["dob"]
fraud_test["age"]= fraud_test["age"].astype('timedelta64[Y]')

For getting month and year information of the transaction, we will extract from `trans_date`. The month and year information consider more informative to classify the status of the transaction.

In [11]:
fraud_train['trans_month'] = fraud_train['trans_date'].dt.month
fraud_train['trans_year'] = fraud_train['trans_date'].dt.year

fraud_test['trans_month'] = fraud_test['trans_date'].dt.month
fraud_test['trans_year'] = fraud_test['trans_date'].dt.year

Next is find out the distance from the card holder location to merchant location in degrees latitude and degrees longitude

In [12]:
fraud_train['latitudinal_distance'] = abs(round(fraud_train['merch_lat']-fraud_train['lat'],3))
fraud_train['longitudinal_distance'] = abs(round(fraud_train['merch_long']-fraud_train['long'],3))

fraud_test['latitudinal_distance'] = abs(round(fraud_test['merch_lat']-fraud_test['lat'],3))
fraud_test['longitudinal_distance'] = abs(round(fraud_test['merch_long']-fraud_test['long'],3))

There are a few columns that have extracted valuable information and columns that are not needed for modeling, so they can be deleted.

In [13]:
drop_cols = ['cc_num','trans_date_trans_time','trans_num','city','lat','long','job','dob','merch_lat','merch_long','trans_date','state', 'merchant','first','last','street','zip','unix_time']
fraud_train=fraud_train.drop(drop_cols,axis=1)
fraud_test=fraud_test.drop(drop_cols,axis=1)

In [14]:
fraud_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4290 entries, 0 to 4289
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   category               4290 non-null   object 
 1   amt                    4290 non-null   float64
 2   gender                 4290 non-null   object 
 3   city_pop               4290 non-null   int64  
 4   is_fraud               4290 non-null   object 
 5   age                    4290 non-null   float64
 6   trans_month            4290 non-null   int64  
 7   trans_year             4290 non-null   int64  
 8   latitudinal_distance   4290 non-null   float64
 9   longitudinal_distance  4290 non-null   float64
dtypes: float64(4), int64(3), object(3)
memory usage: 335.3+ KB


Last step is final adjustment for data types. There are several columns that should be `category`.

In [15]:
cat_cols = ['category','gender']

In [16]:
fraud_train[cat_cols] = fraud_train[cat_cols].astype("category")
fraud_train['is_fraud'] = fraud_train['is_fraud'].astype("category")
fraud_test[cat_cols] = fraud_test[cat_cols].astype("category")
fraud_test['is_fraud'] = fraud_test['is_fraud'].astype("category")

## Train Test Inisialization

The first step in the modeling section is to define predictor variables and target variables in each of the train and test data.

In [17]:
X = fraud_train.drop('is_fraud', axis=1)
y = fraud_train['is_fraud']

X_unseen = fraud_test.drop('is_fraud', axis=1)
y_unseen = fraud_test['is_fraud']

Perform data splitting into training data and validation data, with the proportion of train data being 80%. 

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1037340, 9)
(259335, 9)
(1037340,)
(259335,)


In [19]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 9 columns):
 #   Column                 Non-Null Count    Dtype   
---  ------                 --------------    -----   
 0   category               1296675 non-null  category
 1   amt                    1296675 non-null  float64 
 2   gender                 1296675 non-null  category
 3   city_pop               1296675 non-null  int64   
 4   age                    1296675 non-null  float64 
 5   trans_month            1296675 non-null  int64   
 6   trans_year             1296675 non-null  int64   
 7   latitudinal_distance   1296675 non-null  float64 
 8   longitudinal_distance  1296675 non-null  float64 
dtypes: category(2), float64(4), int64(3)
memory usage: 71.7 MB


 Perform one-hot-enconding to categorical columns.

In [20]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(drop='first', sparse=False)
encoder.fit(X_train[cat_cols])

X_train_encoded = pd.DataFrame(encoder.transform(X_train[cat_cols]),
                               index=X_train.index,
                               columns=encoder.get_feature_names(X_train[cat_cols].columns))
X_train_dummy = pd.concat([X_train.select_dtypes(exclude='category'),
                           X_train_encoded], axis=1)
X_train_dummy.head()




Unnamed: 0,amt,city_pop,age,trans_month,trans_year,latitudinal_distance,longitudinal_distance,category_food_dining,category_gas_transport,category_grocery_net,...,category_health_fitness,category_home,category_kids_pets,category_misc_net,category_misc_pos,category_personal_care,category_shopping_net,category_shopping_pos,category_travel,gender_M
1158821,99.77,4005,75.0,4,2020,0.95,0.804,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
181052,132.98,43102,68.0,4,2019,0.775,0.571,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
874052,9.03,1461,84.0,12,2019,0.599,0.016,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
839608,1.29,1423,21.0,12,2019,0.786,0.56,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1162492,6.15,3224,22.0,4,2020,0.178,0.538,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


In [21]:
X_test_encoded = pd.DataFrame(encoder.transform(X_test[cat_cols]),
                               index=X_test.index,
                               columns=encoder.get_feature_names(X_test[cat_cols].columns))
X_test_dummy = pd.concat([X_test.select_dtypes(exclude='category'),
                           X_test_encoded], axis=1)
X_test_dummy.shape



(259335, 21)

In [22]:
X_unseen_encoded = pd.DataFrame(encoder.transform(X_unseen[cat_cols]),
                               index=X_unseen.index,
                               columns=encoder.get_feature_names(X_unseen[cat_cols].columns))
X_unseen_dummy = pd.concat([X_unseen.select_dtypes(exclude='category'),
                           X_unseen_encoded], axis=1)
X_unseen_dummy.shape



(4290, 21)

## Decision Tree 

The first model that is tried to be applied to perform classification and prediction is Decision Tree

In [23]:
from sklearn.tree import DecisionTreeClassifier

dt_fraud = DecisionTreeClassifier(criterion='entropy', max_depth=2, random_state=100)
dt_fraud.fit(X_train_dummy, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=2, random_state=100)

In [24]:
y_pred_dt = dt_fraud.predict(X_test_dummy)
y_pred_prob_dt = dt_fraud.predict_proba(X_test_dummy)

In [25]:
dt_fraud.score(X_train_dummy, y_train)

0.995473036805676

In [26]:
dt_fraud.score(X_test_dummy, y_test)

0.9955771492471128

In [27]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
y_pred_train = dt_fraud.predict(X_train_dummy)
print("Recall :" , recall_score(y_true=y_train, y_pred=y_pred_train, pos_label='Yes'))
print("Presicion :" ,precision_score(y_true=y_train, y_pred=y_pred_train, pos_label='Yes'))
print("Accuracy :",accuracy_score(y_true=y_train, y_pred=y_pred_train))

Recall : 0.2307819748177601
Presicion : 0.9633471645919779
Accuracy : 0.995473036805676


In [28]:
# test performance
y_pred_test = dt_fraud.predict(X_test_dummy)
print("Recall :" , recall_score(y_true=y_test, y_pred=y_pred_test, pos_label='Yes'))
print("Presicion :" , precision_score(y_true=y_test, y_pred=y_pred_test, pos_label='Yes'))
print("Accuracy :", accuracy_score(y_true=y_test, y_pred=y_pred_test))

Recall : 0.22789115646258504
Presicion : 0.9654178674351584
Accuracy : 0.9955771492471128


The model performance shows that precision and accuracy are very good, reaching 96% and 99%, respectively. Meanwhile, recall is still very low at 22%. Recall becomes a very important metric in this case because we don't want Fraud transaction to be detected as Not Fraud.

## Random Forest

The next model that is used as a solution for classification is Random Forest.

In [29]:
from sklearn.ensemble import RandomForestClassifier

rf_fraud = RandomForestClassifier(random_state=100, oob_score=True)
rf_fraud.fit(X_train_dummy, y_train)

RandomForestClassifier(oob_score=True, random_state=100)

In [30]:
# training performance
y_pred_rf_train = rf_fraud.predict(X_train_dummy)

print("Recall :" , recall_score(y_true=y_train, y_pred=y_pred_rf_train, pos_label='Yes'))
print("Presicion :" ,precision_score(y_true=y_train, y_pred=y_pred_rf_train, pos_label='Yes'))
print("Accuracy :",accuracy_score(y_true=y_train, y_pred=y_pred_rf_train))


Recall : 0.9998343273691186
Presicion : 1.0
Accuracy : 0.9999990359959127


In [31]:
# test performance

y_pred_rf_test = rf_fraud.predict(X_test_dummy)

print("Recall :" , recall_score(y_true=y_test, y_pred=y_pred_rf_test, pos_label='Yes'))
print("Presicion :" , precision_score(y_true=y_test, y_pred=y_pred_rf_test, pos_label='Yes'))
print("Accuracy :", accuracy_score(y_true=y_test, y_pred=y_pred_rf_test))

Recall : 0.6591836734693878
Presicion : 0.8367875647668394
Accuracy : 0.9973393487188386


The results of the Random Forest model show that recall, precision, and accuracy have excelled in train data. The recall value on the test data is also better than the Decision Tree. However, this value is still considered to need to be improved again.

## Improve Model with Balanced Data

Inspect proportion of each value of target

In [32]:
y_train.value_counts(normalize=True)

No     0.994181
Yes    0.005819
Name: is_fraud, dtype: float64

The result shows there is imbalance data, which is 99% of the data set is Not Fraud. So by using a randomoversampler, the target value is equalized.

In [33]:
from imblearn.over_sampling import RandomOverSampler

oversampler = RandomOverSampler(random_state=100)
X_train_up, y_train_up = oversampler.fit_resample(X_train_dummy, y_train)
y_train_up.value_counts()

No     1031304
Yes    1031304
Name: is_fraud, dtype: int64

## Decision Tree

Decision Tree performs with balanced data.

In [34]:
dt_fraud_up = DecisionTreeClassifier(criterion='entropy', max_depth=7, random_state=100)
dt_fraud_up.fit(X_train_up, y_train_up)

DecisionTreeClassifier(criterion='entropy', max_depth=7, random_state=100)

In [35]:
y_pred_train_up = dt_fraud_up.predict(X_train_up)
print("Recall :" , recall_score(y_true=y_train_up, y_pred=y_pred_train_up, pos_label='Yes'))
print("Presicion :" ,precision_score(y_true=y_train_up, y_pred=y_pred_train_up, pos_label='Yes'))
print("Accuracy :",accuracy_score(y_true=y_train_up, y_pred=y_pred_train_up))

Recall : 0.960697330757953
Presicion : 0.9359531594067085
Accuracy : 0.9474786289978513


In [36]:
y_pred_dt_test_up_unseen = dt_fraud_up.predict(X_unseen_dummy)

print("Recall :" , recall_score(y_true=y_unseen, y_pred=y_pred_dt_test_up_unseen, pos_label='Yes'))
print("Presicion :" , precision_score(y_true=y_unseen, y_pred=y_pred_dt_test_up_unseen, pos_label='Yes'))
print("Accuracy :", accuracy_score(y_true=y_unseen, y_pred=y_pred_dt_test_up_unseen))

Recall : 0.9505827505827505
Presicion : 0.9374712643678161
Accuracy : 0.9435897435897436


The result of Desicision Tree with balanced data performs better. 

## Random Forest


Random Forest performs with balanced data.

In [37]:
rf_fraud_up = RandomForestClassifier(random_state=100, oob_score=True)
rf_fraud_up.fit(X_train_up, y_train_up)

RandomForestClassifier(oob_score=True, random_state=100)

In [38]:
y_pred_train_up_rf = rf_fraud_up.predict(X_train_up)
print("Recall :" , recall_score(y_true=y_train_up, y_pred=y_pred_train_up_rf, pos_label='Yes'))
print("Presicion :" ,precision_score(y_true=y_train_up, y_pred=y_pred_train_up_rf, pos_label='Yes'))
print("Accuracy :",accuracy_score(y_true=y_train_up, y_pred=y_pred_train_up_rf))

Recall : 1.0
Presicion : 1.0
Accuracy : 1.0


In [39]:
# test performance

y_pred_rf_test_up = rf_fraud_up.predict(X_test_dummy)

print("Recall :" , recall_score(y_true=y_test, y_pred=y_pred_rf_test_up, pos_label='Yes'))
print("Presicion :" , precision_score(y_true=y_test, y_pred=y_pred_rf_test_up, pos_label='Yes'))
print("Accuracy :", accuracy_score(y_true=y_test, y_pred=y_pred_rf_test_up))

Recall : 0.719047619047619
Presicion : 0.8181114551083591
Accuracy : 0.997501301405518


Inline with result of Decision Tree, Random Forest shows better with balanced data. The model is working perfectly on the data train. But, on the data test, Desicion Tree performs better than Random Forest. Recall as the preferred metric in this case in Random Forest only reaches 72.85%

## Random Forest with Best Parameters 

Then to optimize and improve the model, the best parameter search is carried out using the Random Search Cross Validation technique.

In [40]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

# hyperparameter distributions
param_random_rf = {
    'criterion' : ['gini', 'entropy'],
    'max_depth': range(10, 50),
    'min_samples_split': range(2, 50),
    'min_samples_leaf': range(2, 50),
    'max_features': ['sqrt', 'log2']
    }

rfc = RandomForestClassifier(random_state=123, oob_score=True)

model_rf_random_up = RandomizedSearchCV(
    estimator=rfc,
    param_distributions=param_random_rf,
    n_iter=4,
    cv=3,
    verbose=3,
    random_state=123)

model_rf_random_up.fit(X_train_up, y_train_up)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3] END criterion=entropy, max_depth=24, max_features=sqrt, min_samples_leaf=12, min_samples_split=32;, score=0.997 total time= 8.3min
[CV 2/3] END criterion=entropy, max_depth=24, max_features=sqrt, min_samples_leaf=12, min_samples_split=32;, score=0.997 total time= 7.9min
[CV 3/3] END criterion=entropy, max_depth=24, max_features=sqrt, min_samples_leaf=12, min_samples_split=32;, score=0.997 total time= 7.6min
[CV 1/3] END criterion=entropy, max_depth=30, max_features=sqrt, min_samples_leaf=30, min_samples_split=47;, score=0.996 total time= 6.8min
[CV 2/3] END criterion=entropy, max_depth=30, max_features=sqrt, min_samples_leaf=30, min_samples_split=47;, score=0.996 total time= 7.3min
[CV 3/3] END criterion=entropy, max_depth=30, max_features=sqrt, min_samples_leaf=30, min_samples_split=47;, score=0.996 total time= 6.8min
[CV 1/3] END criterion=gini, max_depth=16, max_features=sqrt, min_samples_leaf=9, min_samples_split=

RandomizedSearchCV(cv=3,
                   estimator=RandomForestClassifier(oob_score=True,
                                                    random_state=123),
                   n_iter=4,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': range(10, 50),
                                        'max_features': ['sqrt', 'log2'],
                                        'min_samples_leaf': range(2, 50),
                                        'min_samples_split': range(2, 50)},
                   random_state=123, verbose=3)

In [41]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

y_pred_rf_test_up_unseen = model_rf_random_up.predict(X_unseen_dummy)

print("Recall :" , recall_score(y_true=y_unseen, y_pred=y_pred_rf_test_up_unseen, pos_label='Yes'))
print("Presicion :" , precision_score(y_true=y_unseen, y_pred=y_pred_rf_test_up_unseen, pos_label='Yes'))
print("Accuracy :", accuracy_score(y_true=y_unseen, y_pred=y_pred_rf_test_up_unseen))

Recall : 0.8321678321678322
Presicion : 0.9988808058198098
Accuracy : 0.9156177156177157


The results obtained are considered quite good, but we need hight recall value.

## Desicion Tree with Best Parameters 

In [42]:
# hyperparameter distributions
param_random_dt = {
    'max_depth': range(10, 100),
    'min_samples_split': range(2, 100),
    'min_samples_leaf': range(2, 100),
    'max_features': list(range(20, 30)) + ['sqrt', 'log2', None]
    }

model_dt_random_up = RandomizedSearchCV(
    estimator=DecisionTreeClassifier(random_state=123),
    param_distributions=param_random_dt,
    n_iter=4,
    scoring='recall',
    cv=3,
    verbose=3,
    random_state=123)

model_dt_random_up.fit(X_train_up, y_train_up)

Fitting 3 folds for each of 4 candidates, totalling 12 fits


Traceback (most recent call last):
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, la

[CV 1/3] END max_depth=49, max_features=sqrt, min_samples_leaf=29, min_samples_split=22;, score=nan total time=  13.4s


Traceback (most recent call last):
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, la

[CV 2/3] END max_depth=49, max_features=sqrt, min_samples_leaf=29, min_samples_split=22;, score=nan total time=  13.5s


Traceback (most recent call last):
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, la

[CV 3/3] END max_depth=49, max_features=sqrt, min_samples_leaf=29, min_samples_split=22;, score=nan total time=  13.8s
[CV 1/3] END max_depth=79, max_features=25, min_samples_leaf=39, min_samples_split=45;, score=nan total time=   3.8s
[CV 2/3] END max_depth=79, max_features=25, min_samples_leaf=39, min_samples_split=45;, score=nan total time=   3.5s
[CV 3/3] END max_depth=79, max_features=25, min_samples_leaf=39, min_samples_split=45;, score=nan total time=   3.7s


Traceback (most recent call last):
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, la

[CV 1/3] END max_depth=43, max_features=sqrt, min_samples_leaf=65, min_samples_split=6;, score=nan total time=  13.9s


Traceback (most recent call last):
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, la

[CV 2/3] END max_depth=43, max_features=sqrt, min_samples_leaf=65, min_samples_split=6;, score=nan total time=  14.3s


Traceback (most recent call last):
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, la

[CV 3/3] END max_depth=43, max_features=sqrt, min_samples_leaf=65, min_samples_split=6;, score=nan total time=  11.9s


Traceback (most recent call last):
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, la

[CV 1/3] END max_depth=19, max_features=None, min_samples_leaf=23, min_samples_split=80;, score=nan total time=  25.4s


Traceback (most recent call last):
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, la

[CV 2/3] END max_depth=19, max_features=None, min_samples_leaf=23, min_samples_split=80;, score=nan total time=  29.1s


Traceback (most recent call last):
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\asus\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, la

[CV 3/3] END max_depth=19, max_features=None, min_samples_leaf=23, min_samples_split=80;, score=nan total time=  31.0s


RandomizedSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=123),
                   n_iter=4,
                   param_distributions={'max_depth': range(10, 100),
                                        'max_features': [20, 21, 22, 23, 24, 25,
                                                         26, 27, 28, 29, 'sqrt',
                                                         'log2', None],
                                        'min_samples_leaf': range(2, 100),
                                        'min_samples_split': range(2, 100)},
                   random_state=123, scoring='recall', verbose=3)

In [43]:
y_pred_dt_test_up_unseen = model_dt_random_up.predict(X_unseen_dummy)

print("Recall :" , recall_score(y_true=y_unseen, y_pred=y_pred_dt_test_up_unseen, pos_label='Yes'))
print("Presicion :" , precision_score(y_true=y_unseen, y_pred=y_pred_dt_test_up_unseen, pos_label='Yes'))
print("Accuracy :", accuracy_score(y_true=y_unseen, y_pred=y_pred_dt_test_up_unseen))

Recall : 0.8522144522144522
Presicion : 0.9827956989247312
Accuracy : 0.9186480186480186


Decision Tree with best parameters improve on Precision. However, because in this case we want to anticipate transactions that are declared not fraudulent even though they are fraudulent. Then the metric that is considered is recall. So that by comparing the evaluation values, especially recall, the decision tree with balanced data came out as the best classifier. Furthermore, this best model is stored under the name `dt_fraud_up`

In [44]:
import joblib
joblib.dump(dt_fraud_up,'dt_fraud_up')

['dt_fraud_up']

In [45]:
joblib.dump(model_rf_random_up,'model_rf_random_up')

['model_rf_random_up']