# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Training Pipeline Notebook</div></center></span>

I separated the machine learning training pipeline from the EDA for cleaner approach.

# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Import Necessary Libraries</div></center></span>

In [3]:
# Exploration purpose
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ML Model related
from sklearn.preprocessing import LabelEncoder, RobustScaler
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report, recall_score, precision_score, f1_score, roc_auc_score, plot_roc_curve
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from lazypredict.Supervised import LazyClassifier
import xgboost as xgb
from joblib import dump
from joblib import load

# Data balancing
from collections import Counter
from imblearn.over_sampling import SMOTE

# Other
import joblib
import pickle
import warnings
warnings.filterwarnings('ignore')

# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Load Dataset</div></center></span>

In [4]:
dataset_dir = '../dataset/cleaned-fraud-payments.csv'
payment_fraud_df = pd.read_csv(dataset_dir)

In [5]:
payment_fraud_df

Unnamed: 0,step,type,amount,isFraud,isFlaggedFraud,diffOrig,diffDest
0,1,PAYMENT,9839.64,0,0,9839.64,0.00
1,1,PAYMENT,1864.28,0,0,1864.28,0.00
2,1,TRANSFER,181.00,1,0,181.00,0.00
3,1,CASH_OUT,181.00,1,0,181.00,21182.00
4,1,PAYMENT,11668.14,0,0,11668.14,0.00
...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,1,0,339682.13,-339682.13
6362616,743,TRANSFER,6311409.28,1,0,6311409.28,0.00
6362617,743,CASH_OUT,6311409.28,1,0,6311409.28,-6311409.27
6362618,743,TRANSFER,850002.52,1,0,850002.52,0.00


# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Data Preprocessing</div></center></span>

## Train Test Split

In [6]:
X = payment_fraud_df.drop(columns='isFraud')
y = payment_fraud_df['isFraud']

In [7]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [8]:
# Check each sample of splitted data
print(f'Total # of sample in whole dataset: {len(X)}')
print('=========================================')
print(f'Total # of sample in train dataset: {len(X_train)}')
print('=========================================')
print(f'Total # of sample in test dataset: {len(X_test)}')


Total # of sample in whole dataset: 6362620
Total # of sample in train dataset: 4453834
Total # of sample in test dataset: 1908786


## Label Encoding
Label Encoder is used to turn categorical column into numeric

In [9]:
encoder = LabelEncoder()

### Scaling with Robust Scaler

From our exploration phase, we concluded that the dataset are heavily-imbued with outliers, that's why we use Robust Scaler to scale it.  RobustScaler is a preprocessing technique used to scale numeric features in a dataset. It scales the features to be centered around the median value and scales the data according to the interquartile range (IQR) rather than the mean and standard deviation, which makes it robust to outliers.

In [10]:
scaler = RobustScaler()

## Column Transformer

In [23]:
# Define numeric and categorical features
numerical_features = payment_fraud_df.select_dtypes(exclude=object).columns.tolist()
categorical_features = payment_fraud_df['type'].name

In [24]:
categorical_features

'type'

In [12]:
# Remove label from feature
numerical_features.remove('isFraud')

In [19]:
# Define column transformer for pipeline use later
preprocessor = ColumnTransformer([
    ('encoder', encoder, categorical_features)
])

# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Model Building</div></center></span>

## Base Model Testing

Before building pipeline, model selection or hyperparameter tuning first we are gonna compare all of ML algorithm to select the best base for our model

In [None]:
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)

100%|██████████| 29/29 [4:29:14<00:00, 557.05s/it]   


In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,1.0,0.89,0.89,1.0,101.68
DecisionTreeClassifier,1.0,0.89,0.89,1.0,57.05
BaggingClassifier,1.0,0.88,0.88,1.0,366.75
RandomForestClassifier,1.0,0.87,0.87,1.0,1235.37
ExtraTreesClassifier,1.0,0.85,0.85,1.0,312.81
KNeighborsClassifier,1.0,0.84,0.84,1.0,1221.42
ExtraTreeClassifier,1.0,0.84,0.84,1.0,6.41
PassiveAggressiveClassifier,1.0,0.82,0.82,1.0,8.65
GaussianNB,0.64,0.81,0.81,0.78,5.93
LGBMClassifier,1.0,0.8,0.8,1.0,11.36


After testing our dataset to some base model we found out that our earlier hypothesis were true, that tree and bagging-based algorithm work great with heavily imbalanced data, and also there's xgboost as usual.  Next we can just use the top algorithm from above to do some hyperparameter tuning and made our models.  We are going to tune Decision Tree and XGBoost as two best models.


## Hyperparameter Tuning

In [18]:
# Temporarily fit the preprocessor for grid search
X_train_processed = preprocessor.fit_transform(X_train)

TypeError: fit_transform() takes 2 positional arguments but 3 were given

### Decision Tree

In [None]:
param_grid_dt = {"criterion":["gini", "entropy"],
                 "splitter":["best", "random"],
                 "max_depth": [None, 3, 6, 9, 12],
                 "max_features":[None, 3, 5, 7],
                 "min_samples_leaf": [2, 3, 4],
                 "min_samples_split": [2, 3, 5, 7, 9, 12]}

In [None]:
dt_grid = DecisionTreeClassifier(class_weight = "balanced", random_state=42)
dt_grid = GridSearchCV(estimator=dt_grid,
                            param_grid=param_grid_dt,
                            scoring='f1',
                            n_jobs = -1,
                            cv=10,
                            verbose=1)

### XGBoost

In [None]:
param_grid_xgb = {
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 500, 1000],
    'max_depth': [3, 5, 7, 9],
    'subsample': [0.5, 0.7, 0.9],
    'colsample_bytree': [0.5, 0.7, 0.9],
    'reg_alpha': [0, 0.1, 0.5, 1],
    'reg_lambda': [0, 0.1, 0.5, 1],
    'gamma': [0, 0.1, 0.5, 1]
}


In [None]:
xgb_grid = xgb.XGBClassifier(objective='binary:logistic', eval_metric='auc', random_state=42)
xgb_grid = GridSearchCV(estimator=xgb_grid,
                            param_grid=param_grid_xgb,
                            scoring='f1',
                            n_jobs = -1,
                            cv=10,
                            verbose=1)

# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Building Training Pipeline</div></center></span>

In [None]:
# Define list of models
classifier = [('decision_tree', dt_grid), ('xgboost', xgb_grid)]