# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Training Pipeline Notebook</div></center></span>

I separated the machine learning training pipeline from the EDA for cleaner approach.

# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Import Necessary Libraries</div></center></span>

In [39]:
# Exploration purpose
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ML Model related
from sklearn.preprocessing import LabelEncoder, RobustScaler
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report, recall_score, precision_score, f1_score, roc_auc_score, plot_roc_curve
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from lazypredict.Supervised import LazyClassifier
from joblib import dump
from joblib import load

# Data balancing
from collections import Counter
from imblearn.over_sampling import SMOTE

# Other
import joblib
import pickle
import warnings
warnings.filterwarnings('ignore')

# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Load Dataset</div></center></span>

In [12]:
dataset_dir = '../dataset/cleaned-fraud-payments.csv'
payment_fraud_df = pd.read_csv(dataset_dir)

In [13]:
payment_fraud_df

Unnamed: 0,step,type,amount,isFraud,isFlaggedFraud,diffOrig,diffDest
0,1,PAYMENT,9839.64,0,0,9839.64,0.00
1,1,PAYMENT,1864.28,0,0,1864.28,0.00
2,1,TRANSFER,181.00,1,0,181.00,0.00
3,1,CASH_OUT,181.00,1,0,181.00,21182.00
4,1,PAYMENT,11668.14,0,0,11668.14,0.00
...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,1,0,339682.13,-339682.13
6362616,743,TRANSFER,6311409.28,1,0,6311409.28,0.00
6362617,743,CASH_OUT,6311409.28,1,0,6311409.28,-6311409.27
6362618,743,TRANSFER,850002.52,1,0,850002.52,0.00


# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Data Preprocessing</div></center></span>

## Train Test Split

In [14]:
X = payment_fraud_df.drop(columns='isFraud')
y = payment_fraud_df['isFraud']

In [18]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [20]:
# Check each sample of splitted data
print(f'Total # of sample in whole dataset: {len(X)}')
print('=========================================')
print(f'Total # of sample in train dataset: {len(X_train)}')
print('=========================================')
print(f'Total # of sample in test dataset: {len(X_test)}')


Total # of sample in whole dataset: 6362620
Total # of sample in train dataset: 4453834
Total # of sample in test dataset: 1908786


## Label Encoding
Label Encoder is used to turn categorical column into numeric

In [21]:
encoder = LabelEncoder()

### Scaling with Robust Scaler

From our exploration phase, we concluded that the dataset are heavily-imbued with outliers, that's why we use Robust Scaler to scale it.  RobustScaler is a preprocessing technique used to scale numeric features in a dataset. It scales the features to be centered around the median value and scales the data according to the interquartile range (IQR) rather than the mean and standard deviation, which makes it robust to outliers.

In [24]:
scaler = RobustScaler()

## Column Transformer

In [33]:
# Define numeric and categorical features
numerical_features = payment_fraud_df.select_dtypes(exclude=object).columns.tolist()
categorical_features = payment_fraud_df.select_dtypes(include=object).columns

In [35]:
# Define column transformer for pipeline use later
preprocessor = ColumnTransformer([
    ('encoder', encoder, categorical_features),
    ('scaler', scaler, numerical_features)
])

# <span><center><div style="font-family: Trebuchet MS; background-color: #1e81b0; color: #eeeee4; padding: 12px; line-height: 1;">Model Building</div></center></span>

Before building pipeline, model selection or hyperparameter tuning first we are gonna compare all of ML algorithm to select the best base for our model

In [None]:
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)