## Project goal

This notebook explores various ways of detecting whether a transaction is fraudulent. The goal is to build a machine learning model that detects frauds accurately and minimizes false postives. To achive that it is crucial to handle data imbalance which is common in fraud detection problems. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

import xgboost as xgb

from scipy.stats import randint, uniform

from sklearn.model_selection import (
    train_test_split,
    RandomizedSearchCV,
    StratifiedKFold,
)
from sklearn.metrics import (
    f1_score,
    recall_score,
    precision_score,
    accuracy_score,
    roc_auc_score,
    average_precision_score,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
)

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

import shap

from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz

  from .autonotebook import tqdm as notebook_tqdm


Data was sourced from Kaggle: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data

This creditcard.csv is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/


In [3]:
credit_card_scaled = pd.read_parquet("input/credit_card_scaled.parquet")

Why do we create a sub-Sample?
In the beginning of this notebook we saw that the original dataframe was heavily imbalanced! Using the original dataframe will cause the following issues:

Overfitting: Our classification models will assume that in most cases there are no frauds! What we want for our model is to be certain when a fraud occurs.
Wrong Correlations: Although we don't know what the "V" features stand for, it will be useful to understand how each of this features influence the result (Fraud or No Fraud) by having an imbalance dataframe we are not able to see the true correlations between the class and features.
Subsampling a training set, either undersampling or oversampling the appropriate class or classes, can be a helpful approach to dealing with classification data where one or more classes occur very infrequently. In such a situation (without compensating for it), most models will overfit to the majority class and produce very good statistics for the class containing the frequently occurring classes while the minority classes have poor performance.

random under-sample and SMOTE and Cost-Sensitive Learning

Random Forest (RF) and Decision Trees - they can handle imbalanced data because of their inherent ability to find decision boundaries that separate classes well.
Ensemble methods like AdaBoost, rfoost, LightGBM, or CatBoost, Bagging, or Stacking with base learners that handle imbalanced data well can be quite effective.

Anomaly detection is a specialized technique for handling data imbalance in
machine learning, particularly when one class (the anomaly or rare event) is
vastly outnumbered by the other class (normal or majority class).
K-Means Clustering, Isolation Forest is a tree-based method that isolates anomalies by creating a random forest of decision trees. Anomalies
are expected to require fewer splits to be isolated

random under-sample

In [4]:
X = credit_card_scaled.drop(columns=["Class"])
y = credit_card_scaled["Class"]

# Train (80%) and Test (20%) from original data set before undersampling 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=11, stratify=y
)


In [5]:
print(X_train.shape, X_test.shape)
print(y_train.value_counts())
print(y_train.value_counts()/y_train.count())
print(y_test.value_counts())
print(y_test.value_counts()/y_test.count())

(227845, 30) (56962, 30)
Class
0    227451
1       394
Name: count, dtype: int64
Class
0    0.998271
1    0.001729
Name: count, dtype: float64
Class
0    56864
1       98
Name: count, dtype: int64
Class
0    0.99828
1    0.00172
Name: count, dtype: float64


In [6]:
frauds = credit_card_scaled[credit_card_scaled.Class == 1]
no_frauds = credit_card_scaled[credit_card_scaled.Class == 0]

In [7]:
undersample_df = pd.concat([frauds, no_frauds.sample(n=len(frauds), random_state=11)])
undersample_df["Class"].value_counts()

Class
1    492
0    492
Name: count, dtype: int64

In [None]:
# shuffling data
undersample_df = undersample_df.sample(frac=1, random_state=11)

The "average_precision" scoring metric (also known as area under the precision-recall curve) is particularly well-suited for fraud detection because:

It focuses specifically on the positive class (fraud) performance
It evaluates the model across different classification thresholds
It isn't influenced by the large number of true negatives that dominate in highly imbalanced datasets

This metric will help you find model parameters that maximize your ability to detect fraud cases while minimizing false positives, which is exactly what you want for this type of problem.

In [None]:
def best_model_randomized_search_cv(n_iter, estimator, search_space, cv, n_jobs, scoring, X_train, y_train, X_test):

    search = RandomizedSearchCV(n_iter=n_iter, estimator= estimator, param_distributions=search_space,
                              cv= cv, verbose= 2, n_jobs= n_jobs, scoring= scoring)

    search.fit(X_train, y_train)
    best_model = search.best_estimator_
    
    preds_train = best_model.predict(X_train)
    preds_test = best_model.predict(X_test)
    
    return best_model, preds_train, preds_test



In [None]:

skf = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
# X and y for undersample: 
X_us = undersample_df.drop(columns=["Class"]) 
y_us = undersample_df["Class"]

X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X_us, y_us, test_size=0.3, random_state= 11, stratify=y_us)

rf_search_space = {
    'n_estimators': range(10, 101),
    'criterion': ['gini', 'entropy'],
    'max_depth': range(2, 51),
    'min_samples_split': range(2, 11),
    'min_samples_leaf': range(1, 11),
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}

xgb_search_space = {
    "max_depth": randint(3, 10),
    "learning_rate": uniform(0.01, 0.1),
    "n_estimators": randint(100, 1000),
    "subsample": uniform(0.5, 0.5),
    "colsample_bytree": uniform(0.5, 0.5),
}
    
# We will be testing on original X_test, not undersampled X_test_us

In [None]:

best_model_rf_us, preds_train_rf_us, preds_test_rf_us = best_model_randomized_search_cv(n_iter=50, estimator=RandomForestClassifier(), search_space=rf_search_space,
                              cv=skf, n_jobs=-1, scoring="average_precision", X_train=X_train_us, y_train=y_train_us, X_test=X_test)


In [None]:
print("Results for Random Forest model with undersampling on original test: ")
print(classification_report(y_test, preds_test_rf_us))
plt = ConfusionMatrixDisplay(confusion_matrix(y_test, preds_test_rf_us))
plt.plot()
print(average_precision_score(y_train_us, preds_train_rf_us))
print(average_precision_score(y_test, preds_test_rf_us))


In [None]:
best_model_xgb_us, preds_train_xgb_us, preds_test_xgb_us = best_model_randomized_search_cv(n_iter=50, estimator=xgb.XGBClassifier(), search_space=xgb_search_space,
                              cv=skf, n_jobs=-1, scoring="average_precision", X_train=X_train_us, y_train=y_train_us, X_test=X_test)

In [None]:
print("Results for XGBoost model with undersampling on original test: ")
print(classification_report(y_test, preds_test_xgb_us))
plt = ConfusionMatrixDisplay(confusion_matrix(y_test, preds_test_xgb_us))
plt.plot()
print(average_precision_score(y_train_us, preds_train_xgb_us))
print(average_precision_score(y_test, preds_test_xgb_us))

In [None]:
# I will be using X, y from original data and only applying SMOTE on training data
xgb_search_space_smote = {
    "model__max_depth": randint(3, 10),
    "model__learning_rate": uniform(0.01, 0.1),
    "model__n_estimators": randint(100, 1000),
    "model__subsample": uniform(0.5, 0.5),
    "model__colsample_bytree": uniform(0.5, 0.5),
}
   
skf = StratifiedKFold(n_splits=5)
pipeline_xgb = Pipeline([('over', SMOTE()), ('model', xgb.XGBClassifier(tree_method='gpu_hist', gpu_id=0,
                                                                        n_jobs=1
                                                                        ))])


In [None]:
best_model_xgb_sm, preds_train_xgb_sm, preds_test_xgb_sm = best_model_randomized_search_cv(n_iter=20, estimator=pipeline_xgb, search_space=xgb_search_space_smote,
                              cv=skf, n_jobs=-1, scoring="average_precision", X_train=X_train, y_train=y_train, X_test=X_test)
# n_jobs=-1 14min


In [None]:
print("Results for XGBoost model with SMOTE on original test: ")
print(classification_report(y_test, preds_test_xgb_sm))
plt = ConfusionMatrixDisplay(confusion_matrix(y_test, preds_test_xgb_sm))
plt.plot()
print(average_precision_score(y_train, preds_train_xgb_sm))
print(average_precision_score(y_test, preds_test_xgb_sm))

In [None]:
for i in range(3):
    tree = best_model_rf_us.estimators_[i]
    dot_data = export_graphviz(
        tree,
        feature_names=X_train.columns,
        filled=True,
        max_depth=2,
        impurity=False,
        proportion=True,
    )
    graph = graphviz.Source(dot_data)
    display(graph)

SMOTE

In [None]:
#TODO: rapids for gpu accelerated random forest

skf = StratifiedKFold(n_splits=5)
sm = SMOTE()

# pipeline_rf = Pipeline([('over', SMOTE()), ('model', RandomForestClassifier())]) 
# za ciezkie dla mojego kompa :(
X_train_smote, y_train_smote = sm.fit_resample(X_train, y_train)


rf_search_space = {
    'n_estimators': range(10, 101),
    'criterion': ['gini', 'entropy'],
    'max_depth': range(2, 51),
    'min_samples_split': range(2, 11),
    'min_samples_leaf': range(1, 11),
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}
# rf_search_space = {
#     'model__n_estimators': range(10, 101),
#     'model__criterion': ['gini', 'entropy'],
#     'model__max_depth': range(2, 51),
#     'model__min_samples_split': range(2, 11),
#     'model__min_samples_leaf': range(1, 11),
#     'model__max_features': ['sqrt', 'log2', None],
#     'model__bootstrap': [True, False]
# }


rf_model = RandomForestClassifier()

random_search_smote_rf = RandomizedSearchCV(n_iter=10, 
                                            # estimator= pipeline_rf,
                                            estimator= rf_model,
                                             param_distributions=rf_search_space,
                              cv= skf, verbose= 3, n_jobs= -1, scoring="f1")

random_search_smote_rf.fit(X=X_train_smote, y=y_train_smote)


In [None]:
params_smote_rf = random_search_smote_rf.best_params_
best_score_train_smote = random_search_smote_rf.best_score_

best_rf_smote = RandomForestClassifier(**params_smote_rf)
best_rf_smote.fit(X_train_smote, y_train_smote)
pred_rf_smote = best_rf_smote.predict(X_test)

Cost-Sensitive Learning