# Performance Objective:
The fraud incidence rate in this dataset is ~1%. For rare-event detection, accuracy becomes a far less meaningful metric than precision and recall. For bank account fraud in particular, missing a fraud instance is far more financially costly than falsely flagging a non-fraud instance. Therefore, we weigh recall significantly more than precision. Given this, we chose the F2 metric as the optimization goal for our models, reflecting real-world business objectives.

# Data Preprocessing:
- Train/test split: First 6 months for training, last 2 months for testing
- Drop the ‘month’ column.
- Drop the 'device_fraud_count' column. It is a constant-valued column (useless for prediction)
- One-hot encoding of categorical columns

In [None]:
import pandas as pd

df = pd.read_csv('Base.csv')

In [None]:
# data processing
mask = df["month"] <= 5
full_training_data = df[mask].sample(frac=1).reset_index(drop=True).drop('month',axis=1) # train on months 0 to 5. drop month as a feature
full_test_data = df[~mask].sample(frac=1).reset_index(drop=True).drop('month',axis=1) # test on months 6 and 7. drop month as a feature

# 'device_fraud_count' is literally a constant column. get rid of it.
full_training_data = full_training_data.drop('device_fraud_count',axis=1)
full_test_data = full_test_data.drop('device_fraud_count',axis=1)

print("Full training data # rows: " + str(full_training_data.shape[0]))
print("Full test data # rows: " + str(full_test_data.shape[0]))
print("Column list: " + str(full_training_data.columns))

Full training data # rows: 794989
Full test data # rows: 205011
Column list: Index(['fraud_bool', 'income', 'name_email_similarity',
       'prev_address_months_count', 'current_address_months_count',
       'customer_age', 'days_since_request', 'intended_balcon_amount',
       'payment_type', 'zip_count_4w', 'velocity_6h', 'velocity_24h',
       'velocity_4w', 'bank_branch_count_8w',
       'date_of_birth_distinct_emails_4w', 'employment_status',
       'credit_risk_score', 'email_is_free', 'housing_status',
       'phone_home_valid', 'phone_mobile_valid', 'bank_months_count',
       'has_other_cards', 'proposed_credit_limit', 'foreign_request', 'source',
       'session_length_in_minutes', 'device_os', 'keep_alive_session',
       'device_distinct_emails_8w'],
      dtype='object')


In [None]:
# more data processing
y_train_full = full_training_data["fraud_bool"]
X_train_full = full_training_data.drop("fraud_bool",axis=1)
y_test_full = full_test_data["fraud_bool"]
X_test_full = full_test_data.drop("fraud_bool",axis=1)

# make sure all numerical columns are actually stored numerically
y_train_full = y_train_full.astype(float)
y_test_full = y_test_full.astype(float)
for col in X_train_full.columns:
    converted = pd.to_numeric(X_train_full[col], errors='coerce')
    if converted.notna().sum() == X_train_full[col].notna().sum():
        X_train_full[col] = converted
for col in X_test_full.columns:
    converted = pd.to_numeric(X_test_full[col], errors='coerce')
    if converted.notna().sum() == X_test_full[col].notna().sum():
        X_test_full[col] = converted

categorical_cols = X_train_full.select_dtypes(exclude='number').columns.tolist()
print("Categorical columns: " + str(categorical_cols))

Categorical columns: ['payment_type', 'employment_status', 'housing_status', 'source', 'device_os']


In [None]:
import numpy as np
import pandas as pd

# Dealing with missing values
# the following columns have "-1" to indicate missing values: prev_address_months_count, current_address_months_count, bank_months_count, session_length_in_minutes

missing_cols = [
    "prev_address_months_count",
    "current_address_months_count",
    "bank_months_count",
    "session_length_in_minutes"
]

# Convert -1 to NaN in both train and test
X_train_full[missing_cols] = X_train_full[missing_cols].replace(-1, np.nan)
X_test_full[missing_cols] = X_test_full[missing_cols].replace(-1, np.nan)

In [None]:
# one hot encoding
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import numpy as np

ohe = OneHotEncoder(
    drop="first",
    handle_unknown="ignore",
    sparse_output=False
)

preprocessor = ColumnTransformer(
    transformers=[("cat", ohe, categorical_cols)],
    remainder="passthrough"
)

preprocessor.set_output(transform="pandas")

X_train_full = preprocessor.fit_transform(X_train_full)
X_test_full  = preprocessor.transform(X_test_full)

# transform into numpy arrays
X_train_full = X_train_full.to_numpy()
X_test_full = X_test_full.to_numpy()

print(X_train_full.shape)
print(X_test_full.shape)

print(type(X_train_full))
print(type(X_test_full))

(794989, 45)
(205011, 45)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


# Methodology:
Our final model will have the following format:
```
import xgboost as xgb

params = {
    "learning_rate": learning_rate,
    "max_depth": max_depth,
    "min_child_weight": min_child_weight,
    "subsample": subsample,
    "colsample_bytree": colsample_bytree,
    "scale_pos_weight": scale_pos_weight,
    "objective": "binary:logistic",
    "eval_metric": "aucpr",
    "tree_method": "hist",
    "verbosity": 0,
}

model = xgb.train(
    params=params,
    dtrain=dtrain,
    num_boost_round=best_iteration_int,
)

y_test_prob = model.predict(dtest)
y_test_pred = (y_test_prob >= best_threshold).astype(int)
```
We'll first tune the following hyperparameters using a stratified 5-fold CV, with F2 as the metric being optimized for:

- learning_rate/eta (range: logspace from 0.03 to 0.2)
- max_depth (range: uniform on 3:8)
- min_child_weight (range: uniform on [1,10])
- subsample (range: uniform on [0.7,1.0])
- colsample_bytree (range: uniform on [0.6,1])
- scale_pos_weight (range: logspace from [10,200])
- best_threshold (range: linspace from [0,1])

Note that best_iteration_int will be implicitly tuned for when we take the average of the early stopping iterations of the best hyperparameter combination across all splits.

Because XGBoost's Python library implementations do not support F2 scores as a metric for early stopping, we will instead use PR AUC as the early stopping metric, which serves as a good surrogate for F2 because it too rewards good overall precision and recall.

Due to the curse of dimensionality making the hyperparameter search grid combinatorially infeasible to cover via a systematic/exhaustive search, we will use random search instead. We will randomly sample 20 different (learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight) hyperparameter combinations from their corresponding ranges, then concatenate each of these combinations with threshold values from the linspace of [0,1], producing a list of (learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight, threshold) hyperparameter candidates.

Our final model will then be trained on the full training data with the hyperparameters that we have found from the CV.

In [None]:
import numpy as np
import random

def sample_hyperparams(n_samples=20):

    combos = []

    for _ in range(n_samples):

        # learning_rate: logspace from 0.03 to 0.2 (base 10)
        lr_exp = np.random.uniform(np.log10(0.03), np.log10(0.2))
        learning_rate = 10 ** lr_exp

        # max_depth: integer uniform in [3,8]
        max_depth = np.random.randint(3, 9)

        # min_child_weight: uniform in [1,10]
        min_child_weight = np.random.uniform(1, 10)

        # subsample: uniform in [0.7,1.0]
        subsample = np.random.uniform(0.7, 1.0)

        # colsample_bytree: uniform in [0.6,1.0]
        colsample_bytree = np.random.uniform(0.6, 1.0)

        # scale_pos_weight: logspace from 10 to 200 (base 10)
        spw_exp = np.random.uniform(np.log10(10), np.log10(200))
        scale_pos_weight = 10 ** spw_exp

        combos.append((learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight))

    return combos

n_hypers = 20
hyperparameter_combinations = sample_hyperparams(n_hypers)
print(hyperparameter_combinations)

# threshold candidates: 25 values from 0 to 1
thresholds = np.linspace(0.0, 1.0, 25)

full_hyperparameter_combinations = [hyper + (threshold,) for hyper in hyperparameter_combinations for threshold in thresholds]

[(0.04370752583739313, 5, 8.790614226800376, 0.9923663120961845, 0.8436168227869654, 18.230416205212666), (0.03981751334779519, 3, 2.9573939461146663, 0.7885565412597443, 0.6145693355755574, 101.13212968988198), (0.1104677673822609, 6, 5.903898250808235, 0.7103998703434149, 0.836721774162202, 29.67350531839289), (0.14997910131165385, 6, 8.482588582187637, 0.73307060844429, 0.6216427912740435, 108.8590375047525), (0.0648764138658463, 5, 6.142196878076124, 0.7624978333041824, 0.7734773068432919, 56.59179326118033), (0.1405467118239003, 5, 5.913580631784271, 0.9444077551534324, 0.7005513034157849, 119.03127811045933), (0.16877813037694922, 4, 8.925293229164936, 0.7899119564484125, 0.7550660115789275, 73.55630382239742), (0.1769910266779769, 7, 6.193380236997042, 0.7279214279537065, 0.9345642246845189, 10.667846811683878), (0.050247437161997976, 5, 9.025677620610027, 0.977511594016963, 0.8633758674842724, 74.84121479891142), (0.03841768852534073, 4, 7.411261235149855, 0.8108707388774795, 0

In [None]:
from sklearn.model_selection import StratifiedKFold
import xgboost as xgb
from sklearn.metrics import average_precision_score
from sklearn.metrics import fbeta_score
from tqdm import tqdm

running_F2_scores = dict.fromkeys(full_hyperparameter_combinations, 0.0) # stores running total of F2 scores across folds to average later
running_best_iteration = dict.fromkeys(full_hyperparameter_combinations, 0.0) # stores running total of best iteration across folds to average later
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state = 23)
total_trains = n_splits * n_hypers
progress_bar = tqdm(total=total_trains, desc="models fitted", ncols=100)

for train_ids, val_ids in skf.split(X_train_full, y_train_full):
    X_train_inner, X_val_inner = X_train_full[train_ids], X_train_full[val_ids]
    y_train_inner, y_val_inner = y_train_full[train_ids], y_train_full[val_ids]
    dtrain = xgb.DMatrix(X_train_inner, label=y_train_inner)
    dval = xgb.DMatrix(X_val_inner, label=y_val_inner)

    for hyper in hyperparameter_combinations:
      learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight = hyper

      params = {
          "learning_rate": learning_rate,
          "max_depth": max_depth,
          "min_child_weight": min_child_weight,
          "subsample": subsample,
          "colsample_bytree": colsample_bytree,
          "scale_pos_weight": scale_pos_weight,
          "objective": "binary:logistic",
          "eval_metric": "aucpr",
          "tree_method": "hist",
      }
      model = xgb.train(
          params,
          dtrain,
          num_boost_round=1000,
          evals=[(dval, "val")],
          early_stopping_rounds=50,
          verbose_eval=False
      )
      progress_bar.update(1)

      best_iteration = model.best_iteration

      y_val_prob = model.predict(dval) # P(y=1)

      # Evaluate F2 for each threshold
      for t in thresholds:
          y_val_pred = (y_val_prob >= t).astype(int)
          f2 = fbeta_score(y_val_inner, y_val_pred, beta=2, zero_division=0)
          running_F2_scores[hyper + (t,)] += f2
          running_best_iteration[hyper + (t,)] += best_iteration

progress_bar.close()

# Average over folds
for full_hyper in full_hyperparameter_combinations:
    running_F2_scores[full_hyper] /= n_splits
    running_best_iteration[full_hyper] /= n_splits

# Select the best hyperparameter set by F2 score
best_hyper = max(running_F2_scores, key=running_F2_scores.get)
best_iteration = running_best_iteration[best_hyper]
best_f2 = running_F2_scores[best_hyper]

print("Best hyperparameters:", best_hyper)
print("Best iteration:", best_iteration)
print("Mean F2 across folds:", best_f2)

models fitted: 100%|████████████████████████████████████████████| 100/100 [1:18:47<00:00, 47.27s/it]

Best hyperparameters: (0.06171184631285087, 3, 9.400860129136118, 0.7741235778036006, 0.7017163522086355, 10.92551522008694, np.float64(0.375))
Best iteration: 540.2
Mean F2 across folds: 0.3076254782687504





In [None]:
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import fbeta_score
import xgboost as xgb

# train the final model

learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight, best_threshold = (0.06171184631285087, 3, 9.400860129136118, 0.7741235778036006, 0.7017163522086355, 10.92551522008694, 0.375)
best_iteration_int = 540

dtrain = xgb.DMatrix(X_train_full, label=y_train_full)
dtest = xgb.DMatrix(X_test_full)

params = {
    "learning_rate": learning_rate,
    "max_depth": max_depth,
    "min_child_weight": min_child_weight,
    "subsample": subsample,
    "colsample_bytree": colsample_bytree,
    "scale_pos_weight": scale_pos_weight,
    "objective": "binary:logistic",
    "eval_metric": "aucpr",
    "tree_method": "hist",
    "verbosity": 0,
}

model = xgb.train(
    params=params,
    dtrain=dtrain,
    num_boost_round=best_iteration_int,
)

y_test_prob = model.predict(dtest)
y_test_pred = (y_test_prob >= best_threshold).astype(int)

In [None]:
# F2 Score:
f2 = fbeta_score(y_test_full, y_test_pred, beta=2, zero_division=0)
print("Final model F2 Score: " + str(f2))

# Recall:
from sklearn.metrics import recall_score
recall = recall_score(y_test_full, y_test_pred)
print("Final model Recall: " + str(recall))

# Precision:
from sklearn.metrics import precision_score
precision = precision_score(y_test_full, y_test_pred)
print("Final model Precision: " + str(precision))

# Accuracy:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test_full, y_test_pred)
print("Final model Accuracy: " + str(accuracy))

# PR AUC:
from sklearn.metrics import average_precision_score
pr_auc = average_precision_score(y_test_full, y_test_prob)
print("Final model PR AUC: " + str(pr_auc))

# ROC AUC:
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test_full, y_test_prob)
print("Final model ROC AUC: " + str(roc_auc))

Final model F2 Score: 0.34628858170800986
Final model Recall: 0.5083391243919388
Final model Precision: 0.15220557636287974
Final model Accuracy: 0.9533488446961382
Final model PR AUC: 0.20025156289995072
Final model ROC AUC: 0.895888985388483


Our XGBoost method achieved the following results on the test data:
```
Final model F2 Score: 0.34628858170800986
Final model Recall: 0.5083391243919388
Final model Precision: 0.15220557636287974
Final model Accuracy: 0.9533488446961382
```
Compare this to our RBF kernel SVM ensemble method's results:
```
Final model F2 Score: 0.31284385315594676
Final model Recall: 0.4346768589298124
Final model Precision: 0.14748879981136526
Final model Accuracy: 0.956792562350313
```
Accuracy and precision remain roughly the same, but we see a significant relative performance gain in recall of roughly (0.5083391243919388/0.4346768589298124) - 1 = ~17% and a meaningful relative performance gain in F2 score of roughly (0.34628858170800986/0.31284385315594676) - 1 = ~11%.

## Comparison to public baselines:
We compare our model to strong baseline models provided in the most popular publicly available notebook for this dataset: https://www.kaggle.com/code/lennart4711/baselinemodels-roc. In their code, they used industry-standard models with balanced class weights to modify their objective functions to account for heavy class imbalance (~1% fraud rate into account), with zero hyperparameter searching. Note that their models are directly comparable to ours since they used an identical dataset (Base.csv) and an identical train/test split.

After modifying their code to compute F2 scores for each model, I found that they achieved:
- Logistic Regression: 0.3102
- XGBoost: 0.2925
- Random Forest: 0.2475
- Neural Network: 0.3204

Observe that our XGBoost model achieves meaningful relative performance gains of (0.34628858170800986/0.3204) - 1 = \~8% over their best baseline model and (0.34628858170800986/0.2925) - 1 = \~18% over their XGBoost model. Given that these baselines already use strong, well-established models which account for the heavy class imbalance (~1% fraud rate), and are therefore representative of realistic industry prototypes, this comparison reveals that our modeling choices offer meaningful, nontrivial performance improvements on this task.

## Reflection and next steps:
The fact remains, however, that XGBoost, linear kernel SVM, and RBF kernel SVM ensemble all achieved similar F2 scores in absolute terms (in the 0.31-0.35 range). Multiple distinct modelling techniques all achieving scores in this narrow range suggests that there is limited predictive signal left to exploit in our currently tabulated features. In other words, we are likely near the Pareto frontier with respect to our currently tabulated features.

Significant improvements in our model performance will likely not be from the choice of modeling method, but from further feature engineering (i.e. deriving new features, exploring interactions, etc.). Therefore, as a next step for improving performance, the primary exploratory focus should be on feature engineering.

Additionally, one may consider trying undersampling/oversampling techniques to boost the proportion of fraud data in the training set to give fraud data more "attention" during training. In our SVM and XGBoost codes, we primarily relied on re-weighting the objective function to heavily penalize fraud misclassification.