# Performance Objective:
The fraud incidence rate in this dataset is ~1%. For rare-event detection, accuracy becomes a far less meaningful metric than precision and recall. For bank account fraud in particular, missing a fraud instance is far more financially costly than falsely flagging a non-fraud instance. Therefore, we weigh recall significantly more than precision. Given this, we chose the F2 metric as the optimization goal for our models, reflecting real-world business objectives.

# Data preprocessing:
- Train/test split: First 6 months for training, last 2 months for testing
- Drop the ‘month’ column.
- Drop the 'device_fraud_count' column. It is a constant-valued column (useless for prediction)
- One-hot encoding of categorical columns
- Create indicator columns for columns with missing values
- Standardize columns
- Mean imputation of missing values


In [None]:
import pandas as pd

df = pd.read_csv('Base.csv')

In [None]:
# data processing
mask = df["month"] <= 5
full_training_data = df[mask].sample(frac=1).reset_index(drop=True).drop('month',axis=1) # train on months 0 to 5. drop month as a feature
full_test_data = df[~mask].sample(frac=1).reset_index(drop=True).drop('month',axis=1) # test on months 6 and 7. drop month as a feature

# 'device_fraud_count' is literally a constant column. get rid of it.
full_training_data = full_training_data.drop('device_fraud_count',axis=1)
full_test_data = full_test_data.drop('device_fraud_count',axis=1)

print("Full training data # rows: " + str(full_training_data.shape[0]))
print("Full test data # rows: " + str(full_test_data.shape[0]))
print("Column list: " + str(full_training_data.columns))

Full training data # rows: 794989
Full test data # rows: 205011
Column list: Index(['fraud_bool', 'income', 'name_email_similarity',
       'prev_address_months_count', 'current_address_months_count',
       'customer_age', 'days_since_request', 'intended_balcon_amount',
       'payment_type', 'zip_count_4w', 'velocity_6h', 'velocity_24h',
       'velocity_4w', 'bank_branch_count_8w',
       'date_of_birth_distinct_emails_4w', 'employment_status',
       'credit_risk_score', 'email_is_free', 'housing_status',
       'phone_home_valid', 'phone_mobile_valid', 'bank_months_count',
       'has_other_cards', 'proposed_credit_limit', 'foreign_request', 'source',
       'session_length_in_minutes', 'device_os', 'keep_alive_session',
       'device_distinct_emails_8w'],
      dtype='object')


In [None]:
# more data processing
y_train_full = full_training_data["fraud_bool"]
X_train_full = full_training_data.drop("fraud_bool",axis=1)
y_test_full = full_test_data["fraud_bool"]
X_test_full = full_test_data.drop("fraud_bool",axis=1)

# percentage of 1's in y_train_full:
print("Fraction of 1's in y_train_full: " + str(y_train_full.mean()))

# make sure all numerical columns are actually stored numerically
y_train_full = y_train_full.astype(float)
y_test_full = y_test_full.astype(float)
for col in X_train_full.columns:
    converted = pd.to_numeric(X_train_full[col], errors='coerce')
    if converted.notna().sum() == X_train_full[col].notna().sum():
        X_train_full[col] = converted
for col in X_test_full.columns:
    converted = pd.to_numeric(X_test_full[col], errors='coerce')
    if converted.notna().sum() == X_test_full[col].notna().sum():
        X_test_full[col] = converted

categorical_cols = X_train_full.select_dtypes(exclude='number').columns.tolist()
print("Categorical columns: " + str(categorical_cols))

Fraction of 1's in y_train_full: 0.010252972053701372
Categorical columns: ['payment_type', 'employment_status', 'housing_status', 'source', 'device_os']


In [None]:
# one hot encoding
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import numpy as np

ohe = OneHotEncoder(
    drop="first",
    handle_unknown="ignore",
    sparse_output=False
)

preprocessor = ColumnTransformer(
    transformers=[("cat", ohe, categorical_cols)],
    remainder="passthrough"
)

preprocessor.set_output(transform="pandas")

X_train_full = preprocessor.fit_transform(X_train_full)
X_test_full  = preprocessor.transform(X_test_full)

In [None]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Dealing with missing values
# the following columns have "-1" to indicate missing values: prev_address_months_count, current_address_months_count, bank_months_count, session_length_in_minutes

missing_cols = [
    "remainder__prev_address_months_count",
    "remainder__current_address_months_count",
    "remainder__bank_months_count",
    "remainder__session_length_in_minutes"
]

# 1. Convert -1 to NaN in both train and test
X_train_full[missing_cols] = X_train_full[missing_cols].replace(-1, np.nan)
X_test_full[missing_cols] = X_test_full[missing_cols].replace(-1, np.nan)

# 1b. Create missing-value indicator columns (before imputation)
for col in missing_cols:
    indicator_name = f"{col}_is_missing"
    X_train_full[indicator_name] = X_train_full[col].isna().astype(int)
    X_test_full[indicator_name] = X_test_full[col].isna().astype(int)

# create 2 numpy arrays storing the mean and sd of each column (not counting NaN values)
col_means = X_train_full.mean().to_numpy()
col_stds  = X_train_full.std().to_numpy()

# 2. Mean imputer (fit only on training data)
imputer = SimpleImputer(strategy="mean")

# Fit on training subset
imputer.fit(X_train_full[missing_cols])

# 3. Transform both train and test
X_train_full[missing_cols] = imputer.transform(X_train_full[missing_cols])
X_test_full[missing_cols] = imputer.transform(X_test_full[missing_cols])

column_names = list(X_train_full.columns) # store column names for later

# convert to numpy arrays
X_train_full = X_train_full.to_numpy()
X_test_full = X_test_full.to_numpy()


In [None]:
# standardization of columns
X_train_full = (X_train_full - col_means) / np.maximum(col_stds, 1e-8) # numerical stability
X_test_full = (X_test_full - col_means) / np.maximum(col_stds, 1e-8)

print(X_train_full.shape[0])
print(X_test_full.shape[0])

794989
205011


In [None]:
print(X_train_full.shape)
print(X_test_full.shape)

(794989, 49)
(205011, 49)


# STAGE 1: LINEAR KERNEL SVM

## Methodology:
- The format of our model is LinearSVC(C=C,dual=False,class_weight={0: 1, 1: R}) (imported from sklearn.svm on Python).
- Hyperparameters: C, R
- First, we do a coarse-grained, stratified 3-fold logspace search for hyperparameters C and R in order to find the “magnitude neighborhoods” that the best hyperparameters live in. The evaluation metric used for this 3-fold search is F2. (C candidates: [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0], R candidates: [1.0, 10.0, 100.0, 1000.0])
- After finding the best hyperparameter candidates from the coarse-grained search, we then do a fine-grained, stratified 3-fold linspace search for them in their neighborhoods. The evaluation metric used for this 3-fold search is again F2. This gives us our final hyperparameters.
- We then train the final model on all of the training data using the C, R pair that we find from hyperparameter tuning.

We first do a coarse-grained logspace search to find the magnitude(s) that the best margin loss penalty hyperparameters live in.

In [None]:
import numpy as np
from itertools import product

# C hyperparameters
C_values = np.logspace(-3, 3, 7).tolist() #1e-3 to 1e3
print(C_values)

# R weight hyperparameters:
R_values = np.logspace(0,3,4).tolist()
print(R_values)

hyperparameter_combinations = list(product(C_values, R_values))

[0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
[1.0, 10.0, 100.0, 1000.0]


In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import LinearSVC
from sklearn.metrics import fbeta_score

running_F2_scores = dict.fromkeys(hyperparameter_combinations, 0) # stores running total of F2 scores across folds
n_splits = 3
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state = 23)

for train_ids, val_ids in skf.split(X_train_full, y_train_full):
    X_train_inner, X_val_inner = X_train_full[train_ids], X_train_full[val_ids]
    y_train_inner, y_val_inner = y_train_full[train_ids], y_train_full[val_ids]

    for hyper in hyperparameter_combinations:
        C = hyper[0]
        R = hyper[1]
        model = LinearSVC(C=C,dual=False,class_weight={0: 1, 1: R})
        model.fit(X_train_inner, y_train_inner)
        preds = model.predict(X_val_inner)

        # F2 score
        f2 = fbeta_score(y_val_inner, preds, beta=2, zero_division=0)
        running_F2_scores[hyper] += f2

average_F2_scores = {hyper: running_F2_scores[hyper] / n_splits for hyper in running_F2_scores}
print("Visual inspection of coarse-grained hyperparameter performances: ")
print("Average F2 score by hyperparameter combination: " + str(average_F2_scores))

Visual inspection of coarse-grained hyperparameter performances: 
Average F2 score by hyperparameter combination: {(0.001, 1.0): 0.0, (0.001, 10.0): 0.1975937397846259, (0.001, 100.0): 0.16711202004468328, (0.001, 1000.0): 0.060256476857456394, (0.01, 1.0): 0.0, (0.01, 10.0): 0.19985552658267544, (0.01, 100.0): 0.1671428437968343, (0.01, 1000.0): 0.06025872031443463, (0.1, 1.0): 0.0, (0.1, 10.0): 0.20034146307836684, (0.1, 100.0): 0.16714542494311724, (0.1, 1000.0): 0.06025907831952048, (1.0, 1.0): 0.0, (1.0, 10.0): 0.20033640339053524, (1.0, 100.0): 0.16714798355157975, (1.0, 1000.0): 0.06025907831952048, (10.0, 1.0): 0.0, (10.0, 10.0): 0.20033163918169547, (10.0, 100.0): 0.16714714883221418, (10.0, 1000.0): 0.06025907831952048, (100.0, 1.0): 0.0, (100.0, 10.0): 0.20033163918169547, (100.0, 100.0): 0.16714714883221418, (100.0, 1000.0): 0.06025907831952048, (1000.0, 1.0): 0.0, (1000.0, 10.0): 0.20033163918169547, (1000.0, 100.0): 0.16714714883221418, (1000.0, 1000.0): 0.060259078319520

In [None]:
# Sort the hyperparameters by F2 score:
for hyper, score in sorted(average_F2_scores.items(), key=lambda x: x[1], reverse=True):
    print(hyper, score)

(0.1, 10.0) 0.20034146307836684
(1.0, 10.0) 0.20033640339053524
(10.0, 10.0) 0.20033163918169547
(100.0, 10.0) 0.20033163918169547
(1000.0, 10.0) 0.20033163918169547
(0.01, 10.0) 0.19985552658267544
(0.001, 10.0) 0.1975937397846259
(1.0, 100.0) 0.16714798355157975
(10.0, 100.0) 0.16714714883221418
(100.0, 100.0) 0.16714714883221418
(1000.0, 100.0) 0.16714714883221418
(0.1, 100.0) 0.16714542494311724
(0.01, 100.0) 0.1671428437968343
(0.001, 100.0) 0.16711202004468328
(0.1, 1000.0) 0.06025907831952048
(1.0, 1000.0) 0.06025907831952048
(10.0, 1000.0) 0.06025907831952048
(100.0, 1000.0) 0.06025907831952048
(1000.0, 1000.0) 0.06025907831952048
(0.01, 1000.0) 0.06025872031443463
(0.001, 1000.0) 0.060256476857456394
(0.001, 1.0) 0.0
(0.01, 1.0) 0.0
(0.1, 1.0) 0.0
(1.0, 1.0) 0.0
(10.0, 1.0) 0.0
(100.0, 1.0) 0.0
(1000.0, 1.0) 0.0


Notes:
- C barely affects anything.
- The optimal R seems to be in the range of [10,100]

For our fine-grained hyperparameter search, we'll fix C = 1 and do a linear search of R from 10 to 100.

In [None]:
import numpy as np

# R weight hyperparameters:
R_values = np.linspace(10, 100, 10).tolist()
print(R_values)

[10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0]


In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import LinearSVC
from sklearn.metrics import fbeta_score

running_F2_scores = dict.fromkeys(R_values, 0) # stores running total of F2 scores across folds
n_splits = 3
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state = 23)

for train_ids, val_ids in skf.split(X_train_full, y_train_full):
    X_train_inner, X_val_inner = X_train_full[train_ids], X_train_full[val_ids]
    y_train_inner, y_val_inner = y_train_full[train_ids], y_train_full[val_ids]

    for R in R_values:
        model = LinearSVC(C=1.0,dual=False,class_weight={0: 1, 1: R})
        model.fit(X_train_inner, y_train_inner)
        preds = model.predict(X_val_inner)

        # F2 score
        f2 = fbeta_score(y_val_inner, preds, beta=2, zero_division=0)
        running_F2_scores[R] += f2

average_F2_scores = {R: running_F2_scores[R] / n_splits for R in running_F2_scores}
best_R = max(average_F2_scores, key=average_F2_scores.get)
print("Best R: " + str(best_R))
print("Best R's F2 score: " + str(average_F2_scores[best_R]))

Best R: 20.0
Best R's F2 score: 0.2795081464328278


In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import fbeta_score
# Our final hyperparameters are C = 1.0, R = 20.0
# train our final model
final_model = LinearSVC(C=1.0,dual=False,class_weight={0: 1, 1: 20.0})
final_model.fit(X_train_full, y_train_full)
preds = final_model.predict(X_test_full)

# F2 Score:
f2 = fbeta_score(y_test_full, preds, beta=2, zero_division=0)
print("Final model F2 Score: " + str(f2))

# Recall:
from sklearn.metrics import recall_score
recall = recall_score(y_test_full, preds)
print("Final model Recall: " + str(recall))

# Precision:
from sklearn.metrics import precision_score
precision = precision_score(y_test_full, preds)
print("Final model Precision: " + str(precision))

# Accuracy:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test_full, preds)
print("Final model Accuracy: " + str(accuracy))

from sklearn.metrics import roc_auc_score, average_precision_score

# Decision scores for AUCs
scores = final_model.decision_function(X_test_full)

# ROC AUC:
roc_auc = roc_auc_score(y_test_full, scores)
print("Final model ROC AUC:", roc_auc)

# PR AUC:
pr_auc = average_precision_score(y_test_full, scores)
print("Final model PR AUC:", pr_auc)

Final model F2 Score: 0.31082471435533654
Final model Recall: 0.3988881167477415
Final model Precision: 0.1650611071171819
Final model Accuracy: 0.963236119037515
Final model ROC AUC: 0.8838175380071882
Final model PR AUC: 0.1662443966266518


In [None]:
# Get coefficient vector
coef = final_model.coef_.ravel() # shape: (n_features,)

# Pair each feature name with its coefficient
feature_coefs = list(zip(column_names, coef))

# Sort by coefficient in descending order
feature_coefs_sorted = sorted(feature_coefs, key=lambda x: x[1], reverse=True)

for name, weight in feature_coefs_sorted:
    print(name, weight)

remainder__prev_address_months_count_is_missing 0.1840775408693057
cat__device_os_windows 0.1726606370235882
remainder__email_is_free 0.0902276679731933
remainder__income 0.08539373697266404
remainder__customer_age 0.07160413475476544
cat__payment_type_AC 0.06656433909167216
remainder__bank_months_count 0.06086480434083718
remainder__credit_risk_score 0.05524611761741483
remainder__device_distinct_emails_8w 0.05475204580850315
cat__device_os_macintosh 0.05473006335153109
remainder__proposed_credit_limit 0.041235656112572835
remainder__prev_address_months_count 0.03794680122176087
cat__device_os_other 0.03074820468014643
remainder__zip_count_4w 0.02999213764554963
remainder__foreign_request 0.0298911560350147
remainder__velocity_4w 0.029156638983872867
remainder__bank_months_count_is_missing 0.02593410624518672
remainder__days_since_request 0.020579034998510815
cat__payment_type_AB 0.018122089365314887
cat__device_os_x11 0.016674212705907326
cat__employment_status_CC 0.01309661249245701

In [None]:
# Find the most significant features
# Filter the sorted coefficients
double_digit = [
    (name, weight)
    for name, weight in feature_coefs_sorted
    if weight >= 0.1 or weight <= -0.1
]

for name, weight in double_digit:
    print(name, weight)

remainder__prev_address_months_count_is_missing 0.1840775408693057
cat__device_os_windows 0.1726606370235882
remainder__keep_alive_session -0.10241048431619434
remainder__name_email_similarity -0.10825898209539894
remainder__phone_home_valid -0.14472009825401563
remainder__has_other_cards -0.14719090368602533
cat__housing_status_BE -0.16965133844394345
cat__housing_status_BC -0.17068864195033798
cat__housing_status_BB -0.1738615613880953


## Interpretable feature analysis:
We filtered for the features with the strongest coefficient signals (absolute value >= 0.1).

### Positive signals for fraud:
- According to this SVM model, if information about the "number of months in previous registered address of the applicant" is missing, this is a relatively strong signal for fraud. One potential reason is that fraudsters tend to avoid providing a traceable address history.
- According to this SVM model, the device OS being windows is a relatively strong signal for fraud. One possible explanation is that fraudsters often operate from cheap and widely available Windows environments.

### Negative signals for fraud:
- According to this SVM, the user keeping the session alive on session logout is a negative indicator of fraud. A potential reason for this is that fraudsters tend to avoid being connected to the system longer than they have to be.
- According to this SVM model, the similarity in the applicant's name to the email name is a negative indicator for fraud. One possible explanation is that fraudsters often use randomly generated or non-identifying email handles.
- According to this SVM model, the validity of the home phone number is a negative indicator of fraud. One possible explanation is that fraudsters often supply fake or burner phone numbers.
- According to this SVM model, the applicant having other cards with the same banking company is a negative signal of fraud. This may be because having multiple cards with the same company under the same identity increases the fraudster's detectability, so fraudsters generally avoid this.
- Housing status being BE, BB, and BC are negative indicators of fraud according to this model. However, given that BE, BB, and BC are anonymized values, we can't say much more beyond this.

# STAGE 2: RBF KERNEL SVM

## Methodology:
Due to the high memory complexity of storing Gram matrices while training kernel SVMs, we will chunk the training data into manageably-sized chunks, train an SVM model on each of them, then predict 1 if at least k out of N of the SVM models predicts 1.

Fix N = 7.

We will divide the training dataset into stratified folds of size ~10000. We will then create 3 splits of 8 stratified folds each. For each split, 7 of the folds will be used to train 7 SVM models while the remaining fold will be used as a validation set. This means a total of 24 folds are required. Luckily, we have around 794989/10000 (approx 79) folds available to us. We will do hyperparameter search by finding the hyperparameters that maximize the average F2 score across all 3 splits.

Our hyperparameters will be:
- C (as in linear kernel SVM)
- R (as in linear kernel SVM)
- Gamma (for the RBF kernel)
- K (the number of learners needed to make a positive prediction)

Search ranges:
- C: [10^-2,10]
- R: [10,1000]
- Gamma: [10^-3,1]
- K: [1,7]

Due to the curse of dimensionality making the hyperparameter search grid combinatorially infeasible to cover via a systematic/exhaustive search, we will use random search instead. Our hyperparameter combination candidates list is created by randomly sampling 10 different (C, R, Gamma) hyperparameter combinations from their corresponding logspaces, then concatenating each of them with each K in 1:7, creating a list of 4-D hyperparameter coordinates (C, R, Gamma, K).

After finding our optimal (C, R, Gamma, K) combination, we will then create 7 stratified folds of size ~20000. We will then train 7 RBF-kernel SVM models using the (C, R, Gamma) hyperparameters. Our final model will predict 1 iff at least K out of the 7 models predicts 1.


In [None]:
import numpy as np
import random
SVC_hyperparameter_combinations = []
for i in range(10):
  C_exp = random.uniform(-2,1)
  R_exp = random.uniform(1,3)
  Gamma_exp = random.uniform(-3,0)
  SVC_hyperparameter_combinations.append((10**C_exp,10**R_exp,10**Gamma_exp))

Ks = [1,2,3,4,5,6,7]

hyperparameter_combinations = [
    (C, R, Gamma, k)
    for (C, R, Gamma) in SVC_hyperparameter_combinations
    for k in Ks
]

In [None]:
from sklearn.model_selection import StratifiedKFold

n_splits = 79
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=23)

# Collect the "folds" = val indices from each split
fold_indices = []

for _, val_idx in skf.split(X_train_full, y_train_full):
    fold_indices.append(val_idx)

print("Number of folds:", len(fold_indices))  # should be 79
print("Size of first fold:", len(fold_indices[0]))

# We only need the first 24 folds
folds_for_splits = fold_indices[:24]

# Now group them into 3 sublists of 8 folds each
splits = [folds_for_splits[i:i + 8] for i in range(0, 24, 8)]

Number of folds: 79
Size of first fold: 10064


In [None]:
from sklearn.metrics import fbeta_score
from sklearn.svm import SVC
from tqdm import tqdm

total_trains = 3 * 10 * 7
progress_bar = tqdm(total=total_trains, desc="SVMs trained", ncols=100)

running_F2_scores = dict.fromkeys(hyperparameter_combinations, 0) # stores running total of F2 scores across folds

for split in splits:
  X_training_sets = []
  y_training_sets = []
  for i in range(7):
    X_training_sets.append(X_train_full[split[i]])
    y_training_sets.append(y_train_full[split[i]])
  X_val = X_train_full[split[7]]
  y_val = y_train_full[split[7]]

  for SVC_hyper in SVC_hyperparameter_combinations:
      C = SVC_hyper[0]
      R = SVC_hyper[1]
      Gamma = SVC_hyper[2]

      model_list = []
      for i in range(7):
        model = SVC(
              kernel="rbf",
              C=C,
              gamma=Gamma,
              class_weight={0: 1, 1: R},
              probability=False
          )
        model.fit(X_training_sets[i], y_training_sets[i])
        model_list.append(model)
        progress_bar.update(1)

      pred_list = []
      for model in model_list:
        preds = model.predict(X_val)
        pred_list.append(preds)
      pred_list = np.array(pred_list)

      total_preds = np.sum(pred_list, axis=0)

      for k in range(1,8):
        f2 = fbeta_score(y_val, (total_preds >= k).astype(int), beta=2, zero_division=0)
        running_F2_scores[(C,R,Gamma,k)] += f2

progress_bar.close()
average_F2_scores = {hyper: running_F2_scores[hyper] / 3 for hyper in running_F2_scores}

# Sort the hyperparameters by F2 score:
for hyper, score in sorted(average_F2_scores.items(), key=lambda x: x[1], reverse=True):
    print(hyper, score)

SVMs trained: 100%|███████████████████████████████████████████████| 210/210 [32:08<00:00,  9.18s/it]

(0.40852364882186243, 22.47084293386643, 0.0012593616869968059, 2) 0.261757771381254
(0.40852364882186243, 22.47084293386643, 0.0012593616869968059, 3) 0.2547957456786937
(0.40852364882186243, 22.47084293386643, 0.0012593616869968059, 4) 0.2547457143399014
(0.04795099213571773, 22.391279582061784, 0.005481646677430149, 1) 0.2476590001832328
(0.40852364882186243, 22.47084293386643, 0.0012593616869968059, 5) 0.2439292907827676
(0.40852364882186243, 22.47084293386643, 0.0012593616869968059, 1) 0.24223896517157362
(0.6932901565519473, 649.2000733242538, 0.00615594122894976, 5) 0.23515579071134626
(1.5932984587626728, 115.51725592971276, 0.0010363448056037567, 7) 0.23313009244004737
(0.6932901565519473, 649.2000733242538, 0.00615594122894976, 4) 0.2314636345887424
(0.09160070752777136, 484.83921777028655, 0.022120835730685556, 4) 0.22786097739567143
(0.09160070752777136, 484.83921777028655, 0.022120835730685556, 5) 0.22184366357265842
(0.40852364882186243, 22.47084293386643, 0.0012593616869




In [None]:
# We set (C, R, Gamma, k) = (0.40852364882186243, 22.47084293386643, 0.0012593616869968059, 2)
# for our final model, we will train each of the 7 SVM models on ~20000 points instead of ~10000

from sklearn.model_selection import StratifiedKFold
import numpy as np

n_splits = 79
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=23)

# Collect the "folds" = validation indices from each split
fold_indices = []
for _, val_idx in skf.split(X_train_full, y_train_full):
    fold_indices.append(val_idx)

print("Number of folds:", len(fold_indices))  # should be 79

# Create the 7 final index arrays (each built from 2 folds)
# We use the first 14 folds -> 7 groups × 2 folds per group
final_train_chunks = []

for i in range(7):
    fold1 = fold_indices[2*i]
    fold2 = fold_indices[2*i + 1]

    merged = np.concatenate([fold1, fold2])
    final_train_chunks.append(merged)

# Inspect sizes:
for i, arr in enumerate(final_train_chunks):
    print(f"Chunk {i+1}: {len(arr)} samples")

Number of folds: 79
Chunk 1: 20128 samples
Chunk 2: 20128 samples
Chunk 3: 20128 samples
Chunk 4: 20128 samples
Chunk 5: 20128 samples
Chunk 6: 20128 samples
Chunk 7: 20126 samples


In [None]:
from sklearn.metrics import fbeta_score
from sklearn.svm import SVC

C, R, Gamma = (0.40852364882186243, 22.47084293386643, 0.0012593616869968059)

model_list = []
for i in range(7):
  model = SVC(
        kernel="rbf",
        C=C,
        gamma=Gamma,
        class_weight={0: 1, 1: R},
        probability=False
    )
  model.fit(X_train_full[final_train_chunks[i]], y_train_full[final_train_chunks[i]])
  model_list.append(model)

pred_list = []
for model in model_list:
  preds = model.predict(X_test_full)
  pred_list.append(preds)
pred_list = np.array(pred_list)

total_preds = np.sum(pred_list, axis=0)
final_preds = (total_preds >= 2).astype(int)

In [None]:
# F2 Score:
f2 = fbeta_score(y_test_full, final_preds, beta=2, zero_division=0)
print("Final model F2 Score: " + str(f2))

# Recall:
from sklearn.metrics import recall_score
recall = recall_score(y_test_full, final_preds)
print("Final model Recall: " + str(recall))

# Precision:
from sklearn.metrics import precision_score
precision = precision_score(y_test_full, final_preds)
print("Final model Precision: " + str(precision))

# Accuracy:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test_full, final_preds)
print("Final model Accuracy: " + str(accuracy))

# ROC AUC and PR AUC
from sklearn.metrics import roc_auc_score, average_precision_score

roc_auc = roc_auc_score(y_test_full, total_preds)
pr_auc = average_precision_score(y_test_full, total_preds)

print("Final model ROC AUC:", roc_auc)
print("Final model PR AUC:", pr_auc)

Final model F2 Score: 0.31344168354492913
Final model Recall: 0.4409312022237665
Final model Precision: 0.14534417592486543
Final model Accuracy: 0.9557535937096058
Final model ROC AUC: 0.7313537579325938
Final model PR AUC: 0.10456909034776057


Here are the results for the RBF ensemble method:
```
Final model F2 Score: 0.31284385315594676
Final model Recall: 0.4346768589298124
Final model Precision: 0.14748879981136526
Final model Accuracy: 0.956792562350313
```

Here are the results for the linear kernel method:
```
Final model F2 Score: 0.31082471435533654
Final model Recall: 0.3988881167477415
Final model Precision: 0.1650611071171819
Final model Accuracy: 0.963236119037515
```

The RBF ensemble method and linear kernel method achieved similar F2 scores, though the RBF ensemble had a significantly better recall score and a slightly worse precision.

Note that the RBF ensemble method used here is pretty much guaranteed to be suboptimal, since N = 7 is a very small ensemble size and I only did a coarse logspace search of hyperparameters and didn't do a finer linear space search afterwards. Therefore, the F2 score of the RBF ensemble method will likely see some non-neglibile improvements with further tuning.

However, a fundamental limitation of using nonlinear kernels here is that complexity explosion (time and space) prevent the model from being feasibly trained on *all* the training data. Moreover, there is no compelling reason to believe that using a polynomial kernel will achieve significantly better results, since it shares much of the same computational limitations as RBF kernel. So although there is likely some more "juice" to be squeezed out of the nonlinear kernel SVM approach, our efforts will likely be better rewarded by focusing on XGBoost from now on instead. We will keep the test results of the RBF ensemble method as a baseline benchmark.