<a href="https://colab.research.google.com/github/arunmishrarut/Credit-card-fraud/blob/main/Final_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Introduction
This notebook intorduces a novel approach towards handling imbalanced datasets.

Baseline: No oversampling (original data)

Custom Oversampling: Your unique, model-driven oversampling method

SMOTE: Synthetic Minority Over-sampling Technique

ADASYN: Adaptive Synthetic Sampling

We evaluate each using a comprehensive set of metrics and visualizations, following best practices from statistical learning theory.

# 2. Data Preparation.





1. Data Indexing and Splitting




*   Each transaction is assigned a unique index for tracking.
*   The dataset is split into 80% training and 20% test sets, stratified by the fraud label to maintain class balance.



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import recall_score, precision_score
import xgboost as xgb
from collections import Counter

data = pd.read_csv('data1.csv')
data.insert(0, 'index', range(len(data)))  # Adding index column in the dataset for better tracking of individual sample

rs1 = 42
train, test = train_test_split(
    data, test_size=0.2, stratify=data['Class'], random_state=rs1
)


**test set is kept aside for testing and never used while implimenting oversampling at all.**

2. Stratified Subdivision of Training Data


*   The training set is further divided into 6 equal, non-overlapping, stratified subsets using StratifiedKFold.
*   Each subset preserves the original class distribution and ensures that no instance appears in more than one subset.



In [None]:
n_subsets = 6
skf = StratifiedKFold(n_splits=n_subsets, shuffle=True, random_state=42)
subsets = []
for i, (_, idx) in enumerate(skf.split(train, train['Class'])):
    subset = train.iloc[idx].copy()
    subset['subset'] = f'sub{i+1}'
    subsets.append(subset)


Code

3. Nested Cross-Validation-style (with a tweek) data subsets' usage to identify false negatives after Model Training
For each of the six subsets, a nested cross-validation approach was used:


*   One subset was held out.
*   Of the remaining five, one was used as a validation block, and the other four were combined for training.


*  An XGBoost classifier was trained on the four training blocks and validated on fifth evaluation block.
*  This process was repeated so that each subset served as the evaluation block, and each possible combination of validation and training blocks was explored.



*   Then next subset was held out; and model was trained on the next 4 subsets and validated on the sixth.



*  This process was repeated till all the subsets were held out one-by-one. This approach is unlike five fold-cross-validation, in which 5 subsets are used (instead of 4) for training and the remaining subset is used for validation.




4. Easy and Hard Case Identification


*   After prediction, easy-to-detect frauds (frauds with high predicted probability) and easy-to-detect legits (legitimate transactions with low fraud probability) are removed from the evaluation set.

*  The remaining "hard" cases—ambiguous or misclassified transactions—are retained for focused analysis and oversampling.























5. Adaptive Oversampling of Hard Cases





*   False Negatives (FNs): Missed frauds are oversampled, with the number of duplicates determined by:

FN multiplier = $\left\lceil \dfrac{\text{fold precision} \times \lambda_1 \times \text{frequency}^2}{\max( \text{predicted probability}, 0.01)} \right\rceil$

*   False Positives (FPs): Incorrectly flagged legits are oversampled by:

FP multiplier = $\left\lceil \dfrac{\text{fold precision} \times \lambda_2 \times \text{frequency}^2}{\max(1 - \text{predicted probability}, 0.01)} \right\rceil$






Here, $\lambda_1$ and $\lambda_2$ are tunable parameters, and "frequency" is number of times an instance is misclassified across folds. This ensures the hardest cases—where the model is least confident and most often wrong—are emphasized during oversampling.


In [None]:
#below is final




# Remove easy frauds
T = 0.55 #top limit for easy fraud probability (that is if T=0.6 then all frauds(only True frauds & detected,nothing will happend to legits-detected)) from 0.6 to 1.0 prob will be comnsidered easy-to-detect-fraud and will be dropped)
# Remove easy legits
B = 0.45 # lower limit for fraud probability for easy legit. (that is if B=0.4 then all legits (only True legits & detected, nothing will happed to frauds-detected) having prob below 0.4 will be considered easy-to-detect legits)
# Now, apply oversampling with initial lambda values, using per-instance recall/precision
lamda_1 = 2 #(greater than 10 will give greater than 10 and lower than 100 integer)
lamda_2 = 1
fnlt= 0.55 # fntl=>0.5 (if fnlt is 0.7 means it will consider all the instances with fraud-prob below 0.7 as tough-to-identify- frauds)
fput= 0.45 # fput=< 0.5 (if fput is 0.35 means it will consider all the instances with fraud-prob above 0.35 as tough-to-identify legits )
target_fraud_ratio = 0.5


# Storage
all_fn_rows, all_fn_probs, all_fn_recalls = [], [], []
all_fp_rows, all_fp_probs, all_fp_precisions = [], [], []
all_val_sub = []





for eval_idx in range(n_subsets):  # Outer: Evaluation block
    train_val_indices = [i for i in range(n_subsets) if i != eval_idx]
    for val_idx in train_val_indices:  # Inner: Leave-one-out from training blocks
        train_indices = [i for i in train_val_indices if i != val_idx]
        train_subs = pd.concat([subsets[i] for i in train_indices], ignore_index=True)
        val_sub = subsets[eval_idx].copy()  # Evaluation always on this fixed block




        X_train_sub = train_subs.drop(['Class', 'index', 'subset'], axis=1)
        y_train_sub = train_subs['Class']
        X_val = val_sub.drop(['Class', 'index', 'subset'], axis=1)
        y_val = val_sub['Class']

        model = xgb.XGBClassifier(tree_method='hist', device='cuda', eval_metric='logloss', random_state=42)
        model.fit(X_train_sub, y_train_sub)
        y_pred = model.predict(X_val)
        y_proba = model.predict_proba(X_val)[:, 1]

        # Calculate recall and precision for this fold (on remaining val_sub)
        fold_recall = recall_score(y_val, y_pred, zero_division=0)
        fold_precision = precision_score(y_val, y_pred, zero_division=0)

        # Remove easy frauds
        y_pred = pd.Series(y_pred, index=val_sub.index)
        y_proba = pd.Series(y_proba, index=val_sub.index)
        easy_frauds_idx = val_sub[(y_proba > T) & (y_val == 1)].index
        val_sub = val_sub.drop(easy_frauds_idx)
        y_val = y_val.drop(easy_frauds_idx)
        y_pred = y_pred.drop(easy_frauds_idx)
        y_proba = y_proba.drop(easy_frauds_idx)

        # Remove easy legits
        easy_legits_idx = val_sub[(y_proba < B) & (y_val == 0)].index
        val_sub = val_sub.drop(easy_legits_idx)
        y_val = y_val.drop(easy_legits_idx)
        y_pred = y_pred.drop(easy_legits_idx)
        y_proba = y_proba.drop(easy_legits_idx)

        # Collect remaining (tough) false negatives and false positives
        false_negatives_idx = val_sub[(y_val == 1) & (y_proba < fnlt)].index
        fn_rows = val_sub.loc[false_negatives_idx].copy()
        fn_probs = y_proba.loc[false_negatives_idx]
        fn_rows['fold_recall'] = fold_recall  # Store per-fold recall with each row, these will have prob < 0.5 because they are FN that is fraud but detected legit
        all_fn_rows.append(fn_rows)
        all_fn_probs.append(fn_probs)
        all_fn_recalls.append(pd.Series([fold_recall] * len(fn_rows), index=fn_rows.index))
        false_positives_idx = val_sub[(y_val == 0) & (y_proba > fput)].index
        fp_rows = val_sub.loc[false_positives_idx].copy()
        fp_probs = y_proba.loc[false_positives_idx]
        fp_rows['fold_precision'] = fold_precision  # Store per-fold precision with each row these will have prob > 0.5 because they are FN that is fraud but detected legit
        all_fp_rows.append(fp_rows)
        all_fp_probs.append(fp_probs)
        all_fp_precisions.append(pd.Series([fold_precision] * len(fp_rows), index=fp_rows.index))



        # all remaining val_sub
        all_val_sub.append(val_sub)

# Combine all tough cases and their associated probabilities and metrics
all_fn_rows = pd.concat(all_fn_rows, ignore_index=True)
all_fn_probs = pd.concat(all_fn_probs, ignore_index=True)
all_fn_recalls = pd.concat(all_fn_recalls, ignore_index=True)
all_fp_rows = pd.concat(all_fp_rows, ignore_index=True)
all_fp_probs = pd.concat(all_fp_probs, ignore_index=True)
all_fp_precisions = pd.concat(all_fp_precisions, ignore_index=True)
all_val_sub = pd.concat(all_val_sub, ignore_index=True)


all_val_sub = all_val_sub.sample(frac=1.0/25, random_state=42) #because val_sub effect is multipled to 25 times becasue of nested loops


# Count FN row frequency
fn_tuples = [tuple(row[:-1]) for row in all_fn_rows.to_numpy()]  # excluding 'fold_recall'
fn_counts = Counter(fn_tuples)
fn_repeat_multiplier = np.array([fn_counts[tuple(row[:-1])] for row in all_fn_rows.to_numpy()]) ** 2

# Count FN row frequency
fp_tuples = [tuple(row[:-1]) for row in all_fp_rows.to_numpy()]  # excluding 'fold_recall'
fp_counts = Counter(fp_tuples)
fp_repeat_multiplier = np.array([fp_counts[tuple(row[:-1])] for row in all_fp_rows.to_numpy()]) ** 2


# Now, apply oversampling with initial lambda values, using per-instance recall/precision
fn_repeat = np.ceil(all_fn_recalls * lamda_1*fn_repeat_multiplier / np.maximum(all_fn_probs, 0.01)).astype(int)
fn_oversampled = pd.DataFrame(
    np.repeat(all_fn_rows.values, fn_repeat, axis=0),
    columns=all_fn_rows.columns)

fp_repeat = np.ceil(all_fp_precisions * lamda_2*fp_repeat_multiplier / np.maximum((1-all_fp_probs), 0.01)).astype(int)
fp_oversampled = pd.DataFrame(
    np.repeat(all_fp_rows.values, fp_repeat, axis=0),
    columns=all_fp_rows.columns)


def adjust_lambdas(train, fn_oversampled, fp_oversampled,all_fn_rows, all_fn_probs, all_fn_recalls,
                   all_fp_rows, all_fp_probs, all_fp_precisions,all_val_sub,
                   target_fraud_ratio):

    lam1, lam2 = lamda_1, lamda_2

    for _ in range(1):  # max 100 iterations
        # Recalculate oversampled sets with updated lambdas and per-instance recall/precision
        fn_repeat = np.ceil(all_fn_recalls * lam1 / np.maximum(all_fn_probs, 0.01)).astype(int)
        fn_oversampled = pd.DataFrame(
            np.repeat(all_fn_rows.values, fn_repeat, axis=0),
            columns=all_fn_rows.columns)

        fp_repeat = np.ceil(all_fp_precisions * lam2 /np.maximum((1-all_fp_probs), 0.01)).astype(int)
        fp_oversampled = pd.DataFrame(
            np.repeat(all_fp_rows.values, fp_repeat, axis=0),
            columns=all_fp_rows.columns)

        # Combine oversampled data
        oversampled_data = pd.concat([fn_oversampled, fp_oversampled,all_val_sub], ignore_index=True)

        # Combine with original train
        combined_dataset = pd.concat([train, oversampled_data], ignore_index=True)

        # Calculate fraud ratio
        fraud_ratio = combined_dataset['Class'].mean()  # mean of binary Class = ratio of 1s

        # Check if it's close enough to target
        if abs(fraud_ratio - target_fraud_ratio) < 0.01:
            break

        # Adjust lambdas based on fraud ratio
        if fraud_ratio < target_fraud_ratio:
            lam1 += 1
        else:
            lam2 += 1

        lam1 = max(lam1, lamda_1)
        lam2 = max(lam2, lamda_2)
        if _ % 10 == 0:
            print(_)

        if len(oversampled_data) > len (train):
          break

    return lam1, lam2, oversampled_data, fraud_ratio , combined_dataset, fn_oversampled, fp_oversampled




lamda_1, lamda_2, oversampled_final, fraud_ratio, combined_dataset, fn_oversampled, fp_oversampled = adjust_lambdas(train, fn_oversampled, fp_oversampled,all_fn_rows,
                                                     all_fn_probs, all_fn_recalls,
                                                     all_fp_rows, all_fp_probs, all_fp_precisions,all_val_sub,
                                                     target_fraud_ratio)








In [None]:
print("\n==================================")
print(f"\nFinal lamda_1: {lamda_1}, lamda_2: {lamda_2}")
print(f"Total samples remaing in train set after removing easy fraud and easy legits: {len(all_val_sub)} out of {len(train)}")
print(f"Fraud ratio in oversampled set: {oversampled_final['Class'].mean():.3f}")
print(f"Fraud ratio in resultant set: {combined_dataset['Class'].mean():.3f}")
print(f"Tough fraud samples added by oversampling (False Negatives, FNs): {len(fn_oversampled)}")
print(f"Toughest fraud samples added by oversampling (False Positives, FPs): {len(fn_counts)}")
print(f"Tough legitimate samples added by oversampling (False Positives, FPs): {len(fp_oversampled)}")
print(f"Toughest legitimate samples added by oversampling (False Negatives, FNs): {len(fp_counts)}")
print("\n==================================")
print(f"Samples added by oversampling: {len(oversampled_final)}")
print(f"train: {train['Class'].value_counts()}")
print(f"oversampled: {oversampled_final['Class'].value_counts(ascending=True)}")
print(f"combined dataset : {combined_dataset['Class'].value_counts()}")




In [None]:
# Prepare the final dataset by combining everything
final_oversampled_df = pd.concat([combined_dataset], ignore_index=True)

# Drop helper columns if needed
final_oversampled_df  = final_oversampled_df .drop(columns=[col for col in ['fold_recall', 'fold_precision', 'index', 'subset'] if col in final_oversampled_df.columns])

# Convert all columns to numeric
for col in final_oversampled_df.columns:
    final_oversampled_df [col] = pd.to_numeric(final_oversampled_df[col], errors='coerce')  # convert, set invalid entries to NaN
# Optionally drop rows with NaNs if any were introduced
final_oversampled_df = final_oversampled_df.dropna()
df_new = final_oversampled_df.copy()

In [None]:
#this is the final ensemble model
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
from sklearn.linear_model import LogisticRegression
import numpy as np
import matplotlib.pyplot as plt # Import matplotlib



threshold = 0.1 #threshold for all , oversmpled, non over sampled and SMOTE and ADASYN

# 2. Model 1 training for comparison
train_model1 = pd.concat([train], ignore_index=True)

# Separate features (X) and target (y) for Model 1 training
X_train_model1 = train_model1.drop('Class', axis=1)
y_train_model1 = train_model1['Class']

# Separate features (X) and target (y) for Model 2 training
train_model2 = pd.concat([df_new], ignore_index=True)
X_train_model2 = train_model2.drop('Class', axis=1)
y_train_model2 = train_model2['Class']

# Separate features (X) and target (y) for test (final evaluation)
X_test = test.drop('Class', axis=1)
y_test = test['Class']

# Ensure 'index' column is dropped if present in any of the relevant DataFrames
for df_to_clean in [X_train_model1, X_train_model2, X_test]:
    if 'index' in df_to_clean.columns:
        df_to_clean.drop('index', axis=1, inplace=True)

# Ensure data types are suitable for XGBoost and other models
X_train_model1 = X_train_model1.astype(float)
y_train_model1 = y_train_model1.astype(float)
X_train_model2 = X_train_model2.astype(float)
y_train_model2 = y_train_model2.astype(float)
X_test = X_test.astype(float)
y_test = y_test.astype(float)

# 3. Initialize and train the first XGBoost model (Model 1) on train_model1 without over sampling
model1 = xgb.XGBClassifier(tree_method='hist', device='cuda', eval_metric='logloss', random_state=42)
model1.fit(X_train_model1, y_train_model1)



# 3.1. Predict probabilities and apply custom threshold
y_pred1 = model1.predict(X_test)
y_prob_preds1 = model1.predict_proba(X_test)[:, 1]


y_score1 = (y_prob_preds1 > threshold).astype(int)

# 6. Evaluate the results

print(f"Test Accuracy1: {accuracy_score(y_test, y_score1):.4f}")
print("Classification Report without oversampling:\n", classification_report(y_test, y_score1))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_score1))
auc = roc_auc_score(y_test, y_score1) # Use X_meta_test here
print()
print(f"AUC:{auc:.4f}")


print("\n")


# 7. Initialize and train the second model (Meta-model) on the meta-model training data with over sampling
model2= xgb.XGBClassifier(tree_method='hist', device='cuda', eval_metric='logloss', random_state=42)
model2.fit(X_train_model2, y_train_model2)





# 3.1. Predict probabilities and apply custom threshold
y_pred2 = model2.predict(X_test)
y_prob_preds2 = model2.predict_proba(X_test)[:, 1]
y_score2 = (y_prob_preds2 > threshold).astype(int)

# 6. Evaluate the results

print(f"Test Accuracy2: {accuracy_score(y_test, y_score2):.4f}")
print(f"Classification Report with {len(oversampled_final)} oversampling:\n", classification_report(y_test, y_score2))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_score2))

print()


# Calculate the AUC score for the ensemble model (meta-model) on test data.
auc = roc_auc_score(y_test, y_prob_preds2) # Use X_meta_test here
print(f"AUC:{auc:.4f}")


In [None]:
#smote for comparison
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, precision_recall_curve, auc, roc_auc_score
)
from imblearn.over_sampling import SMOTE, ADASYN
import xgboost as xgb
import matplotlib.pyplot as plt

# Load data



train = pd.concat([train], ignore_index=True)
X_train_smote = train.drop('Class', axis=1)
y_train_smote = train['Class']

X_test = test.drop('Class', axis=1)
y_test = test['Class']


# Apply SMOTE to training data only
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train_smote, y_train_smote)


# Show class distribution before and after SMOTE
from collections import Counter

print("Before SMOTE:", Counter(y_train_smote))
print("After SMOTE:", Counter(y_train_res))

# Calculate how many synthetic samples were added
added_samples = len(y_train_res) - len(y_train_smote)
print(f"\nSMOTE added {added_samples} synthetic samples.")




# Train XGBoost classifier with GPU
model_smote = xgb.XGBClassifier(tree_method='hist', device='cuda', eval_metric='logloss', random_state=42)
model_smote.fit(X_train_res, y_train_res)

# Predict on test set
y_pred_smote = model_smote.predict(X_test)
y_pred_prob_smote = model_smote.predict_proba(X_test)[:, 1]  # Needed for curves
y_pred_prob_smote = (y_pred_prob_smote > threshold).astype(int)





# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred_prob_smote))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_smote))
print("\nClassification Report:\n", classification_report(y_test, y_pred_smote))

# Precision-Recall Curve
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_pred_prob_smote)
pr_auc = auc(recall, precision)
auc = roc_auc_score(y_test, y_pred_prob_smote) # Use X_meta_test here
print(f"AUC:{auc:.4f}")

#plt.figure(figsize=(8, 6))
#plt.plot(recall, precision, label=f'PR Curve (AUC = {pr_auc:.2f})', color='darkorange')
#plt.xlabel("Recall")
#plt.ylabel("Precision")
#plt.title("Precision-Recall Curve")
#plt.legend(loc="lower left")
#plt.grid(True)
#plt.show()

# ROC Curve
fpr, tpr, thresholds_roc = roc_curve(y_test, y_pred_prob_smote)
roc_auc = roc_auc_score(y_test, y_pred_prob_smote)

#plt.figure(figsize=(8, 6))
#plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})', color='blue')
#plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
#plt.xlabel("False Positive Rate")
#plt.ylabel("True Positive Rate")
#plt.title("ROC Curve")
#plt.legend(loc="lower right")
#plt.grid(True)
#plt.show()


In [None]:
#ADASYN
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, precision_recall_curve, auc, roc_auc_score
)
from imblearn.over_sampling import ADASYN  # Use ADASYN instead of SMOTE
import xgboost as xgb
import matplotlib.pyplot as plt






train = pd.concat([train], ignore_index=True)
X_train_ada = train.drop('Class', axis=1)
y_train_ada = train['Class']

X_test = test.drop('Class', axis=1)
y_test = test['Class']


# Apply ADASYN to training data only
adasyn = ADASYN(random_state=42)
X_train_res, y_train_res = adasyn.fit_resample(X_train_ada, y_train_ada)

# Show class distribution before and after ADASYN
from collections import Counter

print("Before ADASYN:", Counter(y_train_ada))
print("After ADASYN:", Counter(y_train_res))

# Calculate how many synthetic samples were added
added_samples = len(y_train_res) - len(y_train_ada)
print(f"\nADASYN added {added_samples} synthetic samples.")

# Train XGBoost classifier with GPU
model_ada = xgb.XGBClassifier(tree_method='hist', device='cuda', eval_metric='logloss', random_state=42)
model_ada.fit(X_train_res, y_train_res)


# Predict on test set
y_pred_ada = model_ada.predict(X_test)
y_pred_prob_ada = model_ada.predict_proba(X_test)[:, 1]  # Needed for curves
y_pred_prob_ada = (y_pred_prob_ada > threshold).astype(int)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred_ada))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_ada))
print("\nClassification Report:\n", classification_report(y_test, y_pred_ada))
auc = roc_auc_score(y_test, y_pred_prob_ada) # Use X_meta_test here
print(f"AUC:{auc:.4f}")



In [None]:
#metrics for comparision.
from sklearn.metrics import (
    accuracy_score, balanced_accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, roc_auc_score, average_precision_score,
    matthews_corrcoef, log_loss
)
import numpy as np

def get_metrics(y_true, y_pred, y_prob):
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    recall = recall_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    gmean = np.sqrt(recall * specificity)
    balanced_acc = balanced_accuracy_score(y_true, y_pred)
    mcc = matthews_corrcoef(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, y_prob)
    pr_auc = average_precision_score(y_true, y_prob)
    zero_one_loss = 1 - accuracy_score(y_true, y_pred)
    logloss = log_loss(y_true, y_prob)
    return {
        "Accuracy": accuracy_score(y_true, y_pred),
        "Balanced Accuracy": balanced_acc,
        "Precision": precision,
        "Recall (Sensitivity)": recall,
        "Specificity": specificity,
        "F1-Score": f1,
        "G-Mean": gmean,
        "MCC": mcc,
        "ROC-AUC": roc_auc,
        "PR-AUC": pr_auc,
        "0-1 Loss": zero_one_loss,
        "Log-loss": logloss
    }


In [None]:
#metrcis for comparison
# For baseline
metrics_baseline = get_metrics(y_test, y_score1, y_prob_preds1)

# For oversampled
metrics_custom = get_metrics(y_test, y_score2, y_prob_preds2)

# For SMOTE
metrics_smote = get_metrics(y_test, y_pred_smote, y_pred_prob_smote)


# For ADASYN
metrics_ada = get_metrics(y_test, y_pred_ada, y_pred_prob_ada)

import pandas as pd

metrics_df = pd.DataFrame([
    metrics_baseline,  # from your baseline model
    metrics_custom,    # from your custom oversampling model
    metrics_smote,     # from SMOTE
    metrics_ada        # from ADASYN
], index=["Baseline", "O-sampling", "SMOTE", "ADASYN"]).T



# Create DataFrame
metrics_df = metrics_df
# Round and highlight best values
styled_metrics = (
    metrics_df.round(4)
    .style.highlight_max(axis=1, color='lightgreen', props='font-weight:bold;')  # Highlight best in row
    .format("{:.4f}")
)

styled_metrics



In [None]:
# calibration curve (for all models)
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

plt.figure(figsize=(8,6))

for label, y_prob in [
    ("Baseline", y_pred_prob1),
    ("Custom Oversampling", y_prob_preds2),
    ("SMOTE", y_pred_prob_smote),
    ("ADASYN", y_pred_prob_ada)
]:
    prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
    plt.plot(prob_pred, prob_true, marker='o', label=label)

plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction of positives')
plt.title('Calibration Curve (All Models)')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
#all learning curves for all models
from sklearn.model_selection import learning_curve
import numpy as np

def get_learning_curve(estimator, X, y, scoring='f1', cv=5):
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, scoring=scoring, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10), shuffle=True, random_state=42
    )
    train_mean = np.mean(train_scores, axis=1)
    test_mean = np.mean(test_scores, axis=1)
    return train_sizes, train_mean, test_mean

# Collect learning curves (use the respective train sets for each model)
curves = []
curves.append(('Baseline', *get_learning_curve(model1, X_train_model1, y_train_model1)))
curves.append(('Custom Oversampling', *get_learning_curve(model2, X_train_model2, y_train_model2)))
curves.append(('SMOTE', *get_learning_curve(model_smote, X_train_smote, y_train_smote)))
curves.append(('ADASYN', *get_learning_curve(model_ada, X_train_ada, y_train_ada)))

# Plot all on one figure
plt.figure(figsize=(10,7))
for label, train_sizes, train_mean, test_mean in curves:
    plt.plot(train_sizes, test_mean, marker='o', label=f'CV (Test) {label}')
    plt.plot(train_sizes, train_mean, marker='.', linestyle='--', label=f'Train {label}')

plt.title("Learning Curves (F1 Score) - All Models")
plt.xlabel("Training examples")
plt.ylabel("F1 Score")
plt.legend()
plt.grid(True)
plt.show()
