# Binary Classification Model Comparison

This notebook explores and compares multiple machine learning methods for a **binary classification problem** using ranked *League of Legends* solo/duo match data.

The primary goals of this analysis are:
- To **train and evaluate** a variety of classification models
- To **identify top-performing models** based on validation performance
- To **verify robustness** using both **shuffle testing** and **k-fold cross-validation**

The notebook includes:
- Data preprocessing and feature engineering
- Implementation of various classification models (e.g., logistic regression, ensemble methods, neural networks)
- Model evaluation and comparison
- Post-model validation using shuffle tests and k-fold cross-validation

The final result is a selection of the best-performing models, supported by rigorous validation to assess generalization performance.

*Prepared by Barrett James McDonald | PhD Student, University of South Florida*

In [1]:
#python libraries
import numpy as np
import pandas as pd
import numpy.linalg as LA

#data preprocessing & splitting
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

#dimensionality reduction
from sklearn.decomposition import PCA

#classification models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

#gradient boosting models
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

#evaluation metrics
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

## 1. Data Preprocessing and Subsampling

This section loads the raw match data, removes irrelevant metadata, filters for ranked solo/duo *CLASSIC* games, and prepares the cleaned numerical dataset for modeling. A random subsample of 2,500 observations is used to enable fast experimentation.


In [4]:
# Load the data
df_csv = pd.read_csv("league_data.csv", dtype={'win': str})

# Drop irrelevant/metadata columns
columns_to_drop = [
    'game_id', 'game_version', 'participant_id', 'puuid', 'summoner_name', 'summoner_id',
    'solo_tier', 'solo_rank', 'solo_lp', 'solo_wins', 'solo_losses',
    'flex_tier', 'flex_rank', 'flex_lp', 'flex_wins', 'flex_losses',
    'champion_mastery_lastPlayTime', 'champion_mastery_lastPlayTime_utc',
    'champion_id', 'map_id', 'platform_id', 'game_type', 'team_id',
    'game_start_utc', 'queue_id', 'game_mode'
]

# Filter for CLASSIC + ranked solo/duo games
df_filtered = df_csv[(df_csv['game_mode'] == 'CLASSIC') & (df_csv['queue_id'] == 420)].copy()

# Drop metadata columns
df_filtered_cleaned = df_filtered.drop(columns=[col for col in columns_to_drop if col in df_filtered.columns])

# Convert 'win' column to binary
df_filtered_cleaned['win'] = (df_filtered_cleaned['win'] == 'TRUE').astype(int)

# Drop non-numeric/categorical columns (and item columns)
df_numeric_only = df_filtered_cleaned.drop(columns=df_filtered_cleaned.select_dtypes(include=['object', 'category']).columns)
df_numeric_only = df_numeric_only.drop(columns=[col for col in df_numeric_only.columns if col.startswith("item")])

# Final predictor/response matrices
X = df_numeric_only.drop(columns=['win']).fillna(df_numeric_only.mean())
y = df_numeric_only['win']

# --- Subsample preparation (2,500 observations) ---
sample_indices = np.random.choice(X.index, size=2500, replace=False)
X_sample = X.loc[sample_indices]
y_sample = y.loc[sample_indices]

## 2. Modeling: Robust PCA + Logistic Regression

This section uses a manually implemented **Robust PCA** to decompose the scaled predictor matrix into low-rank and sparse components. The low-rank matrix is used for further dimensionality reduction via PCA before fitting a logistic regression classifier.

Performance metrics (Accuracy, F1 Score, ROC AUC) are computed on a held-out test set.

In [7]:
# --- Defining Robust PCA Manual Calculation of L and S ---
def robust_pca_fast(M, max_iter=150, tol=1e-4):
    """
    Perform Robust Principal Component Analysis (RPCA) on matrix M.
    Decomposes M into L (low-rank) and S (sparse) components using
    the Principal Component Pursuit algorithm.

    Args:
        M (np.ndarray): Input data matrix (rows = observations, cols = features)
        max_iter (int): Maximum number of iterations
        tol (float): Convergence tolerance (Frobenius norm of residual)

    Returns:
        L (np.ndarray): Low-rank matrix
        S (np.ndarray): Sparse matrix
    """
    
    # --- Helper function: soft-thresholding operator ---
    def shrinkage_operator(x, tau):
        # Applies soft-thresholding elementwise
        return np.sign(x) * np.maximum(np.abs(x) - tau, 0.)

    # --- Helper function: thresholded SVD ---
    def svd_thresholding_operator(X, tau):
        # Applies singular value thresholding to keep only large singular values
        U, S, Vh = LA.svd(X, full_matrices=False)
        S_thresh = shrinkage_operator(S, tau)
        return U @ np.diag(S_thresh) @ Vh

    # --- Initialization ---
    S = np.zeros_like(M)              # Start with zero sparse matrix
    Y = np.zeros_like(M)              # Lagrange multiplier (dual variable)
    mu = np.prod(M.shape) / (4.0 * LA.norm(M, ord=1))  # Step size parameter
    mu_inv = 1.0 / mu
    lam = 1.0 / np.sqrt(np.max(M.shape))               # Regularization parameter

    for _ in range(max_iter):
        # --- Low-rank update via SVD thresholding ---
        L = svd_thresholding_operator(M - S + mu_inv * Y, mu_inv)

        # --- Sparse matrix update via elementwise shrinkage ---
        S = shrinkage_operator(M - L + mu_inv * Y, lam * mu_inv)

        # --- Dual variable update (Lagrange multiplier) ---
        Y = Y + mu * (M - L - S)

        # --- Check convergence ---
        error = LA.norm(M - L - S, ord='fro')
        if error < tol:
            break

    return L, S

# --- Preprocessing and robust PCA ---
scaler_raw = StandardScaler()
X_scaled_sample = scaler_raw.fit_transform(X_sample)
L_sample, S_sample = robust_pca_fast(X_scaled_sample)

# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(L_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- PCA and Logistic Regression ---
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_pca, y_train)
y_pred = log_reg.predict(X_test_pca)
y_proba = log_reg.predict_proba(X_test_pca)[:, 1]

# --- Performance output ---
print("PCLR Subsample Results (2,500 rows):")
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba):.4f}")

PCLR Subsample Results (2,500 rows):
Accuracy:  0.8293
F1 Score:  0.8232
ROC AUC:   0.8947


## 3. Logistic Regression (No PCA): L1 vs L2 Regularization

This section evaluates logistic regression models trained directly on the full set of numeric features (no dimensionality reduction).

Two forms of regularization are compared:
- **L2 (Ridge):** Penalizes the squared magnitude of coefficients. Tends to shrink coefficients uniformly but keeps all features.
- **L1 (Lasso):** Penalizes the absolute value of coefficients. Can set some coefficients exactly to zero, thus performing feature selection.

We apply both penalties to the same training/test split of a 2,500-observation subsample and evaluate their performance.

In [10]:
# --- Split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- L2 (Ridge) Regularization ---
log_reg_l2 = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
log_reg_l2.fit(X_train_scaled, y_train)
y_pred_l2 = log_reg_l2.predict(X_test_scaled)
y_proba_l2 = log_reg_l2.predict_proba(X_test_scaled)[:, 1]

# --- L1 (Lasso) Regularization ---
log_reg_l1 = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)
log_reg_l1.fit(X_train_scaled, y_train)
y_pred_l1 = log_reg_l1.predict(X_test_scaled)
y_proba_l1 = log_reg_l1.predict_proba(X_test_scaled)[:, 1]

# --- Print results ---
print("Logistic Regression without PCA:")

print("L2 Regularization:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_l2):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_l2):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_l2):.4f}\n")

print("L1 Regularization:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_l1):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_l1):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_l1):.4f}")

Logistic Regression WITHOUT PCA
L2 Regularization:
Accuracy:  0.8893
F1 Score:  0.8846
ROC AUC:   0.9461

L1 Regularization:
Accuracy:  0.8867
F1 Score:  0.8821
ROC AUC:   0.9476


## 4. Decision Tree Classifier

This section implements a basic **Decision Tree Classifier** trained on the same scaled numeric data used in previous models.

Decision Trees are intuitive and interpretable models that recursively split the feature space to classify observations. While they tend to overfit without pruning or regularization, they offer a useful baseline for tree-based methods like Random Forest and Gradient Boosted Trees.

We evaluate the model on a 70/30 train-test split and report standard classification metrics.

In [13]:
# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Train Decision Tree ---
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train_scaled, y_train)
y_pred_tree = tree_clf.predict(X_test_scaled)
y_proba_tree = tree_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("Decision Tree Results:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_tree):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_tree):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_tree):.4f}")

Decision Tree Results:
Accuracy:  0.7667
F1 Score:  0.7566
ROC AUC:   0.7662


## 5. Random Forest Classifier

This section applies a **Random Forest**, an ensemble learning method that builds a collection of decision trees and combines their predictions to improve generalization.

Random Forests reduce overfitting by:
- Training each tree on a random bootstrap sample of the data
- Using a random subset of features at each split

Here, we train a forest of 100 trees on a 70/30 split of the scaled data and report accuracy, F1 score, and ROC AUC. This provides a more robust baseline than a single decision tree.

In [19]:
# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Train Random Forest ---
forest_clf = RandomForestClassifier(n_estimators=100)
forest_clf.fit(X_train_scaled, y_train)
y_pred_forest = forest_clf.predict(X_test_scaled)
y_proba_forest = forest_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("Random Forest Results:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_forest):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_forest):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_forest):.4f}")

Random Forest Results:
Accuracy:  0.8480
F1 Score:  0.8417
ROC AUC:   0.9284


## 6. XGBoost Classifier

This section trains an **Extreme Gradient Boosting (XGBoost)** classifier, a high-performance ensemble method known for its ability to handle:
- Imbalanced classes
- Nonlinear feature interactions
- Feature importance and missing data

XGBoost builds trees sequentially, where each new tree tries to correct the errors of the previous one, minimizing a specified loss function—in this case, **log loss**.

We train the model on scaled data with default hyperparameters and evaluate using accuracy, F1 score, and ROC AUC.

In [24]:
# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Train XGBoost ---
xgb_clf = XGBClassifier(eval_metric='logloss')
xgb_clf.fit(X_train_scaled, y_train)
y_pred_xgb = xgb_clf.predict(X_test_scaled)
y_proba_xgb = xgb_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("XGBoost Results:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_xgb):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_xgb):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_xgb):.4f}")

XGBoost Results:
Accuracy:  0.8773
F1 Score:  0.8740
ROC AUC:   0.9424


## 7. LightGBM Classifier

This section fits a **LightGBM (Light Gradient Boosting Machine)** model—an efficient gradient boosting framework that uses histogram-based algorithms for faster training and lower memory usage.

Compared to XGBoost, LightGBM is:
- Faster on large datasets with many features
- Capable of handling categorical variables natively (though we use numeric-only data here)
- Often just as accurate (or better) with less tuning

We train LightGBM on a 70/30 train-test split of scaled data, evaluating its predictive performance with standard classification metrics.

In [29]:
# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
# Convert scaled arrays back to DataFrames to retain column names and indices
# This is useful for model types (like LightGBM) that can optionally use feature names for better interpretability,
# and it keeps the structure consistent if we later want to analyze feature importances or visualize results.
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

# --- Train LightGBM ---
lgbm_clf = LGBMClassifier(verbose=-1)
lgbm_clf.fit(X_train_scaled, y_train)
y_pred_lgbm = lgbm_clf.predict(X_test_scaled)
y_proba_lgbm = lgbm_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("LightGBM Results:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_lgbm):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_lgbm):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_lgbm):.4f}")

LightGBM Results:
Accuracy:  0.8613
F1 Score:  0.8575
ROC AUC:   0.9336


## 8. Support Vector Machine (RBF Kernel)

This section trains a **Support Vector Machine (SVM)** with a radial basis function (RBF) kernel. SVMs aim to find the optimal hyperplane that separates classes with the **maximum margin** in a high-dimensional space.

Key notes:
- The **RBF kernel** allows the model to learn nonlinear decision boundaries by implicitly mapping features into a higher-dimensional space.
- We set `probability=True` to enable **probability estimates**, which are required for computing the **ROC AUC** score.

Although SVMs can be computationally intensive, especially on large datasets, they often perform well with clean, scaled features.

In [36]:
# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Train SVM (with probability enabled for ROC AUC) ---
svm_clf = SVC(probability=True, kernel='rbf')
svm_clf.fit(X_train_scaled, y_train)
y_pred_svm = svm_clf.predict(X_test_scaled)
y_proba_svm = svm_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("Support Vector Machine Results:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_svm):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_svm):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_svm):.4f}")

Support Vector Machine Results:
Accuracy:  0.8320
F1 Score:  0.8274
ROC AUC:   0.9179


## 9. K-Nearest Neighbors (KNN)

This section applies the **K-Nearest Neighbors (KNN)** algorithm using \( k = 5 \). KNN is a **non-parametric** method that classifies a sample based on the majority label among its nearest neighbors in feature space.

Key characteristics:
- Simple, interpretable, and effective with well-scaled, low-dimensional data
- Sensitive to irrelevant features and class imbalance
- Performance depends heavily on the choice of **k** and the distance metric

Here, we standardize features and evaluate KNN on a 70/30 split of a 2,500-observation subsample.

In [41]:
# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Train KNN ---
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train_scaled, y_train)
y_pred_knn = knn_clf.predict(X_test_scaled)
y_proba_knn = knn_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("K-Nearest Neighbors Results:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_knn):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_knn):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_knn):.4f}")

K-Nearest Neighbors Results:
Accuracy:  0.7813
F1 Score:  0.7595
ROC AUC:   0.8527


## 10. Naive Bayes Classifier

This section implements a **Naive Bayes classifier**, specifically using the **GaussianNB** variant. Naive Bayes is a probabilistic model based on Bayes’ Theorem, assuming **feature independence** given the class label.

Why it matters:
- Surprisingly effective in high-dimensional settings
- Fast to train and easy to interpret
- Works best when features are roughly independent (which is rare, but doesn’t always hurt performance)

We fit the model on scaled data and evaluate using accuracy, F1 score, and ROC AUC.

In [44]:
# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Train Naive Bayes ---
nb_clf = GaussianNB()
nb_clf.fit(X_train_scaled, y_train)
y_pred_nb = nb_clf.predict(X_test_scaled)
y_proba_nb = nb_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("Naive Bayes Results:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_nb):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_nb):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_nb):.4f}")

Naive Bayes Results:
Accuracy:  0.6880
F1 Score:  0.6422
ROC AUC:   0.7606


## 11. Neural Network: Multi-Layer Perceptron (MLP)

This final model is a **Neural Network**, specifically a **Multi-Layer Perceptron (MLP)**. MLPs are **feedforward neural networks** that learn complex, nonlinear relationships through layers of interconnected neurons.

Model architecture:
- One hidden layer with 100 neurons
- Uses ReLU activation (default)
- Optimized with the Adam solver
- Trained for up to 500 iterations (or until convergence)

While MLPs often require more tuning and training time, they can outperform traditional models when the data contains deep, abstract patterns.

In [47]:
# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Train Neural Network (MLP) ---
mlp_clf = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500)
mlp_clf.fit(X_train_scaled, y_train)
y_pred_mlp = mlp_clf.predict(X_test_scaled)
y_proba_mlp = mlp_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("Neural Network (MLP) Results:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_mlp):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_mlp):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_mlp):.4f}")

Neural Network (MLP) Results:
Accuracy:  0.8760
F1 Score:  0.8724
ROC AUC:   0.9424


## 12. Tuned Neural Network (MLP) on Full Dataset

In this final model, we revisit the **Multi-Layer Perceptron (MLP)** and apply **hyperparameter tuning** for improved performance. Key changes include:

- **Architecture:** Two hidden layers with 128 and 64 neurons
- **Activation:** ReLU (Rectified Linear Unit)
- **Optimizer:** Adam (adaptive learning rate)
- **Training:** 1,000 epochs (or until convergence)
- **Dataset:** Full cleaned dataset (not a subsample)

These changes aim to increase the model's capacity to learn complex, nonlinear relationships across the full feature space.

In [50]:
# --- Full dataset (already preprocessed into X and y) ---
X_full = X.copy()
y_full = y.copy()

# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.3, stratify=y_full)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Tuned MLP Neural Network ---
mlp_tuned = MLPClassifier(hidden_layer_sizes=(128, 64), activation='relu', solver='adam', max_iter=1000)
mlp_tuned.fit(X_train_scaled, y_train)
y_pred_mlp = mlp_tuned.predict(X_test_scaled)
y_proba_mlp = mlp_tuned.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("Tuned Neural Network Results (Full Dataset):")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_mlp):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_mlp):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_mlp):.4f}")

Tuned Neural Network Results (Full Dataset):
Accuracy:  0.8778
F1 Score:  0.8790
ROC AUC:   0.9461


## 13. LightGBM on Full Dataset

This model revisits **LightGBM**, this time training on the entire preprocessed dataset rather than a subsample.

By scaling and fitting LightGBM on the full data, we aim to:
- Leverage more information for training
- Capture rarer patterns that may not be present in smaller subsamples
- Potentially improve performance and stability

We retain feature names by converting the scaled arrays back into DataFrames, which can help with future interpretability and feature importance analysis.

In [53]:
# --- Full dataset (already preprocessed into X and y) ---
X_full = X.copy()
y_full = y.copy()

# --- Train/test split and scale ---
# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, stratify=y_sample)
scaler = StandardScaler()

X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

# --- Train LightGBM ---
lgbm_clf = LGBMClassifier(verbose=-1)
lgbm_clf.fit(X_train_scaled, y_train)
y_pred_lgbm = lgbm_clf.predict(X_test_scaled)
y_proba_lgbm = lgbm_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("LightGBM Results (Full Dataset):")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_lgbm):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_lgbm):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_lgbm):.4f}")

LightGBM Results (Full Dataset):
Accuracy:  0.8640
F1 Score:  0.8599
ROC AUC:   0.9427


## 14. XGBoost on Full Dataset

This section returns to **XGBoost**, but this time we train on the entire preprocessed dataset instead of a subsample.

Key details:
- Full dataset provides a richer training signal, potentially capturing subtler relationships
- XGBoost is configured to minimize **log loss**, which is appropriate for binary classification with probabilistic outputs
- Data is standardized before training, though XGBoost can handle unscaled input—scaling helps ensure fair comparison across models

We evaluate model performance using accuracy, F1 score, and ROC AUC.

In [56]:
# --- Full dataset (already preprocessed into X and y) ---
X_full = X.copy()
y_full = y.copy()

# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.3, stratify=y_full)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Train XGBoost ---
xgb_clf = XGBClassifier(eval_metric='logloss')
xgb_clf.fit(X_train_scaled, y_train)
y_pred_xgb = xgb_clf.predict(X_test_scaled)
y_proba_xgb = xgb_clf.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("XGBoost Results (Full Dataset):")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_xgb):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_xgb):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_xgb):.4f}")

XGBoost Results (Full Dataset):
Accuracy:  0.8972
F1 Score:  0.8964
ROC AUC:   0.9648


## 15. L1-Regularized Logistic Regression on Full Dataset

Here we train a **logistic regression model with L1 (Lasso) regularization** on the full dataset.

L1 regularization:
- Encourages sparsity in the model coefficients (i.e., some become exactly zero)
- Effectively performs **feature selection** by shrinking less informative features
- Helps prevent overfitting when many features are present

We scale the full dataset before training, and evaluate using accuracy, F1 score, and ROC AUC.

In [59]:
# --- Full dataset (already preprocessed into X and y) ---
X_full = X.copy()
y_full = y.copy()

# --- Train/test split and scale ---
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.3, stratify=y_full)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- L1 (Lasso) Logistic Regression ---
log_reg_l1 = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)
log_reg_l1.fit(X_train_scaled, y_train)
y_pred_l1 = log_reg_l1.predict(X_test_scaled)
y_proba_l1 = log_reg_l1.predict_proba(X_test_scaled)[:, 1]

# --- Evaluate performance ---
print("L1 Logistic Regression Results (Full Dataset):")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_l1):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred_l1):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba_l1):.4f}")

L1 Logistic Regression Results (Full Dataset):
Accuracy:  0.8848
F1 Score:  0.8826
ROC AUC:   0.9482


## 16. Shuffle Test for Model Validation (LightGBM & XGBoost)

To validate the performance of the top-performing models (**LightGBM** and **XGBoost**), we perform a **shuffle test**:

- The target labels (`y`) are randomly permuted, breaking any true association between the features (`X`) and the outcome.
- We then train and evaluate the models on this **mismatched X/y** pairing.
- If model performance drops significantly (as it should), we gain confidence that the original performance was not

In [62]:
# --- Shuffle labels independently ---
y_shuffled = y_full.sample(frac=1).reset_index(drop=True)
X_shuffled = X_full.reset_index(drop=True)  # Align index with shuffled y

# --- Train/test split and scale (on mismatched X/y) ---
X_train, X_test, y_train_shuffled, y_test_shuffled = train_test_split(
    X_shuffled, y_shuffled, test_size=0.3, stratify=y_shuffled)

scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

# --- LightGBM on shuffled labels ---
lgbm_clf = LGBMClassifier(verbose=-1)
lgbm_clf.fit(X_train_scaled, y_train_shuffled)
y_pred_lgbm = lgbm_clf.predict(X_test_scaled)
y_proba_lgbm = lgbm_clf.predict_proba(X_test_scaled)[:, 1]

# --- XGBoost on shuffled labels ---
xgb_clf = XGBClassifier(eval_metric='logloss')
xgb_clf.fit(X_train_scaled, y_train_shuffled)
y_pred_xgb = xgb_clf.predict(X_test_scaled)
y_proba_xgb = xgb_clf.predict_proba(X_test_scaled)[:, 1]

# --- Results ---
print("LightGBM (Shuffled Labels):")
print(f"Accuracy:  {accuracy_score(y_test_shuffled, y_pred_lgbm):.4f}")
print(f"F1 Score:  {f1_score(y_test_shuffled, y_pred_lgbm):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test_shuffled, y_proba_lgbm):.4f}\n")

print("XGBoost (Shuffled Labels):")
print(f"Accuracy:  {accuracy_score(y_test_shuffled, y_pred_xgb):.4f}")
print(f"F1 Score:  {f1_score(y_test_shuffled, y_pred_xgb):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test_shuffled, y_proba_xgb):.4f}")

LightGBM (Shuffled Labels):
Accuracy:  0.5109
F1 Score:  0.5154
ROC AUC:   0.5124

XGBoost (Shuffled Labels):
Accuracy:  0.4934
F1 Score:  0.4944
ROC AUC:   0.4923


## 17. K-Fold Cross-Validation (LightGBM & XGBoost)

As a final step in validating model performance, we apply **5-fold stratified cross-validation** to the top two models: **LightGBM** and **XGBoost**.

This method ensures that:
- Every observation is used in both training and testing
- Class proportions are preserved in each fold (stratified)
- We obtain a **distribution of ROC AUC scores** across folds, allowing us to assess stability and generalization

Results are reported as individual fold AUCs, along with their mean and standard deviation.

In [65]:
# --- Full dataset (already preprocessed into X and y) ---
X_full = X.copy()
y_full = y.copy()

# --- Scale full data ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_full)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index) #will need to be a dataframe with named columns for the loop below

# --- 5-Fold Stratified CV ---
kf = StratifiedKFold(n_splits=5, shuffle=True)

lgbm_aucs = []
xgb_aucs = []

for train_index, test_index in kf.split(X_scaled_df, y_full):
    X_train, X_test = X_scaled_df.iloc[train_index], X_scaled_df.iloc[test_index]
    y_train, y_test = y_full.iloc[train_index], y_full.iloc[test_index]

    # LightGBM
    lgbm_model = LGBMClassifier(verbose=-1)
    lgbm_model.fit(X_train, y_train)
    lgbm_proba = lgbm_model.predict_proba(X_test)[:, 1]
    lgbm_auc = roc_auc_score(y_test, lgbm_proba)
    lgbm_aucs.append(lgbm_auc)
    
    # XGBoost
    xgb_model = XGBClassifier(eval_metric='logloss')
    xgb_model.fit(X_train, y_train)
    xgb_proba = xgb_model.predict_proba(X_test)[:, 1]
    xgb_auc = roc_auc_score(y_test, xgb_proba)
    xgb_aucs.append(xgb_auc)

# --- Results ---
print("LightGBM K-Fold ROC AUCs:", np.round(lgbm_aucs, 4))
print(f"Mean AUC: {np.mean(lgbm_aucs):.4f} | Std Dev: {np.std(lgbm_aucs):.4f}\n")

print("XGBoost K-Fold ROC AUCs:", np.round(xgb_aucs, 4))
print(f"Mean AUC: {np.mean(xgb_aucs):.4f} | Std Dev: {np.std(xgb_aucs):.4f}")

LightGBM K-Fold ROC AUCs: [0.9652 0.9671 0.9647 0.9653 0.966 ]
Mean AUC: 0.9657 | Std Dev: 0.0008

XGBoost K-Fold ROC AUCs: [0.9664 0.9666 0.9637 0.9661 0.9647]
Mean AUC: 0.9655 | Std Dev: 0.0011


## 18. L1-Regularized Logistic Regression: Shuffle Test & Cross-Validation

To validate the performance of the L1-regularized logistic regression model, we apply two key methods:

### 1. Shuffle Test
- The response variable (`y`) is randomly shuffled to break any feature-target relationships.
- The model is then retrained and evaluated.
- A **low ROC AUC** score indicates that the model’s original performance was based on **real structure** in the data.

### 2. 5-Fold Stratified Cross-Validation
- We use stratified folds to preserve class balance.
- Performance is measured using **ROC AUC** across all folds.
- We report both the fold-wise scores and their **mean and standard deviation**.

These validation steps provide confidence that the logistic model generalizes well and is not simply overfitting the dataset.

In [68]:
# --- Full dataset (already preprocessed into X and y) ---
X_full = X.copy()
y_full = y.copy()

# --- Scale full data ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_full)

# ------------------------------
# Shuffle Test (L1 Logistic)
# ------------------------------
_, y_shuffled = shuffle(X_scaled, y_full)

X_train, X_test, y_train_shuff, y_test_shuff = train_test_split(X_scaled, y_shuffled, test_size=0.3, stratify=y_shuffled)

logreg_l1 = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)
logreg_l1.fit(X_train, y_train_shuff)
y_proba_shuff = logreg_l1.predict_proba(X_test)[:, 1]

print("L1 Logistic Regression – Shuffle Test")
print(f"ROC AUC (Shuffled): {roc_auc_score(y_test_shuff, y_proba_shuff):.4f}")
print()

# ------------------------------
# 5-Fold Cross-Validation
# ------------------------------
kf = StratifiedKFold(n_splits=5, shuffle=True)
auc_scores = []

for train_index, test_index in kf.split(X_scaled, y_full):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train, y_test = y_full.iloc[train_index], y_full.iloc[test_index]
    
    model = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_proba)
    auc_scores.append(auc)

print("L1 Logistic Regression – 5-Fold Cross-Validation")
print("AUCs:", np.round(auc_scores, 4))
print(f"Mean AUC: {np.mean(auc_scores):.4f} | Std Dev: {np.std(auc_scores):.4f}")

L1 Logistic Regression – Shuffle Test
ROC AUC (Shuffled): 0.5040

L1 Logistic Regression – 5-Fold Cross-Validation
AUCs: [0.9513 0.9426 0.9478 0.9533 0.943 ]
Mean AUC: 0.9476 | Std Dev: 0.0043


## 19. Neural Network Validation: Shuffle Test & Cross-Validation

To ensure the reliability of the tuned **MLP neural network**, we validate its performance using both:

### 1. Shuffle Test
- The target labels are randomly permuted to destroy true patterns.
- A strong model should perform poorly in this setting.
- A ROC AUC near 0.5 confirms the model is not just fitting noise.

### 2. 5-Fold Stratified Cross-Validation
- The dataset is split into 5 stratified folds to preserve class distribution.
- The model is trained and evaluated across all folds.
- Mean and standard deviation of ROC AUC scores are reported to assess generalizability.

This double validation confirms whether the neural network is truly learning structure from the data.

In [71]:
# --- Full dataset (already preprocessed into X and y) ---
X_nn = X.copy()
y_nn = y.copy()

# =========================
# Shuffle Test
# =========================
_, y_shuffled = shuffle(X_nn, y_nn)

X_train, X_test, y_train_shuff, y_test_shuff = train_test_split(
    X_nn, y_shuffled, test_size=0.3, stratify=y_shuffled
)

mlp_shuff = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', max_iter=1000)
mlp_shuff.fit(X_train, y_train_shuff)
y_proba_shuff = mlp_shuff.predict_proba(X_test)[:, 1]

print("MLP – Shuffle Test")
print(f"ROC AUC (Shuffled): {roc_auc_score(y_test_shuff, y_proba_shuff):.4f}\n")

# =========================
# 5-Fold Cross-Validation
# =========================
kf = StratifiedKFold(n_splits=5, shuffle=True)
auc_scores = []

for train_index, test_index in kf.split(X_nn, y_nn):
    X_train, X_test = X_nn.iloc[train_index], X_nn.iloc[test_index]
    y_train, y_test = y_nn.iloc[train_index], y_nn.iloc[test_index]
    
    model = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', max_iter=1000)
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_proba)
    auc_scores.append(auc)

print("MLP – 5-Fold Cross-Validation")
print("AUCs:", np.round(auc_scores, 4))
print(f"Mean AUC: {np.mean(auc_scores):.4f} | Std Dev: {np.std(auc_scores):.4f}")

MLP – Shuffle Test
ROC AUC (Shuffled): 0.4990

MLP – 5-Fold Cross-Validation
AUCs: [0.8819 0.8776 0.8375 0.7912 0.8698]
Mean AUC: 0.8516 | Std Dev: 0.0340
