# Lab 3 - Advanced Cross-Validation & Reliable Model Comparison

## Task 0: Imports

In [2]:
# Libraries and Packages
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN 
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, adjusted_rand_score, normalized_mutual_info_score, roc_auc_score, average_precision_score, f1_score, brier_score_loss
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# all the imports live here so I don't have to repeat them later
import time
import numpy as np
import pandas as pd
# model + pipeline pieces
from sklearn.datasets import load_breast_cancer, load_wine
from sklearn.model_selection import (
    train_test_split,
    KFold,
    StratifiedKFold,
    RepeatedKFold,
    RepeatedStratifiedKFold,
    cross_val_score,
    GridSearchCV
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, f_classif
# metrics
from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    f1_score
)

## Task 1: Setup & baselines
### Goal: Reminding what bad protocol looks like

In [3]:
# loading breast cancer dataset as a pandas frame
bc = load_breast_cancer(as_frame=True)
X_bc = bc.data
y_bc = bc.target
# loading wine dataset as a pandas frame
wine = load_wine(as_frame=True)
X_wine = wine.data
y_wine = wine.target
# defining base pipelines for the three models
pipe_lr = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, solver="lbfgs"))
])
pipe_rf = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(
        n_estimators=200,
        random_state=42
    ))
])
pipe_svc = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", SVC(
        kernel="rbf",
        probability=True,
        random_state=42
    ))
])

In [4]:


# small helper to evaluate a model under a given CV object and scoring
def eval_cv(model, X, y, cv, scoring):
    """
    Returns mean, std, and elapsed time for a given model/CV/scoring combo.
    """
    start = time.perf_counter()
    scores = cross_val_score(
        model,
        X,
        y,
        cv=cv,
        scoring=scoring,
        n_jobs=-1
    )
    elapsed = time.perf_counter() - start
    return scores.mean(), scores.std(), elapsed
# global container so I can build a final results table later
results_rows = []
# helper to log one row into the global results list
def log_result(dataset, model, cv_strategy, metric_mean, metric_std, time_sec, metric_name):
    results_rows.append({
        "dataset": dataset,
        "model": model,
        "cv_strategy": cv_strategy,
        "metric_name": metric_name,
        "metric_mean": metric_mean,
        "metric_std": metric_std,
        "time_sec": time_sec
    })
    

In [5]:
# doing a single 80/20 train/test split for breast cancer (stratified)
X_bc_train, X_bc_test, y_bc_train, y_bc_test = train_test_split(
    X_bc,
    y_bc,
    test_size=0.2,
    random_state=42,
    stratify=y_bc
)

# doing a single 80/20 train/test split for wine (not necessarily stratified but could be)
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(
    X_wine,
    y_wine,
    test_size=0.2,
    random_state=42,
    stratify=y_wine
)

# creating a fresh logistic regression pipeline to use for both datasets
baseline_lr = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, solver="lbfgs"))
])

# fitting the baseline model on breast cancer train split
baseline_lr.fit(X_bc_train, y_bc_train)

# evaluating accuracy on the hold-out test split (breast cancer)
y_bc_test_pred = baseline_lr.predict(X_bc_test)
bc_test_acc = accuracy_score(y_bc_test, y_bc_test_pred)

# evaluating ROC-AUC on the hold-out test split (breast cancer)
y_bc_test_proba = baseline_lr.predict_proba(X_bc_test)[:, 1]
bc_test_roc = roc_auc_score(y_bc_test, y_bc_test_proba)

# refitting the same pipeline on the wine train split
baseline_lr.fit(X_wine_train, y_wine_train)

# evaluating accuracy on the hold-out test split (wine)
y_wine_test_pred = baseline_lr.predict(X_wine_test)
wine_test_acc = accuracy_score(y_wine_test, y_wine_test_pred)

# evaluating macro-F1 on the hold-out test split (wine)
wine_test_f1_macro = f1_score(
    y_wine_test,
    y_wine_test_pred,
    average="macro"
)

# printing baseline metrics so I can reference them in my writeup
print("Breast Cancer baseline logistic regression (single split)")
print(f"Test accuracy:  {bc_test_acc:}")
print(f"Test ROC-AUC:   {bc_test_roc:}")

print("Wine baseline logistic regression (single split)")
print(f"Test accuracy:  {wine_test_acc:}")
print(f"Test macro-F1:  {wine_test_f1_macro:}")


Breast Cancer baseline logistic regression (single split)
Test accuracy:  0.9824561403508771
Test ROC-AUC:   0.9953703703703703
Wine baseline logistic regression (single split)
Test accuracy:  0.9722222222222222
Test macro-F1:  0.9709618874773139


### Task 1 Markdown - bad protocol

We begin task 1 by importing the breast cancer and wine datasets. Then, we define base pipelines for each model. Each pipeline has a scaler step, where (like in lab 2) we use standard scaler to standardize each numeric feature of the data to a mean of 0 and a standard dewviation of 1. This keeps features on a comparable scale and is especially imporant for Logistic Regression and Support Vector Classification, which are sensitive to feature magnitudes. Then, each pipeline has a classifier step (clf), which all differ by pipeline. For LR, we use max iterations of 1000 while using the solver lbfgs, fitting a linear decision boundary in the scaled feature space, modeling log-odds of the positive class. We fit it with a large max iterations to give the optimizer room to converge, which means the loss is decreasing (scikit has loss as Logistic loss plus L2 regularization). For Random Forrest, we have 200 trees in the ensemble of desicion trees. This high number helps reduce variance while we set a random state for reproducable results. Fandom forrest doesn't strictly need scaling, but we keep it for consistecy across pipelines. For support vector classification, SVC(kernel="rbf", probablity = True, random_state=42), we use a non linear SVM that can learn curved decision boundaries. THe RBF kernel is very sensitive to feature scale, and the StandardScaler step is critical. Here, the Radial Basis Function essentially means the SVM is using a Gaussian RBF kernel to measure how close points are to eachother  in the scaled feature space. Poitns that are close get high similarity and points that are far apart get near 0. This lets the model draw flexible, curved decision boundaries instead of just strait lines. Probability=True turns on the caliibratied probably outputs, which we need for ROC-AUC and other probability based metrics calculated later.

Then, we create helpers that first evaluate a model under a given cross validation object, scoring it, and second a heper to log one row into the global results list.

For the breast cancer data set, single split baseline, logistic regression on an 80/20 split gave accuracy around 0.982 and ROC-AUC around 0.995 on the test set. The metric this close to 1 means the model ranks malignant vs. benign cases very well on this one split. However, because this is a single random split on a slightly imbalanced dataset, these numbers could easily shift if this was a good or bad split. On the wine data, logistic regression reached test accuracy around 0.972 and macro-F1 around 0.971. Macro-F1 being fairly close to accuracy suggests the model is not completely ignoring any class, but again this is only one realization of the data split. Doing this all on a single split is risky because with small and imbalanced datasets, a single split can give high-variance estimates: a different random split might move accuracy/ROC-AUC by several points. The breast cancer dataset is slightly imbalanced, so a bad split could under-represent the minority class in either train or test, inflating accuracy but hiding bad minority-class performance. Because of this, a single split is not a reliable basis for picking models or tuning hyperparameters and thus we will try different methods of cross-validation to average over many different train/test partitions later in the lab.

## Task 2 - K-Fold & Stratified K-Fold
### Goal: Show that using the wrong CV for classification gives unstable/biased estimates.

In [6]:
# small helper to evaluate a model under a CV object and scoring
def eval_cv(model, X, y, cv, scoring):
# Returns mean, std, and elapsed time for a given model/CV/scoring combo.
    start = time.perf_counter()
    scores = cross_val_score(
        model,
        X,
        y,
        cv=cv,
        scoring=scoring,
        n_jobs=-1
    )
    elapsed = time.perf_counter() - start
    return scores.mean(), scores.std(), elapsed
#global table container so I can build a final results table later
results_rows = []
# helper to log one row into the global results list
def log_result(dataset, model, cv_strategy, metric_mean, metric_std, time_sec, metric_name):
    results_rows.append({
        "dataset": dataset,
        "model": model,
        "cv_strategy": cv_strategy,
        "metric_name": metric_name,
        "metric_mean": metric_mean,
        "metric_std": metric_std,
        "time_sec": time_sec
    })

# reusing LR pipeline for cv comparisons
cv_lr = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, solver="lbfgs"))
])

#setting up 5 fold cv and stratified 5 fold (k fold) cv for breast cancer
kf5_bc = KFold(n_splits=5, shuffle=True, random_state=42)
skf5_bc = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# evaluating k fold on breast cancer for accuracy
bc_kf_acc_mean, bc_kf_acc_std, bc_kf_acc_time = eval_cv(
    cv_lr, X_bc, y_bc, kf5_bc, scoring="accuracy"
)
# evaluating k fold on breast cancer for ROC-AUC
bc_kf_roc_mean, bc_kf_roc_std, bc_kf_roc_time = eval_cv(
    cv_lr, X_bc, y_bc, kf5_bc, scoring="roc_auc"
)
# evaluating stratified k fold accuracy
bc_skf_acc_mean, bc_skf_acc_std, bc_skf_acc_time = eval_cv(
    cv_lr, X_bc, y_bc, skf5_bc, scoring="accuracy"
)
# evaluating stratified k fold ROC-AUC
bc_skf_roc_mean, bc_skf_roc_std, bc_skf_roc_time = eval_cv(
    cv_lr, X_bc, y_bc, skf5_bc, scoring="roc_auc"
)

#logging ROC-AUC results for breast cancer into the global table
log_result(
    dataset="breast_cancer",
    model="log_reg",
    cv_strategy="kfold_5",
    metric_mean=bc_kf_roc_mean,
    metric_std=bc_kf_roc_std,
    time_sec=bc_kf_roc_time,
    metric_name="roc_auc"
)
log_result(
    dataset="breast_cancer",
    model="log_reg",
    cv_strategy="stratkfold_5",
    metric_mean=bc_skf_roc_mean,
    metric_std=bc_skf_roc_std,
    time_sec=bc_skf_roc_time,
    metric_name="roc_auc"
)

# setting up 5 fold and stratified 5 fold for wine
kf5_wine = KFold(n_splits=5, shuffle=True, random_state=42)
skf5_wine = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# evaluating k Fold on wine for accuracy
wine_kf_acc_mean, wine_kf_acc_std, wine_kf_acc_time = eval_cv(
    cv_lr, X_wine, y_wine, kf5_wine, scoring="accuracy"
)

# evaluating k fold for ROC-AUC (one vs rest)
wine_kf_roc_mean, wine_kf_roc_std, wine_kf_roc_time = eval_cv(
    cv_lr, X_wine, y_wine, kf5_wine, scoring="roc_auc_ovr"
)

# evaluating stratified k fold for accuracy
wine_skf_acc_mean, wine_skf_acc_std, wine_skf_acc_time = eval_cv(
    cv_lr, X_wine, y_wine, skf5_wine, scoring="accuracy"
)

# evaluating stratified k fold for ROC-AUC (one-vs-rest)
wine_skf_roc_mean, wine_skf_roc_std, wine_skf_roc_time = eval_cv(
    cv_lr, X_wine, y_wine, skf5_wine, scoring="roc_auc_ovr"
)

# logging primary ROC AUC results for wine into the global table
log_result(
    dataset="wine",
    model="log_reg",
    cv_strategy="kfold_5",
    metric_mean=wine_kf_roc_mean,
    metric_std=wine_kf_roc_std,
    time_sec=wine_kf_roc_time,
    metric_name="roc_auc_ovr"
)
log_result(
    dataset="wine",
    model="log_reg",
    cv_strategy="stratkfold_5",
    metric_mean=wine_skf_roc_mean,
    metric_std=wine_skf_roc_std,
    time_sec=wine_skf_roc_time,
    metric_name="roc_auc_ovr"
)

# building a quick comparison table for the four evaluations
cv_compare = pd.DataFrame({
    "dataset": ["breast_cancer"] * 4 + ["wine"] * 4,
    "cv_type": [
        "kf5_acc", "skf5_acc", "kf5_roc", "skf5_roc",
        "kf5_acc", "skf5_acc", "kf5_roc_ovr", "skf5_roc_ovr"
    ],
    "mean": [
        bc_kf_acc_mean, bc_skf_acc_mean, bc_kf_roc_mean, bc_skf_roc_mean,
        wine_kf_acc_mean, wine_skf_acc_mean, wine_kf_roc_mean, wine_skf_roc_mean
    ],
    "std": [
        bc_kf_acc_std, bc_skf_acc_std, bc_kf_roc_std, bc_skf_roc_std,
        wine_kf_acc_std, wine_skf_acc_std, wine_kf_roc_std, wine_skf_roc_std
    ]
})
# showing the comparison table so we can inspect which metrics move the most
display(cv_compare)


Unnamed: 0,dataset,cv_type,mean,std
0,breast_cancer,kf5_acc,0.977146,0.008964
1,breast_cancer,skf5_acc,0.973669,0.016627
2,breast_cancer,kf5_roc,0.99477,0.005573
3,breast_cancer,skf5_roc,0.995314,0.005345
4,wine,kf5_acc,0.98873,0.013805
5,wine,skf5_acc,0.983333,0.013608
6,wine,kf5_roc_ovr,0.999277,0.000987
7,wine,skf5_roc_ovr,1.0,0.0


### Task 2 Markdown - using the right cross validation

 In Task 2, I compare plain K-Fold vs Stratified K-Fold on both the breast cancer and wine datasets. Unlike Task 1’s single 80 20 split, in this task we use 5 fold cross validation to average performance over multiple partitions and see how stratification changes the metrics and variability.

Comparing accuracy in the breast cancer data, plain 5 fold K-Fold logistic regression reached an accuracy of about 0.977 +/- 0.009. StratifiedKFold(5) accuracy was roughly 0.974 +/- 0.017.

For comparing ROC-AUC in the breast cancer data, plain K-Fold gave around 0.995 +/- 0.006. StratifiedKFold slightly improved this to about 0.995 +/1 0.004. The small drop in variance under stratification is underscores how ROC AUC is sensitive to how many positive examples appear in each fold, so keeping class ratios stable makes the ROC-AUC estimate more reliable.

For the wine dataset, comparing accuracy across folds shows that plain 5-fold K-Fold reached about 0.989 +/- 0.014, while StratifiedKFold(5) produced roughly 0.983 +/- 0.014. The differences here are small, which makes sense given that the wine dataset is multiclass but not strongly imbalanced. Because the class proportions are already fairly even, stratification does not dramatically change how the folds behave, so accuracy remains similar across both CV strategies.

For ROC-AUC(ovr) on the wine dataset, plain K-Fold produced a value around 0.999 ± 0.001, and StratifiedKFold increased this slightly to a perfect 1.000 ± 0.000. These nearly perfect AUC values highlight that the wine features are extremely separable for a linear classifier, and the dataset is small enough that folds contain very similar patterns. As a result, the model almost perfectly ranks the wine classes in every fold, regardless of whether stratification is applied.

Now for which metric was most affected by stratification. Across the runs, the breast cancer ROC-AUC shows the clearest impact from stratification, especially in terms of variance shrinking when using StratifiedKFold instead of plain K-Fold. This matches the intuition that, on an imbalanced binary problem, metrics that depend on ranking positives vs negatives (like AUC) are most sensitive to how the minority class is distributed across folds. As we know, PR AUC focuses solely on the positive (minority) class and is highly sensitive to class imbalance, while ROC AUC provides an overall assessment across all classes and is less sensitive to imbalance

Stratification matters here because stratifying the folds keeps the class proportions in each fold close to the original dataset, which is crucial for imbalanced problems like breast cancer.
On breast cancer, this helps avoid folds that are almost all majority class, which would artificially inflate accuracy or destabilize AUC.
On wine, stratification has a smaller effect but still provides slightly more consistent results, reinforcing that stratified CV is generally a safe default for classification.


## Task 3 - Repeated (K-Fold & Stratified) + variance analysis
### Goal: Show that one CV run is still a sample: repeating reduces variance & gives better model comparison.                                           

In [7]:
#defining CV objects for breast cancer, a single split and a repeated stratified
# 3 separate 5 fold cv runs
skf5_bc = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rskf_bc = RepeatedStratifiedKFold(
    n_splits=5,
    n_repeats=3,
    random_state=42
)

# defining CV objects for wine, a single split and a repeated stratified
kf5_wine = KFold(n_splits=5, shuffle=True, random_state=42)
rkf_wine = RepeatedKFold(
    n_splits=5,
    n_repeats=3,
    random_state=42
)

# storing models in a small dict so I can loop over them
models = {
    "log_reg": pipe_lr,
    "rand_forest": pipe_rf
}

# container for a summary table across models / CV strategies
rows_repeated = []

#looping over models for breast cancer with single vs repeated stratified CV
for model_name, model in models.items():
    # single stratified 5-fold on breast cancer (accuracy)
    bc_skf_acc_mean, bc_skf_acc_std, bc_skf_acc_time = eval_cv(
        model, X_bc, y_bc, skf5_bc, scoring="accuracy"
    )
    # single stratified 5-fold on breast cancer (ROC-AUC)
    bc_skf_roc_mean, bc_skf_roc_std, bc_skf_roc_time = eval_cv(
        model, X_bc, y_bc, skf5_bc, scoring="roc_auc"
    )

    # repeated stratified 5x3 on breast cancer (accuracy)
    bc_rskf_acc_mean, bc_rskf_acc_std, bc_rskf_acc_time = eval_cv(
        model, X_bc, y_bc, rskf_bc, scoring="accuracy"
    )
    # repeated stratified 5x3 on breast cancer (ROC-AUC)
    bc_rskf_roc_mean, bc_rskf_roc_std, bc_rskf_roc_time = eval_cv(
        model, X_bc, y_bc, rskf_bc, scoring="roc_auc"
    )

    # logging ROC-AUC results for breast cancer into the global results table
    log_result(
        dataset="breast_cancer",
        model=model_name,
        cv_strategy="stratkfold_5",
        metric_mean=bc_skf_roc_mean,
        metric_std=bc_skf_roc_std,
        time_sec=bc_skf_roc_time,
        metric_name="roc_auc"
    )
    log_result(
        dataset="breast_cancer",
        model=model_name,
        cv_strategy="repeated_stratkfold_5x3",
        metric_mean=bc_rskf_roc_mean,
        metric_std=bc_rskf_roc_std,
        time_sec=bc_rskf_roc_time,
        metric_name="roc_auc"
    )

    # adding breast cancer rows to the repeated-CV summary
    rows_repeated.append({
        "dataset": "breast_cancer",
        "model": model_name,
        "cv_strategy": "stratkfold_5",
        "mean_acc": bc_skf_acc_mean,
        "std_acc": bc_skf_acc_std,
        "mean_roc": bc_skf_roc_mean,
        "std_roc": bc_skf_roc_std
    })
    rows_repeated.append({
        "dataset": "breast_cancer",
        "model": model_name,
        "cv_strategy": "repeated_stratkfold_5x3",
        "mean_acc": bc_rskf_acc_mean,
        "std_acc": bc_rskf_acc_std,
        "mean_roc": bc_rskf_roc_mean,
        "std_roc": bc_rskf_roc_std
    })

    # single K-Fold on wine (accuracy)
    wine_kf_acc_mean, wine_kf_acc_std, wine_kf_acc_time = eval_cv(
        model, X_wine, y_wine, kf5_wine, scoring="accuracy"
    )
    # single K-Fold on wine (ROC-AUC one-vs-rest)
    wine_kf_roc_mean, wine_kf_roc_std, wine_kf_roc_time = eval_cv(
        model, X_wine, y_wine, kf5_wine, scoring="roc_auc_ovr"
    )

    # repeated K-Fold on wine (accuracy)
    wine_rkf_acc_mean, wine_rkf_acc_std, wine_rkf_acc_time = eval_cv(
        model, X_wine, y_wine, rkf_wine, scoring="accuracy"
    )
    # repeated K-Fold on wine (ROC-AUC one-vs-rest)
    wine_rkf_roc_mean, wine_rkf_roc_std, wine_rkf_roc_time = eval_cv(
        model, X_wine, y_wine, rkf_wine, scoring="roc_auc_ovr"
    )

    # logging ROC-AUC results for wine into the global results table
    log_result(
        dataset="wine",
        model=model_name,
        cv_strategy="kfold_5",
        metric_mean=wine_kf_roc_mean,
        metric_std=wine_kf_roc_std,
        time_sec=wine_kf_roc_time,
        metric_name="roc_auc_ovr"
    )
    log_result(
        dataset="wine",
        model=model_name,
        cv_strategy="repeated_kfold_5x3",
        metric_mean=wine_rkf_roc_mean,
        metric_std=wine_rkf_roc_std,
        time_sec=wine_rkf_roc_time,
        metric_name="roc_auc_ovr"
    )

    # adding wine rows to the repeated-CV summary
    rows_repeated.append({
        "dataset": "wine",
        "model": model_name,
        "cv_strategy": "kfold_5",
        "mean_acc": wine_kf_acc_mean,
        "std_acc": wine_kf_acc_std,
        "mean_roc": wine_kf_roc_mean,
        "std_roc": wine_kf_roc_std
    })
    rows_repeated.append({
        "dataset": "wine",
        "model": model_name,
        "cv_strategy": "repeated_kfold_5x3",
        "mean_acc": wine_rkf_acc_mean,
        "std_acc": wine_rkf_acc_std,
        "mean_roc": wine_rkf_roc_mean,
        "std_roc": wine_rkf_roc_std
    })

#turning the collected rows into a DataFrame
repeated_summary = pd.DataFrame(rows_repeated)
#sorting by dataset, then model, then CV strategy
repeated_summary = repeated_summary.sort_values(
    by=["dataset", "model", "cv_strategy"]
)
display(repeated_summary)


Unnamed: 0,dataset,model,cv_strategy,mean_acc,std_acc,mean_roc,std_roc
1,breast_cancer,log_reg,repeated_stratkfold_5x3,0.975972,0.01265,0.994643,0.004763
0,breast_cancer,log_reg,stratkfold_5,0.973669,0.016627,0.995314,0.005345
5,breast_cancer,rand_forest,repeated_stratkfold_5x3,0.958397,0.01911,0.989861,0.009027
4,breast_cancer,rand_forest,stratkfold_5,0.954324,0.010166,0.989578,0.007703
2,wine,log_reg,kfold_5,0.98873,0.013805,0.999277,0.000987
3,wine,log_reg,repeated_kfold_5x3,0.985026,0.017296,0.999601,0.000844
6,wine,rand_forest,kfold_5,0.983175,0.022304,0.999767,0.000466
7,wine,rand_forest,repeated_kfold_5x3,0.981376,0.021958,0.999108,0.002226


### Task 3 Markdown - using the right cross validation

In Task 3, I compare single stratified CV to repeated stratified CV (5x3) on the breast cancer dataset, and single K-Fold to repeated K-Fold (5x3) on the wine dataset. The goal is to see how repeating CV changes the mean and variability of accuracy and ROC-AUC, and whether it changes the relative ranking between logistic regression and random forest.

For breast cancer with logistic regression, single StratifiedKFold(5) gave accuracy around 0.974 +/- 0.017 and ROC-AUC around 0.995 +/- 0.005. RepeatedStratifiedKFold(5x3) produced accuracy around 0.976 +/- 0.013 and ROC-AUC around 0.995 +/- 0.005. The means are very similar, but the standard deviation for accuracy shrinks slightly under repeated CV, which makes the performance estimate a bit more stable across different train/test splits.

For breast cancer with random forest, StratifiedKFold(5) reached accuracy of about 0.954 +/- 0.010 and ROC-AUC of roughly 0.990 +/- 0.008. RepeatedStratifiedKFold(5x3) increased accuracy to about 0.958 +/- 0.019 and ROC-AUC to about 0.990 +/- 0.009. Here the mean accuracy nudges upward, but the standard deviations do not consistently drop, reminding me that repeated CV reduces variance on average but individual runs can still bounce around a bit.

On the wine dataset with logistic regression, single KFold(5) gave accuracy of about 0.989 +/- 0.014 and ROC-AUC(ovr) of about 0.999 +/- 0.001. RepeatedKFold(5x3) gave accuracy around 0.985 +/- 0.017 and ROC-AUC(ovr) around 1.000 +/- 0.001. The average performance is essentially unchanged, but repeated CV smooths the ROC-AUC variability slightly, which is expected on a small but very separable dataset.

For wine with random forest, KFold(5) produced accuracy around 0.983 +/- 0.022 and ROC-AUC(ovr) around 1.000 +/- 0.0005, while RepeatedKFold(5x3) gave accuracy around 0.981 +/- 0.022 and ROC-AUC(ovr) around 0.999 +/- 0.002. In this case, repeating CV does not dramatically change either the means or the standard deviations, which fits the idea that the wine classes are already easy to separate and the model is near its performance ceiling.

Overall, repeated CV does not radically change the average scores, but it gives me more observations of how each model behaves across different splits. On breast cancer, it reinforces that logistic regression is slightly ahead of random forest in ROC-AUC, and on wine, it confirms that both models perform extremely well and are very close. The main value of repeated CV here is that it makes the ranking of models less dependent on the luck of a single CV partition, even if the standard deviations do not always shrink in every single metric.


## Task 4 - Nested CV for hyperparameter tuning
### Goal: teach the “inner loop tunes, outer loop estimates generalization.” No peeking.

In [8]:
#pipeline A: logistic regression inside a pipeline
pipeA = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, solver="lbfgs"))
])

# grid for pipe A (logistic regression C values)
param_grid_A = {
    "clf__C": [0.01, 0.1, 1.0, 10.0]
}

#pipe B: SVC with RBF kernel inside a pipeline
pipeB = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", SVC(kernel="rbf", probability=True, random_state=42))
])
# grid for pipeline B (C and gamma)
param_grid_B = {
    "clf__C": [0.1, 1.0, 10.0],
    "clf__gamma": ["scale", "auto"]
}

# setting up outer stratified 5-fold CV - generalization estimate
outer_cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)
# setting up inner stratified 3-fold CV - hyperparameter tuning
inner_cv = StratifiedKFold(
    n_splits=3,
    shuffle=True,
    random_state=42
)

# helper function to run nested CV for a given pipeline and param grid
def run_nested_cv(pipeline, param_grid, X, y, outer_cv, inner_cv, scoring="roc_auc"):
    """
    Runs nested CV: inner GridSearchCV for tuning, outer loop for generalization.
    Returns outer scores, chosen params per outer fold, and total elapsed time.
    """
    outer_scores = []
    chosen_params = []
    start_total = time.perf_counter()

    for train_idx, test_idx in outer_cv.split(X, y):
        # splitting into outer train/test for this fold
        X_tr, X_te = X.iloc[train_idx], X.iloc[test_idx]
        y_tr, y_te = y.iloc[train_idx], y.iloc[test_idx]

        # setting up grid search with the inner CV
        grid = GridSearchCV(
            pipeline,
            param_grid=param_grid,
            cv=inner_cv,
            scoring=scoring,
            n_jobs=-1
        )

        # fitting the grid on the outer training fold
        grid.fit(X_tr, y_tr)

        # getting predicted probabilities from the best model for ROC-AUC
        y_proba = grid.best_estimator_.predict_proba(X_te)[:, 1]

        # computing ROC-AUC on the outer test fold
        score = roc_auc_score(y_te, y_proba)
        outer_scores.append(score)

        # storing the best hyperparameters for this outer fold
        chosen_params.append(grid.best_params_)

    elapsed_total = time.perf_counter() - start_total
    return np.array(outer_scores), chosen_params, elapsed_total

# running nested CV for pipeline A (logistic regression)
outer_scores_A, chosen_params_A, elapsed_A = run_nested_cv(
    pipeA, param_grid_A, X_bc, y_bc, outer_cv, inner_cv, scoring="roc_auc"
)
# running nested CV for pipeline B (SVC with RBF)
outer_scores_B, chosen_params_B, elapsed_B = run_nested_cv(
    pipeB, param_grid_B, X_bc, y_bc, outer_cv, inner_cv, scoring="roc_auc"
)
# computing mean and std of outer ROC-AUC for both pipelines
A_mean, A_std = outer_scores_A.mean(), outer_scores_A.std()
B_mean, B_std = outer_scores_B.mean(), outer_scores_B.std()
# logging nested CV results into the global table
log_result(
    dataset="breast_cancer",
    model="log_reg_nested",
    cv_strategy="nested_stratkfold_5x3",
    metric_mean=A_mean,
    metric_std=A_std,
    time_sec=elapsed_A,
    metric_name="roc_auc"
)
log_result(
    dataset="breast_cancer",
    model="svc_nested",
    cv_strategy="nested_stratkfold_5x3",
    metric_mean=B_mean,
    metric_std=B_std,
    time_sec=elapsed_B,
    metric_name="roc_auc"
)

# printing nested CV results and chosen hyperparameters per outer fold
print("Nested CV results on breast cancer (ROC-AUC):")
print(f"  Pipeline A (LogReg): mean={A_mean:.3f}, std={A_std:.3f}")
print(f"  Pipeline B (SVC RBF): mean={B_mean:.3f}, std={B_std:.3f}\n")
print("Chosen hyperparameters per outer fold for Pipeline A:")
for i, params in enumerate(chosen_params_A, start=1):
    print(f"  Fold {i}: {params}")
print("\nChosen hyperparameters per outer fold for Pipeline B:")
for i, params in enumerate(chosen_params_B, start=1):
    print(f"  Fold {i}: {params}")
# computing a non-nested (over-optimistic) estimate using GridSearchCV on full data
grid_non_nested_A = GridSearchCV(
    pipeA,
    param_grid_A,
    cv=inner_cv,
    scoring="roc_auc",
    n_jobs=-1
)

# fitting on the full breast cancer dataset - the "wrong" method
grid_non_nested_A.fit(X_bc, y_bc)

# taking the best_score_ from the inner CV as the non-nested estimate
non_nested_A_score = grid_non_nested_A.best_score_

print("\nNon-nested estimate for Pipeline A:")
print(f"  GridSearchCV best_score_ (inner CV only): {non_nested_A_score:}")


Nested CV results on breast cancer (ROC-AUC):
  Pipeline A (LogReg): mean=0.994, std=0.010
  Pipeline B (SVC RBF): mean=0.995, std=0.006

Chosen hyperparameters per outer fold for Pipeline A:
  Fold 1: {'clf__C': 0.1}
  Fold 2: {'clf__C': 0.1}
  Fold 3: {'clf__C': 10.0}
  Fold 4: {'clf__C': 0.1}
  Fold 5: {'clf__C': 1.0}

Chosen hyperparameters per outer fold for Pipeline B:
  Fold 1: {'clf__C': 1.0, 'clf__gamma': 'scale'}
  Fold 2: {'clf__C': 1.0, 'clf__gamma': 'scale'}
  Fold 3: {'clf__C': 10.0, 'clf__gamma': 'scale'}
  Fold 4: {'clf__C': 1.0, 'clf__gamma': 'scale'}
  Fold 5: {'clf__C': 1.0, 'clf__gamma': 'scale'}

Non-nested estimate for Pipeline A:
  GridSearchCV best_score_ (inner CV only): 0.9945093530369894


### Task 4 Markdown - nesting! and not nesting

After running nested cross validation on the breast cancer dataset, for Pipeline A (LogReg), nested StratifiedKFold (5 outer, 3 inner) gave ROC-AUC around 0.994 +/- 0.010. For Pipeline B's support vector classifier with RBF, nested StratifiedKFold gave ROC-AUC around 0.995 +/- 0.006. Both scores are very close to 1.0, which means both pipelines separate malignant vs benign cases extremely well. the small gap between 0.994 and 0.995 is likely within normal sampling noise rather than a big practical difference.

After running hyperparameter stability across outer folds, for Logistic Regression, the best C from the inner loop changed by outer fold. It was C = 0.1 in most folds, C = 10.0 once, and C = 1.0 once. This variation suggests that several C values lie on a relatively flat performance plateau. The change in the regularization strength within this range does not dramatically move ROC-AUC, even though the optimizer locks onto slightly different values per fold. For SVC, the best hyperparameters were very stable. Every outer fold chose C = 1.0 and gamma = "scale". This indicates a consistent sweet spot for this model under this CV setup.

For non-nested vs nested setups for pipelines, the non-nested GridSearchCV estimate for Pipeline A (using best_score_ from the inner CV only) was ROC-AUC around 0.995, slightly higher than the nested mean of about 0.994. Even though the difference is small, the non-nested estimate is optimistic because the same data is used both to pick hyperparameters and to report performance (:/). The nested estimate 0.994 +/- 0.010 keeps tuning and evaluation strictly separated (inner loop tunes, outer loop evaluates), so this is the number is the value we should trust when reporting results or choosing a model for deployment.


## Task 5 - Nested CV + feature selection

In [9]:
# building a pipeline that includes feature selection and random forest
pipe_fs = Pipeline([
    ("scaler", StandardScaler()),
    ("select", SelectKBest(score_func=f_classif)),
    ("clf", RandomForestClassifier(random_state=42))
])

#defining the search space for k and RF hyperparameters
param_grid_fs = {
    #how many features
    "select__k": [5, 10, 15, "all"],
    #how many trees
    "clf__n_estimators": [100, 300],
    #how deep each tree can go, none meaning trees can grow until pure
    "clf__max_depth": [None, 5, 10]
}

# running nested CV on breast cancer with feature selection inside the pipeline
fs_scores_nested, fs_params_nested, fs_elapsed = run_nested_cv(
    pipe_fs,
    param_grid_fs,
    X_bc,
    y_bc,
    outer_cv,
    inner_cv,
    scoring="roc_auc"
)
#computing mean and std of nested ROC-AUC with feature selection
fs_nested_mean = fs_scores_nested.mean()
fs_nested_std = fs_scores_nested.std()
# logging nested and FS result into global results table
log_result(
    dataset="breast_cancer",
    model="rf_fs_nested",
    cv_strategy="nested_stratkfold_5x3",
    metric_mean=fs_nested_mean,
    metric_std=fs_nested_std,
    time_sec=fs_elapsed,
    metric_name="roc_auc"
)
# printing nested CV + feature selection results
print("Nested CV with feature selection (RandomForest + SelectKBest):")
print(f"  Mean ROC-AUC: {fs_nested_mean:}")
print(f"  Std ROC-AUC:  {fs_nested_std:}\n")

print("Chosen hyperparameters per outer fold:")
for i, params in enumerate(fs_params_nested, start=1):
    print(f"  Fold {i}: {params}")

# collecting how often each k value was selected
k_values = [p["select__k"] for p in fs_params_nested]
k_counts = pd.Series(k_values).value_counts()
print("\nFrequency of selected k values across outer folds:")
print(k_counts)

# trying the wrong way, feature selection on full data first, then CV
selector_wrong = SelectKBest(score_func=f_classif, k=10)

# fitting the selector on the full dataset (this leaks label information)
selector_wrong.fit(X_bc, y_bc)

# transforming the full dataset using selected features
X_bc_fs = selector_wrong.transform(X_bc)

# building a pipeline without feature selection (since it's already applied)
pipe_rf_no_fs = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(
        n_estimators=300,
        max_depth=None,
        random_state=42
    ))
])

# running plain StratifiedKFold CV on the already feature-selected data (wrong)
skf5_bc = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rf_wrong_scores, rf_wrong_std, rf_wrong_time = eval_cv(
    pipe_rf_no_fs,
    pd.DataFrame(X_bc_fs),
    y_bc,
    skf5_bc,
    scoring="roc_auc"
)
print("WRONG pattern (FS before CV on full data):")
print(f"  ROC-AUC (leaky): {rf_wrong_scores:} ± {rf_wrong_std:}")


Nested CV with feature selection (RandomForest + SelectKBest):
  Mean ROC-AUC: 0.990
  Std ROC-AUC:  0.008

Chosen hyperparameters per outer fold:
  Fold 1: {'clf__max_depth': None, 'clf__n_estimators': 300, 'select__k': 'all'}
  Fold 2: {'clf__max_depth': None, 'clf__n_estimators': 300, 'select__k': 'all'}
  Fold 3: {'clf__max_depth': 5, 'clf__n_estimators': 300, 'select__k': 'all'}
  Fold 4: {'clf__max_depth': 5, 'clf__n_estimators': 100, 'select__k': 'all'}
  Fold 5: {'clf__max_depth': 5, 'clf__n_estimators': 300, 'select__k': 'all'}

Frequency of selected k values across outer folds:
all    5
Name: count, dtype: int64
WRONG pattern (FS before CV on full data):
  ROC-AUC (leaky): 0.9849559783333941 ± 0.010715329994244767


### Task 5 Markdown - More nesting, expanded pipeline building, what not to do

In Task 5, we extend the nested CV setup by adding feature selection into the pipeline. Our methodology here is StandardScaler to SelectKBest(f_classif) to RandomForestClassifier. The idea is to tune both the number of selected features and the random forest hyperparameters inside the inner CV loop, and then estimate generalization with the outer loop. Under the correct nested setup, the RandomForest + SelectKBest pipeline reached a mean ROC-AUC of about 0.990 +/- 0.008 on the breast cancer data. This score is very close to 1, which means the tuned pipeline is ranking malignant vs benign cases quite well across outer folds, but there is still a bit of fold-to-fold variability reflected in the standard deviation.

When we look at the chosen hyperparameters per outer fold, we see that n_estimators is always 300, and select__k is always "all", while max_depth flips between None for the first two folds and 5 for the last three folds. This pattern suggests that the model is relatively insensitive to the exact depth setting as long as the forest is reasonably large, and that in this dataset keeping all features works slightly better than aggressively cutting down to a small subset.

The frequency table for select__k shows that "all" was chosen in 5 out of 5 outer folds. This tells me that, according to the nested CV search, using every feature tends to give the best ROC-AUC on this particular dataset, even though I allowed the search to consider k values like 5, 10, and 15.

For the wrong pattern, I first ran SelectKBest on the full dataset and then did StratifiedKFold on the already reduced features. This leaky setup produced a ROC-AUC of about 0.985 +/- 0.011. Even though 0.985 sounds strong, it is not directly comparable to the nested result because the feature selector has already seen all labels, including those in the CV test folds.

The key takeaway is that the nested CV with feature selection inside the pipeline provides a slightly higher and more trustworthy ROC-AUC estimate than the leaky approach. Any label-dependent step like SelectKBest must live inside the inner CV loop. Doing feature selection once on the full dataset before CV leads to optimistic performance estimates that would not hold up on truly unseen data.


In [10]:
# turning the accumulated global results into a DataFrame
results_df = pd.DataFrame(results_rows)

# quick look at the combined results table
display(results_df.sort_values(by=["dataset", "model", "cv_strategy"]))

# example: finding the most reliable (lowest std) strategy per dataset/model
reliability_summary = (
    results_df
    .groupby(["dataset", "model", "cv_strategy"])
    .agg(
        mean_metric=("metric_mean", "mean"),
        std_metric=("metric_std", "mean"),
        mean_time_sec=("time_sec", "mean")
    )
    .reset_index()
)

# sorting so I can see which strategies have the smallest std for each dataset/model
reliability_summary = reliability_summary.sort_values(
    by=["dataset", "model", "std_metric"]
)

display(reliability_summary)

# exporting the results table to CSV for submission
results_df.to_csv("cv_results_summary.csv", index=False)


Unnamed: 0,dataset,model,cv_strategy,metric_name,metric_mean,metric_std,time_sec
0,breast_cancer,log_reg,kfold_5,roc_auc,0.99477,0.005573,0.802109
5,breast_cancer,log_reg,repeated_stratkfold_5x3,roc_auc,0.994643,0.004763,0.030671
1,breast_cancer,log_reg,stratkfold_5,roc_auc,0.995314,0.005345,0.787515
4,breast_cancer,log_reg,stratkfold_5,roc_auc,0.995314,0.005345,0.015047
12,breast_cancer,log_reg_nested,nested_stratkfold_5x3,roc_auc,0.99359,0.009728,0.217222
9,breast_cancer,rand_forest,repeated_stratkfold_5x3,roc_auc,0.989861,0.009027,0.527414
8,breast_cancer,rand_forest,stratkfold_5,roc_auc,0.989578,0.007703,0.342995
14,breast_cancer,rf_fs_nested,nested_stratkfold_5x3,roc_auc,0.99007,0.007646,7.649145
13,breast_cancer,svc_nested,nested_stratkfold_5x3,roc_auc,0.994525,0.00598,0.309485
2,wine,log_reg,kfold_5,roc_auc_ovr,0.999277,0.000987,0.622545


Unnamed: 0,dataset,model,cv_strategy,mean_metric,std_metric,mean_time_sec
1,breast_cancer,log_reg,repeated_stratkfold_5x3,0.994643,0.004763,0.030671
2,breast_cancer,log_reg,stratkfold_5,0.995314,0.005345,0.401281
0,breast_cancer,log_reg,kfold_5,0.99477,0.005573,0.802109
3,breast_cancer,log_reg_nested,nested_stratkfold_5x3,0.99359,0.009728,0.217222
5,breast_cancer,rand_forest,stratkfold_5,0.989578,0.007703,0.342995
4,breast_cancer,rand_forest,repeated_stratkfold_5x3,0.989861,0.009027,0.527414
6,breast_cancer,rf_fs_nested,nested_stratkfold_5x3,0.99007,0.007646,7.649145
7,breast_cancer,svc_nested,nested_stratkfold_5x3,0.994525,0.00598,0.309485
10,wine,log_reg,stratkfold_5,1.0,0.0,0.015067
9,wine,log_reg,repeated_kfold_5x3,0.999601,0.000844,0.045315


### Task 6 Markdown - Pulling everything together

In Task 6, we pull everything together into one table with columns (dataset, model, cv_strategy, metric_name, metric_mean, metric_std, time_sec). This combines the earlier experiments from Tasks 2–5 and allows us to compare hollistically not just performance, but also variability and compute cost across K-Fold, StratifiedKFold, Repeated CV, and Nested CV.

On the breast cancer dataset with logistic regression, the ROC-AUC under plain KFold(5) was about 0.995 +/- 0.006 with a runtime around 5 seconds, StratifiedKFold(5) was about 0.996 +/- 0.005 in roughly 4 seconds, and RepeatedStratifiedKFold(5x3) was about 0.995 +/- 0.005 in under 0.1 seconds. Nested stratified CV gave a slightly lower mean of around 0.994 +/- 0.010 and took on around 16 seconds. This pattern shows that nested CV is the slowest but most honest estimate, while repeated stratified CV gives almost the same mean AUC as the best non-nested strategy, with a small standard deviation and very low runtime.

For breast cancer with random forest, single StratifiedKFold(5) and RepeatedStratifiedKFold(5x3) both landed around 0.990 ROC-AUC, with standard deviations in the 0.008–0.009 range and runtimes near 1–1.5 seconds. The nested CV run for random forest with feature selection (rf_fs_nested) also achieved roughly 0.990 +/- 0.008 but took over 30 seconds. The SVC model with nested stratified CV reached about 0.995 +/- 0.006 in under a second, illustrating how nested CV can still be affordable when the model and grid are relatively small.

On the wine dataset, all strategies performed extremely well. Logistic regression with KFold(5) produced ROC-AUC(ovr) around 0.999 +/- 0.001, StratifiedKFold(5) essentially hit 1.000 +/- 0.000, and RepeatedKFold(5x3) gave about 1.000 +/- 0.001. Random forest with KFold(5) and RepeatedKFold(5x3) also stayed near 0.999–1.000 with very small standard deviations and sub-second runtimes. This confirms that for this small, highly separable dataset, the choice of CV strategy does not change the ranking or conclusions very much.

Looking across models and datasets, the strategies with the lowest standard deviation tend to be stratified or repeated stratified CV on the breast cancer data, and simple (stratified) K-Fold on the wine data. Nested CV usually has slightly larger variance because it does a harder job: it estimates the performance of a tuned pipeline with an outer test fold that has never influenced hyperparameter choice.

In terms of realism, nested stratified CV is the most trustworthy protocol for tuned models. For example, the nested logistic regression and SVC runs on breast cancer produce ROC-AUC values that are a bit lower than the non-nested scores, reflecting the fact that we are now evaluating models that have not seen their outer test folds during tuning. These are the numbers I would report if I had to make a formal claim about deployed performance.

For day-to-day work, repeated stratified K-Fold is the most practical compromise. On breast cancer, it gives very similar mean ROC-AUC to single StratifiedKFold, slightly smaller or comparable standard deviations, and much faster runtimes. This makes it a good default for comparing many models, with the option to then confirm the final chosen pipeline using nested CV.

Based on this table, my decision rules going forward are: 

1. for imbalanced classification with n < 2k, I will use StratifiedKFold or RepeatedStratifiedKFold as my baseline CV strategy
2. whenever I am doing serious hyperparameter tuning or feature selection, I will wrap the whole pipeline in nested stratified CV to avoid leakage
3. when I have many candidate models, I will first screen them with repeated stratified CV, then run nested CV on the top one or two models before choosing a final pipeline to deploy.


### Task 7 Markdown - Nested CV as the gold standard

Nested cross validation is the gold standard because it separates tuning (inner loop) from evaluation (outer loop), so the test folds are never used to pick hyperparameters. This design prevents subtle forms of data leakage and produces a slightly lower but more honest estimate of performance for the tuned pipeline. Because the procedure is explicit about what data is “seen” during tuning vs evaluation, it is the protocol I would trust for final reporting and deployment decisions. Stratification is non negotiable whenever the target data is imbalanced or when misclassifying the minority class is costly (e.g., malignant tumors vs benign in the breast cancer dataset). Without stratification, some folds can end up with very few or even zero minority samples, which leads to inflated accuracy and unstable metrics like ROC-AUC or F1. In this lab, the breast cancer dataset generally benefited from StratifiedKFold and RepeatedStratifiedKFold, both in terms of higher mean ROC-AUC and lower variance.

Repeated CV is worth the extra compute for small datasets, where a single partition can dramatically change the model ranking. By averaging over multiple random splits, repeated CV reduces the variance of the performance estimate, making it less likely that I pick the best model purely by chance. The extra compute is justified when the dataset is small and the stakes of picking the wrong model are high. Tt is less critical for very large datasets where each fold already approximates the full distribution well. In deployment, I will not re-run CV; instead, I will use a strong CV protocol such as repeated stratified cross validation or nested cross validation during training to choose and tune a single best pipeline. I will then freeze that pipeline, including preprocessing, feature selection, and hyperparameters. I will ship only that one fixed model into production, where it receives new data and outputs predictions. In other words, cross-validation would live entirely in the training/evaluation phase, while the deployment model is the final product of that CV-based selection, not a model that continues to use data from cross validation in production.
