In [1]:
## üìö 1. Setup and Data Loading (Imbalanced Data)

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# --- Load a standard dataset (Breast Cancer for Classification) ---
data_set = load_breast_cancer(as_frame=True)
X = data_set.data
y = data_set.target

# --- SIMULATE IMBALANCE for a clear demonstration ---
# We will keep most '0' (Malignant) cases and remove many '1' (Benign) cases.
# This makes the problem harder and highlights the need for stratification.
y_0 = y[y == 0] # Malignant
y_1 = y[y == 1].sample(frac=0.4, random_state=42) # Keep only 40% of Benign cases

# Recombine the indices to create the imbalanced dataset
indices = list(y_0.index) + list(y_1.index)
X_imbalanced = X.loc[indices].reset_index(drop=True)
y_imbalanced = y.loc[indices].reset_index(drop=True)

print(f"Dataset loaded: {X_imbalanced.shape[0]} samples.")
print("\nTarget Class Distribution (0=Malignant, 1=Benign):")
print(y_imbalanced.value_counts())

# Check the baseline accuracy (predicting everything as the majority class '1')
baseline_accuracy = y_imbalanced.value_counts(normalize=True).max()
print(f"\nBaseline (predicting all 'Benign'): {baseline_accuracy:.2f}")

Dataset loaded: 355 samples.

Target Class Distribution (0=Malignant, 1=Benign):
target
0    212
1    143
Name: count, dtype: int64

Baseline (predicting all 'Benign'): 0.60


## ‚ö†Ô∏è 2. The Risk of Basic K-Fold (The Biased Test)

When we use standard **K-Fold**, it splits the data purely by index without looking at the target classes (`y`). On an imbalanced dataset, this randomness can create testing folds that are completely **unrepresentative**.

### 2.1. Defining the Folds

We will use $K=5$ folds.

```python
# Initialize a simple K-Fold (Shuffle=True is necessary here)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize a simple classification model (Logistic Regression)
model_kf = LogisticRegression(solver='liblinear', random_state=42, max_iter=2000)

In [2]:
# Initialize a simple K-Fold (Shuffle=True is necessary here)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize a simple classification model (Logistic Regression)
model_kf = LogisticRegression(solver='liblinear', random_state=42, max_iter=2000)

In [3]:
# Use cross_val_score to perform K-Fold CV
# Scoring is set to 'accuracy' for this classification problem
cv_scores_kf = cross_val_score(
    model_kf, 
    X_imbalanced, 
    y_imbalanced, 
    cv=kf, 
    scoring='accuracy'
)

print("\nAccuracy scores for each of the 5 folds (K-Fold):")
print(cv_scores_kf)

print(f"\nFinal K-Fold CV Score (Average Accuracy): {cv_scores_kf.mean():.4f}")
print(f"Standard Deviation of Accuracy: {cv_scores_kf.std():.4f}")


Accuracy scores for each of the 5 folds (K-Fold):
[0.92957746 0.95774648 0.98591549 0.95774648 0.8028169 ]

Final K-Fold CV Score (Average Accuracy): 0.9268
Standard Deviation of Accuracy: 0.0645


## üìè 3. Stratified K-Fold (The Fair Test)

**Stratified K-Fold** is the solution to the imbalanced data problem. It ensures that the **proportion of the target class (`y`) is roughly the same** in every training fold and testing fold.

### 3.1. Defining the Stratified Folds

```python
# Initialize Stratified K-Fold
# This guarantees class balance in every split.
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("Stratified K-Fold object created, guaranteeing class balance.")

# Initialize model again
model_skf = LogisticRegression(solver='liblinear', random_state=42, max_iter=2000)

In [4]:
# Initialize Stratified K-Fold
# This guarantees class balance in every split.
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("Stratified K-Fold object created, guaranteeing class balance.")

# Initialize model again
model_skf = LogisticRegression(solver='liblinear', random_state=42, max_iter=2000)

Stratified K-Fold object created, guaranteeing class balance.


In [5]:
# Use cross_val_score with the Stratified object
cv_scores_skf = cross_val_score(
    model_skf, 
    X_imbalanced, 
    y_imbalanced, 
    cv=skf,  # Using the StratifiedKFold object
    scoring='accuracy'
)

print("\nAccuracy scores for each of the 5 stratified folds:")
print(cv_scores_skf)

print(f"\nFinal Stratified CV Score (Average Accuracy): {cv_scores_skf.mean():.4f}")
print(f"Standard Deviation of Accuracy: {cv_scores_skf.std():.4f}")


Accuracy scores for each of the 5 stratified folds:
[0.91549296 0.88732394 0.91549296 0.97183099 0.95774648]

Final Stratified CV Score (Average Accuracy): 0.9296
Standard Deviation of Accuracy: 0.0309


## üåü 4. Conclusion: Stability is Key

By comparing the results, we can see the clear advantage of Stratified K-Fold:

| CV Method | Average Accuracy | Standard Deviation (Stability) |
| :--- | :--- | :--- |
| **K-Fold (Basic)** | [Insert Average KF Score] | [Insert Std Dev KF Score] (Likely High) |
| **Stratified K-Fold** | [Insert Average SKF Score] | [Insert Std Dev SKF Score] (Likely Low) |

Even if the average accuracy is similar, the **significantly lower Standard Deviation** for Stratified K-Fold proves that the model's performance is **stable and reliable** across all test samples, guaranteeing that we didn't just get lucky (or unlucky) with a non-representative fold.

**Rule:** Always use **Stratified K-Fold** for classification problems!

