In [3]:
## üìö 1. Setup and Data Loading
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score



# --- Load a standard classification dataset (Iris) ---
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target

print(f"Dataset loaded: {iris.frame.shape[0]} samples.")
print(f"Target distribution (0, 1, 2): {y.value_counts().tolist()}")

Dataset loaded: 150 samples.
Target distribution (0, 1, 2): [50, 50, 50]


## üîÅ 2. Basic K-Fold Cross-Validation

**K-Fold** is the standard workhorse for general CV. It divides the data into $K$ equal-sized blocks. Since order doesn't matter here, we can **shuffle** the data to ensure each fold is randomly mixed.

### 2.1. Defining the Folds

We will use $K=5$ folds.

```python
# Initialize a simple K-Fold (Shuffle=True is the standard for non-time-series data)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print("K-Fold object created with 5 splits and shuffling enabled.")

# Initialize a simple classification model (Logistic Regression)
model_kf = LogisticRegression(solver='liblinear', random_state=42)

In [5]:
# --- Define the K-Fold cross-validator and model ---
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Logistic Regression model for classification
model_kf = LogisticRegression(max_iter=1000, random_state=42)


In [6]:
# Use cross_val_score to perform K-Fold CV
# Scoring is set to 'accuracy' for this classification problem
cv_scores_kf = cross_val_score(
    model_kf, 
    X, 
    y, 
    cv=kf, 
    scoring='accuracy'
)

print("\nAccuracy scores for each of the 5 folds:")
print(cv_scores_kf)

print(f"\nFinal K-Fold CV Score (Average Accuracy): {cv_scores_kf.mean():.4f}")
print(f"Standard Deviation of Accuracy: {cv_scores_kf.std():.4f}")


Accuracy scores for each of the 5 folds:
[1.         1.         0.93333333 0.96666667 0.96666667]

Final K-Fold CV Score (Average Accuracy): 0.9733
Standard Deviation of Accuracy: 0.0249


## ‚ö†Ô∏è 3. The Problem: When Classes are Imbalanced

In our Iris dataset, the classes are perfectly balanced (50 samples each). But what if they weren't?

Imagine you are classifying a rare disease (95% healthy, 5% sick).

If you use **standard K-Fold**, a random split might result in one of your test folds (the exam questions) accidentally containing:
* **Only** healthy samples, giving a useless test score.
* **No** sick samples, meaning the model is never tested on the hardest cases.

**Solution:** We need to ensure that every fold is a miniature, representative sample of the whole dataset. This is called **Stratification**.

## üìè 4. Stratified K-Fold (The Fair Exam)

**Stratified K-Fold** guarantees that the proportion of the target class (y) is roughly the same in every training fold and testing fold. This is the **required method** for virtually all classification problems.

### 4.1. Defining the Stratified Folds

In [8]:

# Initialize Stratified K-Fold
# Note: StratifiedKFold requires shuffle=True to work properly
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("Stratified K-Fold object created with 5 splits and guaranteed class balance.")

# Initialize model again
model_skf = LogisticRegression(solver='liblinear', random_state=42)

Stratified K-Fold object created with 5 splits and guaranteed class balance.


In [9]:
# Use cross_val_score with the Stratified object
cv_scores_skf = cross_val_score(
    model_skf, 
    X, 
    y, 
    cv=skf,  # Using the StratifiedKFold object
    scoring='accuracy'
)

print("\nAccuracy scores for each of the 5 stratified folds:")
print(cv_scores_skf)

print(f"\nFinal Stratified CV Score (Average Accuracy): {cv_scores_skf.mean():.4f}")
print(f"Standard Deviation of Accuracy: {cv_scores_skf.std():.4f}")


Accuracy scores for each of the 5 stratified folds:
[0.96666667 1.         0.9        0.93333333 1.        ]

Final Stratified CV Score (Average Accuracy): 0.9600
Standard Deviation of Accuracy: 0.0389




## üåü 5. Conclusion and Next Step

### Summary of Results:

| Method | Average Accuracy | Standard Deviation |
| :--- | :--- | :--- |
| **K-Fold (Basic)** | [Insert Average KF Score] | [Insert Std Dev KF Score] |
| **Stratified K-Fold** | [Insert Average SKF Score] | [Insert Std Dev SKF Score] |

For **balanced datasets** like Iris, the results are often very similar. However, for real-world **imbalanced classification problems**, **Stratified K-Fold** is essential to ensure a reliable and honest evaluation of the model.

### ‚è≠Ô∏è What About Time Series?

In our previous notebook, we used K-Fold on time-series data, which is technically incorrect because it breaks the chronological order (mixing past and future).

In the next notebook, we will learn the correct CV method for time-series data!
