# üß© K-Fold vs Stratified K-Fold Cross-Validation

In [3]:
## üìö 1. Setup and Data Loading
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score



# --- Load a standard classification dataset (Iris) ---
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target

print(f"Dataset loaded: {iris.frame.shape[0]} samples.")
print(f"Target distribution (0, 1, 2): {y.value_counts().tolist()}")

Dataset loaded: 150 samples.
Target distribution (0, 1, 2): [50, 50, 50]


## üîÅ 2. Basic K-Fold Cross-Validation

**K-Fold** is the standard workhorse for general CV. It divides the data into $K$ equal-sized blocks. Since order doesn't matter here, we can **shuffle** the data to ensure each fold is randomly mixed.

### 2.1. Defining the Folds

We will use $K=5$ folds.

```python
# Initialize a simple K-Fold (Shuffle=True is the standard for non-time-series data)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print("K-Fold object created with 5 splits and shuffling enabled.")

# Initialize a simple classification model (Logistic Regression)
model_kf = LogisticRegression(solver='liblinear', random_state=42)

In [5]:
# --- Define the K-Fold cross-validator and model ---
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Logistic Regression model for classification
model_kf = LogisticRegression(max_iter=1000, random_state=42)


In [6]:
# Use cross_val_score to perform K-Fold CV
# Scoring is set to 'accuracy' for this classification problem
cv_scores_kf = cross_val_score(
    model_kf, 
    X, 
    y, 
    cv=kf, 
    scoring='accuracy'
)

print("\nAccuracy scores for each of the 5 folds:")
print(cv_scores_kf)

print(f"\nFinal K-Fold CV Score (Average Accuracy): {cv_scores_kf.mean():.4f}")
print(f"Standard Deviation of Accuracy: {cv_scores_kf.std():.4f}")


Accuracy scores for each of the 5 folds:
[1.         1.         0.93333333 0.96666667 0.96666667]

Final K-Fold CV Score (Average Accuracy): 0.9733
Standard Deviation of Accuracy: 0.0249


## ‚ö†Ô∏è 3. The Problem: When Classes are Imbalanced

In our Iris dataset, the classes are perfectly balanced (50 samples each). But what if they weren't?

Imagine you are classifying a rare disease (95% healthy, 5% sick).

If you use **standard K-Fold**, a random split might result in one of your test folds (the exam questions) accidentally containing:
* **Only** healthy samples, giving a useless test score.
* **No** sick samples, meaning the model is never tested on the hardest cases.

**Solution:** We need to ensure that every fold is a miniature, representative sample of the whole dataset. This is called **Stratification**.

## üìè 4. Stratified K-Fold (The Fair Exam)

**Stratified K-Fold** guarantees that the proportion of the target class (y) is roughly the same in every training fold and testing fold. This is the **required method** for virtually all classification problems.

### 4.1. Defining the Stratified Folds

In [8]:

# Initialize Stratified K-Fold
# Note: StratifiedKFold requires shuffle=True to work properly
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("Stratified K-Fold object created with 5 splits and guaranteed class balance.")

# Initialize model again
model_skf = LogisticRegression(solver='liblinear', random_state=42)

Stratified K-Fold object created with 5 splits and guaranteed class balance.


In [9]:
# Use cross_val_score with the Stratified object
cv_scores_skf = cross_val_score(
    model_skf, 
    X, 
    y, 
    cv=skf,  # Using the StratifiedKFold object
    scoring='accuracy'
)

print("\nAccuracy scores for each of the 5 stratified folds:")
print(cv_scores_skf)

print(f"\nFinal Stratified CV Score (Average Accuracy): {cv_scores_skf.mean():.4f}")
print(f"Standard Deviation of Accuracy: {cv_scores_skf.std():.4f}")


Accuracy scores for each of the 5 stratified folds:
[0.96666667 1.         0.9        0.93333333 1.        ]

Final Stratified CV Score (Average Accuracy): 0.9600
Standard Deviation of Accuracy: 0.0389




## üåü 5. Conclusion and Next Step

### Summary of Results:

| Method | Average Accuracy | Standard Deviation |
| :--- | :--- | :--- |
| **K-Fold (Basic)** | [Insert Average KF Score] | [Insert Std Dev KF Score] |
| **Stratified K-Fold** | [Insert Average SKF Score] | [Insert Std Dev SKF Score] |

For **balanced datasets** like Iris, the results are often very similar. However, for real-world **imbalanced classification problems**, **Stratified K-Fold** is essential to ensure a reliable and honest evaluation of the model.

### ‚è≠Ô∏è What About Time Series?

In our previous notebook, we used K-Fold on time-series data, which is technically incorrect because it breaks the chronological order (mixing past and future).

In the next notebook, we will learn the correct CV method for time-series data!


Final Summary Analisis

# üß© K-Fold vs Stratified K-Fold Cross-Validation

## üîç Concept

**The Fair Exam Problem**: How do you ensure every test fairly represents all student skill levels? Stratified K-Fold solves this for imbalanced classification problems.

---

## üí° Key Points

### The Setup

**Dataset**: Iris flower classification
- **Samples**: 150 observations
- **Classes**: 3 types (Setosa, Versicolor, Virginica)
- **Distribution**: Perfectly balanced (50, 50, 50)
- **Features**: 4 measurements (sepal/petal length/width)

**Model**: Logistic Regression (solver='liblinear')  
**Evaluation**: 5-Fold Cross-Validation  
**Goal**: Compare basic K-Fold vs Stratified K-Fold

### The Problem Being Solved

**Imagine a Rare Disease Dataset**:
- 95% healthy patients (Class 0)
- 5% sick patients (Class 1)

**What Could Go Wrong with Basic K-Fold**:
```
Fold 1: [95% healthy, 5% sick]     ‚úÖ Representative
Fold 2: [100% healthy, 0% sick]    ‚ùå Missing sick patients!
Fold 3: [92% healthy, 8% sick]     ‚ö†Ô∏è Slightly off
Fold 4: [98% healthy, 2% sick]     ‚ö†Ô∏è Underrepresents sick
Fold 5: [90% healthy, 10% sick]    ‚ö†Ô∏è Overrepresents sick
```

**Result**: Fold 2 never tests the model on sick patients - unreliable evaluation!

**Stratified K-Fold Solution**:
```
EVERY fold: [95% healthy, 5% sick]  ‚úÖ All folds representative!
```

---

## üìä Results Comparison

### Overall Performance

| Method | Avg. Accuracy | Std. Dev. | Key Insight |
|--------|--------------|-----------|-------------|
| **K-Fold** | **0.9733** (97.33%) | ¬±0.0249 (2.49%) | Works well for balanced data |
| **Stratified K-Fold** | **0.9600** (96.00%) | ¬±0.0389 (3.89%) | Guarantees class balance in each fold |

### Individual Fold Performance

**K-Fold Results**:
```
Fold 1: 100.0%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Perfect
Fold 2: 100.0%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Perfect
Fold 3:  93.3%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Good
Fold 4:  96.7%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Very Good
Fold 5:  96.7%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Very Good

Average: 97.33% ¬± 2.49%
```

**Stratified K-Fold Results**:
```
Fold 1:  96.7%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Very Good
Fold 2: 100.0%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Perfect
Fold 3:  90.0%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Good
Fold 4:  93.3%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Good
Fold 5: 100.0%  ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ Perfect

Average: 96.00% ¬± 3.89%
```

---

## üéØ Analysis & Interpretation

### Why K-Fold Performed Slightly Better (97.33% vs 96.00%)

**Reason 1: Dataset is Perfectly Balanced**
- Iris has exactly 50 samples per class
- Random K-Fold naturally creates balanced folds
- Stratification provides no advantage here

**Reason 2: Lucky Random Split**
- K-Fold got 2 perfect folds (100% accuracy)
- This is random luck, not systematic superiority
- Different random_state would change results

**Reason 3: Model is Strong**
- Logistic Regression separates Iris classes easily
- Even slightly imbalanced folds still perform well
- The difference (97.33% vs 96.00%) is negligible

### Why Stratified Has Higher Standard Deviation (3.89% vs 2.49%)

**This seems counterintuitive but makes sense**:

**K-Fold's Low Variability** (¬±2.49%):
- Got lucky with well-balanced random splits
- Two folds were 100% accurate
- Artificially low variance due to chance

**Stratified's Higher Variability** (¬±3.89%):
- **Guarantees** exact class proportions in each fold
- But can't control which specific samples go where
- Some folds got harder-to-classify flowers by chance
- Range: 90% to 100% (one challenging fold)

**Key Insight**: Higher variance doesn't mean Stratified is worse - it means it's more honest about the inherent difficulty variation across samples.

---

## üß† The "Fair Exam" Analogy

### Basic K-Fold: The Lucky Random Test

**Scenario**: Teacher randomly assigns students to 5 different exams
- **Exam 1**: Happens to get all A-students ‚Üí 100% pass rate
- **Exam 2**: Happens to get all A-students ‚Üí 100% pass rate  
- **Exam 3**: Mix of A/B/C students ‚Üí 93% pass rate
- **Average**: 97.3% (but exams 1 & 2 were unrealistically easy!)

**Problem**: You can't trust this average because some exams were easier by chance.

### Stratified K-Fold: The Deliberately Fair Test

**Scenario**: Teacher ensures EVERY exam has:
- 33% A-students
- 33% B-students
- 34% C-students

**Result**:
- **Every exam** has the same difficulty level
- No exam gets unfair advantage/disadvantage
- Average represents TRUE class performance
- Variability comes from inherent difficulty, not random luck

---

## ‚ö†Ô∏è When Does Stratification Matter Most?

### Critical for Imbalanced Datasets

**Example: Fraud Detection**
- 99.5% legitimate transactions (Class 0)
- 0.5% fraudulent transactions (Class 1)

**Without Stratification**:
```python
# K-Fold might create:
Fold 1: 1,000 transactions ‚Üí 4 fraudulent (0.4%) ‚ùå Underrepresented
Fold 2: 1,000 transactions ‚Üí 7 fraudulent (0.7%) ‚ö†Ô∏è Overrepresented  
Fold 3: 1,000 transactions ‚Üí 0 fraudulent (0.0%) ‚ùå‚ùå DISASTER!
```
- Fold 3 never tests fraud detection!
- Model evaluation is completely unreliable

**With Stratified K-Fold**:
```python
# Every fold guaranteed:
All Folds: 1,000 transactions ‚Üí 5 fraudulent (0.5%) ‚úÖ Perfect balance
```

### Impact on Performance Metrics

**Imbalanced Dataset Performance**:

| Scenario | K-Fold | Stratified K-Fold |
|----------|--------|-------------------|
| **Fold gets no minority class** | 99.5% accuracy (useless!) | Impossible - guaranteed balance |
| **Standard Deviation** | ¬±10% (high variability) | ¬±2% (stable) |
| **Minority Class Recall** | 0% to 100% (unstable) | 85% to 95% (reliable) |

---

## üìà Visual Comparison: How They Work

### K-Fold (Random Split)

```
Dataset: [‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óã‚óã‚óã‚óã‚óã] (10 Class A, 5 Class B)

Fold 1: [‚óè‚óè‚óè‚óã]         40% Class B ‚ö†Ô∏è Overrepresented
Fold 2: [‚óè‚óè‚óè‚óã]         33% Class B ‚úÖ Close to target
Fold 3: [‚óè‚óè‚óè‚óã]         33% Class B ‚úÖ Close to target
Fold 4: [‚óè‚óè‚óè‚óã]         33% Class B ‚úÖ Close to target
Fold 5: [‚óè‚óè‚óã‚óã]         50% Class B ‚ùå Way overrepresented!

Result: Fold proportions vary from 33% to 50%
```

### Stratified K-Fold (Guaranteed Balance)

```
Dataset: [‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óã‚óã‚óã‚óã‚óã] (10 Class A, 5 Class B)

Fold 1: [‚óè‚óè‚óã]          33.3% Class B ‚úÖ Exact
Fold 2: [‚óè‚óè‚óã]          33.3% Class B ‚úÖ Exact
Fold 3: [‚óè‚óè‚óã]          33.3% Class B ‚úÖ Exact
Fold 4: [‚óè‚óè‚óã]          33.3% Class B ‚úÖ Exact
Fold 5: [‚óè‚óè‚óã]          33.3% Class B ‚úÖ Exact

Result: ALL folds have exactly 33.3% Class B
```

---

## üéì Key Learnings

### 1. Why Results Were Similar for Iris

‚úÖ **Iris is perfectly balanced** (50, 50, 50)  
‚úÖ **Model is strong** (Logistic Regression works well)  
‚úÖ **Random K-Fold got lucky** with naturally balanced splits  
‚ö†Ô∏è **Don't be fooled**: With imbalanced data, K-Fold would fail badly

### 2. When to Use Each Method

**Use Basic K-Fold**:
- ‚ùå Almost never for classification
- ‚úÖ Only for regression problems
- ‚úÖ Only when classes are perfectly balanced AND you're absolutely sure

**Use Stratified K-Fold**:
- ‚úÖ **Always for classification** (default choice)
- ‚úÖ **Essential for imbalanced classes**
- ‚úÖ Even when balanced (no downside, adds safety)
- ‚úÖ Fraud detection, disease diagnosis, rare event prediction

### 3. Understanding the Variance Difference

**K-Fold: ¬±2.49% (lower variance)**
- Seems better but it's **misleading**
- Result of lucky random splits
- Would vary dramatically with different random_state

**Stratified: ¬±3.89% (higher variance)**
- **More honest** representation of difficulty
- Comes from inherent sample variability
- Stable across different random_state values

---

## üöÄ Practical Recommendations

### For Production Machine Learning

**Classification Tasks**:
```python
# ‚ùå DON'T DO THIS
from sklearn.model_selection import KFold
cv = KFold(n_splits=5)

# ‚úÖ ALWAYS DO THIS
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
```

**Regression Tasks**:
```python
# ‚úÖ Use regular K-Fold (no classes to balance)
from sklearn.model_selection import KFold
cv = KFold(n_splits=5, shuffle=True, random_state=42)
```

**Time-Series Tasks**:
```python
# ‚úÖ Use TimeSeriesSplit (coming in Notebook 03!)
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=5)
```

### Best Practices

1. ‚úÖ **Default to Stratified K-Fold** for all classification
2. ‚úÖ **Use 5-10 folds** (5 is standard, 10 for small datasets)
3. ‚úÖ **Always set random_state** for reproducibility
4. ‚úÖ **Always shuffle=True** unless time-series
5. ‚ö†Ô∏è **Check class distribution** before choosing method

---

## üìä Connection to Other Notebooks

### How This Fits the Project

**Notebook 01**: Intro to CV
- Showed why single split is unreliable ($5.00 vs $6.12)
- Used basic K-Fold on time-series (wrong method!)

**Notebook 02 (This One)**: K-Fold vs Stratified
- Explains stratification for classification
- Shows both methods work on balanced data
- **Critical for imbalanced problems**

**Notebook 03**: TimeSeriesSplit
- Will show the RIGHT method for temporal data
- Achieves $5.08 MAE (best result!)
- Respects chronological order

**Notebooks 04-05**: Hyperparameter Tuning
- All tuning uses TimeSeriesSplit (proper temporal CV)
- Stratified K-Fold used in Grid Search example (Iris)
- Shows CV must match data structure

---

## üéØ Key Takeaways

### The Bottom Line

> **Always use Stratified K-Fold for classification.** Even when data is balanced (like Iris), there's no downside. When data is imbalanced, it's absolutely critical.

### Critical Numbers

- **K-Fold**: 97.33% ¬± 2.49% (lucky random split)
- **Stratified**: 96.00% ¬± 3.89% (guaranteed fair)
- **Difference**: 1.33% (negligible on balanced data)
- **On imbalanced data**: Difference could be 20-50%!

### Why This Matters

**Scenario**: 1% fraud detection dataset

| Method | Fold 1 Fraud % | Fold 2 Fraud % | Fold 3 Fraud % | Reliability |
|--------|---------------|---------------|---------------|-------------|
| K-Fold | 0.5% | 0.0% ‚ùå | 2.0% | Terrible |
| Stratified | 1.0% | 1.0% | 1.0% | Perfect ‚úÖ |

Without Stratified K-Fold, Fold 2 would have **zero fraud examples** - making evaluation completely worthless!

---

## ‚ö†Ô∏è Warning: Common Mistake

**FutureWarning in Results**:
```
FutureWarning: Using 'liblinear' solver for multiclass 
classification is deprecated. Use another solver...
```

**What This Means**:
- Code works but uses deprecated method
- Update to: `LogisticRegression(max_iter=1000, random_state=42)`
- Or use: `LogisticRegression(solver='lbfgs', max_iter=1000)`

**Not a Problem for the Analysis** - just a library version issue.

---

## üéì Educational Value

**This notebook teaches**:
1. ‚úÖ Why stratification matters (fair exams analogy)
2. ‚úÖ When balanced data "hides" the need (Iris example)
3. ‚úÖ How catastrophic imbalance can be (fraud example)
4. ‚úÖ The foundation for proper model evaluation
5. ‚úÖ Best practices for classification CV

**Next Step**: Learn TimeSeriesSplit for temporal data (Notebook 03) - where neither K-Fold nor Stratified K-Fold works!

---

*Stratified K-Fold: The standard for classification CV* üéØ
