# 🎯 K-Fold Cross Validation: Professional Implementation

## 📊 Overview
This notebook demonstrates a comprehensive comparison of machine learning models using K-Fold cross validation techniques. We'll explore:

- 🔄 **K-Fold Cross Validation Concepts**
- 📈 **Model Performance Comparison** 
- ⚙️ **Hyperparameter Tuning**
- 📋 **Professional Results Analysis**

### 🎪 What You'll Learn:
- How to implement manual and automated cross validation
- Best practices for model evaluation
- Professional code structure and documentation
- Statistical significance of model performance

## 📚 Step 1: Import Required Libraries

**Purpose:** Import all necessary libraries organized by category for clean code structure.

- 🔢 **NumPy & Matplotlib:** For numerical computations and visualizations
- 🤖 **Scikit-learn:** For machine learning algorithms and cross validation tools
- 🎲 **Models:** Logistic Regression, SVM, and Random Forest classifiers

In [28]:
# K-Fold Cross Validation Comparison
# Professional implementation of K-Fold cross validation for model comparison

# Data Science Libraries
import numpy as np
import matplotlib.pyplot as plt

# Scikit-learn Core
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score

# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [29]:
# Data Loading and Preparation
digits = load_digits()
print(f"Dataset shape: {digits.data.shape}")
print(f"Number of classes: {len(np.unique(digits.target))}")

# Initial train-test split for baseline comparison
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.3, random_state=42
)

Dataset shape: (1797, 64)
Number of classes: 10


## 🗃️ Step 2: Data Loading & Preparation

**Objective:** Load the digits dataset and prepare it for model training.

- 📋 **Dataset:** Handwritten digits (0-9) classification
- 🔄 **Train-Test Split:** 70% training, 30% testing
- 🎯 **Random State:** Fixed for reproducible results

![Data Loading](https://media.giphy.com/media/3oKIPnAiaMCws8nOsE/giphy.gif)

In [30]:
# Baseline Model Performance (Single Train-Test Split)
print("Baseline Performance with Single Train-Test Split:")
print("=" * 50)

# Logistic Regression
lr = LogisticRegression(solver='newton-cg', random_state=42)
lr_score = lr.fit(X_train, y_train).score(X_test, y_test)
print(f"Logistic Regression: {lr_score:.4f}")

# Support Vector Machine
svm = SVC(gamma='auto', random_state=42)
svm_score = svm.fit(X_train, y_train).score(X_test, y_test)
print(f"Support Vector Machine: {svm_score:.4f}")

# Random Forest
rf = RandomForestClassifier(n_estimators=40, random_state=42)
rf_score = rf.fit(X_train, y_train).score(X_test, y_test)
print(f"Random Forest: {rf_score:.4f}")

Baseline Performance with Single Train-Test Split:
Logistic Regression: 0.9704
Support Vector Machine: 0.3926
Random Forest: 0.9685
Support Vector Machine: 0.3926
Random Forest: 0.9685


## 🎯 Step 3: Baseline Model Performance

**Goal:** Establish baseline performance using a single train-test split.

- 🔫 **Quick Fire:** Test all three models rapidly
- 📊 **Comparison:** Get initial performance estimates
- ⚠️ **Limitation:** Single split may not be representative

> 💡 **Pro Tip:** This is just the starting point - cross validation will give us more reliable results!

In [31]:
# K-Fold Cross Validation Demonstration
print("\nK-Fold Cross Validation Concept:")
print("=" * 40)

# Simple K-Fold example
kf = KFold(n_splits=3, shuffle=True, random_state=42)
print("K-Fold split demonstration with sample data [1,2,3,4,5,6,7,8,9]:")
for fold, (train_index, test_index) in enumerate(kf.split([1,2,3,4,5,6,7,8,9]), 1):
    print(f"Fold {fold}: Train indices: {train_index}, Test indices: {test_index}")


K-Fold Cross Validation Concept:
K-Fold split demonstration with sample data [1,2,3,4,5,6,7,8,9]:
Fold 1: Train indices: [0 2 3 4 6 8], Test indices: [1 5 7]
Fold 2: Train indices: [1 3 4 5 6 7], Test indices: [0 2 8]
Fold 3: Train indices: [0 1 2 5 7 8], Test indices: [3 4 6]


## 🔄 Step 4: K-Fold Cross Validation Concept

**Understanding K-Fold:** Learn how data is split into multiple folds for robust evaluation.

- 📂 **K=3 Folds:** Data divided into 3 equal parts
- 🔄 **Rotation:** Each fold serves as test set once
- 🎯 **Result:** 3 different train-test combinations

![K-Fold Animation](https://miro.medium.com/v2/resize:fit:1400/1*J2B_bcbd1-s1kfJeHPEEiQ.gif)

### 🧠 Why K-Fold?
- ✅ More reliable performance estimates
- ✅ Better use of available data
- ✅ Reduces impact of data splitting variance

In [32]:
# Utility Functions
def get_score(model, X_train, X_test, y_train, y_test):
    """
    Train a model and return its accuracy score.
    
    Parameters:
    model: sklearn model instance
    X_train, X_test: training and testing features
    y_train, y_test: training and testing targets
    
    Returns:
    float: accuracy score
    """
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

def display_cv_results(scores, model_name):
    """Display cross-validation results in a formatted way."""
    print(f"{model_name}:")
    print(f"  Individual scores: {[f'{score:.4f}' for score in scores]}")
    print(f"  Mean: {np.mean(scores):.4f} (+/- {np.std(scores) * 2:.4f})")
    print()

## 🛠️ Step 5: Professional Utility Functions

**Code Organization:** Create reusable helper functions for clean, maintainable code.

- 🎯 **get_score():** Train model and return accuracy
- 📊 **display_cv_results():** Format cross validation results professionally
- 📈 **Documentation:** Clear docstrings and type hints

> 🏗️ **Best Practice:** Well-documented utility functions make code reusable and professional!

In [33]:
# Manual Stratified K-Fold Cross Validation
print("Manual Stratified K-Fold Cross Validation (3 folds):")
print("=" * 55)

# Initialize stratified K-fold
folds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Initialize score lists
scores_logistic = []
scores_svm = []
scores_rf = []

# Perform cross validation
for fold, (train_index, test_index) in enumerate(folds.split(digits.data, digits.target), 1):
    print(f"Processing Fold {fold}...")
    
    # Split data
    X_train_fold = digits.data[train_index]
    X_test_fold = digits.data[test_index]
    y_train_fold = digits.target[train_index]
    y_test_fold = digits.target[test_index]
    
    # Evaluate models
    scores_logistic.append(get_score(
        LogisticRegression(solver='newton-cg', random_state=42), 
        X_train_fold, X_test_fold, y_train_fold, y_test_fold
    ))
    scores_svm.append(get_score(
        SVC(gamma='auto', random_state=42), 
        X_train_fold, X_test_fold, y_train_fold, y_test_fold
    ))
    scores_rf.append(get_score(
        RandomForestClassifier(n_estimators=40, random_state=42), 
        X_train_fold, X_test_fold, y_train_fold, y_test_fold
    ))

print("\nManual Cross Validation Results:")
print("-" * 35)
display_cv_results(scores_logistic, "Logistic Regression")
display_cv_results(scores_svm, "Support Vector Machine")
display_cv_results(scores_rf, "Random Forest")

Manual Stratified K-Fold Cross Validation (3 folds):
Processing Fold 1...
Processing Fold 2...
Processing Fold 2...
Processing Fold 3...
Processing Fold 3...

Manual Cross Validation Results:
-----------------------------------
Logistic Regression:
  Individual scores: ['0.9616', '0.9783', '0.9633']
  Mean: 0.9677 (+/- 0.0150)

Support Vector Machine:
  Individual scores: ['0.5659', '0.5659', '0.5710']
  Mean: 0.5676 (+/- 0.0047)

Random Forest:
  Individual scores: ['0.9733', '0.9683', '0.9766']
  Mean: 0.9727 (+/- 0.0069)


Manual Cross Validation Results:
-----------------------------------
Logistic Regression:
  Individual scores: ['0.9616', '0.9783', '0.9633']
  Mean: 0.9677 (+/- 0.0150)

Support Vector Machine:
  Individual scores: ['0.5659', '0.5659', '0.5710']
  Mean: 0.5676 (+/- 0.0047)

Random Forest:
  Individual scores: ['0.9733', '0.9683', '0.9766']
  Mean: 0.9727 (+/- 0.0069)



## 🔧 Step 6: Manual Stratified K-Fold Implementation

**Deep Dive:** Implement cross validation manually to understand the process step-by-step.

- 🎯 **Stratified K-Fold:** Maintains class distribution in each fold
- 🔄 **3 Folds:** Train on 2 folds, test on 1 fold
- 📊 **All Models:** Test Logistic Regression, SVM, and Random Forest
- 💪 **Manual Control:** Full understanding of the validation process

<p style="text-align:center;">
  <img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width="500" height="300" alt="Resized GIF">
</p>


### 🎪 The Magic Happens Here:
Each fold gives us independent performance estimates! 🎭

In [34]:
# Automated Cross Validation using cross_val_score
print("Automated Cross Validation with cross_val_score (3 folds):")
print("=" * 60)

# Define models
models = {
    'Logistic Regression': LogisticRegression(solver='newton-cg', random_state=42),
    'Support Vector Machine': SVC(gamma='auto', random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=40, random_state=42)
}

# Perform automated cross validation
cv_results = {}
for name, model in models.items():
    scores = cross_val_score(model, digits.data, digits.target, cv=3, scoring='accuracy')
    cv_results[name] = scores
    display_cv_results(scores, name)

Automated Cross Validation with cross_val_score (3 folds):
Logistic Regression:
  Individual scores: ['0.9232', '0.9416', '0.9182']
  Mean: 0.9277 (+/- 0.0201)

Support Vector Machine:
  Individual scores: ['0.3806', '0.4107', '0.5125']
  Mean: 0.4346 (+/- 0.1129)

Support Vector Machine:
  Individual scores: ['0.3806', '0.4107', '0.5125']
  Mean: 0.4346 (+/- 0.1129)

Random Forest:
  Individual scores: ['0.9316', '0.9366', '0.9249']
  Mean: 0.9310 (+/- 0.0096)

Random Forest:
  Individual scores: ['0.9316', '0.9366', '0.9249']
  Mean: 0.9310 (+/- 0.0096)



## ⚡ Step 7: Automated Cross Validation with Scikit-Learn

**Efficiency Boost:** Use scikit-learn's built-in `cross_val_score` for rapid evaluation.

- 🚀 **One-Liner:** Powerful cross validation in minimal code
- 🔄 **Same Results:** Should match our manual implementation
- ⚡ **Speed:** Optimized and faster execution
- 🎯 **Industry Standard:** How professionals do it in practice

![Automated Process](https://media.giphy.com/media/JIX9t2j0ZTN9S/giphy.gif)

### 🏆 Pro Level: Compare with manual results!

In [35]:
# Hyperparameter Tuning: Random Forest n_estimators
print("Random Forest Hyperparameter Tuning (n_estimators):")
print("=" * 50)

# Test different n_estimators values
n_estimators_values = [5, 20, 30, 40]
rf_tuning_results = {}

for n_est in n_estimators_values:
    model = RandomForestClassifier(n_estimators=n_est, random_state=42)
    scores = cross_val_score(model, digits.data, digits.target, cv=10, scoring='accuracy')
    mean_score = np.mean(scores)
    rf_tuning_results[n_est] = {
        'scores': scores,
        'mean': mean_score,
        'std': np.std(scores)
    }
    print(f"n_estimators={n_est:2d}: Mean accuracy = {mean_score:.4f} (+/- {np.std(scores) * 2:.4f})")

# Find best parameter
best_n_estimators = max(rf_tuning_results.keys(), key=lambda x: rf_tuning_results[x]['mean'])
print(f"\nBest n_estimators: {best_n_estimators} with accuracy: {rf_tuning_results[best_n_estimators]['mean']:.4f}")

Random Forest Hyperparameter Tuning (n_estimators):
n_estimators= 5: Mean accuracy = 0.8787 (+/- 0.0912)
n_estimators=20: Mean accuracy = 0.9332 (+/- 0.0658)
n_estimators=20: Mean accuracy = 0.9332 (+/- 0.0658)
n_estimators=30: Mean accuracy = 0.9421 (+/- 0.0540)
n_estimators=30: Mean accuracy = 0.9421 (+/- 0.0540)
n_estimators=40: Mean accuracy = 0.9466 (+/- 0.0500)

Best n_estimators: 40 with accuracy: 0.9466
n_estimators=40: Mean accuracy = 0.9466 (+/- 0.0500)

Best n_estimators: 40 with accuracy: 0.9466


## 🎛️ Step 8: Hyperparameter Tuning with Cross Validation

**Optimization:** Find the best parameters using systematic cross validation approach.

- 🔫 **Target:** Random Forest `n_estimators` parameter
- 🎯 **Values Tested:** [5, 20, 30, 40] trees
- 📊 **10-Fold CV:** More robust evaluation with 10 folds
- 🏆 **Best Performance:** Data-driven parameter selection

![Parameter Tuning](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExdW4zNjloNTMwZnVnamN5bmQ0djExcWFybXZ0OTA5MXo0NDNoMDE2dSZlcD12MV9naWZzX3NlYXJjaCZjdD1n/o6S51npJYQM48/giphy.gif)

### 🎪 The Gun Show: 💪
Watch as we systematically find the optimal number of trees! 🌳🌳🌳

In [36]:
# Summary and Model Comparison
print("=" * 60)
print("COMPREHENSIVE MODEL PERFORMANCE SUMMARY")
print("=" * 60)

print("\n1. BASELINE PERFORMANCE (Single Train-Test Split):")
print("   - Shows performance on a single random split")
print("   - May not be representative due to variance in data splits")

print("\n2. CROSS VALIDATION RESULTS:")
print("   - More reliable performance estimates")
print("   - Reduces variance by averaging over multiple folds")
print("   - Helps detect overfitting and model stability")

print("\n3. HYPERPARAMETER TUNING:")
print("   - Systematic approach to find optimal parameters")
print("   - Uses cross validation to avoid overfitting")
print("   - Demonstrates the impact of model complexity")

print("\n4. KEY INSIGHTS:")
print("   - Cross validation provides more robust performance estimates")
print("   - Model ranking may change between single split and CV")
print("   - Proper hyperparameter tuning can significantly improve performance")
print("   - Random Forest shows sensitivity to n_estimators parameter")

print("\nRECOMMENDATIONS:")
print("- Always use cross validation for model evaluation")
print("- Consider stratified K-fold for classification tasks")
print("- Tune hyperparameters using nested cross validation")
print("- Report mean and standard deviation of CV scores")

COMPREHENSIVE MODEL PERFORMANCE SUMMARY

1. BASELINE PERFORMANCE (Single Train-Test Split):
   - Shows performance on a single random split
   - May not be representative due to variance in data splits

2. CROSS VALIDATION RESULTS:
   - More reliable performance estimates
   - Reduces variance by averaging over multiple folds
   - Helps detect overfitting and model stability

3. HYPERPARAMETER TUNING:
   - Systematic approach to find optimal parameters
   - Uses cross validation to avoid overfitting
   - Demonstrates the impact of model complexity

4. KEY INSIGHTS:
   - Cross validation provides more robust performance estimates
   - Model ranking may change between single split and CV
   - Proper hyperparameter tuning can significantly improve performance
   - Random Forest shows sensitivity to n_estimators parameter

RECOMMENDATIONS:
- Always use cross validation for model evaluation
- Consider stratified K-fold for classification tasks
- Tune hyperparameters using nested cross valid

## 🏁 Conclusion & Next Steps

### 🎯 What We Accomplished:
- ✅ **Professional Implementation** of K-Fold cross validation
- ✅ **Manual vs Automated** comparison and understanding
- ✅ **Hyperparameter Tuning** with systematic approach
- ✅ **Best Practices** for model evaluation

### 🚀 Next Level Actions:
1. **Grid Search CV** for comprehensive hyperparameter tuning
2. **Nested Cross Validation** for unbiased performance estimates  
3. **Pipeline Integration** for preprocessing and model training
4. **Statistical Testing** for significance of performance differences

---

### 💪 The Gun Show Results: 🔫
**You've just witnessed professional-grade machine learning evaluation!**

![Mission Accomplished](https://media.giphy.com/media/3ohzdIuqJoo8QdKlnW/giphy.gif)

---
*📈 Keep practicing, keep improving! The journey to ML mastery continues...* 🎯

## 🎊 Step 9: Comprehensive Analysis & Recommendations

**Final Showdown:** Professional summary of all findings and best practices.

- 📊 **Performance Comparison:** Baseline vs Cross Validation
- 🎯 **Key Insights:** What we learned from the analysis
- 💡 **Best Practices:** Professional recommendations
- 🏆 **Winner Selection:** Data-driven model choice

![Success](https://media.giphy.com/media/26u4cqiYI30juCOGY/giphy.gif)

### 🔫 The Final Gun: 
Time to deliver the knockout punch with professional insights! 💥