## Cross Validation

Cross validation compares different ML methods to see how well they work in practice. It solves the problem of:
- **Training**: Estimating parameters (learning from data)
- **Testing**: Evaluating performance (how well it works on new data)

Key principle: Never use the same data for both training and testing - we need to know how the method performs on data it wasn't trained on.


In [1]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True) # A classification dataset
X, y = X[:20], y[:20]             # Deal with a smaller dataset

X.shape, y.shape

((20, 4), (20,))

#### K-Fold Cross Validation
Divides data into K blocks. Each block is used once for testing while the rest train the model. All data gets used for both training and testing.


In [2]:
from sklearn.model_selection import KFold

kf = KFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)

for i, (train_index, test_index) in enumerate(kf.split(X), start=1):
    print(f"Fold {i}   |   Train index: {train_index}   |   Test index: {test_index}")

Fold 1   |   Train index: [ 2  3  4  5  6  7  8  9 10 11 12 13 14 16 18 19]   |   Test index: [ 0  1 15 17]
Fold 2   |   Train index: [ 0  1  2  4  6  7  9 10 12 13 14 15 16 17 18 19]   |   Test index: [ 3  5  8 11]
Fold 3   |   Train index: [ 0  1  3  4  5  6  7  8  9 10 11 12 14 15 17 19]   |   Test index: [ 2 13 16 18]
Fold 4   |   Train index: [ 0  1  2  3  5  6  7  8 10 11 13 14 15 16 17 18]   |   Test index: [ 4  9 12 19]
Fold 5   |   Train index: [ 0  1  2  3  4  5  8  9 11 12 13 15 16 17 18 19]   |   Test index: [ 6  7 10 14]


#### Stratified K-Fold
Maintains the same proportion of each class in every fold. Essential for imbalanced datasets to ensure each fold represents the overall class distribution.


In [3]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)

for i, (train_index, test_index) in enumerate(skf.split(X, y), start=1):
    print(f"Fold {i}   |   Train index: {train_index}   |   Test index: {test_index}")

Fold 1   |   Train index: [ 1  2  4  5  6  8  9 10 12 13 14 15 16 17 18 19]   |   Test index: [ 0  3  7 11]
Fold 2   |   Train index: [ 0  1  2  3  4  6  7  8  9 10 11 12 13 15 17 18]   |   Test index: [ 5 14 16 19]
Fold 3   |   Train index: [ 0  1  2  3  5  7  8  9 10 11 13 14 15 16 18 19]   |   Test index: [ 4  6 12 17]
Fold 4   |   Train index: [ 0  1  3  4  5  6  7  8  9 11 12 13 14 16 17 19]   |   Test index: [ 2 10 15 18]
Fold 5   |   Train index: [ 0  2  3  4  5  6  7 10 11 12 14 15 16 17 18 19]   |   Test index: [ 1  8  9 13]


#### Leave One Out Cross Validation (LOOCV)
Extreme case where each sample is a test set. Trains on n-1 samples, tests on 1. Provides most data for training but computationally expensive.


In [4]:
from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()

for i, (train_index, test_index) in enumerate(loo.split(X), start=1):
    print(f"Fold {str(i).zfill(2)}   |   Train index: {train_index}   |   Test index: {test_index}")

Fold 01   |   Train index: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]   |   Test index: [0]
Fold 02   |   Train index: [ 0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]   |   Test index: [1]
Fold 03   |   Train index: [ 0  1  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]   |   Test index: [2]
Fold 04   |   Train index: [ 0  1  2  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]   |   Test index: [3]
Fold 05   |   Train index: [ 0  1  2  3  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]   |   Test index: [4]
Fold 06   |   Train index: [ 0  1  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19]   |   Test index: [5]
Fold 07   |   Train index: [ 0  1  2  3  4  5  7  8  9 10 11 12 13 14 15 16 17 18 19]   |   Test index: [6]
Fold 08   |   Train index: [ 0  1  2  3  4  5  6  8  9 10 11 12 13 14 15 16 17 18 19]   |   Test index: [7]
Fold 09   |   Train index: [ 0  1  2  3  4  5  6  7  9 10 11 12 13 14 15 16 17 18 19]   |   Test index: [8]
Fold 10   |   Train index: [

#### Repeated Stratified K-Fold
Runs stratified k-fold multiple times with different random splits. Provides more robust performance estimates by averaging results across different data partitions.


In [5]:
from sklearn.model_selection import RepeatedStratifiedKFold

rskf = RepeatedStratifiedKFold(
    n_splits=5,
    n_repeats=3,
    random_state=42
)

for i, (train_index, test_index) in enumerate(rskf.split(X, y), start=1):
    print(f"Fold {str(i).zfill(2)}   |   Train index: {train_index}   |   Test index: {test_index}")

Fold 01   |   Train index: [ 1  2  4  5  6  8  9 10 12 13 14 15 16 17 18 19]   |   Test index: [ 0  3  7 11]
Fold 02   |   Train index: [ 0  1  2  3  4  6  7  8  9 10 11 12 13 15 17 18]   |   Test index: [ 5 14 16 19]
Fold 03   |   Train index: [ 0  1  2  3  5  7  8  9 10 11 13 14 15 16 18 19]   |   Test index: [ 4  6 12 17]
Fold 04   |   Train index: [ 0  1  3  4  5  6  7  8  9 11 12 13 14 16 17 19]   |   Test index: [ 2 10 15 18]
Fold 05   |   Train index: [ 0  2  3  4  5  6  7 10 11 12 14 15 16 17 18 19]   |   Test index: [ 1  8  9 13]
Fold 06   |   Train index: [ 0  1  2  3  4  5  6  7  9 11 12 13 14 15 16 17]   |   Test index: [ 8 10 18 19]
Fold 07   |   Train index: [ 0  1  2  5  6  8 10 11 12 13 14 15 16 17 18 19]   |   Test index: [3 4 7 9]
Fold 08   |   Train index: [ 0  1  2  3  4  5  6  7  8  9 10 12 14 16 18 19]   |   Test index: [11 13 15 17]
Fold 09   |   Train index: [ 0  1  3  4  7  8  9 10 11 13 14 15 16 17 18 19]   |   Test index: [ 2  5  6 12]
Fold 10   |   Train ind

#### Stratified Group K-Fold
Combines group-based splitting with stratification. Keeps related samples (groups) together while maintaining class proportions. Prevents data leakage when samples are naturally grouped.

**Example**: Medical dataset with multiple lab tests per patient. You want to predict disease but need to ensure all tests from one patient stay in the same fold (no patient in both train/test) while maintaining diseased/healthy ratios.


In [6]:
from sklearn.model_selection import StratifiedGroupKFold
import numpy as np

# Create groups - specifying which sample belongs to which group
groups = np.array([1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10])

sgkf = StratifiedGroupKFold(
    n_splits=4,
    shuffle=True, 
    random_state=42
)

for i, (train_index, test_index) in enumerate(sgkf.split(X, y, groups), start=1):
    # Show that groups don't overlap between train/test
    train_groups = sorted([int(x) for x in set(groups[train_index])])
    test_groups = sorted([int(x) for x in set(groups[test_index])])
    print(f"Fold {i}   |   Train Groups: {train_groups}   |   Test Groups: {test_groups}   |   Overlap: {len(np.intersect1d(train_groups, test_groups))}")


Fold 1   |   Train Groups: [2, 3, 4, 6, 7, 8, 9]   |   Test Groups: [1, 5, 10]   |   Overlap: 0
Fold 2   |   Train Groups: [1, 3, 4, 5, 7, 8, 9, 10]   |   Test Groups: [2, 6]   |   Overlap: 0
Fold 3   |   Train Groups: [1, 2, 4, 5, 6, 8, 10]   |   Test Groups: [3, 7, 9]   |   Overlap: 0
Fold 4   |   Train Groups: [1, 2, 3, 5, 6, 7, 9, 10]   |   Test Groups: [4, 8]   |   Overlap: 0


#### Sample Usage

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

clf = DecisionTreeClassifier(random_state=42)

scores = cross_val_score(clf, X, y, cv=rskf)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Average CV Score:  1.0
Number of CV Scores used in Average:  15
