# Part 2.18: Supervised Learning - Cross-Validation

A single train/test split is a good start, but the evaluation score you get can be dependent on exactly how the data was split. **Cross-Validation (CV)** is a more robust technique that gives a better estimate of your model's performance on unseen data.

### The K-Fold Cross-Validation Process
1.  **Split**: The training data is split into 'k' smaller sets or 'folds'.
2.  **Train & Evaluate**: A model is trained using k-1 of the folds as training data, and the resulting model is validated on the remaining part of the data (the holdout fold).
3.  **Repeat**: This process is repeated k times, with each of the k folds used exactly once as the validation data.
4.  **Average**: The k results are then averaged to produce a single, more robust performance estimate.

In [1]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Create a model
model = LogisticRegression(max_iter=200)

### Using `cross_val_score`
Scikit-learn provides a simple helper function to perform cross-validation.

In [2]:
# The 'cv' parameter determines the number of folds (k)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"Scores for each fold: {scores}")
print(f"\nMean accuracy: {scores.mean():.4f}")
print(f"Standard deviation: {scores.std():.4f}")

Scores for each fold: [0.96666667 1.         0.93333333 0.96666667 1.        ]

Mean accuracy: 0.9733
Standard deviation: 0.0249


### Stratified K-Fold
When doing cross-validation for a classification problem, it's important to use **stratification**. This means that each fold has approximately the same percentage of samples of each target class as the complete set.

`cross_val_score` automatically uses stratified k-fold for classification models, but you can also use the `StratifiedKFold` object explicitly for more control.

In [4]:
from sklearn.model_selection import StratifiedKFold
import numpy as np
skf = StratifiedKFold(n_splits=3)

fold_no = 1
for train_index, test_index in skf.split(X, y):
    # Get the data for this fold
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Print the distribution of classes in the test set for this fold
    unique, counts = np.unique(y_test, return_counts=True)
    print(f"Fold {fold_no} - Class Distribution: {dict(zip(unique, counts))}")
    fold_no += 1

Fold 1 - Class Distribution: {np.int64(0): np.int64(17), np.int64(1): np.int64(17), np.int64(2): np.int64(16)}
Fold 2 - Class Distribution: {np.int64(0): np.int64(17), np.int64(1): np.int64(16), np.int64(2): np.int64(17)}
Fold 3 - Class Distribution: {np.int64(0): np.int64(16), np.int64(1): np.int64(17), np.int64(2): np.int64(17)}
