# K-Fold Cross Validation

# Math Notation: K-Fold Cross-Validation

Let:

- \( D = \{(x^{(i)}, y^{(i)})\}_{i=1}^n \): the dataset with \( n \) samples
- \( D_1, D_2, \ldots, D_K \): the partitions (folds)

Then the **K-Fold Cross-Validation estimate of error** is:

$$
\text{CV}_K = \frac{1}{K} \sum_{k=1}^{K} \mathcal{L}(f_{-k}, D_k)
$$

Where:

- \( f_{-k} \): the model trained on all data **except** fold \( k \)
- \( \mathcal{L} \): a loss function (e.g., MSE, accuracy, log-loss)
- \( D_k \): the test data (fold \( k \))

# [Code from Scratch]

In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def k_fold_cross_val(X, y, model_cls, k=5):
    """
    Perform k-fold cross-validation from scratch.
    
    Parameters:
        X: np.ndarray of shape (n_samples, n_features)
        y: np.ndarray of shape (n_samples,)
        model_cls: Class of model (e.g., LogisticRegression)
        k: int - number of folds

    Returns:
        List of accuracy scores across folds
    """
    n_samples = X.shape[0]
    indices = np.arange(n_samples)
    np.random.shuffle(indices)

    fold_sizes = np.full(k, n_samples // k, dtype=int)
    fold_sizes[:n_samples % k] += 1  # Distribute remainder
    current = 0
    scores = []

    for fold_size in fold_sizes:
        start, stop = current, current + fold_size
        test_idx = indices[start:stop]
        train_idx = np.concatenate([indices[:start], indices[stop:]])

        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        model = model_cls()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        acc = accuracy_score(y_test, y_pred)
        scores.append(acc)
        current = stop

    return scores

In [2]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

acc_scores = k_fold_cross_val(X, y, LogisticRegression, k=5)
print("Fold accuracies:", acc_scores)
print("Mean accuracy:", np.mean(acc_scores))

Fold accuracies: [0.9298245614035088, 0.8947368421052632, 0.9298245614035088, 0.956140350877193, 0.9292035398230089]
Mean accuracy: 0.9279459711224967


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

# [Code using sklearn]

In [3]:
from sklearn.model_selection import cross_val_score

model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)
print("Sklearn CV Mean:", scores.mean())

Sklearn CV Mean: 0.9367644775655954


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt