# K-Fold Cross Validation

## Mathematical Definition of K-Fold Cross Validation

K-Fold Cross Validation is a widely used resampling procedure to evaluate machine learning models on a limited data sample. The general procedure is as follows:

1.  **Shuffle the Dataset**: Randomly shuffle the dataset to ensure that each fold is a good representative of the overall data distribution.

2.  **Split into K Folds**: Divide the shuffled dataset into $K$ equally sized (or as equally sized as possible) folds or subsets. Let the total number of samples in the dataset be $N$. Each fold will contain approximately $N/K$ samples.

3.  **Iterate K Times**: For each of the $K$ folds, the following steps are performed:
    a.  **Training Set**: One fold is used as the validation (or test) set.
    b.  **Validation Set**: The remaining $K-1$ folds are combined to form the training set.
    c.  **Model Training**: A machine learning model is trained on the training set.
    d.  **Model Evaluation**: The trained model is evaluated on the validation set, and a performance metric (e.g., accuracy, precision, recall, F1-score, MSE) is recorded.

4.  **Aggregate Results**: After $K$ iterations, $K$ different performance metrics are obtained. The final performance of the model is typically the average of these $K$ metrics.

### Mathematical Notation
Let $D = \{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\}$ be the entire dataset with $N$ samples.

The dataset $D$ is partitioned into $K$ disjoint subsets (folds):

$$D = D_1 \cup D_2 \cup ... \cup D_K \quad \text{where} \quad D_i \cap D_j = \emptyset \text{ for } i \neq j$$

For each iteration $k \in \{1, ..., K\}\$:

-   **Validation Set**: $D_{test}^{(k)} = D_k$\n
-   **Training Set**: $D_{train}^{(k)} = D \setminus D_k = \bigcup_{j=1, j \neq k}^{K} D_j$\n

Let $M$ be a machine learning model with parameters $\theta$. In each iteration $k$, the model is trained on $D_{train}^{(k)}$ to obtain parameters $\hat{\theta}^{(k)}$:

$$M_k = \text{train}(D_{train}^{(k)})$$ 

The performance of $M_k$ is evaluated on $D_{test}^{(k)}$ using a chosen metric $E$. Let $E(M_k, D_{test}^{(k)})$ denote this performance.

The final estimated performance of the model is the average of the $K$ individual performances:

$$\text{Estimated Performance} = \frac{1}{K} \sum_{k=1}^{K} E(M_k, D_{test}^{(k)})$$ 

### Advantages
-   **Reduced Bias**: All data points are used for both training and validation, reducing the bias of the performance estimate.
-   **Reduced Variance**: The variance of the performance estimate is reduced compared to a single train-test split, as it averages results over multiple splits.
-   **Efficient Data Usage**: Particularly useful when the dataset is small, as it makes maximum use of the available data.

### Disadvantages
-   **Computationally Expensive**: Training the model $K$ times can be computationally intensive, especially for large datasets or complex models.
-   **Not Suitable for Time Series**: The random shuffling step makes it unsuitable for time series data where the temporal order is important.

## Example

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()

In [15]:
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=42)

In [16]:
lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.975

In [17]:
svm = SVC()
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.9861111111111112

In [35]:
rf = RandomForestClassifier(n_estimators=40, min_samples_split=2, max_depth=10)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9722222222222222

In [36]:
from sklearn.model_selection import KFold

splits= 3
kf = KFold(n_splits=3, shuffle=True, random_state=42)

lrscore = 0
svmscore = 0
rfscore = 0

for train_index, test_index in kf.split(digits.data):
    X_train_kf, X_test_kf = digits.data[train_index], digits.data[test_index]
    y_train_kf, y_test_kf = digits.target[train_index], digits.target[test_index]
    
    lr.fit(X_train_kf, y_train_kf)
    print("Logistic Regression Score:", lr.score(X_test_kf, y_test_kf))
    lrscore += lr.score(X_test_kf, y_test_kf)
    
    svm.fit(X_train_kf, y_train_kf)
    print("SVM Score:", svm.score(X_test_kf, y_test_kf))
    svmscore += svm.score(X_test_kf, y_test_kf)

    rf.fit(X_train_kf, y_train_kf)
    print("Random Forest Score:", rf.score(X_test_kf, y_test_kf))
    rfscore += rf.score(X_test_kf, y_test_kf)
    print("-" * 30)

print("Average Logistic Regression Score:", lrscore / splits)
print("Average SVM Score:", svmscore / splits)
print("Average Random Forest Score:", rfscore / splits)

Logistic Regression Score: 0.9666110183639399
SVM Score: 0.986644407345576
Random Forest Score: 0.9682804674457429
------------------------------
Logistic Regression Score: 0.9649415692821369
SVM Score: 0.988313856427379
Random Forest Score: 0.9616026711185309
------------------------------
Logistic Regression Score: 0.9515859766277128
SVM Score: 0.986644407345576
Random Forest Score: 0.9766277128547579
------------------------------
Average Logistic Regression Score: 0.9610461880912632
Average SVM Score: 0.9872008903728436
Average Random Forest Score: 0.9688369504730105


In [37]:
from sklearn.model_selection import StratifiedKFold

splits= 3
lrscore = 0
svmscore = 0
rfscore = 0

folds = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)

for train_index, test_index in folds.split(digits.data, digits.target):
    X_train_skf, X_test_skf = digits.data[train_index], digits.data[test_index]
    y_train_skf, y_test_skf = digits.target[train_index], digits.target[test_index]
    
    lr.fit(X_train_skf, y_train_skf)
    print("Stratified Logistic Regression Score:", lr.score(X_test_skf, y_test_skf))
    lrscore += lr.score(X_test_kf, y_test_kf)


    svm.fit(X_train_skf, y_train_skf)
    print("Stratified SVM Score:", svm.score(X_test_skf, y_test_skf))
    svmscore += svm.score(X_test_kf, y_test_kf)


    rf.fit(X_train_skf, y_train_skf)
    print("Stratified Random Forest Score:", rf.score(X_test_skf, y_test_skf))
    rfscore += rf.score(X_test_kf, y_test_kf)

    print("-" * 30)

print("Average Stratified Logistic Regression Score:", lrscore / splits)
print("Average Stratified SVM Score:", svmscore / splits)
print("Average Stratified Random Forest Score:", rfscore / splits)

Stratified Logistic Regression Score: 0.9616026711185309
Stratified SVM Score: 0.986644407345576
Stratified Random Forest Score: 0.9749582637729549
------------------------------
Stratified Logistic Regression Score: 0.9782971619365609
Stratified SVM Score: 0.991652754590985
Stratified Random Forest Score: 0.9699499165275459
------------------------------
Stratified Logistic Regression Score: 0.9632721202003339
Stratified SVM Score: 0.9833055091819699
Stratified Random Forest Score: 0.9782971619365609
------------------------------
Average Stratified Logistic Regression Score: 0.988313856427379
Average Stratified SVM Score: 0.994991652754591
Average Stratified Random Forest Score: 0.991652754590985


## Differences between K-Fold Cross Validation and Stratified K-Fold Cross Validation

###  K-Fold Cross Validation

* Process: The dataset is randomly divided into k equal-sized
folds. In each iteration, one fold is used as the test set,
and the remaining k-1 folds are used as the training set.

* Class Distribution: It does not guarantee that each fold will
  have the same proportion of class labels as the original
  dataset. If the dataset has an imbalanced class distribution,
  some folds might end up with a disproportionate number of
  samples from a particular class, leading to biased model
  evaluation.

### Stratified K-Fold Cross Validation

* Process: This is a variation of K-Fold Cross Validation that
ensures each fold maintains the same proportion of class
labels as the original dataset. The data is divided into k
folds such that the percentage of samples for each class is
preserved in each fold.

* Class Distribution: It is particularly useful when dealing
with imbalanced datasets, as it prevents any single fold from
having a significantly different class distribution, thus
providing a more reliable and less biased estimate of the
model's performance.

Key Difference Summary:


* K-Fold: Randomly splits data, no guarantee of class proportion in
folds.

* Stratified K-Fold: Preserves the percentage of samples for each
class in each fold, making it suitable for imbalanced datasets.

### Cross_val_score

In [44]:
from sklearn.model_selection import cross_val_score
import numpy as np

np.mean(cross_val_score(LogisticRegression(max_iter=10000), digits.data, digits.target, cv=3))

0.9259877573734001

In [45]:
np.mean(cross_val_score(SVC(), digits.data, digits.target, cv=3))

0.9699499165275459

In [48]:
np.mean(cross_val_score(RandomForestClassifier(n_estimators=40, min_samples_split=2, max_depth=10), digits.data, digits.target, cv=3))

0.9343350027824151