# **Cross-Validation in Machine Learning**

## **Introduction**

Cross-validation is a statistical method used to evaluate the performance of a machine learning model. It ensures that the model performs well not only on the training data but also on unseen data, thus preventing overfitting and underfitting.

##
---

## **Key Characteristics**

1. **Purpose**: Cross-validation is an **evaluation technique**, not a feature engineering or preprocessing method.
2. **Main Idea**: Split the dataset into training and testing sets multiple times to evaluate model performance more reliably.
3. **Advantage**: Provides a better estimate of a model’s generalization error compared to a single train-test split.

![Cross validation idea.png](../images/cross_validation.png)

##
---

## **Why Use Cross-Validation?**

- **Model Validation**: Provides a reliable estimate of model performance.

- **Reduces Overfitting**: Ensures the model is not overfitted to the training data.

- **Model Selection**: Helps compare the performance of different algorithms.

- **Improved Generalization**: Evaluates how well the model performs on unseen data.

##
---

## **How It Works**

The core idea of cross-validation is to divide the dataset into subsets (folds) and perform multiple training and testing cycles to ensure the model is validated on different portions of the data.


##
---

## **Types of Cross-Validation**

### 1. **K-Fold Cross-Validation**

- The dataset is divided into `K` equal-sized subsets (folds).
- For each iteration:
  - One fold is used as the test set.
  - The remaining \(K-1\) folds are used for training.
- The process is repeated `K` times, and the average performance across all folds is calculated.

#### Example Code:
```python
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestClassifier()

# Evaluate model
scores = cross_val_score(model, X, y, cv=kf)
print("Accuracy for each fold:", scores)
print("Average Accuracy:", scores.mean())
```

In [1]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestClassifier()

# Evaluate model
scores = cross_val_score(model, X, y, cv=kf)
print("Accuracy for each fold:", scores)
print("Average Accuracy:", scores.mean())

Accuracy for each fold: [1.         0.96666667 0.93333333 0.93333333 0.96666667]
Average Accuracy: 0.9600000000000002


###
---

### 2. **Stratified K-Fold Cross-Validation**

- Similar to K-Fold but preserves the proportion of classes (target variable) in each fold.

- Useful for imbalanced datasets.

#### Python Example:

```python
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, predictions))
```

In [3]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, predictions))

Accuracy: 1.0
Accuracy: 0.9666666666666667
Accuracy: 0.9333333333333333
Accuracy: 0.9666666666666667
Accuracy: 0.9


###
---

### 3. **Leave-One-Out Cross-Validation (LOOCV)**

- Uses one data point as the validation set and the rest for training.

- Repeated for every data point.

- Computationally expensive for large datasets.

#### Python Example:

```python
from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print("True:", y_test, "Predicted:", predictions)
```

In [4]:
from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print("True:", y_test, "Predicted:", predictions)

True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]
True: [0] Predicted: [0]


###
---

### 4. **Hold-Out Validation**

- The dataset is split into separate training and testing sets.

- Simpler but prone to bias if the split isn't representative.

#### Python Example:

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))
```

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))

Accuracy: 1.0


###
---

### 5. **Time Series Cross-Validation**

- Used for time-dependent data.

- Splits the data sequentially to respect temporal order.

- Avoids "data leakage" from future observations.

#### Python Example:

```python
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=3)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, predictions))
```

In [6]:
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=3)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, predictions))

Accuracy: 0.2972972972972973
Accuracy: 0.6486486486486487
Accuracy: 0.7567567567567568


##
---

## **Evaluating Cross-Validation Results**

After performing cross-validation, metrics such as accuracy, precision, recall, or F1-score are averaged across all folds.

#### Example:

```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print("Accuracy per fold:", scores)
print("Mean Accuracy:", scores.mean())
```

- **cross_val_score(model, X, y, cv=cv)** : return the score values for each datasets.
  - *model* : Machine learning model
  - *X* : 2D `x` values of the dataset.
  - *y* : target (y) values of the dataset.
  - *cv* : Number of split data sets.

In [8]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

model = RandomForestClassifier()

scores = cross_val_score(model, X, y, cv=5)
print("Accuracy per fold:", scores)
print("Mean Accuracy:", scores.mean())

Accuracy per fold: [0.96666667 0.96666667 0.93333333 0.96666667 1.        ]
Mean Accuracy: 0.9666666666666668


##
---

## **Advantages & Disadvantages of Cross-Validation**

### Advantages of Cross-Validation

- Efficient use of data for training and testing.

- Provides more reliable performance estimates.

- Reduces variance in evaluation compared to a single train-test split.



### Disadvantages of Cross-Validation

- Computationally Intensive:

  - Requires training the model multiple times.

- Complexity:

  - Difficult to implement in some scenarios like time-series data.

- Bias-Variance Tradeoff:

  - LOOCV has high variance, while K-Fold might introduce some bias.

##
---

## **Mathematical Equations**

For K-Fold Cross-Validation, the average score across all folds is calculated as:

$$
\text{Score}_{\text{avg}} = \frac{1}{K} \sum_{i=1}^{K} \text{Score}_i
$$

Where:

- K = Number of folds.

- $\text{Score}_i$​ = Performance metric (e.g., accuracy) for the $i^{th}$ fold.

##
---

## **Comparison of Cross-Validation Methods**

| Method              |   Strengths                          | Weaknesses                                                    |
|---------------------|--------------------------------------|---------------------------------------------------------------|
|K-Fold               | Reliable, balances bias and variance |Computational cost for large datasets                          |
|Stratified K-Fold    | Handles imbalanced datasets          |Similar to K-Fold limitations                                  |
|Leave-One-Out (LOOCV) | Best for small datasets             |Very computationally expensive                                 |
|Hold-Out Validation  |Simple and fast                       |Results depend on the split                                    |
|Time Series          |Respects time order                   |Limited to sequential data                                     |

##
---

Cross-validation is a cornerstone technique in machine learning that ensures robust and unbiased evaluation of model performance, making it essential for any ML workflow.

##
----