# Validation Set & Cross-Validation

---

## Validation Set
Besides splitting into **train** and **test**, we can further split data into:

- **Training Set** → used to fit the model  
- **Validation Set** → used for tuning hyperparameters & model selection  
- **Testing Set** → used only for final evaluation  

This helps in preventing **data leakage** and ensures fair evaluation.

---

## Example Split (80/20 rule)
- 60% Training  
- 20% Validation  
- 20% Testing  

---

## Limitation of Train / Validation / Test
- Sometimes dataset is small → splitting into 3 parts reduces training data.
- Performance may vary depending on how data is split.

---

# Cross-Validation (CV)

To overcome small dataset issues, we use **Cross-Validation**.

---

## k-Fold Cross-Validation
1. Split the dataset into **k equal folds**.  
2. Train on **k-1 folds** and test on the **remaining fold**.  
3. Repeat the process **k times**, each time changing the test fold.  
4. Average the results to get the final performance.

---

## Formula

If dataset = $D$ and folds = $k$:

$$
Accuracy = \frac{1}{k} \sum_{i=1}^{k} Accuracy_i
$$

where:
- $Accuracy_i$ → accuracy on $i^{th}$ fold  
- $k$ → number of folds (commonly 5 or 10)

---

## Advantages of k-Fold CV
- Uses the **entire dataset** for both training & testing.  
- Reduces variance due to a single split.  
- Works well with small datasets.  

---


In [1]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([1.2, 2.3, 2.9, 4.2, 5.1, 6.1, 7.0, 8.2])

# Model
model = LinearRegression()

# Define 4-Fold CV
kf = KFold(n_splits=4, shuffle=True, random_state=42)

# Evaluate using cross-validation
scores = cross_val_score(model, X, y, cv=kf, scoring="r2")

print("Cross-validation scores:", scores)
print("Average R²:", scores.mean())

Cross-validation scores: [0.99445983 0.99836735 0.96051054 0.99506043]
Average R²: 0.9870995380002167


# Stratified k-Fold Cross-Validation

---

## What is it?
- A special version of **k-Fold CV**.  
- Ensures each fold has the **same class proportion** as the original dataset.  
- Useful when data is **imbalanced** (e.g., 90% class A, 10% class B).  

---

## Why Stratification?
- Normal k-Fold may create folds with uneven class distributions.  
- Stratification prevents bias → each fold represents the population properly.  

---

## Advantages
- Works better for classification problems with **imbalanced data**.  
- Reduces risk of misleading accuracy.  

---

## Example
Suppose dataset has 100 samples:
- 80 are Class 0  
- 20 are Class 1  

With **5-Fold Stratified CV**:  
- Each fold will contain ~16 Class 0 and ~4 Class 1 → balanced evaluation.  

---


In [2]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

# Sample dataset (features + binary labels)
X = np.array([[i] for i in range(1, 21)])
y = np.array([0]*15 + [1]*5)  # Imbalanced dataset

# Model
model = LogisticRegression()

# Define Stratified 5-Fold CV
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate using cross-validation
scores = cross_val_score(model, X, y, cv=skf, scoring="accuracy")

print("Stratified CV scores:", scores)
print("Average Accuracy:", scores.mean())


Stratified CV scores: [1.   1.   0.75 1.   1.  ]
Average Accuracy: 0.95
