# Standardization vs Normalisation

**1. Standardization** : 

standardisation transforms the data to have a mean(μ) of 0 and a standard deviation(σ) of 1.

x standardized=X-μ/σ

# Normalisation 

### Min-Max Scaling Formula

Min-Max Scaling transforms data into a fixed range, typically \([0,1]\).

$$
X' = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$$

Where:

- \( X \) is the original value,  
- \( X_{\min} \) is the minimum value in the dataset,  
- \( X_{\max} \) is the maximum value in the dataset,  
- \( X' \) is the normalized value in the range \([0,1]\).  

For a custom range \([a, b]\):

$$
X' = a + \frac{(X - X_{\min}) (b - a)}{X_{\max} - X_{\min}}
$$



## 3. Key Differences

| Feature                     | Standardization (Z-score)       | Normalization (Min-Max)      |
|-----------------------------|--------------------------------|-----------------------------|
| **Formula**                 | X-μ/σ | $$
X' = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$$ |
| **Range**                   | No fixed range                | \([0,1]\) or \([-1,1]\)    |
| **Effect on Data**          | Centered at 0, with unit variance | Rescales data within a fixed range |
| **Best Used When**          | Data is normally distributed  | Data has different scales or outliers |
| **Examples of Use Cases**   | SVM, PCA, Linear Regression, K-Means | Neural Networks, KNN, Image Processing |
| **Sensitive to Outliers?**  | No                             | Yes                         |



# Cross Validation : A technique to improve model performance

cross validation is a technique used to evaluate the performance of a machine learning model by 

## Why Use Cross-Validation?

- **Prevents Overfitting** – Ensures the model generalizes well to new data instead of just memorizing the training set.  
- **Improves Model Evaluation** – Tests the model on different subsets of data, providing a more comprehensive assessment.  
- **Provides Reliable Performance Estimates** – Reduces variability compared to a single train-test split, leading to more consistent results.  


## 1. K-Fold Cross-Validation (Most Common)

### How It Works?

1. Split the dataset into **K** equal-sized folds.  
2. Train the model on **K-1** folds and test on the remaining fold.  
3. Repeat the process **K** times, each time using a different fold as the test set.  
4. Compute the **average performance** across all **K** iterations.  


### Example (K = 5)
| Iteration | Training Folds      | Testing Fold |
|-----------|---------------------|--------------|
| 1         | 2, 3, 4, 5          | 1            |
| 2         | 1, 3, 4, 5          | 2            |
| 3         | 1, 2, 4, 5          | 3            |
| 4         | 1, 2, 3, 5          | 4            |
| 5         | 1, 2, 3, 4          | 5            |


In [21]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=20,random_state=42)

# K-Fold Cross-Validation setup
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize Linear Regression model
model = LinearRegression()

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=kf, scoring='r2')

# Print cross-validation results
print("Cross-validation scores:", scores)
print("Average score:", np.mean(scores))


Cross-validation scores: [0.93708841 0.97124696 0.96886821 0.94274007 0.96240286]
Average score: 0.9564693037880823


In [22]:
from sklearn.datasets import fetch_california_housing
california=fetch_california_housing()
df=pd.DataFrame(california.data,columns=california.feature_names)

In [26]:
X=df
y=california.target

In [27]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

In [31]:
model = LinearRegression()

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=kf, scoring='r2')

# Print results
print("Cross-validation scores:", scores)
print("Average score:", np.mean(scores))

Cross-validation scores: [0.57578771 0.61374822 0.60856043 0.62126494 0.5875292 ]
Average score: 0.6013781013684618
