<a href="https://colab.research.google.com/github/anjalikokare/MLP-lectures/blob/main/week2_mlp_2_6_2ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **feature scalling**

# **Introduction**
* **machine learning models need clean, preprocesses data.**
* **features often have different scales(e.g, age vs.income)**
* **this can cause models to misintercept feature importance**
* feature scalling puts all features on similar scale for better performance.

* **Feature scalling** : Its a way adjust numerical data to that all features have a similar range


In [43]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MaxAbsScaler

In [32]:
house_price = [
    {'age':10, 'area':6070, 'bedrooms':3,'stories': 3, 'belcony': 1, 'price(in M)': 5.0},
    {'age':20, 'area':7888, 'bedrooms':4,'stories': 2, 'belcony': 3, 'price(in M)': 7.0},
    {'age':20, 'area':8925, 'bedrooms':4,'stories': 2, 'belcony': 3, 'price(in M)': 7.3},
    {'age':30, 'area':8700,'bedrooms':2, 'stories': 4, 'belcony': 2, 'price(in M)': 6.0},
    {'age':40, 'area':8000,'bedrooms':5, 'stories': 4, 'belcony': 4, 'price(in M)': 5.0},
    {'age':40, 'area':7520, 'bedrooms':3,'stories': 4,'belcony': 1, 'price(in M)': 6.0}

]

In [33]:
dp = pd.DataFrame(house_price)

In [34]:
print(dp)

   age  area  bedrooms  stories  belcony  price(in M)
0   10  6070         3        3        1          5.0
1   20  7888         4        2        3          7.0
2   20  8925         4        2        3          7.3
3   30  8700         2        4        2          6.0
4   40  8000         5        4        4          5.0
5   40  7520         3        4        1          6.0


In [35]:
sc = StandardScaler()
sc.fit_transform(dp)

array([[-1.50755672, -1.91636945, -0.52223297, -0.18569534, -1.20604538,
        -1.18952649],
       [-0.60302269,  0.04036161,  0.52223297, -1.29986737,  0.60302269,
         1.07623825],
       [-0.60302269,  1.15649479,  0.52223297, -1.29986737,  0.60302269,
         1.41610296],
       [ 0.30151134,  0.91432511, -1.5666989 ,  0.92847669, -0.30151134,
        -0.05664412],
       [ 1.20604538,  0.1609083 ,  1.5666989 ,  0.92847669,  1.50755672,
        -1.18952649],
       [ 1.20604538, -0.35572036, -0.52223297,  0.92847669, -1.20604538,
        -0.05664412]])

**Feature Scallling advantages**

* feature scalling makes numeical features camparable in scale
* it prevent models from favoring features with larger values
* without it, algortihm may converge slowly or perform poorly



# * **Types of feature scallling**

* **standardization(z-score Normalization)**
$$x' = \frac{x_i - \mu}{\sigma}$$

* recommended for: Improve model performance by making features resemble standard normal data

* **Min-Max_scaler**
* $x' = \frac{x - x.mean}{x.max - x.min}$
* range from [0,1] for minmaxscaler
* standardScaler and minMaxScaler very sensitive to the persence of the outlier

In [36]:
data = [
    {'feature1':1 , 'feature32':50, 'feature3':200},
    {'feature1':2 , 'feature32':60, 'feature3':180},
    {'feature1':3 , 'feature32':70, 'feature3':160},
    {'feature1':4 , 'feature32':80, 'feature3':140},
    {'feature1':5 , 'feature32':90, 'feature3':120}
]

In [37]:
df = pd.DataFrame(data)
print(df)

   feature1  feature32  feature3
0         1         50       200
1         2         60       180
2         3         70       160
3         4         80       140
4         5         90       120


In [38]:
ss = StandardScaler()
ss.fit_transform(df)

array([[-1.41421356, -1.41421356,  1.41421356],
       [-0.70710678, -0.70710678,  0.70710678],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.70710678,  0.70710678, -0.70710678],
       [ 1.41421356,  1.41421356, -1.41421356]])

In [39]:
mms = MinMaxScaler()
mms.fit_transform(df)

array([[0.  , 0.  , 1.  ],
       [0.25, 0.25, 0.75],
       [0.5 , 0.5 , 0.5 ],
       [0.75, 0.75, 0.25],
       [1.  , 1.  , 0.  ]])

**Robust Scaling(median and IQR- Based)**

* formula = $x' = \frac{x - Q(x)}{Q3(x) - Q1(X)}$
* Q2 = median of the dataset
* Q3 = 75% of the dataset
* Q1 = 25% of the dataset

In [41]:
rs = RobustScaler()
rs.fit_transform(df)

array([[-1. , -1. ,  1. ],
       [-0.5, -0.5,  0.5],
       [ 0. ,  0. ,  0. ],
       [ 0.5,  0.5, -0.5],
       [ 1. ,  1. , -1. ]])

# **Max Abs Scalling**

formula = $x' = \frac{x}{max|x|}$

* recommended for : Best for sparse data

In [44]:
mas = MaxAbsScaler()
mas.fit_transform(df)

array([[0.2       , 0.55555556, 1.        ],
       [0.4       , 0.66666667, 0.9       ],
       [0.6       , 0.77777778, 0.8       ],
       [0.8       , 0.88888889, 0.7       ],
       [1.        , 1.        , 0.6       ]])

# **Distance-based Models**
* **K-Nearest Neighbours(KNN)** --> distance-based algorithm(Euclidean Distance) so numbers must in quality.

* **Support Vector machine(SVM)** --> SVMs rely on distance, so scalling boosts performance-especially with RBF Kernel


# **Dimentionality Reducation Technique**

* **Principal Component Analysis(PCA)** --> because the varience is high for high-magnitude features so scalling is critical.


# **Gradient-Based Models**
* To ensure smooth convergence of gradient descent and consistent update rates acress all features
E.g Logistic Regression

# **Tree-based Models (Minimal Impact)**

* classification and regression trees(CART)
* random forest


# **Key Takeaways**
* Feature scaling is an important preprocessing step, but it doesn't always improve performance.

* Scaling is crucial for models that rely on distance calculations, like SVM, KNN, and Logistic Regression.

* Tree-based models (e.g., Decision Trees, Random Forest) are mostly unaffected by scaling.

* Avoid data leakage by fitting the scaler only on training data and applying the transformation to test data separately.

* Different scalers (StandardScaler, MinMaxScaler, etc.) can impact model performance differently, so it's important to choose the right one.