# Feature Scalling

Feature scaling is an essential step in machine learning, particularly for algorithms that are sensitive to the magnitude of features. It ensures that all features contribute equally to the model's performance, preventing bias due to differing scales. Here’s why feature scaling is important

01. Prevents Dominance of Large-Scale Features <br/>
 - Many datasets have features with different units (e.g., age in years vs. income in dollars).
 - Without scaling, features with larger magnitudes can dominate those with smaller values, leading to biased models. <br/>
    Example: In a salary prediction model, if age is between 20-60 years but income is in thousands, income will heavily influence the model.
02. Required for Distance-Based Algorithm <br/>
 - Algorithms like KNN, K-Means, SVM, and PCA rely on distance metrics (Euclidean distance, Manhattan distance). <br/>

<br/>

When Is Feature Scaling Not Needed? <br/>
Tree-based models (e.g., Decision Trees, Random Forest, XGBoost) are not affected by scaling because they use conditions like if X > threshold rather than distances.



In [1]:
import numpy as np
data = np.array([[26,50000],
                 [29,70000],
                 [34,55000],
                 [31,41000],])


## Normalization

Normalization rescales feature values to a fixed range, usually [0, 1] or [-1, 1]. <br/>

When to Use Normalization? <br/>
✔ When features have different scales (e.g., age vs. salary). <br/>
✔ When the data has a bounded range and follows a uniform or non-Gaussian distribution. <br/>
✔ Suitable for distance-based algorithms like KNN, K-Means, SVM, and Neural Networks (since weights adjust better with normalized inputs). <br/>

Example Use Case: <br/>
If the temperature feature ranges from 0°C to 100°C and the humidity feature ranges from 0% to 1%, normalization helps bring them to the same scale.
<br/>



In [2]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data);
scaled_data

array([[0.        , 0.31034483],
       [0.375     , 1.        ],
       [1.        , 0.48275862],
       [0.625     , 0.        ]])

## Standarization

Standardization transforms features to have a mean of 0 and a standard deviation of 1, making the data follow a normal distribution <br/>

When to Use Standardization? <br/>
✔ When the data follows a normal (Gaussian) distribution. <br/>
✔ When features have different scales, but we want to maintain outliers and natural data spread. <br/>
✔ Suitable for linear regression, logistic regression, PCA, and gradient descent-based models. <br/>
✔ Useful for SVM and k-means clustering, where distance metrics matter but need robust scaling. <br/>

If the dataset contains features like height (cm), weight (kg), and age (years), standardization ensures they have comparable influence.



In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data);
scaled_data

array([[-1.37198868, -0.3805212 ],
       [-0.34299717,  1.52208478],
       [ 1.37198868,  0.0951303 ],
       [ 0.34299717, -1.23669388]])