# Part 1.3: Scaling & Normalization

Feature scaling is a critical preprocessing step. Many machine learning algorithms (like SVMs, Logistic Regression, and Neural Networks) perform better or converge faster when features are on a relatively similar scale.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, Normalizer
from sklearn.model_selection import train_test_split

data = {
    'age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 100], # includes outlier
    'salary': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 250000] # includes outlier
}
df = pd.DataFrame(data)

# IMPORTANT: Split data first to prevent data leakage
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

**Best Practice**: Always fit your scaler on the training data and then use it to transform both the training and the test data.

### Standardization (Z-score Scaling)
Rescales features to have a mean (μ) of 0 and a standard deviation (σ) of 1. Formula: `(x - μ) / σ`

In [2]:
scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(X_train)
X_test_std = scaler_std.transform(X_test)

print("StandardScaler - Mean of transformed train data:", X_train_std.mean(axis=0))
print("StandardScaler - Std Dev of transformed train data:", X_train_std.std(axis=0))

StandardScaler - Mean of transformed train data: [2.77555756e-17 0.00000000e+00]
StandardScaler - Std Dev of transformed train data: [1. 1.]


### Min-Max Scaling
Rescales features to a fixed range, usually [0, 1]. Formula: `(x - min) / (max - min)`

In [3]:
scaler_minmax = MinMaxScaler()
X_train_minmax = scaler_minmax.fit_transform(X_train)
X_test_minmax = scaler_minmax.transform(X_test)

print("MinMaxScaler - Min of transformed train data:", X_train_minmax.min(axis=0))
print("MinMaxScaler - Max of transformed train data:", X_train_minmax.max(axis=0))

MinMaxScaler - Min of transformed train data: [0. 0.]
MinMaxScaler - Max of transformed train data: [1. 1.]


### Robust Scaling
Uses statistics that are robust to outliers (median and Interquartile Range - IQR). Formula: `(x - median) / IQR`

In [4]:
scaler_robust = RobustScaler()
X_train_robust = scaler_robust.fit_transform(X_train)
X_test_robust = scaler_robust.transform(X_test)

print("Original Training Data:\n", X_train)
print("\nRobust Scaler Transformed Training Data:\n", pd.DataFrame(X_train_robust, columns=X_train.columns))

Original Training Data:
    age  salary
5   50  100000
0   25   50000
7   60  120000
2   35   70000
9  100  250000
4   45   90000
3   40   80000
6   55  110000

Robust Scaler Transformed Training Data:
         age    salary
0  0.142857  0.142857
1 -1.285714 -1.285714
2  0.714286  0.714286
3 -0.714286 -0.714286
4  3.000000  4.428571
5 -0.142857 -0.142857
6 -0.428571 -0.428571
7  0.428571  0.428571


### Normalization
Normalization scales individual samples (rows) to have unit norm (length of 1). It is used when the direction of the data matters, not the magnitude.

In [5]:
normalizer = Normalizer(norm='l2')
X_train_normalized = normalizer.fit_transform(X_train)

print("Original first row:", X_train.iloc[0].values)
print("Normalized first row:", X_train_normalized[0])
print("L2 Norm of first row:", np.linalg.norm(X_train_normalized[0]))

Original first row: [    50 100000]
Normalized first row: [4.99999938e-04 9.99999875e-01]
L2 Norm of first row: 0.9999999999999999
