# Scaling

Scaling transforms the columns in order to make them have the **same order of magnitude**. Sometimes it's necessary because different orders of magnitude may affect the importance of the features *perceived* by the model.

It doesn't change the correlation between the features and the target, so it doesn't affect the predictive power of the features.

Additionally, some models require features that are in an interval around 0 (logistic regression, KNN, neural networks...)

# 5.1 Normalization, Standardization and Robust Scaling:



In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("sample_dataset.csv")

In [3]:
X = df.iloc[:,0:3].dropna()

In [4]:
X

Unnamed: 0,mean radius,mean texture,mean perimeter
1,20.57,17.77,132.90
2,19.69,21.25,130.00
3,11.42,20.38,77.58
5,12.45,15.70,82.57
7,13.71,20.83,90.20
...,...,...,...
562,15.22,30.62,103.40
563,20.92,25.09,143.00
564,21.56,22.39,142.00
566,16.60,28.08,108.30


## 5.1 Normalization (Mix - Max Scaling)

- Transforms the data to specified range, usually [0,1].

In [5]:
from sklearn.preprocessing import MinMaxScaler

In [6]:
scaler = MinMaxScaler()

In [12]:
X_scaled = scaler.fit_transform(X)

In [13]:
X_scaled

array([[0.63004759, 0.27257355, 0.60432679],
       [0.58687012, 0.3902604 , 0.58368915],
       [0.18110004, 0.36083869, 0.21064617],
       ...,
       [0.67862225, 0.42881299, 0.66908625],
       [0.43525833, 0.62123774, 0.42926274],
       [0.63151955, 0.66351031, 0.65556504]])

In [9]:
import numpy as np

In [14]:
np.apply_over_axes(np.max, X_scaled, 0)

array([[1., 1., 1.]])

In [15]:
np.apply_over_axes(np.min, X_scaled, 0)

array([[0., 0., 0.]])

## 5.2 Standardization  (Z-score scaling):

- Transforms the data to have a mean of 0 and a standard deviation of 1.

In [16]:
from sklearn.preprocessing import StandardScaler

In [18]:
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

In [20]:
X_scaled

array([[ 1.87149535, -0.35527432,  1.7252216 ],
       [ 1.61663638,  0.44774298,  1.6034021 ],
       [-0.77845877,  0.24698866, -0.59859036],
       ...,
       [ 2.1582117 ,  0.71080038,  2.10748279],
       [ 0.72173384,  2.02377982,  0.6918562 ],
       [ 1.88018373,  2.31221994,  2.02767001]])

In [21]:
np.apply_over_axes(np.mean, X_scaled, 0)

array([[-7.38226219e-17, -2.86062660e-16, -1.84556555e-17]])

In [23]:
np.apply_over_axes(np.var, X_scaled, 0)

array([[1., 1., 1.]])

##  5.3 Robust scaling

Uses the median and interquartile range (IQR) to scale the data, which is less sensitive to outliers.   

In [24]:
from sklearn.preprocessing import RobustScaler

In [26]:
scaler = RobustScaler()

In [27]:
X_scaled = scaler.fit_transform(X)

In [28]:
X_scaled

array([[ 1.78484108, -0.18650089,  1.64689365],
       [ 1.56968215,  0.43161634,  1.54510355],
       [-0.45232274,  0.27708703, -0.29484029],
       ...,
       [ 2.02689487,  0.63410302,  1.96630397],
       [ 0.81418093,  1.64476021,  0.78343278],
       [ 1.79217604,  1.86678508,  1.8996139 ]])

In [29]:
np.apply_over_axes(np.median, X_scaled, 0)

array([[0., 0., 0.]])

In [30]:
scaler.inverse_transform(X_scaled)

array([[ 20.57,  17.77, 132.9 ],
       [ 19.69,  21.25, 130.  ],
       [ 11.42,  20.38,  77.58],
       ...,
       [ 21.56,  22.39, 142.  ],
       [ 16.6 ,  28.08, 108.3 ],
       [ 20.6 ,  29.33, 140.1 ]])

In [32]:
X.head(3)

Unnamed: 0,mean radius,mean texture,mean perimeter
1,20.57,17.77,132.9
2,19.69,21.25,130.0
3,11.42,20.38,77.58


### Learnings from `5-scaling.ipynb`

**Pseudocode:**
1. Import necessary libraries.
2. Load and preprocess the dataset.
3. Apply different scaling techniques: Normalization, Standardization, and Robust Scaling.
4. Observe the results of each scaling technique.

**Code:**
```python
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Load and preprocess the dataset
df = pd.read_csv("sample_dataset.csv")
X = df.iloc[:, 0:3].dropna()

# Normalization (Min-Max Scaling)
min_max_scaler = MinMaxScaler()
X_min_max_scaled = min_max_scaler.fit_transform(X)

# Standardization (Z-score scaling)
standard_scaler = StandardScaler()
X_standard_scaled = standard_scaler.fit_transform(X)

# Robust Scaling
robust_scaler = RobustScaler()
X_robust_scaled = robust_scaler.fit_transform(X)
```

**Learnings:**
1. **Normalization (Min-Max Scaling):** Transforms the data to a specified range, usually \[0,1\].
2. **Standardization (Z-score scaling):** Transforms the data to have a mean of 0 and a standard deviation of 1.
3. **Robust Scaling:** Uses the median and interquartile range (IQR) to scale the data, which is less sensitive to outliers.

## Exercises

- Load sample_datset.csv
- Numerical variables must be cleaned using the median value, the normalized
- Change the scaler to sandard scaler and transform the dataset again