# Data Normalisation

Normalisation is a common data pre-processing step, which involves changing the values of numeric columns in the data to a common scale, without distorting differences in the ranges of values.

In [None]:
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format

## Data Loading

Load in a sample dataset:

In [None]:
df = pd.read_csv("penguins_af.csv", index_col=0)
df.head(10)

Inspect the numeric columns:

In [None]:
numeric_columns = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "year"]
df[numeric_columns].head()

From looking at the ranges for these numeric features, we can see that they are quite different:

In [None]:
df[numeric_columns].describe().transpose()

## Min-Max Normalisation

One common preprocessing approach is **min-max normalisation**, which rescales the range of a feature's values to [0,1], based on its minimum and maximum values. 

We can use the *MinMaxScaler* implementation in scikit-learn.

In [None]:
from sklearn.preprocessing import MinMaxScaler
# copy the original data
X = df.copy()
# apply the scaling process to the numeric columns
scaler = MinMaxScaler()
X[numeric_columns] = scaler.fit_transform(X[numeric_columns])
# inspect the result
X.head()

We can now see that the numeric features all have the same ranges - i.e. 0 to 1:

In [None]:
X[numeric_columns].describe().transpose()

We can also reverse the normalisation process, if we needed to get back our original data. This is done by applying an inverse transform:

In [None]:
# apply the inverse transform
Z = scaler.inverse_transform(X[numeric_columns])
# turn the array back into a DataFrame
pd.DataFrame(Z, index=X.index, columns=numeric_columns)

## Z-Score Normalisation

Another common approach is **z-score normalisation** (also sometimes called **standard scaling**). Here for all values for a feature, we subtract the feature mean and divide by the by the feature's standard deviation.

We can apply this using the scikit-learn *StandardScaler* implementation.

In [None]:
from sklearn.preprocessing import StandardScaler
# copy the original data
X = df.copy()
# apply the scaling process to the numeric columns
scaler = StandardScaler()
X[numeric_columns] = scaler.fit_transform(X[numeric_columns])
# inspect the result
X.head()

We can see that our features now have the same range:

In [None]:
X[numeric_columns].describe().transpose()