# Video: Standardizing Data with Scikit-Learn

This video shows how to standardize a data set with scikit-learn.

In [None]:
import pandas as pd

In [None]:
abalone = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/main/data/abalone.tsv", sep="\t")
abalone

Script:
* Here's our usual abalone dataset.
* If we look at some basic summary statistics, the scale varies significantly by column.


In [None]:
abalone.describe()

Script:
* The rings column has the highest standard deviation, over three.
* The next highest is whole weight which is just under one half.
* And the smallest standard deviation is for height which is just 0.04.
* Some kinds of models are affected a lot by these scale differences, so let's try standardize them.
* First, we will drop the Sex column since the standardization only works with numbers.

In [None]:
abalone = abalone.drop("Sex", axis=1)

Script:
* Then we will compute the mean and standard deviation of each column.

In [None]:
abalone_mean = abalone.mean(axis=0)
abalone_mean

In [None]:
abalone_std = abalone.std(axis=0)
abalone_std

Script:
* And now we can compute a new version, standardized to mean zero and standard deviation one.

In [None]:
abalone_standardized = (abalone - abalone_mean) / abalone_std
abalone_standardized

Script:
* An earlier draft of this code skipped dropping the Sex column and used the numeric_only option for mean and std.
* However, the Sex column that resulted was all Not a Number values.
* And since that column is a string, the usual tweaks for missing data do not work so well.
* Generally, you will want to deal with non-numeric columns first before neatening up the numbers.
* Let's sanity check the result of standardization.

In [None]:
abalone_standardized.describe()

Script:
* The mean column doesn't look like zero at first glance, but if you look closely, the exponents range from -16 to -18.
* So these are all very small numbers close to zero.
* We are seeing the results of numerical imprecision here, and the mean is zero for practical purposes.
* For the standard deviation row, all the numbers are one.
* So there were fewer numerical issues here.
* Looks like the transformation worked.
* One question you might have about this transformation is why I broke it down into three steps.
* Could I not have done it in one expression like this?

In [None]:
(abalone - abalone.mean(axis=0)) / abalone.std(axis=0)

Script:
* I can do this calculation, but then I have not saved the mean and standard deviation to transform new data later.
* When we are running this process later, we will want to subtract the same means and standard deviations that we just used.
* We do not want to use the means and standard deviations of the new data, because the transformation will keep changing and not match how we built the model.
* This gets particularly inane when we look at one new row of data.

In [None]:
new_abalone = abalone.iloc[0:1]
new_abalone

Script:
* Pretend this is a fresh row of data.
* What happens when we subtract the mean?


In [None]:
new_abalone - new_abalone.mean(axis=0)

Script:
* Subtracting the mean from a new batch of data with one row will give all zeros.
* And the standard deviation will be undefined or zero depending on whether you are diligent about your sample adjustment.
* Either way, the calculation does not work.

In [None]:
(new_abalone - new_abalone.mean(axis=0)) / new_abalone.std(axis=0)

Script:
* The proper way to do these transformations is to save the means and standard deviations, so the transformation is consistent between training and later predictions.
* Scikit-learn has a built-in preprocessing class called StandardScaler which does this for you.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
standardize_transform = StandardScaler()
standardize_transform.fit(abalone)

In [None]:
standardize_transform.transform(abalone)

Script:
* The output that you get back from using the scikit-learn StandardScaler is a numpy array, so the column names are lost, but usually that is ok since this will just be a temporary result.
* And it is very convenient for scikit-learn to track the standardization data for you.