# Chapter 2: Scale Machine Learning Data

Many machine learning algorithms expect data to be scaled consistently. There are two popular
methods that you should consider when scaling your data for machine learning. In this tutorial,
you will discover how you can rescale your data for machine learning. After reading this tutorial
you will know:

* How to normalize your data from scratch.
* How to standardize your data from scratch.
* When to normalize as opposed to standardize data.

Let’s get started.

## 2.1 Description

Many machine learning algorithms expect the scale of the input and even the output data to be
equivalent. It can help in methods that weight inputs in order to make a prediction, such as
in linear regression and logistic regression. It is practically required in methods that combine
weighted inputs in complex ways such as in artificial neural networks and deep learning.

### 2.1.1 Pima Indians Diabetes Dataset
In this tutorial we will use the Pima Indians Diabetes Dataset. This dataset involves the predic-
tion of the onset of diabetes within 5 years. The baseline performance on the problem is approx-
imately 65%. You can learn more about it in Appendix A, Section A.4. Download the dataset
and save it into your current working directory with the filename pima-indians-diabetes.csv.

## 2.2 Tutorial

This tutorial is divided into 3 parts:
1. Normalize Data.
2. Standardize Data.
3. When to Normalize and Standardize.

These steps will provide the foundations you need to handle scaling your own data.

### 2.2.1 Normalize Data

Normalization can refer to different techniques depending on context. Here, we use normalization
to refer to rescaling an input variable to the range between 0 and 1. Normalization requires
that you know the minimum and maximum values for each attribute.
This can be estimated from training data or specified directly if you have deep knowledge
of the problem domain. You can easily estimate the minimum and maximum values for each
attribute in a dataset by enumerating through the values. The snippet of code below defines
the dataset minmax() function that calculates the min and max value for each attribute in a
dataset, then returns an array of these minimum and maximum values.

In [1]:
# Load libraries
use strict;
use warnings;
use PDL;
use PDL::NiceSlice;
use Data::Dump qw(dump);
use sml; # Statistical Machine Learning Library

In [2]:
# Función para calcular el mínimo y máximo de cada columna usando PDL
sub dataset_minmax {
    my ($dataset) = @_;
    
    # Transponer el tensor para que cada columna sea una fila temporal
    my $transposed = $dataset->xchg(0, 1);

    # Calcular mínimos y máximos directamente para cada columna
    my $mins = $transposed->minimum;
    my $maxs = $transposed->maximum;

    # Combinar mínimos y máximos en un solo tensor
    my $minmax = cat($mins, $maxs)->xchg(0, 1);

    return $minmax;
}
sml->add_to_class('dataset_minmax', \&{'dataset_minmax'});

*sml::dataset_minmax

Warning: Subroutine sml::dataset_minmax redefined at /usr/local/lib/perl5/site_perl/5.32.1/x86_64-linux/sml.pm line 22.


With this contrived dataset, we can test our function for calculating the min and max for
each column.

In [3]:
# Crear un tensor de 2x2 usando PDL
my $dataset = pdl [[50, 30], [20, 90]];
print "Dataset:\n", $dataset;

# Calcular mínimos y máximos usando PDL directamente
my $minmax = dataset_minmax($dataset);  # Llamada directa
print "Minimos y maximos por columna:\n", $minmax;

Dataset:

[
 [50 30]
 [20 90]
]
Minimos y maximos por columna:

[
 [20 50]
 [30 90]
]


1

Once we have estimates of the maximum and minimum allowed values for each column, we
can now normalize the raw data to the range 0 and 1. The calculation to normalize a single
value for a column is:

<center>$scaled\ value = (value − min)\ /\ (max − min)$</center>  (2.1)

Below is an implementation of this in a function called normalize dataset() that normalizes
values in each column of a provided dataset.

In [4]:
# Función para normalizar un conjunto de datos usando PDL
sub normalize_dataset {
    my ($dataset, $minmax) = @_;

    # Extraer mínimos y máximos del tensor minmax
    my $mins = $minmax->slice(':,0');  
    my $maxs = $minmax->slice(':,1');  

    # Expandir y repetir las dimensiones para que coincidan con el dataset
    $mins = $mins->dummy(0);
    $maxs = $maxs->dummy(0);
    $mins = $mins->reshape($dataset->dim(0), $dataset->dim(1));
    $maxs = $maxs->reshape($dataset->dim(0), $dataset->dim(1));

    # Calcular el rango para cada columna
    my $range = $maxs - $mins;

    # Normalización: (valor - min) / (max - min)
    my $normalized = ($dataset - $mins) / $range;

    return $normalized;
}

We can tie this function together with the dataset minmax() function and normalize the
contrived dataset.

In [5]:
# Crear un tensor de 2x2 usando PDL
my $dataset = pdl [[50, 30], [20, 90]];
print "Dataset:\n", $dataset;

# Calcular el mínimo y el máximo para cada columna
my $minmax = dataset_minmax($dataset);
print "Minimos y maximos por columna:\n", $minmax;

# Normalizar el conjunto de datos
my $normalized = normalize_dataset($dataset, $minmax);
print "Dataset Normalizado:\n", $normalized;

Dataset:

[
 [50 30]
 [20 90]
]
Minimos y maximos por columna:

[
 [20 50]
 [30 90]
]
Dataset Normalizado:

[
 [   3 -0.5]
 [ Inf  Inf]
]


1

We can combine this code with code for loading a CSV dataset and load and normalize the
Pima Indians Diabetes dataset. The example first loads the dataset and converts the values for
each column from string to floating point values. The minimum and maximum values for each
column are estimated from the dataset, and finally, the values in the dataset are normalized.

In [6]:
# Cargar el dataset desde un archivo CSV usando PDL
my $filename = './data/pima-indians-diabetes.csv';
my $raw_dataset = sml->load_csv($filename);

# Convertir datos a formato PDL
my $dataset = pdl(map { [map {$_ + 0} @$_] } @$raw_dataset);

# Imprimir información del conjunto de datos
printf "Loaded data file %s with %d rows and %d columns.\n\n", $filename, $dataset->dim(0), $dataset->dim(1);
print "Dataset[0]:\n", $dataset->slice("0,:"), "\n";

Loaded data file ./data/pima-indians-diabetes.csv with 9 rows and 768 columns.

Dataset[0]:

[
 [ 6]
 [ 1]
 [ 8]
 [ 1]
 [ 0]
 [ 5]
 [ 3]
 [10]
 [ 2]
 [ 8]
 [ 4]
 [10]
 [10]
 [ 1]
 [ 5]
 [ 7]
 [ 0]
 [ 7]
 [ 1]
 [ 1]
 [ 3]
 [ 8]
 [ 7]
 [ 9]
 [11]
 [10]
 [ 7]
 [ 1]
 [13]
 [ 5]
 [ 5]
 [ 3]
 [ 3]
 [ 6]
 [10]
 [ 4]
 [11]
 [ 9]
 [ 2]
 [ 4]
 [ 3]
 [ 7]
 [ 7]
 [ 9]
 [ 7]
 [ 0]
 [ 1]
 [ 2]
 [ 7]
 [ 7]
 [ 1]
 [ 1]
 [ 5]
 [ 8]
 [ 7]
 [ 1]
 [ 7]
 [ 0]
 [ 0]
 [ 0]
 [ 2]
 [ 8]
 [ 5]
 [ 2]
 [ 7]
 [ 5]
 [ 0]
 [ 2]
 [ 1]
 [ 4]
 [ 2]
 [ 5]
 [13]
 [ 4]
 [ 1]
 [ 1]
 [ 7]
 [ 5]
 [ 0]
 [ 2]
 [ 3]
 [ 2]
 [ 7]
 [ 0]
 [ 5]
 [ 2]
 [13]
 [ 2]
 [15]
 [ 1]
 [ 1]
 [ 4]
 [ 7]
 [ 4]
 [ 2]
 [ 6]
 [ 2]
 [ 1]
 [ 6]
 [ 1]
 [ 1]
 [ 1]
 [ 0]
 [ 1]
 [ 2]
 [ 1]
 [ 1]
 [ 4]
 [ 3]
 [ 0]
 [ 3]
 [ 8]
 [ 1]
 [ 4]
 [ 7]
 [ 4]
 [ 5]
 [ 5]
 [ 4]
 [ 4]
 [ 0]
 [ 6]
 [ 2]
 [ 5]
 [ 0]
 [ 1]
 [ 3]
 [ 1]
 [ 1]
 [ 0]
 [ 4]
 [ 9]
 [ 3]
 [ 8]
 [ 2]
 [ 2]
 [ 0]
 [ 0]
 [ 0]
 [ 5]
 [ 3]
 [ 5]
 [ 2]
 [10]
 [ 4]
 [ 0]
 [ 9]
 [ 2]
 [ 5]
 [ 2]
 [ 1]

1

### 2.2.2 Standardize Data

Standardization is a rescaling technique that refers to centering the distribution of the data on
the value 0 and the standard deviation to the value 1. Together, the mean and the standard
deviation can be used to summarize a normal distribution, also called the Gaussian distribution
or bell curve.
It requires that the mean and standard deviation of the values for each column be known
prior to scaling. As with normalizing above, we can estimate these values from training data, or
use domain knowledge to specify their values. Let’s start with creating functions to estimate
the mean and standard deviation statistics for each column from a dataset. The mean describes
the middle or central tendency for a collection of numbers. The mean for a column is calculated
as the sum of all values for a column divided by the total number of values.<br><br>

<center>$\sum_{i=1}^n values_i / count(values)$</center> (2.2)

The function below named column_means() calculates the mean values for each column in
the dataset.

In [7]:
# Función para calcular la media de cada columna usando PDL
sub column_means {
    my ($dataset) = @_;

    # Transponer el tensor para que cada columna sea una fila temporal
    my $transposed = $dataset->xchg(0, 1);

    # Calcular la media para cada columna
    my $means = $transposed->average;

    return $means;
}

# Registrar el método en la clase
sml->add_to_class('column_means', \&{'column_means'});

*sml::column_means

Warning: Subroutine sml::column_means redefined at /usr/local/lib/perl5/site_perl/5.32.1/x86_64-linux/sml.pm line 22.


The standard deviation describes the average spread of values from the mean. It can be
calculated as the square root of the sum of the squared difference between each value and the
mean and dividing by the number of values minus 1.<br><br>

<center>$ standard\ deviation = \sqrt{\sum_{i=1}^n (values_i - mean)^2 / count(values) − 1}$</center> (2.3)

The function below named column stdevs() calculates the standard deviation of values for
each column in the dataset and assumes the means have already been calculated.

In [8]:
# Función para calcular la desviación estándar de cada columna usando PDL
sub column_stdevs {
    my ($dataset, $means) = @_;

    # Transponer el tensor para que cada columna sea una fila temporal
    my $transposed = $dataset->xchg(0, 1);

    # Ajustar dimensiones de las medias para que coincidan con el dataset
    $means = $means->dummy(0);  # Expandir la dimensión para que coincida

    # Calcular la varianza para cada columna
    my $variance = ($transposed - $means)**2;
    my $var_sum = $variance->average;

    # Calcular la desviación estándar
    my $stdevs = sqrt($var_sum);

    return $stdevs;
}

# Registrar el método en la clase
sml->add_to_class('column_stdevs', \&{'column_stdevs'});

*sml::column_stdevs

Warning: Subroutine sml::column_stdevs redefined at /usr/local/lib/perl5/site_perl/5.32.1/x86_64-linux/sml.pm line 22.


Using the contrived dataset, we can estimate the summary statistics.

In [9]:
# Crear un tensor de 3x2 usando PDL
my $dataset = pdl [[50, 30], [20, 90], [30, 50]];
print "Dataset:\n", $dataset;

# Calcular la media y la desviación estándar
my $means = column_means($dataset);
my $stdevs = column_stdevs($dataset, $means);

# Formatear los resultados a dos decimales y convertirlos en cadenas
my $formatted_means = join(", ", map { sprintf "%0.2f", $_ } list $means);
my $formatted_stdevs = join(", ", map { sprintf "%0.2f", $_ } list $stdevs);

# Imprimir resultados
print "Means: [$formatted_means]\n";
print "Stdevs: [$formatted_stdevs]\n";

Dataset:

[
 [50 30]
 [20 90]
 [30 50]
]
Means: [33.33, 56.67]
Stdevs: [12.47, 24.94]


1

Once the summary statistics are calculated, we can easily standardize the values in each
column. The calculation to standardize a given value is as follows:<br><br>

<center>$standardized\_value_i = (value_i − mean)\ /\ stdev$</center>  (2.4)

Below is a function named standardize dataset() that implements this equation

In [10]:
# Función para estandarizar un conjunto de datos usando PDL
sub standardize_dataset {
    my ($dataset, $means, $stdevs) = @_;

    # Expandir y repetir las medias y desviaciones estándar para que coincidan con el dataset
    $means = $means->dummy(0);  
    $stdevs = $stdevs->dummy(0);  

    # Ajustar la forma para que coincida con el número de filas del dataset
    $means = $means->reshape($dataset->dim(0), $dataset->dim(1));
    $stdevs = $stdevs->reshape($dataset->dim(0), $dataset->dim(1));

    # Estandarización: (valor - media) / desviación estándar
    my $standardized = ($dataset - $means) / $stdevs;

    return $standardized;
}

# Registrar el método en la clase
sml->add_to_class('standardize_dataset', \&{'standardize_dataset'});

*sml::standardize_dataset

Warning: Subroutine sml::standardize_dataset redefined at /usr/local/lib/perl5/site_perl/5.32.1/x86_64-linux/sml.pm line 22.


Combining this with the functions to estimate the mean and standard deviation summary
statistics, we can standardize our contrived dataset.

In [11]:
# Cargar el dataset desde un archivo CSV usando PDL
my $filename = './data/pima-indians-diabetes.csv';
my $raw_dataset = sml->load_csv($filename);

# Convertir datos a formato PDL, asegurando que los valores sean flotantes
my $dataset = pdl(map { [map {$_ + 0} @$_] } @$raw_dataset);
print "Loaded data file $filename with ", $dataset->dim(0), " rows and ", $dataset->dim(1), " columns.\n\n";

# Ver el primer registro del dataset original
print "Dataset[0]: ", $dataset->slice("0,:"), "\n\n";

# Calcular el mínimo y el máximo para cada columna
my $minmax = dataset_minmax($dataset);

# Normalizar el conjunto de datos
$dataset = normalize_dataset($dataset, $minmax);

# Calcular la media y la desviación estándar
my $means = column_means($dataset);
my $stdevs = column_stdevs($dataset, $means);

# Estandarizar el conjunto de datos
$dataset = standardize_dataset($dataset, $means, $stdevs);

# Imprimir el primer registro del dataset estandarizado
my $formatted_record = join(", ", map { sprintf "%0.2f", $_ } list $dataset->slice("0,:"));
print "Dataset[0]: ($formatted_record)\n\n";

Loaded data file ./data/pima-indians-diabetes.csv with 9 rows and 768 columns.

Dataset[0]: 
[
 [ 6]
 [ 1]
 [ 8]
 [ 1]
 [ 0]
 [ 5]
 [ 3]
 [10]
 [ 2]
 [ 8]
 [ 4]
 [10]
 [10]
 [ 1]
 [ 5]
 [ 7]
 [ 0]
 [ 7]
 [ 1]
 [ 1]
 [ 3]
 [ 8]
 [ 7]
 [ 9]
 [11]
 [10]
 [ 7]
 [ 1]
 [13]
 [ 5]
 [ 5]
 [ 3]
 [ 3]
 [ 6]
 [10]
 [ 4]
 [11]
 [ 9]
 [ 2]
 [ 4]
 [ 3]
 [ 7]
 [ 7]
 [ 9]
 [ 7]
 [ 0]
 [ 1]
 [ 2]
 [ 7]
 [ 7]
 [ 1]
 [ 1]
 [ 5]
 [ 8]
 [ 7]
 [ 1]
 [ 7]
 [ 0]
 [ 0]
 [ 0]
 [ 2]
 [ 8]
 [ 5]
 [ 2]
 [ 7]
 [ 5]
 [ 0]
 [ 2]
 [ 1]
 [ 4]
 [ 2]
 [ 5]
 [13]
 [ 4]
 [ 1]
 [ 1]
 [ 7]
 [ 5]
 [ 0]
 [ 2]
 [ 3]
 [ 2]
 [ 7]
 [ 0]
 [ 5]
 [ 2]
 [13]
 [ 2]
 [15]
 [ 1]
 [ 1]
 [ 4]
 [ 7]
 [ 4]
 [ 2]
 [ 6]
 [ 2]
 [ 1]
 [ 6]
 [ 1]
 [ 1]
 [ 1]
 [ 0]
 [ 1]
 [ 2]
 [ 1]
 [ 1]
 [ 4]
 [ 3]
 [ 0]
 [ 3]
 [ 8]
 [ 1]
 [ 4]
 [ 7]
 [ 4]
 [ 5]
 [ 5]
 [ 4]
 [ 4]
 [ 0]
 [ 6]
 [ 2]
 [ 5]
 [ 0]
 [ 1]
 [ 3]
 [ 1]
 [ 1]
 [ 0]
 [ 4]
 [ 9]
 [ 3]
 [ 8]
 [ 2]
 [ 2]
 [ 0]
 [ 0]
 [ 0]
 [ 5]
 [ 3]
 [ 5]
 [ 2]
 [10]
 [ 4]
 [ 0]
 [ 9]
 [ 2]
 [ 5]
 [ 2]
 [ 1]

1

Again, we can demonstrate the standardization of a machine learning dataset. The example
below demonstrates how to load and standardize the Pima Indians diabetes dataset, assumed
to be in the current working directory as in the previous normalization example.

In [12]:
# Cargar el dataset desde un archivo CSV usando PDL
my $filename = './data/pima-indians-diabetes.csv';
my $raw_dataset = sml->load_csv($filename);

# Convertir datos a formato PDL, asegurando que los valores sean flotantes
my $dataset = pdl(map { [map {$_ + 0} @$_] } @$raw_dataset);
print "Loaded data file $filename with ", $dataset->dim(0), " rows and ", $dataset->dim(1), " columns.\n\n";

# Ver el primer registro del dataset original
print "Dataset[0]: ", join(", ", map { sprintf "%0.2f", $_ } list $dataset->slice("0,:")), "\n\n";

# Calcular el mínimo y el máximo para cada columna
my $minmax = dataset_minmax($dataset);
print "Minimos y Maximos por columna:\n", $minmax, "\n";

# Normalizar el conjunto de datos
$dataset = normalize_dataset($dataset, $minmax);
print "Dataset Normalizado[0]: ", join(", ", map { sprintf "%0.2f", $_ } list $dataset->slice("0,:")), "\n\n";

# Calcular la media y la desviación estándar
my $means = column_means($dataset);
my $stdevs = column_stdevs($dataset, $means);
print "Medias: ", join(", ", map { sprintf "%0.2f", $_ } list $means), "\n";
print "Desviaciones: ", join(", ", map { sprintf "%0.2f", $_ } list $stdevs), "\n\n";

# Estandarizar el conjunto de datos
$dataset = standardize_dataset($dataset, $means, $stdevs);
print "Dataset Estandarizado[0]: ", join(", ", map { sprintf "%0.2f", $_ } list $dataset->slice("0,:")), "\n\n";


Loaded data file ./data/pima-indians-diabetes.csv with 9 rows and 768 columns.

Dataset[0]: 6.00, 1.00, 8.00, 1.00, 0.00, 5.00, 3.00, 10.00, 2.00, 8.00, 4.00, 10.00, 10.00, 1.00, 5.00, 7.00, 0.00, 7.00, 1.00, 1.00, 3.00, 8.00, 7.00, 9.00, 11.00, 10.00, 7.00, 1.00, 13.00, 5.00, 5.00, 3.00, 3.00, 6.00, 10.00, 4.00, 11.00, 9.00, 2.00, 4.00, 3.00, 7.00, 7.00, 9.00, 7.00, 0.00, 1.00, 2.00, 7.00, 7.00, 1.00, 1.00, 5.00, 8.00, 7.00, 1.00, 7.00, 0.00, 0.00, 0.00, 2.00, 8.00, 5.00, 2.00, 7.00, 5.00, 0.00, 2.00, 1.00, 4.00, 2.00, 5.00, 13.00, 4.00, 1.00, 1.00, 7.00, 5.00, 0.00, 2.00, 3.00, 2.00, 7.00, 0.00, 5.00, 2.00, 13.00, 2.00, 15.00, 1.00, 1.00, 4.00, 7.00, 4.00, 2.00, 6.00, 2.00, 1.00, 6.00, 1.00, 1.00, 1.00, 0.00, 1.00, 2.00, 1.00, 1.00, 4.00, 3.00, 0.00, 3.00, 8.00, 1.00, 4.00, 7.00, 4.00, 5.00, 5.00, 4.00, 4.00, 0.00, 6.00, 2.00, 5.00, 0.00, 1.00, 3.00, 1.00, 1.00, 0.00, 4.00, 9.00, 3.00, 8.00, 2.00, 2.00, 0.00, 0.00, 0.00, 5.00, 3.00, 5.00, 2.00, 10.00, 4.00, 0.00, 9.00, 2.00, 5.00, 2.

1

### 2.2.3 When to Normalize and Standardize

Standardization is a scaling technique that assumes your data conforms to a normal distribution.
If a given data attribute is normal or close to normal, this is probably the scaling method to use.
It is good practice to record the summary statistics used in the standardization process so that
you can apply them when standardizing data in the future that you may want to use with your
model. Normalization is a scaling technique that does not assume any specific distribution.

If your data is not normally distributed, consider normalizing it prior to applying your
machine learning algorithm. It is good practice to record the minimum and maximum values
for each column used in the normalization process, again, in case you need to normalize new
data in the future to be used with your model.

## 2.3 Extensions

There are many other data transforms you could apply. The idea of data transforms is to best
expose the structure of your problem in your data to the learning algorithm. It may not be
clear what transforms are required upfront. A combination of trial and error and exploratory
data analysis (plots and stats) can help tease out what may work. Below are some additional
transforms you may want to consider researching and implementing:
* Normalization that permits a configurable range, such as -1 to 1 and more.
* Standardization that permits a configurable spread, such as 1, 2 or more standard deviations
from the mean.
* Exponential transforms such as logarithm, square root and exponents.
* Power transforms such as Box-Cox for fixing the skew in normally distributed data.

## 2.4 Review

In this tutorial, you discovered how to rescale your data for machine learning from scratch.
Specifically, you learned:
* How to normalize data from scratch.
* How to standardize data from scratch.
* When to use normalization or standardization on your data.