# Data Preprocessing - Normalization
As already said in my previous tutorial on data normalization, Data Normalization involves adjusting values measured on different scales to a common scale. 

Normalization applies only to columns containing numeric values. Five methods of normalization exist:
* single feature scaling
* min max
* z-score
* log scaling
* clipping

In this tutorial, we use the `scikit-learn` library to perform normalization, while in my previous tutorial I dealt with data normalization using the `pandas` library. The `scikit-learn` library can be used also to deal with missing values, as explained in my previous post.

All the `scikit-learn` operations described in this tutorial follow the following steps:
* select a preprocessing methodology
* fit it through the `fit()` function
* apply it to data through the `transform()` function.

The `scikit-learn` library works only with arrays, thus when performing every operation, a dataframe column must be converted to an array. This can be achieved through the `numpy.array()` function, which receives the dataframe column as input. In addition, the `fit()` function receives as input an array of arrays, each representing a sample of the dataset. Thus the `reshape()` function could be used to convert a standard array to an array of arrays.

## Data Import
As example dataset, in this tutorial we consider the dataset provided by the Italian Protezione Civile, related to the number of COVID-19 cases registered since the beginning of the COVID-19 pandemic. The dataset is updated daily and can be downloaded from [this link](https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv).

First of all, we need to import the Python `pandas` library and read the dataset through the `read_csv()` function. Then we can drop all the columns with `NaN` values. This is done through `dropna()` function. 

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv')
df.dropna(axis=1,inplace=True)
df.head(10)

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0,5
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0,0
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0,1
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,0,10
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,18,148
5,2020-02-24T18:00:00,ITA,6,Friuli Venezia Giulia,45.649435,13.768136,0,0,0,0,0,0,0,0,0,0,58
6,2020-02-24T18:00:00,ITA,12,Lazio,41.89277,12.483667,1,1,2,0,2,0,2,1,0,3,124
7,2020-02-24T18:00:00,ITA,7,Liguria,44.411493,8.932699,0,0,0,0,0,0,0,0,0,0,1
8,2020-02-24T18:00:00,ITA,3,Lombardia,45.466794,9.190347,76,19,95,71,166,0,166,0,6,172,1463
9,2020-02-24T18:00:00,ITA,11,Marche,43.61676,13.518875,0,0,0,0,0,0,0,0,0,0,16


## Single Feature Scaling
Single Feature Scaling converts every value of a column into a number between 0 and 1. The new value is calculated as the current value divided by the max value of the column. This can be done through the `MaxAbsScaler` class.
We apply the scaler to the `tamponi` column, which mut be converted to array and reshaped.

In [2]:
import numpy as np
from sklearn.preprocessing import MaxAbsScaler

X = np.array(df['tamponi']).reshape(-1,1)
scaler = MaxAbsScaler()

Now we can fit the scaler and then apply the transformation. We convert it to the original shape by applying the inverse `reshape()` function and we store the result into a new column of the datafram `df`.

In [4]:
scaler.fit(X)
X_scaled = scaler.transform(X)
df['single feature scaling'] = X_scaled.reshape(1,-1)[0]
df.head()

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi,single feature scaling
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0,5,9.55625e-07
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0,0,0.0
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0,1,1.91125e-07
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,0,10,1.91125e-06
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,18,148,2.82865e-05


The `scikit-learn` library also provides a function to restore the original values, given the transormation. This function also works for the transformations described later in this article.

In [6]:
scaler.inverse_transform(X_scaled)

array([[5.000000e+00],
       [0.000000e+00],
       [1.000000e+00],
       ...,
       [5.507300e+05],
       [6.654400e+04],
       [3.643743e+06]])

## Min Max
Similarly to Single Feature Scaling, Min Max converts every value of a column into a number between 0 and 1. The new value is calculated as the difference between the current value and the min value, divided by the range of the column values. In `scikit-learn` we use the `MinMaxScaler` class. For example, we can apply the min max method to the column `totale_casi`.

In [10]:
from sklearn.preprocessing import MinMaxScaler
X = np.array(df['totale_casi']).reshape(-1,1)
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
df['min max'] = X_scaled.reshape(1,-1)[0]
df.head()

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi,single feature scaling,min max
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0,5,9.55625e-07,0.0
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0,1,1.91125e-07,0.0
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,0,10,1.91125e-06,0.0
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,18,148,2.82865e-05,3.5e-05


## z-score
Z-Score converts every value of a column into a number around 0. Typical values obtained by a z-score transformation range from -3 and 3. The new value is calculated as the difference between the current value and the average value, divided by the standard deviation. In `scikit-learn` we can use the `StandardScaler` function. For example, we can calculate the z-score of the column `deceduti`.

In [11]:
from sklearn.preprocessing import StandardScaler

X = np.array(df['deceduti']).reshape(-1,1)
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
df['z score'] = X_scaled.reshape(1,-1)[0]
df.head()

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi,single feature scaling,min max,z score
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0,5,9.55625e-07,0.0,-0.46295
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,-0.46295
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0,1,1.91125e-07,0.0,-0.46295
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,0,10,1.91125e-06,0.0,-0.46295
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,18,148,2.82865e-05,3.5e-05,-0.46295


For more details, you can give a look at [this link](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html)