# Data Preprocessing - Normalization
Data Normalization involves adjusting values measured on different scales to a common scale. When dealing with dataframes, data normalization permits to adjust values referred to different columns to a common scale. This operation is strongly recommended when the columns of a dataframe are considered as input features of a machine learning algorithm, because it permits to give all the features the same weight.

Normalization applies only to columns containing numeric values. Five methods of normalization exist:
* single feature scaling
* min max
* z-score
* log scaling
* clipping

In the remainder of the tutorial, we apply each method to a single column. However, if you wanted to use each column of the dataset as input features of a machine learning algorithm, you should apply the same normalization method to all the columns. 

In this tutorial, we use the `pandas` library to perform normalization. As an alternative, you could use the [preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) methods of the `scikit-learn` libray. **A little note for readers**: if you would like to learn how to use the preprocessing package of `scikit-learn`, please drop me a message or a comment to this post :)

## Data Import
As example dataset, in this tutorial we consider the dataset provided by the Italian Protezione Civile, related to the number of COVID-19 cases registered since the beginning of the COVID-19 pandemic. The dataset is updated daily and can be downloaded from [this link](https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv).

First of all, we need to import the Python `pandas` library and read the dataset through the `read_csv()` function. Then we can drop all the columns with `NaN` values. This is done through `dropna()` function. 

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv')
df.dropna(axis=1,inplace=True)
df.head(10)

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0,5
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0,0
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0,1
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,0,10
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,18,148
5,2020-02-24T18:00:00,ITA,6,Friuli Venezia Giulia,45.649435,13.768136,0,0,0,0,0,0,0,0,0,0,58
6,2020-02-24T18:00:00,ITA,12,Lazio,41.89277,12.483667,1,1,2,0,2,0,2,1,0,3,124
7,2020-02-24T18:00:00,ITA,7,Liguria,44.411493,8.932699,0,0,0,0,0,0,0,0,0,0,1
8,2020-02-24T18:00:00,ITA,3,Lombardia,45.466794,9.190347,76,19,95,71,166,0,166,0,6,172,1463
9,2020-02-24T18:00:00,ITA,11,Marche,43.61676,13.518875,0,0,0,0,0,0,0,0,0,0,16


## Single Feature Scaling
Single Feature Scaling converts every value of a column into a number between 0 and 1. The new value is calculated as the current value divided by the max value of the column. For example, if we consider the column `tamponi`, we can apply the single feature scaling by applying to the column the function `max()`, whic calculates the maximum value of the column:

In [5]:
df['tamponi'] = df['tamponi']/df['tamponi'].max()
df.head(10)

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0.0,9.55625e-07
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0.0,0.0
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0.0,1.91125e-07
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,0.0,1.91125e-06
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,3.5e-05,2.82865e-05
5,2020-02-24T18:00:00,ITA,6,Friuli Venezia Giulia,45.649435,13.768136,0,0,0,0,0,0,0,0,0,0.0,1.108525e-05
6,2020-02-24T18:00:00,ITA,12,Lazio,41.89277,12.483667,1,1,2,0,2,0,2,1,0,6e-06,2.36995e-05
7,2020-02-24T18:00:00,ITA,7,Liguria,44.411493,8.932699,0,0,0,0,0,0,0,0,0,0.0,1.91125e-07
8,2020-02-24T18:00:00,ITA,3,Lombardia,45.466794,9.190347,76,19,95,71,166,0,166,0,6,0.000335,0.0002796159
9,2020-02-24T18:00:00,ITA,11,Marche,43.61676,13.518875,0,0,0,0,0,0,0,0,0,0.0,3.058e-06


## Min Max
Similarly to Single Feature Scaling, Min Max converts every value of a column into a number between 0 and 1. The new value is calculated as the difference between the current value and the min value, divided by the range of the column values. For example, we can apply the min max method to the column `totale_casi`.

In [3]:
df['totale_casi'] = (df['totale_casi'] - df['totale_casi'].min())/(df['totale_casi'].max() - df['totale_casi'].min())

## z-score
Z-Score converts every value of a column into a number around 0. Typical values obtained by a z-score transformation range from -3 and 3. The new value is calculated as the difference between the current value and the average value, divided by the standard deviation. The average value of a column can be obtained through the `mean()` function, while the standard deviation through the `std()` function. For example, we can calculate the z-score of the column `deceduti`.

In [11]:
df['deceduti'] = (df['deceduti']-df['deceduti'].mean())/df['deceduti'].std()

Now we can calculate the minimum and maximum value obtained by the z-score transformation:

In [13]:
df['deceduti'].min()

-0.4329770144818199

In [14]:
df['deceduti'].max()

5.945028962275545

## Log Scaling
Log Scaling involves the conversion of a column to the logarithmic scale. If we want to use the natural logarithm, we can use the `log()` function of the `numpy` library. For example, we can apply log scaling to the column `dimessi_guariti`. We must deal with `log(0)` because it does not exist. We use the `lambda` operator to select the single rows of the column.

In [20]:
import numpy as np

df['dimessi_guariti'] = df['dimessi_guariti'].apply(lambda x: np.log(x) if x != 0 else 0)

0        0.000000
1        0.000000
2        0.000000
3        0.000000
4        0.000000
          ...    
5812     9.846388
5813    10.794296
5814     9.474088
5815     8.372861
5816    10.922389
Name: dimessi_guariti, Length: 5817, dtype: float64

## Clipping
Clipping involves the capping of all values below or above a certain value. Clipping is useful when a column contains some outliers. We can set a maximum `vmax` and a minimum value `vmin` and set all outliers greater than the maximum value to `vmax` and all the outliers lower than the minimum value to `vmin`. For example, we can consider the column `ricoverati_con_sintomi` and we can set `vmax = 10000` and `vmin = 10`.

In [22]:
vmax = 10000
vmin = 10

df['ricoverati_con_sintomi'] = df['ricoverati_con_sintomi'].apply(lambda x: vmax if x > vmax else vmin if x < vmin else x)