# Data Preprocessing - Standardization
This tutorial explains how to preprocess data using the Pandas library. Preprocessing is the process of doing a pre-analysis of data, in order to transform them into a standard and normalised format. Preprocessing involves the following aspects:
* missing values
* data formatting
* data normalisation
* data standardisation
* data binning
In this tutorial we deal only with standardization. Standardization is often confused with normalization, however
they refer to different things. Normalization involves adjusting values measured on different scales to a common scale, while standardization transforms data to have a mean of zero and a standard deviation of 1. 
Standardization is also done through a z-score transformation, where the new value is calculated as the difference between the current value and the average value, divided by the standard deviation. 

Z-score is a statistical measure that specifies how far is a single data point from the rest of the dataset. As highlighted by Mahbubul Alam in [his article](https://towardsdatascience.com/z-score-for-anomaly-detection-d98b0006f510), z-score can be used to detect outliers in a dataset.

Z-score can be calculated manually as described in [my previous post](https://towardsdatascience.com/data-preprocessing-with-python-pandas-part-3-normalisation-5b5392d27673). However, in this tutorial I will show you how to calculate z-score using some functions from the `scipy.stats` library.

In this tutorial we consider two types of standardizations:
* z-score
* z-map

## Data Import
As example dataset, in this tutorial we consider the dataset provided by the Italian Protezione Civile, related to the number of COVID-19 cases registered since the beginning of the COVID-19 pandemic. The dataset is updated daily and can be downloaded from [this link](https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv).

First of all, we need to import the Python `pandas` library and read the dataset through the `read_csv()` function. Then we can drop all the columns with `NaN` values. This is done through `dropna()` function. 

In [5]:
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv')
df.dropna(axis=1,inplace=True)
df.tail(10)

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi
8432,2021-03-31T17:00:00,ITA,21,P.A. Bolzano,46.499335,11.356624,90,20,110,576,686,-33,120,67142,1126,68954,1123271
8433,2021-03-31T17:00:00,ITA,22,P.A. Trento,46.068935,11.121231,201,51,252,2611,2863,-143,187,37086,1282,41231,702081
8434,2021-03-31T17:00:00,ITA,1,Piemonte,45.073274,7.680687,3873,376,4249,30810,35059,-358,2298,263913,10308,309280,3310922
8435,2021-03-31T17:00:00,ITA,16,Puglia,41.125596,16.867367,1840,260,2100,44757,46857,81,1962,141343,4812,193012,1871149
8436,2021-03-31T17:00:00,ITA,20,Sardegna,39.215312,9.110616,222,34,256,14141,14397,366,444,29872,1234,45503,1005266
8437,2021-03-31T17:00:00,ITA,19,Sicilia,38.115697,13.362357,891,140,1031,18889,19920,2503,2904,150806,4628,175354,3162046
8438,2021-03-31T17:00:00,ITA,9,Toscana,43.769231,11.255889,1560,265,1825,26282,28107,217,1538,161919,5348,195374,3427730
8439,2021-03-31T17:00:00,ITA,10,Umbria,43.106758,12.388247,349,60,409,4397,4806,-239,162,44846,1256,50908,980871
8440,2021-03-31T17:00:00,ITA,2,Valle d'Aosta,45.737503,7.320149,49,8,57,845,902,0,62,7971,425,9298,95322
8441,2021-03-31T17:00:00,ITA,5,Veneto,45.434905,12.338452,1676,282,1958,36739,38697,30,2317,333516,10625,382838,6184340


## z-score
The new value is calculated as the difference between the current value and the average value, divided by the standard deviation. For example, we can calculate the z-score of the column `deceduti`. We can use the `zscore()` function of the `scipy.stats` library.

In [6]:
from scipy.stats import zscore
df['zscore-deceduti'] = zscore(df['deceduti'])

## z-map
The new value is calculated as the difference between the current value and the average value of a comparison array, divided by the standard deviation of a comparison array. For example, we can calculate the z-map of the column `deceduti`, using the column `terapia_intensiva` as comparison array. We can use the `zmap()` function of the `scipy.stats` library.

In [7]:
from scipy.stats import zmap
zmap(df['deceduti'], df['terapia_intensiva'])

array([-0.54814841, -0.54814841, -0.54814841, ...,  8.50292398,
        2.51451542, 76.01844725])

## Detect outliers
Standardization can be used to detect and delete outliers. For example, a threshold can be defined to specify which values can be considered as outliers. In this example, we set `threshold = 2`. We can add a new column to the dataframe, called `outliers` which is set to `True` if the value is less than `-2` or greater than `2`. We use the `numpy` function `where()` to perform comparisons.

In [8]:
threshold = 2
df['outliers'] = np.where((df['zscore-deceduti'] - threshold > 0), True, np.where(df['zscore-deceduti'] + threshold < 0, True, False)) 

Now, we can remove outliers, using the `drop()` function.

In [9]:
df.drop(df[df['outliers'] == True].index,inplace=True)

In [10]:
df.shape

(8072, 19)