# What is data normalization ?

Data normalization is one of the steps of data pre-processing. It consists in turn the range of the values is consistent between attributes. 
We can normalize these two variables into values that range from 0 to 1, for example.

Let's see an example above.

## Demonstrating

Consider the dataset df_payslip, this features "Salary" , "Addition" and "Charge" contain values from different ranges. 

"Salary" ranges from 0 to 50000, while "Addition" ranges from 0 to 80 and "Charge" ranges from 900

Salary is about 100 times than "Charge" and 1,000 times larger than "Addition"

In [6]:
import pandas as pd

dict_payslip = {"IdRegistration":[231,232,332,224,335,632,327,856,923,105], 
                 "Salary":[20000, 0, 40000,10000,10000, 80000, 10000,20000, 50000, 40000], 
                 "Charge": [100, 150, 240,210,310, 0, 900, 120, 550, 240],
                 "Addition":[20, 50, 40,0,10, 80, 10,20, 50, 40]}

df_payslip = pd.DataFrame.from_dict(dict_payslip)

df_payslip

Unnamed: 0,IdRegistration,Salary,Charge,Addition
0,231,20000,100,20
1,232,0,150,50
2,332,40000,240,40
3,224,10000,210,0
4,335,10000,310,10
5,632,80000,0,80
6,327,10000,900,10
7,856,20000,120,20
8,923,50000,550,50
9,105,40000,240,40


## Reason to normalization

1. It is important for computational reasons.
2. But mainly because the biases. Talking statisticaly, the atributte "Salary"  will gain greater weight due to its larger value, but this doesn’t necessarily mean it is more ‘important’ as a predictor.

## Ways to normalize data.

There are several ways to normalize data.
I will just outline three techniques.



*   The first method, called “simple feature scaling”, just divides each value by the maximum value for that feature. This makes the new values range between 0 and 1.

*   The second method, called “Min-Max”, takes each value, X_old, subtracted from the minimum
value of that feature, then divides by the range of that feature.
Again, the resulting new values range between 0 and 1.

*   The third method is called “z-score” or “standard score”.
In this formula, for each value, you subtract the Mu which is the average of the feature,
and then divide by the standard deviation (sigma).
The resulting values hover around 0, and typically range between -3 and +3, but can be higher
or lower.



### Simple feature scaling

In [8]:
df_payslip['Salary'] = df_payslip['Salary'] / df_payslip['Salary'].max()

#See the result to Salary
df_payslip

Unnamed: 0,IdRegistration,Salary,Charge,Addition
0,231,0.25,100,20
1,232,0.0,150,50
2,332,0.5,240,40
3,224,0.125,210,0
4,335,0.125,310,10
5,632,1.0,0,80
6,327,0.125,900,10
7,856,0.25,120,20
8,923,0.625,550,50
9,105,0.5,240,40


### Min-Max

In [10]:
df_payslip['Charge'] = (df_payslip['Charge'] - df_payslip['Charge'].min()) / (df_payslip['Charge'].max() - df_payslip['Charge'].min())

#See the result to Charge
df_payslip

Unnamed: 0,IdRegistration,Salary,Charge,Addition
0,231,0.25,0.111111,20
1,232,0.0,0.166667,50
2,332,0.5,0.266667,40
3,224,0.125,0.233333,0
4,335,0.125,0.344444,10
5,632,1.0,0.0,80
6,327,0.125,1.0,10
7,856,0.25,0.133333,20
8,923,0.625,0.611111,50
9,105,0.5,0.266667,40


### Z-score or Standard score

In [12]:
df_payslip['Addition'] = (df_payslip['Addition'] - df_payslip['Addition'].mean()) / df_payslip['Addition'].std()

#See the result to Addition
df_payslip

Unnamed: 0,IdRegistration,Salary,Charge,Addition
0,231,0.25,0.111111,-0.491723
1,232,0.0,0.166667,0.737584
2,332,0.5,0.266667,0.327815
3,224,0.125,0.233333,-1.31126
4,335,0.125,0.344444,-0.901491
5,632,1.0,0.0,1.96689
6,327,0.125,1.0,-0.901491
7,856,0.25,0.133333,-0.491723
8,923,0.625,0.611111,0.737584
9,105,0.5,0.266667,0.327815


## Conclusion

Now that all three variables are normalize the dataset is ready to statistical models.