# Data Normalization

### Defination:
                The goal of normalization is to transform features to be on a similar scale. Normalization  improves the performance and training stability of the model.

## Normalization Techniques

    -   Scaling to a range(Min ~ Max)
    -   Clipping
    -   Log Scaling
    -   z-score

## Scaling to a Range

### Scaling to a range is a good choice when both of the following conditions are met:

        -   You know the approximate upper and lower bounds on your data with few or no outliers.
        -   Your data is approximately uniformly distributed across that range.

# simple features scaling

dim['depth'] = (dim['depth']- dim['depth'].min()) / ( dim['depth'].max()- dim['depth'].min())
dim.head()

In [None]:
# simple features scaling

dim['depth'] = (dim['depth']- dim['depth'].min()) / ( dim['depth'].max()- dim['depth'].min())
dim.head()

# Feature Clipping

    -   If your data set contains extreme outliers, you might try feature clipping,
    -    which caps all feature values above (or below) a certain value to fixed value. 
    -    For example, you could clip all temperature values above 40 to be exactly 40.
    -   You may apply feature clipping before or after other normalizations.



# Feature Clipping

dim['depth'] = dim['depth']/dim['depth'].max()
dim['table'] = dim['table']/dim['table'].max()
dim.head()

In [None]:
# Feature Clipping

dim['depth'] = dim['depth']/dim['depth'].max()
dim['table'] = dim['table']/dim['table'].max()
dim.head()

### Log Scaling

    -   Log scaling computes the log of your values to compress a wide range to a narrow range.
    -   Log scaling is helpful when a handful of your values have many points, while most other values have few points.
    -    This data distribution is known as the power law distribution.

# Log Scaling

ship['depth'] = np.log(ship['depth'])
dim.head()


In [None]:
# Log Scaling

ship['depth'] = np.log(ship['depth'])
dim.head()

### Z-Score:

    -   Z-score is a variation of scaling that represents the number of standard deviations away from the mean. 
    -   You would use z-score to ensure your feature distributions have mean = 0 and std = 1. 
    -   It’s useful when there are a few outliers, but not so extreme that you need clipping.

# Z score (standard score)

dim['depth'] = (dim['depth']- dim['depth'].mean())/dim['depth'].std()
dim.head()

In [None]:
# Z score (standard score)

dim['depth'] = (dim['depth']- dim['depth'].mean())/dim['depth'].std()
dim.head()

### import Libraries

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
from scipy import stats
from scipy.stats import shapiro
from numpy.random import randn
from numpy.random import poisson
from numpy.random import seed
from scipy.stats import shapiro
from numpy.random import randn

### Different Dataset to compare the different techniques of normalization

In [22]:
dim = sns.load_dataset("diamonds")
dim_FC = sns.load_dataset("diamonds")
dim_Z = sns.load_dataset("diamonds")
dim_Log = sns.load_dataset("diamonds")
dim_Srange = sns.load_dataset("diamonds")

In [16]:
dim.head(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [17]:
dim.shape

(53940, 10)

In [29]:
dim.dropna()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [18]:
dim.isnull().sum()

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

In [19]:
dim.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [20]:
dim.head(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## Simple features scaling(Min ~ Max)

In [24]:
# simple features scaling(Min ~ Max)
dim_Srange['depth'] = (dim_Srange['depth']- dim_Srange['depth'].min()) / ( dim_Srange['depth'].max()- dim_Srange['depth'].min())
dim_Srange.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,0.513889,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,0.466667,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,0.386111,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,0.538889,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,0.563889,58.0,335,4.34,4.35,2.75


## Feature Clipping

In [25]:
# Feature Clipping

dim_FC['depth'] = dim_FC['depth']/dim_FC['depth'].max()
dim_FC['table'] = dim_FC['table']/dim_FC['table'].max()
dim_FC.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,0.778481,0.578947,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,0.756962,0.642105,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,0.720253,0.684211,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,0.789873,0.610526,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,0.801266,0.610526,335,4.34,4.35,2.75


# Log Scaling

In [26]:
# Log Scaling

dim_Log['depth'] = np.log(dim_Log['depth'])
dim_Log.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,4.119037,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,4.091006,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,4.041295,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,4.133565,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,4.147885,58.0,335,4.34,4.35,2.75


# Z-Score

In [27]:
# Z score (standard score)

dim_Z['depth'] = (dim_Z['depth']- dim_Z['depth'].mean())/dim_Z['depth'].std()
dim_Z.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,-0.17409,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,-1.360726,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,-3.384987,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,0.454129,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,1.082348,58.0,335,4.34,4.35,2.75
