# Scaling dataset 

This notebook intends to show how to scale data set for machine learning. In this sense, here we show two of the most common method for scaling data. Many machine learning algorithm require scaled input and output data. These methods consists in a prior normalize and standardize the dataset. 

## Import libraries

In [1]:
import pandas as pd
import numpy as np

## Loading data

### Pima indians diabetes dataset

**Description**: This dataset intends to estimate the onset diabetes of a native american population (about [Akimel O'odham](https://en.wikipedia.org/wiki/Akimel_O%27odham)) given medical details.  There are 768 observations with 8 input variables and 1 output variable. The baseline performance of predicting is a classification accuracy of approximately 65%. Missing values are enconded with null values. Each feature names are as follows:

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skinfold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in Kg/(height in m)ˆ2)
7. Diabetes pedigree function
8. Age (years)

In [2]:
fname_pima = 'data_set/pima-indians-diabetes.csv'

In [3]:
names = ['pregnant', 'glucose', 'pressure', 'triceps', 'insulin', 'BMI', 'pedigree', 'age','class']

In [4]:
pima = pd.read_csv(fname_pima,usecols=(0,1,2,3,4,5,6,7,8),names=names,dtype={'pregnant': np.int64,
                                                                             'glucose': np.float64,
                                                                             'presure': np.float64,
                                                                             'triceps': np.float64,
                                                                             'insulin': np.float64,
                                                                             'BMI': np.float64,
                                                                             'pedigree': np.float64,
                                                                             'age': np.int64,
                                                                             'class':np.int64})

In [5]:
data_pima = pd.DataFrame(pima)

In [6]:
data_pima

Unnamed: 0,pregnant,glucose,pressure,triceps,insulin,BMI,pedigree,age,class
0,6,148.0,72,35.0,0.0,33.6,0.627,50,1
1,1,85.0,66,29.0,0.0,26.6,0.351,31,0
2,8,183.0,64,0.0,0.0,23.3,0.672,32,1
3,1,89.0,66,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40,35.0,168.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76,48.0,180.0,32.9,0.171,63,0
764,2,122.0,70,27.0,0.0,36.8,0.340,27,0
765,5,121.0,72,23.0,112.0,26.2,0.245,30,0
766,1,126.0,60,0.0,0.0,30.1,0.349,47,1


## Normalize dataset

The normalization refers to rescaling a certain input variable to the range between zero and one. For this reason, this method require the knowledge of the maximum and minimum values for each feature. Therefore, the normalization of the i-th value for a column attribute is:

$$ Normalized[i] = \frac{value[i] - min}{max - min} \, . $$

In [7]:
pima_normalized = pd.DataFrame()

In [8]:
for i in range(len(names)):
    pima_normalized[names[i]] = ((data_pima.values[:,i] - data_pima.min(axis=0).values[i])/
                                 (data_pima.max(axis=0).values[i] - data_pima.min(axis=0).values[i]))

In [9]:
pima_normalized 

Unnamed: 0,pregnant,glucose,pressure,triceps,insulin,BMI,pedigree,age,class
0,0.352941,0.743719,0.590164,0.353535,0.000000,0.500745,0.234415,0.483333,1.0
1,0.058824,0.427136,0.540984,0.292929,0.000000,0.396423,0.116567,0.166667,0.0
2,0.470588,0.919598,0.524590,0.000000,0.000000,0.347243,0.253629,0.183333,1.0
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.000000,0.0
4,0.000000,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.200000,1.0
...,...,...,...,...,...,...,...,...,...
763,0.588235,0.507538,0.622951,0.484848,0.212766,0.490313,0.039710,0.700000,0.0
764,0.117647,0.613065,0.573770,0.272727,0.000000,0.548435,0.111870,0.100000,0.0
765,0.294118,0.608040,0.590164,0.232323,0.132388,0.390462,0.071307,0.150000,0.0
766,0.058824,0.633166,0.491803,0.000000,0.000000,0.448584,0.115713,0.433333,1.0


## Standardize dataset

Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0 and the standard deviation to the value 1. It assumes a normal distribution of the data, also called the Gaussian distribution. This method require to estimate the mean and standard deviation of each attribute prior to scaling process. The mean $\mu_i$ for the i-th column is calcutated as: 

$$ \mu_i = \frac{\sum_{j=0}^{n_i -1} value_j} {n_i-1}$$
where the index $i$ is the $i$-th column attribute, $j$ refers to the $j$-th value of the column, and $n$ is the total number of data within a column. The standard deviation $\sigma_i$ is defined as follows:

$$ \sigma_i = \sqrt{\frac{\sum_{j=0}^{n_i -1} (value_j - mean_i)^2}{n_i -1}}.$$

Finally, we standardize the dataset by applying the relation:

$$ s_j = \frac{value_j - \mu_i}{\sigma_i}.$$




In [10]:
pima_standardized = pd.DataFrame()

In [11]:
data_pima.std(axis=0)

pregnant      3.369578
glucose      31.972618
pressure     19.355807
triceps      15.952218
insulin     115.244002
BMI           7.884160
pedigree      0.331329
age          11.760232
class         0.476951
dtype: float64

In [12]:
data_pima.mean(axis=0)

pregnant      3.845052
glucose     120.894531
pressure     69.105469
triceps      20.536458
insulin      79.799479
BMI          31.992578
pedigree      0.471876
age          33.240885
class         0.348958
dtype: float64

In [13]:
for i in range(len(names)):
    pima_standardized[names[i]] = ((data_pima.values[:,i] - data_pima.mean(axis=0).values[i])/
                                 (data_pima.std(axis=0).values[i]))

In [14]:
pima_standardized

Unnamed: 0,pregnant,glucose,pressure,triceps,insulin,BMI,pedigree,age,class
0,0.639530,0.847771,0.149543,0.906679,-0.692439,0.203880,0.468187,1.425067,1.365006
1,-0.844335,-1.122665,-0.160441,0.530556,-0.692439,-0.683976,-0.364823,-0.190548,-0.731643
2,1.233077,1.942458,-0.263769,-1.287373,-0.692439,-1.102537,0.604004,-0.105515,1.365006
3,-0.844335,-0.997558,-0.160441,0.154433,0.123221,-0.493721,-0.920163,-1.040871,-0.731643
4,-1.141108,0.503727,-1.503707,0.906679,0.765337,1.408828,5.481337,-0.020483,1.365006
...,...,...,...,...,...,...,...,...,...
763,1.826623,-0.622237,0.356200,1.721613,0.869464,0.115094,-0.908090,2.530487,-0.731643
764,-0.547562,0.034575,0.046215,0.405181,-0.692439,0.609757,-0.398023,-0.530677,-0.731643
765,0.342757,0.003299,0.149543,0.154433,0.279412,-0.734711,-0.684747,-0.275580,-0.731643
766,-0.844335,0.159683,-0.470426,-1.287373,-0.692439,-0.240048,-0.370859,1.169970,1.365006
