#### Prepare Data

In [1]:
# Prepare Data
# 1) Rescale data     - rescale data to distribute between min and max
# 2) Standardize data - rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
# 3) Normalize data   - rescale the values into a range of [0, 1]
# 4) Binarize data    - modify values below threshold to 0 and values above threshold to 1

Scikit-Learn library provides two idioms for transforming data:
    - Fit and multiple transform
    - Combined Fit and transform

Fit and multiple transform method is the preferred approach. You call fit() to prepare the parameters and then you can use transform() to apply transformation on training data then on testing data, on validation data or on new data.
Combined method is a convenient club and can be used to plot or summarize transformed data.

<b>1) Rescale Data</b><br/>
When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling 
the attributes to all have the same scale. Often this is referred to as <b>normalization</b> and attributes are often rescaled into the range between <b>0</b> and <b>1</b>. This is useful for optimization algorithms used in the core of machine learning algorithms like <b>gradient descent</b>. It is also useful for algorithms that <u>weight inputs</u> like <b>regression</b> and <b>neural networks</b> and algorithms that use <u>distance measures</u> like <b>k-Nearest Neighbors</b>. You can rescale your data using scikit-learn using the <b>MinMaxScaler</b> 
class.<br/>
__When to use?__<br/>
 - _For algorithms that weight inputs like regression, neural networks_
 - _For algorithms that measure distances like k-nearest neighbors_

__Which library to use?__ <br/>
 - _Scikit's MinMaxScaler_

__What it does?__ <br/>
 - _Transforms all values to stay between __min__ and __max__ values that we specify_

In [2]:
from pandas import read_csv

filename = 'diabetes.csv'
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 
         'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = read_csv(filename, header = 0, names = names)
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [3]:
array = df.values
X = array[:, 0:8]
Y = array[:, 8]

In [4]:
from numpy import set_printoptions
set_printoptions(precision=2)

In [11]:
# Transforms features by scaling each feature to a given range (min and max)
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
rescaledX[0:5, :]

array([[0.35, 0.74, 0.59, 0.35, 0.  , 0.5 , 0.23, 0.48],
       [0.06, 0.43, 0.54, 0.29, 0.  , 0.4 , 0.12, 0.17],
       [0.47, 0.92, 0.52, 0.  , 0.  , 0.35, 0.25, 0.18],
       [0.06, 0.45, 0.54, 0.23, 0.11, 0.42, 0.04, 0.  ],
       [0.  , 0.69, 0.33, 0.35, 0.2 , 0.64, 0.94, 0.2 ]])

***

<b>2) Standardize Data</b><br/>
Standardization is a useful technique to <u>transform attributes with a Gaussian distribution and differing means and 
standard deviations</u> to a standard Gaussian distribution with a <b>mean</b> of <b>0</b> and a <b>standard deviation</b> of 
<b>1</b>. It is <u>most suitable for techniques that assume a Gaussian distribution in the input variables</u> and 
work better with rescaled data, such as <b>linear regression</b>, <b>logistic regression</b> and <b>linear discriminate 
analysis</b>. You can standardize data using scikit-learn with the <b>StandardScaler</b> class.

In [37]:
# Standardize data (0 mean, 1 stdev)
# After transformation, each attribute will have a mean value of 0 and a standard deviation of 1.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
rescaledX[0:5, :]

array([[ 0.64,  0.85,  0.15,  0.91, -0.69,  0.2 ,  0.47,  1.43],
       [-0.84, -1.12, -0.16,  0.53, -0.69, -0.68, -0.37, -0.19],
       [ 1.23,  1.94, -0.26, -1.29, -0.69, -1.1 ,  0.6 , -0.11],
       [-0.84, -1.  , -0.16,  0.15,  0.12, -0.49, -0.92, -1.04],
       [-1.14,  0.5 , -1.5 ,  0.91,  0.77,  1.41,  5.48, -0.02]])

***

<b>3) Normalize Data</b><br/>
Normalizing in scikit-learn refers to rescaling <u>each observation (row)</u> to have a length of 1 (called a <b>unit norm</b> or a vector with the length of 1 in linear algebra i.e. <b>unit vector</b>).<br/>
This pre-processing method can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using
algorithms that weight input values such as <b>neural networks</b> and <u>algorithms that use distance measures</U> such as 
<b>k-Nearest Neighbors</b>. You can normalize data in Python with scikit-learn using the <b>Normalizer</b> class. 

In [7]:
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer

scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
normalizedX[0:5, :]

array([[0.03, 0.83, 0.4 , 0.2 , 0.  , 0.19, 0.  , 0.28],
       [0.01, 0.72, 0.56, 0.24, 0.  , 0.22, 0.  , 0.26],
       [0.04, 0.92, 0.32, 0.  , 0.  , 0.12, 0.  , 0.16],
       [0.01, 0.59, 0.44, 0.15, 0.62, 0.19, 0.  , 0.14],
       [0.  , 0.6 , 0.17, 0.15, 0.73, 0.19, 0.01, 0.14]])

***

__4) Binarize Data__ <br/>
You can transform your data using a binary threshold. All values above the threshold are
marked 1 and all equal to or below are marked as 0. This is called binarizing your data or
thresholding your data. It can be useful when you have probabilities that you want to make crisp
values. It is also useful when feature engineering and you want to add new features that indicate
something meaningful. You can create new binary attributes in Python using scikit-learn with
the Binarizer class.

In [8]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
binaryX[0:5, :]

array([[1., 1., 1., 1., 0., 1., 1., 1.],
       [1., 1., 1., 1., 0., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1., 1., 1., 1.]])