### Pre-process data

Expose the structure of the problem through the following techniques:

- Rescale
- Standarize
- Normalize
- Binarize

Different algorithms may require different transformations (different assumptions). Some transforms may not benefit certain algorithms. A good approach is to create multiple views and transform of the data, then apply some algorithms to each view to discard and indentify transforms that expose the problem structure better.

##### Steps

1. Load the dataset url
2. Split the dataset into *input* and *output* variables for ML
3. Apply a preprocessing transform to the *input*
4. Summarize the data to show the change

** Scikit-learn

Provides 2 standard transformation methods that can be applied to existing and future training data
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

Transforms:

- Fit and Multiple Transform [preferred]

1. fit()
Prepares the parameters of the transform once on your data

2. transform()
Use in the same data to prepare for modelling and again on the test and validation dataset, and again on new data.

- Combined Fit-And-Transform

Use for one off tasks, good for plotting or summarizing data,


##### 1. Rescale Data

Use when data is compromised of attributes of varying scales, and algorithm benefits from all attributes in same scale. Also known as Normalization, with values between 1 and 0. 

Useful in:
- optimization algorithms (core ML) like gradient descent.
- Algorithms that weight inputs like regression and NN
- Algorithms that use distance measures like k-Nearest neighbours

Using scikit-learn's `MinMaxScaler` class:

In [1]:
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
# separate IN and OUT components
X = array[:, 0:8]
Y = array[:, 8]


In [8]:
#...
scaler = MinMaxScaler(feature_range=(0,1))
rescaledX = scaler.fit_transform(X)
# sumarize transformed data
set_printoptions(precision=3)
print X[0:5,:]
print rescaledX[0:5,:]

[[  6.000e+00   1.480e+02   7.200e+01   3.500e+01   0.000e+00   3.360e+01
    6.270e-01   5.000e+01]
 [  1.000e+00   8.500e+01   6.600e+01   2.900e+01   0.000e+00   2.660e+01
    3.510e-01   3.100e+01]
 [  8.000e+00   1.830e+02   6.400e+01   0.000e+00   0.000e+00   2.330e+01
    6.720e-01   3.200e+01]
 [  1.000e+00   8.900e+01   6.600e+01   2.300e+01   9.400e+01   2.810e+01
    1.670e-01   2.100e+01]
 [  0.000e+00   1.370e+02   4.000e+01   3.500e+01   1.680e+02   4.310e+01
    2.288e+00   3.300e+01]]
[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.     0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.     0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.   ]
 [ 0.     0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]


##### 2. Standarize Data

Useful for attributes with a Gaussian distribution and differing mean and standard deviation. It transform to a set of atrributes with mean of 0 and standard deviation of 1.

Suitable for techniques assuming a Gaussian distribution in the input, and work better with rescaled data, e.g., 

- linear regression
- logistic regression
- linear discriminate analysis


Using scikit-learn's `StandardScaler` class:


In [14]:
#...
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print X[0:5,:]
print rescaledX[0:5, :]

[[  6.000e+00   1.480e+02   7.200e+01   3.500e+01   0.000e+00   3.360e+01
    6.270e-01   5.000e+01]
 [  1.000e+00   8.500e+01   6.600e+01   2.900e+01   0.000e+00   2.660e+01
    3.510e-01   3.100e+01]
 [  8.000e+00   1.830e+02   6.400e+01   0.000e+00   0.000e+00   2.330e+01
    6.720e-01   3.200e+01]
 [  1.000e+00   8.900e+01   6.600e+01   2.300e+01   9.400e+01   2.810e+01
    1.670e-01   2.100e+01]
 [  0.000e+00   1.370e+02   4.000e+01   3.500e+01   1.680e+02   4.310e+01
    2.288e+00   3.300e+01]]
[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


##### 3. Normalize Data

Normalize = Scale each observation to have a length of 1. Useful for sparse datasets (many zeros) with attributes of varying scales, best for:

- NNs, Algorithms that weight input values
- k-Nearest Neighbors, Algorithms that use distance measures

Using scikit-learn's `Normalizer` class:

In [3]:
# ...
from sklearn.preprocessing import Normalizer

scaler = Normalizer().fit(X)
normalizedX =  scaler.transform(X)

#summarize transformed data
set_printoptions(precision=3)
print normalizedX[0:5,:]

[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
 [ 0.008  0.716  0.556  0.244  0.     0.224  0.003  0.261]
 [ 0.04   0.924  0.323  0.     0.     0.118  0.003  0.162]
 [ 0.007  0.588  0.436  0.152  0.622  0.186  0.001  0.139]
 [ 0.     0.596  0.174  0.152  0.731  0.188  0.01   0.144]]


##### 4. Binarize Data

Transform data using a binary threshold. Values above threshold marked as *1*, equal and below marked as *0*

Useful when you have probabilities that you want crisp values, or for feature engineering (add something meaningful)

Using scikit-learn's `Binarizer` class:

In [5]:
#...
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)

# summarize transformed data

set_printoptions(precision=3)
print binaryX[0:5, :]

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]


> Source: Jason Brownleee