# Prepare your data for ML

We will prepare our data for ML in Python using ***scikit-learn***. Focus will be on:

1. Rescaling the data
2. Standardizing the data
3. Normalizing the data
4. Binarizing the data

NOTE: this is not carved in stones. Different approaches in applied-ML may apply. Most important take-away message in this part is to be very careful about how you prepare your data _before_ trying out any algorithm and hope to get something reasonable out of it.

## General need for data pre-processing

Discussed in the hands-on session.

Focus on **the structure of your problem**. 

The pre-ML data pre-processing steps we will describe in the following are structured around the same steps:

* Load the dataset
* Split the dataset into the input and output variables for ML
* Apply a pre-processing transform to the input variables
* Summarize the data to show the change

## How to do this? Scikit-learn

The ***scikit-learn*** library provides two standard idioms for transforming data. Each is useful in different circumstances. 

* Fit and Multiple Transform
* Combined Fit-And-Transform

You can review the [preprocess API in scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing), where all the calls are listed and explained in details.

If you do not have it:

    pip install -U scikit-learn

or (on anaconda):

    conda install scikit-learn

## 1. Rescale data

Discussed in the hands-on session.

You can rescale your data using scikit-learn using the MinMaxScaler class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

In [1]:
from pandas import read_csv
from numpy import set_printoptions

In [2]:
from sklearn.preprocessing import MinMaxScaler

In [3]:
# data import
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
array

array([[   6.   ,  148.   ,   72.   , ...,    0.627,   50.   ,    1.   ],
       [   1.   ,   85.   ,   66.   , ...,    0.351,   31.   ,    0.   ],
       [   8.   ,  183.   ,   64.   , ...,    0.672,   32.   ,    1.   ],
       ..., 
       [   5.   ,  121.   ,   72.   , ...,    0.245,   30.   ,    0.   ],
       [   1.   ,  126.   ,   60.   , ...,    0.349,   47.   ,    1.   ],
       [   1.   ,   93.   ,   70.   , ...,    0.315,   23.   ,    0.   ]])

In [4]:
# separate array into input and output components
X = array[:,0:8]   # features: build an array with each element being a full row with all columns but the last (values) one 
Y = array[:,8]     # labels: build an array with only last column, i.e. labels only

In [5]:
X

array([[   6.   ,  148.   ,   72.   , ...,   33.6  ,    0.627,   50.   ],
       [   1.   ,   85.   ,   66.   , ...,   26.6  ,    0.351,   31.   ],
       [   8.   ,  183.   ,   64.   , ...,   23.3  ,    0.672,   32.   ],
       ..., 
       [   5.   ,  121.   ,   72.   , ...,   26.2  ,    0.245,   30.   ],
       [   1.   ,  126.   ,   60.   , ...,   30.1  ,    0.349,   47.   ],
       [   1.   ,   93.   ,   70.   , ...,   30.4  ,    0.315,   23.   ]])

In [6]:
Y

array([ 1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,  1.,
        1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,
        1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,
        0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  1.,  1.,  1.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  1.,
        1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,
        0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0

In [8]:
# Rescale data (between 0 and 1)
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

# summarize original data...
set_printoptions(precision=3)
print(X[0:5,:])

[[  6.000e+00   1.480e+02   7.200e+01   3.500e+01   0.000e+00   3.360e+01
    6.270e-01   5.000e+01]
 [  1.000e+00   8.500e+01   6.600e+01   2.900e+01   0.000e+00   2.660e+01
    3.510e-01   3.100e+01]
 [  8.000e+00   1.830e+02   6.400e+01   0.000e+00   0.000e+00   2.330e+01
    6.720e-01   3.200e+01]
 [  1.000e+00   8.900e+01   6.600e+01   2.300e+01   9.400e+01   2.810e+01
    1.670e-01   2.100e+01]
 [  0.000e+00   1.370e+02   4.000e+01   3.500e+01   1.680e+02   4.310e+01
    2.288e+00   3.300e+01]]


In [11]:
#.. and rescaled data
print(rescaledX[0:5,:])   # first few rows, you see 8 feature columns rescaled

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


## <font color='red'>Exercise</font>

<div class="alert alert-block alert-info">
Change the feature range in the scaling: e.g. put (0,10) and what it does is immediately visible..
</div>

## <font color='red'>Exercise</font>

<div class="alert alert-block alert-info">
Can you change this from a fit_and_transform to a fit first and transform later? Put solution in the box below.
</div>

## <font color='red'>Solution</font>

In [None]:
### put your code here

## <font color='red'>Done.</font> Let's continue.

## 2. Standardize data

Discussed in the hands-on session.

You can standardize data using scikit-learn with the StandardScaler, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [9]:
from sklearn.preprocessing import StandardScaler

In [10]:
# Standardize data (0 mean, 1 stdev)
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


The values for each attribute now have a mean value of 0 and a standard deviation of 1.

## 3. Normalize data

Discussed in the hands-on session.

You can normalize data in Python with scikit-learn using the Normalizer class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html). 

In [14]:
from sklearn.preprocessing import Normalizer

In [15]:
# Normalize data (length of 1)
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
 [ 0.008  0.716  0.556  0.244  0.     0.224  0.003  0.261]
 [ 0.04   0.924  0.323  0.     0.     0.118  0.003  0.162]
 [ 0.007  0.588  0.436  0.152  0.622  0.186  0.001  0.139]
 [ 0.     0.596  0.174  0.152  0.731  0.188  0.01   0.144]]


An alternative way, still in scikit-learn:

In [16]:
from sklearn.preprocessing import normalize
alternative_normalizedX = normalize(X)
alternative_normalizedX

array([[ 0.034,  0.828,  0.403, ...,  0.188,  0.004,  0.28 ],
       [ 0.008,  0.716,  0.556, ...,  0.224,  0.003,  0.261],
       [ 0.04 ,  0.924,  0.323, ...,  0.118,  0.003,  0.162],
       ..., 
       [ 0.027,  0.651,  0.388, ...,  0.141,  0.001,  0.161],
       [ 0.007,  0.838,  0.399, ...,  0.2  ,  0.002,  0.313],
       [ 0.008,  0.736,  0.554, ...,  0.241,  0.002,  0.182]])

## <font color='red'>Exercise</font>

<div class="alert alert-block alert-info">
The rows should now be normalized to length 1. Check it out this is true for both methods above.
</div>

## <font color='red'>Solution</font>

In [None]:
### put your code here

Note that you can compare both l2-norm (default) and l1-norm:

**l1-norm**: also called least absolute deviations (LAD), or least absolute errors (LAE). It is minimizing the sum of the absolute differences between target (y) and estimated (x) values:

   $$\sum_{i=1}^n |y_i - f(x_i)|$$

**l2-norm**: also known as least squares. It is  minimizing the sum of the squares of the differences between target (y) and estimated (x) values:

   $$\sum_{i=1}^n (y_i - f(x_i))^2$$

In [17]:
alternative_l2norm_normalizedX = preprocessing.normalize(X, norm="l2")
alternative_l2norm_normalizedX

NameError: name 'preprocessing' is not defined

In [None]:
alternative_l1norm_normalizedX = preprocessing.normalize(X, norm="l1")
alternative_l1norm_normalizedX

In [None]:
# try to compute manually the l1-norm
sum2=0.
for element in alternative_l1norm_normalizedX[0,:]:
    sum2 += element
sum2

## <font color='red'>Done.</font> Let's continue.

## 4. Binarize data

Discussed in the hands-on session.

You can normalize data in Python with scikit-learn using the Binarizer class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html). 

In [19]:
from sklearn.preprocessing import Binarizer

In [20]:
# binarization
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]


In [22]:
# .. compare with original data
print(X[0:5,:])

[[  6.000e+00   1.480e+02   7.200e+01   3.500e+01   0.000e+00   3.360e+01
    6.270e-01   5.000e+01]
 [  1.000e+00   8.500e+01   6.600e+01   2.900e+01   0.000e+00   2.660e+01
    3.510e-01   3.100e+01]
 [  8.000e+00   1.830e+02   6.400e+01   0.000e+00   0.000e+00   2.330e+01
    6.720e-01   3.200e+01]
 [  1.000e+00   8.900e+01   6.600e+01   2.300e+01   9.400e+01   2.810e+01
    1.670e-01   2.100e+01]
 [  0.000e+00   1.370e+02   4.000e+01   3.500e+01   1.680e+02   4.310e+01
    2.288e+00   3.300e+01]]


## Summary

What we did:

* we discovered how you can prepare your data for ML in Python using scikit-learn, with 4 recipes.

## What's next 

Now that we know how to transform the data to best expose the structure of my problem to the modeling algorithms, we need now to discover how to select the features of my data that are most relevant to making predictions.