# Prepare your data for ML

It is often said (and read) that ML are black boxes where you send data in and get results out, sort of magically. Of coursem **it is not magic**. But - most importantly in this part - **it is not even true that you just "send data in"**. Or, at least, you shouldn't!

One - often neglected - factor to remember is that many ML are not actually really "agnostic" about the data how one could think. Quite the opposite: **many ML algorithms make assumptions about your data**. So, it is often a very good idea to **prepare your data in such a way to best expose the structure of the problem to the ML algorithms** that you intend to use. 

We will have a look at how  to prepare your data for ML in Python using ***scikit-learn***. Focus will be on:

1. Rescaling the data
2. Standardizing the data
3. Normalizing the data
4. Binarizing the data

NOTE: this is not carved in stones. Different approaches in applied-ML may offer you a different, longer, shorter list of things to possibly do. Most important take-away message in this part is to be very careful about how you prepare your data _before_ trying out any algorithm and hope to get something reasonable out of it.


## General need for data pre-processing

You (almost?) always need to pre-process your data: it is a general and required step. 

A difficulty is that **different algorithms make different assumptions about your data** and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms can deliver better results without pre-processing. So the process needs care.. and patience!

Tip: focus on **the structure of your problem**. You need to make it emerge from your data, and you need it NOW.

Generally, a good idea is to create many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

The 4 **pre-ML data pre-processing steps** we will describe in the following are structured around the same steps:

* (Load the dataset from a URL)
* Split the dataset into the input and output variables for ML
* Apply a pre-processing transform to the input variables
* Summarize the data to show the change

## How to do this? Scikit-learn

The ***scikit-learn*** library provides two standard idioms for transforming data. Each is useful in different circumstances. The transforms are calculated in such a way that they can be applied to your training data and any samples of data you may have in the future. The two different pre-processing methods are:

* Fit and Multiple Transform
* Combined Fit-And-Transform

The latter method is the preferred approach. 

In the former, you call the `fit()` function to prepare the parameters of the transform once on your data. Then later you can use the `transform()` function on the same data to prepare it for modeling and again on the test or validation dataset or new data that you may see in the future. 

The latter, the *Combined Fit-And-Transform*, is a convenience that you can use for one-off tasks. This might be useful if you are interested in plotting or summarizing the transformed data. You can review the [preprocess API in scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing), where all the calls are listed and explained in details.

If you do not have it:

    pip install -U scikit-learn

or (on anaconda):

    conda install scikit-learn

## 1. Rescale data

In a nutshell: **try to get the same scale for all attributes**.

When your data is comprised of attributes with varying scales, many ML algorithms would benefit from rescaling the attributes to *all* have the same scale.  

Attributes are often rescaled into the range between 0 and 1. 

Where/when is it most useful?
* This is useful for optimization algorithms used in the core of ML algorithms like gradient descent (GD).
* It is also useful for algorithms that weight inputs, like regression and neural networks, and algorithms that use distance measures, like k-Nearest Neighbors. 

You can rescale your data using scikit-learn using the MinMaxScaler class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

In [1]:
from pandas import read_csv
from numpy import set_printoptions

In [2]:
from sklearn.preprocessing import MinMaxScaler

In [3]:
# data import
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
array

array([[   6.   ,  148.   ,   72.   , ...,    0.627,   50.   ,    1.   ],
       [   1.   ,   85.   ,   66.   , ...,    0.351,   31.   ,    0.   ],
       [   8.   ,  183.   ,   64.   , ...,    0.672,   32.   ,    1.   ],
       ..., 
       [   5.   ,  121.   ,   72.   , ...,    0.245,   30.   ,    0.   ],
       [   1.   ,  126.   ,   60.   , ...,    0.349,   47.   ,    1.   ],
       [   1.   ,   93.   ,   70.   , ...,    0.315,   23.   ,    0.   ]])

The next cell will be done once, but used for all examples below.

In [4]:
# separate array into input and output components
X = array[:,0:8]   # features: build an array with each element being a full row with all columns but the last (values) one 
Y = array[:,8]     # labels: build an array with only last column, i.e. labels only

In [5]:
X

array([[   6.   ,  148.   ,   72.   , ...,   33.6  ,    0.627,   50.   ],
       [   1.   ,   85.   ,   66.   , ...,   26.6  ,    0.351,   31.   ],
       [   8.   ,  183.   ,   64.   , ...,   23.3  ,    0.672,   32.   ],
       ..., 
       [   5.   ,  121.   ,   72.   , ...,   26.2  ,    0.245,   30.   ],
       [   1.   ,  126.   ,   60.   , ...,   30.1  ,    0.349,   47.   ],
       [   1.   ,   93.   ,   70.   , ...,   30.4  ,    0.315,   23.   ]])

In [6]:
Y

array([ 1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,  1.,
        1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,
        1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,
        0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  1.,  1.,  1.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  1.,
        1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,
        0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0

In [12]:
# Rescale data (between 0 and 1)
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])   # first few rows, you see 8 feature columns rescaled

[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.     0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.     0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.   ]
 [ 0.     0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]


In [13]:
set_printoptions(precision=3)
print(X[0:5,:])   #compare with X before..

[[  6.000e+00   1.480e+02   7.200e+01   3.500e+01   0.000e+00   3.360e+01
    6.270e-01   5.000e+01]
 [  1.000e+00   8.500e+01   6.600e+01   2.900e+01   0.000e+00   2.660e+01
    3.510e-01   3.100e+01]
 [  8.000e+00   1.830e+02   6.400e+01   0.000e+00   0.000e+00   2.330e+01
    6.720e-01   3.200e+01]
 [  1.000e+00   8.900e+01   6.600e+01   2.300e+01   9.400e+01   2.810e+01
    1.670e-01   2.100e+01]
 [  0.000e+00   1.370e+02   4.000e+01   3.500e+01   1.680e+02   4.310e+01
    2.288e+00   3.300e+01]]


After rescaling: all of the values are in the range between 0 and 1.

## <font color='red'>Exercise</font>

<div class="alert alert-block alert-info">
Change the feature range in the scaling: e.g. put (0,10) and what it does is immediately visible..
</div>

## <font color='red'>Exercise</font>

<div class="alert alert-block alert-info">
Can you change this from a fit_and_transform to a fit first and transform later? Put solution in the box below.
</div>

In [15]:
# Rescale data (between 0 and 1)
scaler = MinMaxScaler(feature_range=(0, 1)).fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])   # first few rows, you see 8 feature columns rescaled

[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.     0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.     0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.   ]
 [ 0.     0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]


## <font color='red'>Done.</font> Let's continue.

## 2. Standardize data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a **standard Gaussian distribution with a mean of 0 and a standard deviation of 1**. 

Where/when is it most useful?

* **It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis**. 

You can standardize data using scikit-learn with the StandardScaler, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [16]:
from sklearn.preprocessing import StandardScaler

In [17]:
# Standardize data (0 mean, 1 stdev)
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


The values for each attribute now have a mean value of 0 and a standard deviation of 1.

## 3. Normalize data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm or a vector with the length of 1 in linear algebra).

Where/when is it most useful?
* This pre-processing method can be useful **for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbors**. 

You can normalize data in Python with scikit-learn using the Normalizer class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html). 

In [18]:
from sklearn.preprocessing import Normalizer

In [19]:
# Normalize data (length of 1)
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
 [ 0.008  0.716  0.556  0.244  0.     0.224  0.003  0.261]
 [ 0.04   0.924  0.323  0.     0.     0.118  0.003  0.162]
 [ 0.007  0.588  0.436  0.152  0.622  0.186  0.001  0.139]
 [ 0.     0.596  0.174  0.152  0.731  0.188  0.01   0.144]]


An alternative way, still in scikit-learn:

In [24]:
from sklearn.preprocessing import normalize
alternative_normalizedX = normalize(X)
alternative_normalizedX

array([[ 0.034,  0.828,  0.403, ...,  0.188,  0.004,  0.28 ],
       [ 0.008,  0.716,  0.556, ...,  0.224,  0.003,  0.261],
       [ 0.04 ,  0.924,  0.323, ...,  0.118,  0.003,  0.162],
       ..., 
       [ 0.027,  0.651,  0.388, ...,  0.141,  0.001,  0.161],
       [ 0.007,  0.838,  0.399, ...,  0.2  ,  0.002,  0.313],
       [ 0.008,  0.736,  0.554, ...,  0.241,  0.002,  0.182]])

## <font color='red'>Exercise</font>

<div class="alert alert-block alert-info">
The rows should now be normalized to length 1. Check it out this is true for both methods above.
</div>

## <font color='red'>Solution</font>

In [29]:
# try to compute manually the l2-norm of each
sum2=0.
for element in normalizedX[1,:]: # try this for various rows..
    sum2 += element**2
sum2

0.99999999999999989

In [30]:
sum2=0.
for element in alternative_normalizedX[1,:]: # try this for various rows..
    sum2 += element**2
sum2

0.99999999999999989

## <font color='red'>Done.</font> Let's continue.

Note that you can compare both l2-norm (default) and l1-norm:

In [31]:
alternative_l2norm_normalizedX = normalize(X, norm="l2")
alternative_l2norm_normalizedX

array([[ 0.034,  0.828,  0.403, ...,  0.188,  0.004,  0.28 ],
       [ 0.008,  0.716,  0.556, ...,  0.224,  0.003,  0.261],
       [ 0.04 ,  0.924,  0.323, ...,  0.118,  0.003,  0.162],
       ..., 
       [ 0.027,  0.651,  0.388, ...,  0.141,  0.001,  0.161],
       [ 0.007,  0.838,  0.399, ...,  0.2  ,  0.002,  0.313],
       [ 0.008,  0.736,  0.554, ...,  0.241,  0.002,  0.182]])

In [32]:
alternative_l1norm_normalizedX = normalize(X, norm="l1")
alternative_l1norm_normalizedX

array([[ 0.017,  0.429,  0.209, ...,  0.097,  0.002,  0.145],
       [ 0.004,  0.356,  0.276, ...,  0.111,  0.001,  0.13 ],
       [ 0.026,  0.588,  0.206, ...,  0.075,  0.002,  0.103],
       ..., 
       [ 0.013,  0.311,  0.185, ...,  0.067,  0.001,  0.077],
       [ 0.004,  0.476,  0.227, ...,  0.114,  0.001,  0.178],
       [ 0.004,  0.374,  0.281, ...,  0.122,  0.001,  0.092]])

In [33]:
# try to compute manually the l1-norm
sum2=0.
for element in alternative_l1norm_normalizedX[0,:]:
    sum2 += element
sum2

0.99999999999999989

## 4. Binarize data

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0. This is called binarizing your data or thresholding your data. 

Where/when is it most useful?
* **When you have probabilities that you want to make crisp values.**
* **It is also useful when feature engineering and you want to add new features that indicate something meaningful.** 

You can normalize data in Python with scikit-learn using the Binarizer class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html). 

In [34]:
from sklearn.preprocessing import Binarizer

In [35]:
# binarization
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]


You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

## Summary

What we did:

* we discovered how you can prepare your data for ML in Python using scikit-learn, with 4 recipes.

## What's next 

Now that we know how to transform the data to best expose the structure of my problem to the modeling algorithms, we need now to discover how to select the features of my data that are most relevant to making predictions.