# Prepare your data for ML

Key items for an introductory discussion on this topic:
* ML systems: "you just send data into a black box", this is not correct!
* are ML systems agnostic about the data they receive?

Let's explore how to prepare your data in Python in such a way to best expose the structure of the problem to the ML algorithms, and we will do it by using ***scikit-learn***. Focus will be on:

1. Rescaling the data
2. Standardizing the data
3. Normalizing the data
4. Binarizing the data



NOTE: this is not carved in stones. Different approaches exist in applied ML. Most important take-away message from this part is: be very careful about how you prepare your data _before_ trying out any ML algorithm..




## How to do this? Scikit-learn

The ***scikit-learn*** library provides two standard idioms for transforming data. Each is useful in different circumstances. 

* Fit and Multiple Transform
* Combined Fit-And-Transform

You can review the [preprocess API in scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing), where all the calls are listed and explained in details.

All pre-installed in Colab.. but if you are working locally and you do not have it, just do:

    pip install -U scikit-learn

or (on anaconda):

    conda install scikit-learn

For us now: we are running on Colab, so we need to do nothing - all pre-installed for us, just need some imports.

## 0. Import the data

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AMLBas2122/main/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## 1. Rescale data

You can rescale your data using scikit-learn using the MinMaxScaler class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

In [2]:
from numpy import set_printoptions

In [3]:
from sklearn.preprocessing import MinMaxScaler

In [4]:
array = data.values
array

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

In [5]:
type(array)

numpy.ndarray

In [6]:
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

In [7]:
X

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

In [8]:
Y

array([1., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 1., 1., 1., 1.,
       1., 0., 1., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0.,
       0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0.,
       0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0.,
       0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 1.,
       1., 1., 1., 0., 0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0.,
       0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1., 1., 1., 0., 1., 1., 1.,
       1., 0., 0., 0., 0.

In [9]:
# Rescale data (between 0 and 1)
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

In [10]:
# summarize original data...
set_printoptions(precision=3)
print(X[0:5,:])

[[6.000e+00 1.480e+02 7.200e+01 3.500e+01 0.000e+00 3.360e+01 6.270e-01
  5.000e+01]
 [1.000e+00 8.500e+01 6.600e+01 2.900e+01 0.000e+00 2.660e+01 3.510e-01
  3.100e+01]
 [8.000e+00 1.830e+02 6.400e+01 0.000e+00 0.000e+00 2.330e+01 6.720e-01
  3.200e+01]
 [1.000e+00 8.900e+01 6.600e+01 2.300e+01 9.400e+01 2.810e+01 1.670e-01
  2.100e+01]
 [0.000e+00 1.370e+02 4.000e+01 3.500e+01 1.680e+02 4.310e+01 2.288e+00
  3.300e+01]]


In [12]:
#.. and rescaled data
print(rescaledX[0:79,:])   # first few rows, you see 8 feature columns rescaled

[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]
 [0.294 0.583 0.607 0.    0.    0.382 0.053 0.15 ]
 [0.176 0.392 0.41  0.323 0.104 0.462 0.073 0.083]
 [0.588 0.578 0.    0.    0.    0.526 0.024 0.133]
 [0.118 0.99  0.574 0.455 0.642 0.455 0.034 0.533]
 [0.471 0.628 0.787 0.    0.    0.    0.066 0.55 ]
 [0.235 0.553 0.754 0.    0.    0.56  0.048 0.15 ]
 [0.588 0.844 0.607 0.    0.    0.566 0.196 0.217]
 [0.588 0.698 0.656 0.    0.    0.404 0.582 0.6  ]
 [0.059 0.95  0.492 0.232 1.    0.449 0.137 0.633]
 [0.294 0.834 0.59  0.192 0.207 0.385 0.217 0.5  ]
 [0.412 0.503 0.    0.    0.    0.447 0.173 0.183]
 [0.    0.593 0.689 0.475 0.272 0.683 0.202 0.167]
 [0.412 0.538 0.607 0.    0.    0.441 0.075 0.167]
 [0.059 0.518 0.246 0.384 0.098 0.645 0.045 0.2  ]
 [0.059 0.578 0.574 0.303 0.113

### <font color='red'>Exercise 1</font>

Try changing the feature range in the scaling: e.g. put (0,10). Make sure you understand what it does. Try different ranges to get familiar.


In [None]:
### put your code here

### <font color='red'>Exercise 2</font>

Can you change this from a combined fit-and-transform to a fit first then transform (just once) later? 

In [None]:
### put your code here

## 2. Standardize data

You can standardize data using scikit-learn with the StandardScaler, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [14]:
from sklearn.preprocessing import StandardScaler

In [16]:
# Standardize data (0 mean, 1 stdev)
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:10,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]
 [ 0.343 -0.153  0.253 -1.288 -0.693 -0.811 -0.818 -0.276]
 [-0.251 -1.342 -0.988  0.719  0.071 -0.126 -0.676 -0.616]
 [ 1.828 -0.184 -3.573 -1.288 -0.693  0.42  -1.02  -0.361]
 [-0.548  2.382  0.046  1.535  4.022 -0.189 -0.948  1.681]
 [ 1.234  0.128  1.39  -1.288 -0.693 -4.06  -0.724  1.766]]


The values for each attribute now have a mean value of 0 and a standard deviation of 1.

## 3. Normalize data

You can normalize data in Python with scikit-learn using the Normalizer class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html). 

In [17]:
from sklearn.preprocessing import Normalizer

In [19]:
# Normalize data (length of 1)
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:10,:])

[[3.355e-02 8.276e-01 4.026e-01 1.957e-01 0.000e+00 1.879e-01 3.506e-03
  2.796e-01]
 [8.424e-03 7.160e-01 5.560e-01 2.443e-01 0.000e+00 2.241e-01 2.957e-03
  2.611e-01]
 [4.040e-02 9.241e-01 3.232e-01 0.000e+00 0.000e+00 1.177e-01 3.393e-03
  1.616e-01]
 [6.612e-03 5.885e-01 4.364e-01 1.521e-01 6.215e-01 1.858e-01 1.104e-03
  1.389e-01]
 [0.000e+00 5.964e-01 1.741e-01 1.524e-01 7.313e-01 1.876e-01 9.960e-03
  1.437e-01]
 [3.491e-02 8.099e-01 5.167e-01 0.000e+00 0.000e+00 1.787e-01 1.403e-03
  2.095e-01]
 [2.177e-02 5.659e-01 3.628e-01 2.322e-01 6.385e-01 2.249e-01 1.799e-03
  1.886e-01]
 [8.055e-02 9.263e-01 0.000e+00 0.000e+00 0.000e+00 2.843e-01 1.079e-03
  2.336e-01]
 [3.408e-03 3.357e-01 1.193e-01 7.669e-02 9.254e-01 5.198e-02 2.693e-04
  9.032e-02]
 [4.796e-02 7.494e-01 5.756e-01 0.000e+00 0.000e+00 0.000e+00 1.391e-03
  3.237e-01]]


An alternative way, still in scikit-learn:

In [20]:
#from sklearn import preprocessing
from sklearn.preprocessing import normalize

alternative_normalizedX = normalize(X)
alternative_normalizedX

array([[0.034, 0.828, 0.403, ..., 0.188, 0.004, 0.28 ],
       [0.008, 0.716, 0.556, ..., 0.224, 0.003, 0.261],
       [0.04 , 0.924, 0.323, ..., 0.118, 0.003, 0.162],
       ...,
       [0.027, 0.651, 0.388, ..., 0.141, 0.001, 0.161],
       [0.007, 0.838, 0.399, ..., 0.2  , 0.002, 0.313],
       [0.008, 0.736, 0.554, ..., 0.241, 0.002, 0.182]])

### <font color='red'>Exercise 3</font>

The rows should now be normalized to length 1. Check it out this is true for both methods above.

Note that you can compare both l2-norm (default) and l1-norm:

**l1-norm**: also called least absolute deviations (LAD), or least absolute errors (LAE). It is minimizing the sum of the absolute differences between target (y) and estimated (x) values:

   $$\sum_{i=1}^n |y_i - f(x_i)|$$

**l2-norm**: also known as least squares. It is  minimizing the sum of the squares of the differences between target (y) and estimated (x) values:

   $$\sum_{i=1}^n (y_i - f(x_i))^2$$

_HINT: you need to check the scikit-learn documentation for `preprocessing.normalize` to find how to choose among these options, then the rest is implementation in python.._


In [None]:
### put your code here

## 4. Binarize data

You can normalize data in Python with scikit-learn using the Binarizer class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html). 

In [22]:
from sklearn.preprocessing import Binarizer

In [23]:
# binarization
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:20,:])

[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 0. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 0. 0. 0. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 0. 0. 0. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]]


In [24]:
# .. compare with original data
print(X[0:5,:])

[[6.000e+00 1.480e+02 7.200e+01 3.500e+01 0.000e+00 3.360e+01 6.270e-01
  5.000e+01]
 [1.000e+00 8.500e+01 6.600e+01 2.900e+01 0.000e+00 2.660e+01 3.510e-01
  3.100e+01]
 [8.000e+00 1.830e+02 6.400e+01 0.000e+00 0.000e+00 2.330e+01 6.720e-01
  3.200e+01]
 [1.000e+00 8.900e+01 6.600e+01 2.300e+01 9.400e+01 2.810e+01 1.670e-01
  2.100e+01]
 [0.000e+00 1.370e+02 4.000e+01 3.500e+01 1.680e+02 4.310e+01 2.288e+00
  3.300e+01]]


## Summary

What we did:

* we discovered how you can prepare your data for ML in Python using scikit-learn, with 4 recipes.

## What's next 

Now that we know how to transform the data to best expose the structure of my problem to the modeling algorithms, we need now to discover how to select the features of my data that are most relevant to making predictions.