# Prepare your data for ML

We will prepare our data for ML in Python using ***scikit-learn***. Focus will be on:

1. Rescaling the data
2. Standardizing the data
3. Normalizing the data
4. Binarizing the data

## How to do this? Scikit-learn

The ***scikit-learn*** library provides two standard idioms for transforming data. Each is useful in different circumstances. 

* Fit and Multiple Transform
* Combined Fit-And-Transform

You can review the [preprocess API in scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing), where all the calls are listed and explained in details.

If you do not have it:

    pip install -U scikit-learn

or (on anaconda):

    conda install scikit-learn

Of course, as we are running on google colab, we need to do nothing - all pre-installed for us, just need some imports.

## 0. Import the data

In [0]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AML_basic_AA1920/master/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

## 1. Rescale data

You can rescale your data using scikit-learn using the MinMaxScaler class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

In [0]:
#from pandas import read_csv
from numpy import set_printoptions

In [0]:
from sklearn.preprocessing import MinMaxScaler

In [0]:
#filename = 'pima-indians-diabetes.data.csv'
#names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
#dataframe = read_csv(filename, names=names)
array = data.values
array

In [0]:
type(array)

In [0]:
# separate array into input and output components
X = array[:,0:8]   # features: build an array with each element being a full row with all columns but the last (values) one 
Y = array[:,8]     # labels: build an array with only last column, i.e. labels only

In [0]:
X

In [0]:
Y

In [0]:
# Rescale data (between 0 and 1)
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

# summarize original data...
set_printoptions(precision=3)
print(X[0:5,:])

In [0]:
#.. and rescaled data
print(rescaledX[0:5,:])   # first few rows, you see 8 feature columns rescaled

## <font color='red'>Exercise 1</font>

<div class="alert alert-block alert-info">
Change the feature range in the scaling: e.g. put (0,10) and what it does is immediately visible..
</div>

## <font color='green'>Solution</font>

In [0]:
### put your code here

## <font color='red'>Exercise 2</font>

<div class="alert alert-block alert-info">
Can you change this from a fit_and_transform to a fit first and transform later? Put solution in the box below.
</div>

## <font color='green'>Solution</font>

In [0]:
### put your code here

## 2. Standardize data

You can standardize data using scikit-learn with the StandardScaler, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [0]:
from sklearn.preprocessing import StandardScaler

In [0]:
# Standardize data (0 mean, 1 stdev)
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

## 3. Normalize data

You can normalize data in Python with scikit-learn using the Normalizer class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html). 

In [0]:
from sklearn.preprocessing import Normalizer

In [0]:
# Normalize data (length of 1)
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

An alternative way, still in scikit-learn:

In [0]:
from sklearn import preprocessing
#from sklearn.preprocessing import normalize

alternative_normalizedX = normalize(X)
alternative_normalizedX

## <font color='red'>Exercise</font>

<div class="alert alert-block alert-info">
The rows should now be normalized to length 1. Check it out this is true for both methods above.
</div>

## <font color='green'>Solution</font>

In [0]:
### put your code here

## 4. Binarize data

You can normalize data in Python with scikit-learn using the Binarizer class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html). 

In [0]:
from sklearn.preprocessing import Binarizer

In [0]:
# binarization
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])

In [0]:
# .. compare with original data
print(X[0:5,:])

## Summary

What we did:

* we discovered how you can prepare your data for ML in Python using scikit-learn, with 4 recipes.

## What's next 

Now that we know how to transform the data to best expose the structure of my problem to the modeling algorithms, we need now to discover how to select the features of my data that are most relevant to making predictions.