<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/demos/workflow/DataPreprocessing.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


# Demo: Data Preprocessing with scikit-learn

<https://scikit-learn.org/stable/modules/preprocessing.html>

<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>

<https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>

## Package setup

In [2]:
import numpy as np
from sklearn import preprocessing
from sklearn import model_selection

## Training/test set splitting

In [34]:
# Generate a random (30,3) tensor
x = np.random.rand(30, 3)
test_size = .25

print(f'x: {x.shape}')

x: (30, 3)


In [36]:
# Using scikit-learn's train_test_split() function
x_train, x_test = model_selection.train_test_split(x, test_size=test_size)

print(f'x_train: {x_train.shape}. x_test: {x_test.shape}')

x_train: (22, 3). x_test: (8, 3)


## Feature scaling

In [37]:
# Generate a random (3,3) tensor with values between 1 and 10
x = np.random.randint(1, 10, (3,3))

print(x)

[[3 8 6]
 [9 4 3]
 [3 4 9]]


### Min-Max scaling

Features are shifted and rescaled to the `[0,1]` range by substracting the `min` value and dividing by `(max-min)` on the first axis.

In [39]:
# Using scikit-learn's MinMaxScaler
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)

print(x_scaled)
print(f'Minimum: {x_scaled.min()}. Maximum: {x_scaled.max()}')

[[0.  1.  0.5]
 [1.  0.  0. ]
 [0.  0.  1. ]]
Minimum: 0.0. Maximum: 1.0


### Standardization

Features are centered then reduced: substracted by their mean and divided by their standard deviation. 

The result has a mean of 0 and a standard deviation of 1.

In [41]:
# Using scikit-learn's scale() function
x_scaled = preprocessing.scale(x)

print(x_scaled)
print(f'Mean: {x_scaled.mean()}. Std: {x_scaled.std()}')

[[-0.70710678  1.41421356  0.        ]
 [ 1.41421356 -0.70710678 -1.22474487]
 [-0.70710678 -0.70710678  1.22474487]]
Mean: 7.401486830834377e-17. Std: 1.0


### Feature scaling on training/test sets

To avoid information leakage, the test set must be scaled with values calculated on the training set.

See https://stats.stackexchange.com/a/174865 for details.

In [43]:
# Using scikit-learn's StandardScaler class
scaler = preprocessing.StandardScaler().fit(x_train)

x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

## Feature encoding

### One-hot encoding

Each categorical feature with `n` possible integer values is transformed into `n` binary features, with one of them 1 and all others 0.

In [3]:
x = np.random.randint(0, 10, (8,1))

print(x)

[[4]
 [4]
 [5]
 [8]
 [8]
 [4]
 [1]
 [8]]


In [4]:
# Encoder input must be a matrix
# Output will be a sparse matrix where each column corresponds to one possible value of one feature
encoder = preprocessing.OneHotEncoder(categories='auto').fit(x)
x_hot = encoder.transform(x).toarray()

print(x_hot)

[[0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]]
