<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/demos/workflow/DataPreprocessing.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


# Data Preprocessing With scikit-learn

<https://scikit-learn.org/stable/modules/preprocessing.html>

<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>

<https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>

## Package setup

In [11]:
import numpy as np
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn import preprocessing
from sklearn import model_selection

scikit-learn version: 0.22.2.post1


## Training/test set splitting

In [12]:
# Generate a random (30,3) tensor
x = np.random.rand(30, 3)
print(f'x: {x.shape}')

# Using scikit-learn's train_test_split() function
x_train, x_test = model_selection.train_test_split(x, test_size=0.25)

print(f'x_train: {x_train.shape}')
print(f'x_test: {x_test.shape}')

x: (30, 3)
x_train: (22, 3)
x_test: (8, 3)


In [13]:
# Generate a random vector with a number of elements equal to first dimension of x
y = np.random.rand(30)

# Split data while preserving matching between x and y
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=1/3)

print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

x_train: (20, 3). y_train: (20,)
x_test: (10, 3). y_test: (10,)


## Feature scaling

In [14]:
# Generate a random (3,3) tensor with values between 1 and 10
x = np.random.randint(1, 10, (3,3))

print(x)

[[5 3 9]
 [2 7 1]
 [4 8 2]]


### Min-Max scaling

Features are shifted and rescaled to the `[0,1]` range by substracting the `min` value and dividing by `(max-min)` on the first axis.

In [15]:
# Using scikit-learn's MinMaxScaler
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)

print(x_scaled)
print(f'Minimum: {x_scaled.min()}. Maximum: {x_scaled.max()}')

[[1.         0.         1.        ]
 [0.         0.8        0.        ]
 [0.66666667 1.         0.125     ]]
Minimum: 0.0. Maximum: 1.0


### Standardization

Features are centered then reduced: substracted by their mean and divided by their standard deviation. 

The result has a mean of 0 and a standard deviation of 1.

In [16]:
# Using scikit-learn's scale() function
x_scaled = preprocessing.scale(x)

print(x_scaled)
print(f'Mean: {x_scaled.mean()}. Std: {x_scaled.std()}')

[[ 1.06904497 -1.38873015  1.40487872]
 [-1.33630621  0.46291005 -0.84292723]
 [ 0.26726124  0.9258201  -0.56195149]]
Mean: 4.9343245538895844e-17. Std: 1.0


### Feature scaling on training/test sets

To avoid information leakage, the test set must be scaled with values calculated on the training set.

See https://stats.stackexchange.com/a/174865 for details.

In [17]:
# Using scikit-learn's StandardScaler class
scaler = preprocessing.StandardScaler().fit(x_train)

x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

## Feature encoding

### One-hot encoding

Each categorical feature with `n` possible integer values is transformed into `n` binary features, with one of them 1 and all others 0.

In [18]:
x = np.random.randint(0, 10, (8,1))

print(x)

[[3]
 [2]
 [9]
 [9]
 [5]
 [4]
 [8]
 [1]]


In [19]:
# Encoder input must be a matrix
# Output will be a sparse matrix where each column corresponds to one possible value of one feature
encoder = preprocessing.OneHotEncoder(categories='auto').fit(x)
x_hot = encoder.transform(x).toarray()

print(x_hot)

[[0. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0.]]
