Data in real-world applications requires preprocessing stages: e.g., (1) many machine learning algorithms are sensitive to the data range, data standardization is a normal preprocessing step. For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centred around zero and have variance in the same order. and (2) some features are provided as categorical data, e.g., [male, female], education level, encoding categorical features is another important preprocessing step.  Sklearn provides functions for doing these preprocessing tasks. (Important for your coursework!)

(1) Example:
Normalization based on mean and standard deviation:

In [1]:
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.], [ 2.,  0.,  0.], [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)

vector_unnorm = np.array([[0.3, 0.5, 3]])
scaler.transform(vector_unnorm)


array([[-0.85732141,  0.61237244,  2.13808994]])

Minmax standardisation to normalise data in range [0,1]

In [2]:
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
scaler.fit(data)
print(scaler.data_max_)
print(scaler.transform(data))


[ 1. 18.]
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]


OrdinalEncoder:

In [3]:
enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
enc.transform([['female', 'from US', 'uses Safari']])


array([[0., 1., 1.]])

One Hot Encoder

In [4]:
enc = preprocessing.OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)
enc.transform([['female', 'from US', 'uses Safari'],
               ['male', 'from Europe', 'uses Safari']]).toarray()


array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])