### Intro

[API]() |
[]() |
[demo]()

### Standardization (Mean Removal & Variance Scaling)

* We often ignore data distribution shape by 1) removing the mean 
* value and 2) dividing by std deviation to scale it

[API, scale](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale) |
[API, StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) |
[demo](plot_robust_scaling.ipynb)

In [2]:
#scale: quick transformation on single array-like dataset

from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X)

print(X_scaled)
print(X_scaled.mean(axis=0))
print(X_scaled.std(axis=0))

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[ 0.  0.  0.]
[ 1.  1.  1.]


In [3]:
# StandardScaler usage
# find mean & stddev on training set, to be reapplied later to test set

scaler = preprocessing.StandardScaler().fit(X)

print(scaler, scaler.mean_, scaler.scale_)

scaler.transform(X)

StandardScaler(copy=True, with_mean=True, with_std=True) [ 1.          0.          0.33333333] [ 0.81649658  0.81649658  1.24721913]


array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

### Scaling Features to a Range

[MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) - scale data to min,max - usually 0,1

[MaxAbsScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) - scale data to -1,+1

In [8]:
# scale training data to [0,1]
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

In [9]:
# applying same transformer to new data
X_test = np.array([[ -3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

In [10]:
# scaler attributes
print(min_max_scaler.scale_)
print(min_max_scaler.min_)

[ 0.5         0.5         0.33333333]
[ 0.          0.5         0.33333333]


In [15]:
# MaxAbsScaler
# Designed for data that is ALREADY CENTERED AT ZERO or sparse data

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
print(X_train_maxabs)

X_test = np.array([[ -3., -1.,  4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
print(X_test_maxabs)               

print(max_abs_scaler.scale_)

[[ 0.5 -1.   1. ]
 [ 1.   0.   0. ]
 [ 0.   1.  -0.5]]
[[-1.5 -1.   2. ]]
[ 2.  1.  2.]


### Scaling with Outliers

* If many outliers, scaling with mean&variance probably won't work.
* Use robust_scale or RobustScaler instead.

[RobustScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) | [demo](plot_robust_scaling.ipynb)

### Centering Kernel Matrices

[KernelCenterer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KernelCenterer.html#sklearn.preprocessing.KernelCenterer)

### Normalization

* def: scaling samples to have unit norms
* accepts both dense & sparse matrices as inputs

[normalize](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html#sklearn.preprocessing.normalize) |
[Normalizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) |
[Vector Space Model](https://en.wikipedia.org/wiki/Vector_Space_Model)

In [16]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized  

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [17]:
# useful in Pipelines too

normalizer = preprocessing.Normalizer().fit(X)
print(normalizer)

normalizer.transform(X)                            
normalizer.transform([[-1.,  1., 0.]]) 

Normalizer(copy=True, norm='l2')


array([[-0.70710678,  0.70710678,  0.        ]])

### Binarization

* Thresholds numbers to get **boolean values**.
* Used in probabilistic estimators that assume a Bernoulli distribution.
* Accepts both dense & sparse array inputs.

[Binarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html#sklearn.preprocessing.Binarizer) |
[binarize](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.binarize.html#sklearn.preprocessing.binarize)

In [18]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
binarizer.transform(X)

array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

In [19]:
# adjust threshold
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

### Encoding Categories

* category values are best represented by integers
* SciKit estimators expect continuous inputs - problem
* instead convert categories into one-of-K encoded values

[OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) |
[demo](plot_feature_transformation.ipynb)

In [21]:
# example
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
enc.transform([[0, 1, 3]]).toarray()

array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

In [22]:
# how to handle missing values
enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])

# Note: missing categorical values for 2nd & 3rd features
enc.fit([[1, 2, 3], [0, 2, 0]])  
enc.transform([[1, 0, 0]]).toarray()

array([[ 0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  0.]])

### Missing Values

* "Impute" = infer missing values from known values
* Imputer uses mean, median or most frequent value of relevant row or column.

[Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer) |
[demo](missing_values.ipynb)

In [23]:
# example
import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))

[[ 4.          2.        ]
 [ 6.          3.66666667]
 [ 7.          6.        ]]


In [24]:
# handling sparse matrices
import scipy.sparse as sp
X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit(X)

X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
print(imp.transform(X_test))

[[ 4.          2.        ]
 [ 6.          3.66666667]
 [ 7.          6.        ]]


### Polynomial Features

* Adds complexity to a model with nonlinear features

[PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) |
[demo](plot_polynomial_interpolation.ipynb)

In [28]:
# example
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3, 2)
print(X)                                                 
poly = PolynomialFeatures(2)
poly.fit_transform(X)

[[0 1]
 [2 3]
 [4 5]]


array([[  1.,   0.,   1.,   0.,   0.,   1.],
       [  1.,   2.,   3.,   4.,   6.,   9.],
       [  1.,   4.,   5.,  16.,  20.,  25.]])

In [29]:
# using only interaction terms
X = np.arange(9).reshape(3, 3)
print(X)                                                 
poly = PolynomialFeatures(degree=3, interaction_only=True)
poly.fit_transform(X) 

[[0 1 2]
 [3 4 5]
 [6 7 8]]


array([[   1.,    0.,    1.,    2.,    0.,    0.,    2.,    0.],
       [   1.,    3.,    4.,    5.,   12.,   15.,   20.,   60.],
       [   1.,    6.,    7.,    8.,   42.,   48.,   56.,  336.]])

### Custom Transformers

[FunctionTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) |
[demo](plot_function_transformer.ipynb)

In [31]:
# example - add log transformation in a pipeline
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)

array([[ 0.        ,  0.69314718],
       [ 1.09861229,  1.38629436]])