> Reference
+ [scikit-learn: Preprocessing data](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)

The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

In [1]:
from sklearn import preprocessing
import numpy as np

# Standardization #

## Mean removal and variance scaling ##

Standardize dataset to normally distributed data - **Gaussian wth zero mean and unit variance**

Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

### Standard Scaler ###
has fit() and transform(), suitable fro use in early steps of [sklearn.pipeline.Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)

It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to the constructor of StandardScaler.

In [2]:
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],              
              [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X)
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [3]:
scaler.mean_

array([ 1.        ,  0.        ,  0.33333333])

In [4]:
scaler.scale_

array([ 0.81649658,  0.81649658,  1.24721913])

In [5]:
scaler.transform(X) # transform train data

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [6]:
scaler.transform([[-1.,1.,0.]]) # transform test data

array([[-2.44948974,  1.22474487, -0.26726124]])

## Scaling features to a range ##

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

### MinMaxScaler ###

Standardize features to lie between a given minimum and maximum value

In [7]:
# training data
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
min_max_scaler=preprocessing.MinMaxScaler()
X_train_minmax=min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

In [8]:
min_max_scaler.scale_

array([ 0.5       ,  0.5       ,  0.33333333])

In [9]:
min_max_scaler.min_

array([ 0.        ,  0.5       ,  0.33333333])

In [10]:
# testing data
X_test = np.array([[ -3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

If MinMaxScaler is given an explicit feature_range=(min, max) the full formula is:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std / (max - min) + min

### MaxAbsScaler ###

Standardize features to lie within the range [-1, 1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.

In [11]:
# training data
X_train = np.array([[ 1., -1.,  2.],
                   [ 2.,  0.,  0.],
                   [ 0.,  1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs                # doctest +NORMALIZE_WHITESPACE^

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [12]:
# testing data
X_test = np.array([[ -3., -1.,  4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs                 

array([[-1.5, -1. ,  2. ]])

In [13]:
max_abs_scaler.scale_         

array([ 2.,  1.,  2.])

### Scaling sparse data ###

Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.

MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended way to go about this.

However, scale and StandardScaler can accept scipy.sparse matrices as input, as long as with_mean=False is explicitly passed to the constructor.

### Scaling data with outliers ###

If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. 

In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data.

### Scaling vs Whitening ###

It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features.

To address this issue you can use sklearn.decomposition.PCA or sklearn.decomposition.RandomizedPCA with whiten=True to further remove the linear correlation across features.

### Scaling target variables in regression ###

scale and StandardScaler work out-of-the-box with 1d arrays. This is very useful for scaling the target / response variables used for regression.

### References: ###

[Further discussion on the importance of centering and scaling data is available on this FAQ](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html)

# Normalization #

the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

**Sparse input**

normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input.

In [14]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer

Normalizer(copy=True, norm='l2')

In [15]:
normalizer.transform(X)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [16]:
normalizer.transform([[-1.,1.,0.]])

array([[-0.70710678,  0.70710678,  0.        ]])

# Feature binarization #

The process of thresholding numerical features to get boolean values. 

This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM.

It is also common among the text processing community to use binary feature values (probably to simplify the probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform slightly better in practice.

**Sparse input**
binarize and Binarizer accept both dense array-like and sparse matrices from scipy.sparse as input.

In [17]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
binarizer

Binarizer(copy=True, threshold=0.0)

In [18]:
binarizer.transform(X)

array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

In [19]:
# adjusting the threshold of the binarizer
binarizer=preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])