> Reference
+ [scikit-learn: Preprocessing data](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)
+ [machinelearningmastery: prepare data](http://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/)
+ [machinelearningmastery: rescaling data](http://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/)

The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

**Tip: Which Method To Use**

It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them. If often can, but not always.

A good tip is to create rescaled copies of your dataset and race them against each other using your test harness and a handful of algorithms you want to spot check. This can quickly highlight the benefits (or lack there of) of rescaling your data with given models, and which rescaling method may be worthy of further investigation.

In [12]:
from sklearn import preprocessing
import numpy as np

# Standardization #

## Mean removal and variance scaling ##

Standardize dataset to normally distributed data - **Gaussian wth zero mean and unit variance**

Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

Useful for:
+ linear regression
+ logistic regression
+ linear discriminant analysis

### Standard Scaler ###
has fit() and transform(), suitable fro use in early steps of [sklearn.pipeline.Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)

It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to the constructor of StandardScaler.

In [13]:
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],              
              [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X)
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [14]:
scaler.mean_

array([ 1.        ,  0.        ,  0.33333333])

In [15]:
scaler.scale_

array([ 0.81649658,  0.81649658,  1.24721913])

In [16]:
scaler.transform(X) # transform train data

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [17]:
scaler.transform([[-1.,1.,0.]]) # transform test data

array([[-2.44948974,  1.22474487, -0.26726124]])

In [38]:
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(X[0:5,:])
print(rescaledX[0:5,:])

[[  6.000e+00   1.480e+02   7.200e+01   3.500e+01   0.000e+00   3.360e+01
    6.270e-01   5.000e+01]
 [  1.000e+00   8.500e+01   6.600e+01   2.900e+01   0.000e+00   2.660e+01
    3.510e-01   3.100e+01]
 [  8.000e+00   1.830e+02   6.400e+01   0.000e+00   0.000e+00   2.330e+01
    6.720e-01   3.200e+01]
 [  1.000e+00   8.900e+01   6.600e+01   2.300e+01   9.400e+01   2.810e+01
    1.670e-01   2.100e+01]
 [  0.000e+00   1.370e+02   4.000e+01   3.500e+01   1.680e+02   4.310e+01
    2.288e+00   3.300e+01]]
[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


In [43]:
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = iris.data
y = iris.target
# standardize the data attributes
scaler = StandardScaler().fit(X)
standardized_X = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(X[0:5,:])
print(standardized_X[0:5,:])

(150, 4)
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]
[[-0.901  1.032 -1.341 -1.313]
 [-1.143 -0.125 -1.341 -1.313]
 [-1.385  0.338 -1.398 -1.313]
 [-1.507  0.106 -1.284 -1.313]
 [-1.022  1.263 -1.341 -1.313]]


## Scaling features to a range ##

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

### MinMaxScaler ###

Standardize features to lie between a given minimum and maximum value

Useful for:
+ optimization algorithm - gradient descent
+ algorithms that weigh inputs - regression and neural networks
+ algorithms that use distance measures - k-nearest neighbors

In [18]:
# training data
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
min_max_scaler=preprocessing.MinMaxScaler()
X_train_minmax=min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

In [19]:
min_max_scaler.scale_

array([ 0.5       ,  0.5       ,  0.33333333])

In [20]:
min_max_scaler.min_

array([ 0.        ,  0.5       ,  0.33333333])

In [21]:
# testing data
X_test = np.array([[ -3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

In [37]:
# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(X[0:5,:])
print(rescaledX[0:5,:])

[[  6.000e+00   1.480e+02   7.200e+01   3.500e+01   0.000e+00   3.360e+01
    6.270e-01   5.000e+01]
 [  1.000e+00   8.500e+01   6.600e+01   2.900e+01   0.000e+00   2.660e+01
    3.510e-01   3.100e+01]
 [  8.000e+00   1.830e+02   6.400e+01   0.000e+00   0.000e+00   2.330e+01
    6.720e-01   3.200e+01]
 [  1.000e+00   8.900e+01   6.600e+01   2.300e+01   9.400e+01   2.810e+01
    1.670e-01   2.100e+01]
 [  0.000e+00   1.370e+02   4.000e+01   3.500e+01   1.680e+02   4.310e+01
    2.288e+00   3.300e+01]]
[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.     0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.     0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.   ]
 [ 0.     0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]


If MinMaxScaler is given an explicit feature_range=(min, max) the full formula is:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std / (max - min) + min

### MaxAbsScaler ###

Standardize features to lie within the range [-1, 1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.

In [22]:
# training data
X_train = np.array([[ 1., -1.,  2.],
                   [ 2.,  0.,  0.],
                   [ 0.,  1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs                # doctest +NORMALIZE_WHITESPACE^

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [23]:
# testing data
X_test = np.array([[ -3., -1.,  4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs                 

array([[-1.5, -1. ,  2. ]])

In [24]:
max_abs_scaler.scale_         

array([ 2.,  1.,  2.])

### Scaling sparse data ###

Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.

MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended way to go about this.

However, scale and StandardScaler can accept scipy.sparse matrices as input, as long as with_mean=False is explicitly passed to the constructor.

### Scaling data with outliers ###

If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. 

In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data.

### Scaling vs Whitening ###

It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features.

To address this issue you can use sklearn.decomposition.PCA or sklearn.decomposition.RandomizedPCA with whiten=True to further remove the linear correlation across features.

### Scaling target variables in regression ###

scale and StandardScaler work out-of-the-box with 1d arrays. This is very useful for scaling the target / response variables used for regression.

### References: ###

[Further discussion on the importance of centering and scaling data is available on this FAQ](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html)

# Normalization #

The process of scaling individual samples to have unit norm [rescaling each observation (row) to have a length of 1]. 
Also, Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.
How is this different than MinMaxScaler? MinMaxScaler can have explicit Min/Max values while Normalizer can only be between
0 and 1?

This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

**Sparse input**

normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input.

Useful for sparse datasets (lots of zeros) with attributes of varying scales when using:
+ algorithms that weight input values - neural networks
+ algorithms that use distance measures - k-nearest neighbors

In [25]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer

Normalizer(copy=True, norm='l2')

In [26]:
normalizer.transform(X)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [27]:
normalizer.transform([[-1.,1.,0.]])

array([[-0.70710678,  0.70710678,  0.        ]])

In [39]:
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(X[0:5,:])
print(normalizedX[0:5,:])

[[  6.000e+00   1.480e+02   7.200e+01   3.500e+01   0.000e+00   3.360e+01
    6.270e-01   5.000e+01]
 [  1.000e+00   8.500e+01   6.600e+01   2.900e+01   0.000e+00   2.660e+01
    3.510e-01   3.100e+01]
 [  8.000e+00   1.830e+02   6.400e+01   0.000e+00   0.000e+00   2.330e+01
    6.720e-01   3.200e+01]
 [  1.000e+00   8.900e+01   6.600e+01   2.300e+01   9.400e+01   2.810e+01
    1.670e-01   2.100e+01]
 [  0.000e+00   1.370e+02   4.000e+01   3.500e+01   1.680e+02   4.310e+01
    2.288e+00   3.300e+01]]
[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
 [ 0.008  0.716  0.556  0.244  0.     0.224  0.003  0.261]
 [ 0.04   0.924  0.323  0.     0.     0.118  0.003  0.162]
 [ 0.007  0.588  0.436  0.152  0.622  0.186  0.001  0.139]
 [ 0.     0.596  0.174  0.152  0.731  0.188  0.01   0.144]]


In [42]:
# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn.preprocessing import Normalizer
# load the iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data from the target attributes
X = iris.data
y = iris.target
# normalize the data attributes
scaler = Normalizer().fit(X)
normalized_X = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(X[0:5,:])
print(normalized_X[0:5,:])

(150, 4)
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]
[[ 0.804  0.552  0.221  0.032]
 [ 0.828  0.507  0.237  0.034]
 [ 0.805  0.548  0.223  0.034]
 [ 0.8    0.539  0.261  0.035]
 [ 0.791  0.569  0.221  0.032]]


# Feature binarization #

The process of thresholding numerical features to get boolean values. 

This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM.

It is also common among the text processing community to use binary feature values (probably to simplify the probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform slightly better in practice.

It is also useful when feature engineering and you want to add new features that indicate something meaningful.
 
**Sparse input**
binarize and Binarizer accept both dense array-like and sparse matrices from scipy.sparse as input.

In [28]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
binarizer

Binarizer(copy=True, threshold=0.0)

In [29]:
binarizer.transform(X)

array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

In [30]:
# adjusting the threshold of the binarizer
binarizer=preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [40]:
# binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(X[0:5,:])
print(binaryX[0:5,:])

[[  6.000e+00   1.480e+02   7.200e+01   3.500e+01   0.000e+00   3.360e+01
    6.270e-01   5.000e+01]
 [  1.000e+00   8.500e+01   6.600e+01   2.900e+01   0.000e+00   2.660e+01
    3.510e-01   3.100e+01]
 [  8.000e+00   1.830e+02   6.400e+01   0.000e+00   0.000e+00   2.330e+01
    6.720e-01   3.200e+01]
 [  1.000e+00   8.900e+01   6.600e+01   2.300e+01   9.400e+01   2.810e+01
    1.670e-01   2.100e+01]
 [  0.000e+00   1.370e+02   4.000e+01   3.500e+01   1.680e+02   4.310e+01
    2.288e+00   3.300e+01]]
[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]


# Encoding categorical features #
**Need to read this again to understand and then summarize here**

# Imputation of missing values #

Missing values in datasets are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.

Instead of discarding missing value dataset rows, it is better to impute [to infer them from the known part of the data].

Imputer strategies to apply to row/column in which the missing values are located:
+ use mean value
+ use median value
+ use mode (most frequent value)

The Imputer class also supports sparse matrices. In this case, missing values are encoded by 0 and are thus implicitly stored in the matrix. This format is thus suitable when there are many more missing values than observed values.

In [31]:
import numpy as np
from sklearn.preprocessing import Imputer

# impute np.nan using mean value of the columns (axix=0)
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

In [32]:
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))                           

[[ 4.          2.        ]
 [ 6.          3.66666667]
 [ 7.          6.        ]]


In [33]:
import scipy.sparse as sp
X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit(X)

Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)

In [34]:
X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
print(imp.transform(X_test))                      

[[ 4.          2.        ]
 [ 6.          3.66666675]
 [ 7.          6.        ]]


# Generating polynomial features #
**Need to read this again to understand and then summarize here**

# Custom transformers #

In [35]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer

# custom transformer to apply a log transformation in a pipeline
transformer = FunctionTransformer(np.log1p)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)

array([[ 0.        ,  0.69314718],
       [ 1.09861229,  1.38629436]])