# Preprocessing in sklearn Python


The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

## Standardisation (mean removal and variance scaling)


Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. The formula for standardisation is 
\begin{align}
\dot{x}&=\frac{x-\mu}{\sigma}\\
\end{align}

__Standardisation__

_Standardize across entire space_:Calculate the mean/std (1 values) for the entire matrix and subtract/divide this element wise for each cell.

_Standardize on row/input case level_: Calculate the mean/std for each entire row, and subtract this element wise on each features in that row.

_Standardize on column/feature basis_: Calculate the mean/std for each entire column (feature), and subtract/divide this element wise all cells in that column.


Many machine learning algorithms work on data that is centered around zero and have variance in the same order.If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected._scale_ is the function from python used to calculate the standardisation.

__Scale function:__

    sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)
__Parameters:__

__X : {array-like, sparse matrix}__
The data to center and scale.

__axis : int (0 by default)__
axis used to compute the means and standard deviations along. If 0, independently standardize each feature, otherwise (if 1) standardize each sample.

__with_mean : boolean, True by default__
If True, center the data before scaling.

__with_std : boolean, True by default__
If True, scale the data to unit variance (or equivalently, unit standard deviation).

__copy : boolean, optional, default True__
set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSC matrix and if axis is 1).
    

In [11]:
from sklearn import preprocessing
import numpy as np
X_train=np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
X_scaled=preprocessing.scale(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [36]:
X_scaled=preprocessing.scale(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

The calculation of mean and the standard deviation is carried along each feature(column) on using scale function as shown in the cell below.

In [39]:
X_test = (X_train - np.mean(X_train,axis=0)) / np.std(X_train,axis=0)
X_test

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

### Standard Scaler 

Standard Scaler is an API that has the same functionality as scale.It has an estimator associated with it. The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. 

    class sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
__Parameters:__
 
__copy : boolean, optional, default True__
If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

__with_mean : boolean, True by default__
If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

__with_std : boolean, True by default__
If True, scale the data to unit variance (or equivalently, unit standard deviation)



__Attributes:__

   __scale_ : ndarray or None, shape (n_features,)__
Per feature relative scaling of the data. This is calculated using np.sqrt(var_). Equal to None when with_std=False.

__mean_ : ndarray or None, shape (n_features,)__
The mean value for each feature in the training set. Equal to None when with_mean=False.

__var_ : ndarray or None, shape (n_features,)__
The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.

__n_samples_seen_ : int or array, shape (n_features,)__
The number of samples processed by the estimator for each feature. If there are not missing samples, the n_samples_seen will be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments across partial_fit calls.

__Methods:__
1. fit(X[, y])                 - Compute the mean and std to be used for later scaling.
2. fit_transform(X[, y])	    -Fit to data, then transform it.
3. get_params([deep])	        -Get parameters for this estimator.
4. inverse_transform(X[, copy])	-Scale back the data to the original representation
5. partial_fit(X[, y])	        -Online computation of mean and std on X for later scaling.
6. set_params(**params)	        -Set the parameters of this estimator.
7. transform(X[, y, copy])	    -Perform standardization by centering and scaling


In [13]:
scaler=preprocessing.StandardScaler().fit(X_train)
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [14]:
scaler.mean_

array([1.        , 0.        , 0.33333333])

In [15]:
scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

In [16]:
scaler.transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

### Scaling features to a range

An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

#### MinMaxScaler

    class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
    Transforms features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.Transforms features by scaling each feature to a given range.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

This transformation is often used as an alternative to zero mean, unit variance scaling.

__Parameters:__
__feature_range : tuple (min, max), default=(0, 1)__
Desired range of transformed data.

__copy : boolean, optional, default True__
Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).



__Attributes:__

__min_ : ndarray, shape (n_features,)__
Per feature adjustment for minimum. Equivalent to min - X.min(axis=0) * self.scale_

__scale_ : ndarray, shape (n_features,)__
Per feature relative scaling of the data. Equivalent to (max - min) / (X.max(axis=0) - X.min(axis=0))

__data_min_ : ndarray, shape (n_features,)__
Per feature minimum seen in the data

__data_max_ : ndarray, shape (n_features,)__
Per feature maximum seen in the data

__data_range_ : ndarray, shape (n_features,)__
Per feature range (data_max_ - data_min_) seen in the data

__Methods:__
1. fit(X[, y])	-Compute the minimum and maximum to be used for later scaling.
2. fit_transform(X[, y])	-Fit to data, then transform it.
3. get_params([deep])	-Get parameters for this estimator.
4. inverse_transform(X)	-Undo the scaling of X according to feature_range.
5. partial_fit(X[, y])	-Online computation of min and max on X for later scaling.
6. set_params(**params)	-Set the parameters of this estimator.
7. transform(X)	-Scaling features of X according to feature_range.


__Here is an example to scale a toy data matrix to the [0, 1] range:__

In [19]:
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax


array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [38]:
X_std = (X_train- X_train.min(axis=0)) / (X_train.max(axis=0) - X_train.min(axis=0))
print(X_std)


[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]


In [39]:
min_max_scaler.scale_

array([0.5       , 0.5       , 0.33333333])

### MaxAbs Scaler

MaxAbsScaler works in a very similar fashion, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.

     class sklearn.preprocessing.MaxAbsScaler(copy=True)
Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

This scaler can also be applied to sparse CSR or CSC matrices.

__Parameters:__	
__copy : boolean, optional, default is True__
Set to False to perform inplace scaling and avoid a copy (if the input is already a numpy array).

__Attributes:__	
__scale_ : ndarray, shape (n_features,)__
Per feature relative scaling of the data.
New in version 0.17: scale_ attribute.

__max_abs_ : ndarray, shape (n_features,)__
Per feature maximum absolute value.

__n_samples_seen_ : int__
The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across partial_fit calls.

In [41]:
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs


array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

## Robust Scaler

    class sklearn.preprocessing.RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
    
 Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.


__Parameters:__
__with_centering : boolean, True by default__
If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

__with_scaling : boolean, True by default__
If True, scale the data to interquartile range.

__quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0__
Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_.

New in version 0.18.

__copy : boolean, optional, default is True__
If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

__Attributes:__	
center_ : array of floats
The median value for each feature in the training set.

__scale_ : array of floats__
The (scaled) interquartile range for each feature in the training set.

New in version 0.17: scale_ attribute.

In [47]:
robscaler=preprocessing.RobustScaler()
transformer =robscaler.fit_transform(X_train)
transformer


array([[ 0.        , -1.        ,  1.33333333],
       [ 1.        ,  0.        ,  0.        ],
       [-1.        ,  1.        , -0.66666667]])