# Preprocessing data

### Standardization, or mean removal and variance scaling


Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; 

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.



In [1]:
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X_train)

X_scaled     

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [None]:
X_scaled.mean(axis=0)

In [None]:
X_scaled.std(axis=0)

The standard score of a sample x is calculated as:
<tt>z = (x - u) / s</tt>
where u is the mean of the training samples or zero if <tt>with_mean=False</tt>, and s is the standard deviation of the training samples or one if <tt>with_std=False</tt>.


The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.

In [4]:
scaler = preprocessing.StandardScaler().fit(X_train)

In [5]:
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [6]:
scaler.mean_ 

array([1.        , 0.        , 0.33333333])

In [7]:
scaler.scale_ 

array([0.81649658, 0.81649658, 1.24721913])

In [8]:
scaler.transform(X_train) 

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

The scaler instance can then be used on new data to transform it the same way it did on the training set



In [None]:
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)    

### Scaling features to a range

An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size.

In [9]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [10]:
min_max_scaler = preprocessing.MinMaxScaler() # default [0,1]

In [11]:
X_train_minmax = min_max_scaler.fit_transform(X_train)

# utilizzare i metodi separatemente permette di mantenere il modello fitted ai valori di training anche per
# fare fitting dei dati di test, mantenendo gli stessi parametri e quindi scalando i dati di test allo stesso
# modo


In [12]:
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [13]:
X_test = np.array([[-3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

The transformation is computed as:

<tt>X_scaled = scale * X + min - X.min(axis=0) * scale</tt>

where <tt>scale = (max - min) / (X.max(axis=0) - X.min(axis=0))</tt>

In [None]:
min_max_scaler.scale_  

Scale each feature by its maximum absolute value.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

In [None]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [None]:
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs                # doctest +NORMALIZE_WHITESPACE^

In [None]:
X_test = np.array([[ -3., -1.,  4.]])
X_test

In [None]:
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs  

In [None]:
max_abs_scaler.scale_   

If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data.



In [None]:
from sklearn.preprocessing import RobustScaler
X = [[ 1., -2.,  2.],
     [ -2.,  1.,  3.],
     [ 4.,  1., -2.]]
X

In [None]:
transformer = RobustScaler().fit(X)

In [None]:
transformer  


In [None]:
transformer.transform(X)

## Normalization


Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

In [None]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')

In [None]:
X_normalized                                      


##  Encoding categorical features

In [None]:
enc = preprocessing.OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)  

In [None]:
enc.transform([['female', 'from US', 'uses Safari'],
               ['male', 'from Europe', 'uses Safari']]).toarray()

In [None]:
enc.categories_

In [None]:
genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
# Note that for there are missing categorical values for the 2nd and 3rd
# feature
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X) 

In [None]:
enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()


In [None]:
enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X) 





In [None]:
enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()

In [None]:
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)

In [None]:
drop_enc.categories_

In [None]:
drop_enc.transform(X).toarray()

## Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.

In [None]:
X = np.array([[ -3., 5., 15 ],
              [  0., 6., 14 ],
              [  6., 3., 11 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)
est

In [None]:
est.transform(X)   

Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution.

In [None]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
binarizer


In [None]:

binarizer.transform(X)

In [None]:
binarizer = preprocessing.Binarizer(threshold=1.1)

In [None]:
binarizer.transform(X)

## Custom transformers

Often, you will want to convert an existing Python function into a transformer to assist in data cleaning or processing. 

In [None]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)

In [None]:
import warnings
warnings.filterwarnings("error", message=".*check_inverse*.",
                        category=UserWarning, append=False)