# Preprocessing with sklearn: a complete and comprehensive guide

https://towardsdatascience.com/preprocessing-with-sklearn-a-complete-and-comprehensive-guide-670cb98fcfb9

For aspiring data scientist it might sometimes be difficult to find their way through the forest of preprocessing techniques. Sklearn its preprocessing library forms a solid foundation to guide you through this important task in the data science pipeline. Although Sklearn a has pretty solid documentation, it often misses streamline and intuition between different concepts.

> **This article intends to be a complete guide on preprocessing with sklearn v0.20.0.**

The following subjects will be handled:

- Missing values
- Polynomial features
- Categorical features
- Numerical features
- Custom transformations
- Feature scaling
- Normalization

In [1]:
# Import standard libs
import numpy as np
import pandas as pd
import math

## Missing Values

In [2]:
X = pd.DataFrame(
    np.array([5, 7, 8, np.NaN, np.NaN, np.NaN, -5,
              0, 25, 999, 1, -1, np.NaN, 0, np.NaN])
    .reshape((5, 3)))
X.columns = ['f1', 'f2', 'f3']  # feature 1, feature 2, feature 3
X_original = X.copy()
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,,,
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


In [3]:
# We update our dataset by deleting all the rows (axis=0) with only missing values.
# Note that in this case instead of setting tresh to 1, you can also set the how parameter to ‘all’.
# As a result our second sample is dropped, since it only consist of missing values.
X.dropna(axis=0, thresh=1, inplace=True)
# Note that we reset the index and drop the old index column for future convenience.
X.reset_index(inplace=True)
X.drop(['index'], axis=1, inplace=True)
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,-5.0,0.0,25.0
2,999.0,1.0,-1.0
3,,0.0,


In [4]:
from sklearn.impute import MissingIndicator

X.replace({999.0: np.NaN}, inplace=True)
indicator = MissingIndicator(missing_values=np.NaN)
indicator = indicator.fit_transform(X)
indicator = pd.DataFrame(indicator, columns=['m1', 'm3'])
indicator

Unnamed: 0,m1,m3
0,False,False
1,False,False
2,True,False
3,True,True


After deciding to keep (some of) your missing values and creating missing value indicators, the next question is if you should replace the missing values. Most learning algorithms perform poorly when missing values are expressed as not a number (np.NaN) and need some form of missing value imputation. 

> **Be aware that some libraries and algorithms, such as XGBoost, can handle missing values and impute these values automatically by learning.**

In [5]:
from sklearn.impute import SimpleImputer

# Don't use imputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit_transform(X)

array([[ 5.        ,  7.        ,  8.        ],
       [-5.        ,  0.        , 25.        ],
       [ 0.        ,  1.        , -1.        ],
       [ 0.        ,  0.        , 10.66666667]])

Note that the values returned are put into an Numpy array and we lose all the meta-information. Since all these strategies can be mimicked in pandas, we are going to use pandas fillna method to impute missing values. For ‘mean’ we can use the following code. This pandas implementation also provides options to fill forward (ffill) or fill backward (bfill), which are convenient when working with time series.

In [6]:
# Use .fillna()
X.fillna(X.mean(), inplace=True)
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,-5.0,0.0,25.0
2,0.0,1.0,-1.0
3,0.0,0.0,10.666667


Other popular ways to impute missing data are clustering the data with the k-nearest neighbor (KNN) algorithm or interpolating the values using a wide range of interpolation methods. Both techniques are not implemented in sklearn’s preprocessing library and won’t be discussed here.

## Polynomial features

Sklearn provides a PolynomialFeatures class to create polynomial features from scratch. The degree parameter determines the maximum degree of the polynomial. For example, when degree is set to two and X=x1, x2, the features created will be 1, x1, x2, x1², x1x2 and x2². The interaction_only parameter let the function know we only want the interaction features, i.e. 1, x1, x2 and x1x2.

In [7]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=3, interaction_only=True)
polynomials = pd.DataFrame(poly.fit_transform(X),
                           columns=['0', '1', '2', '3',
                                    'p1', 'p2', 'p3', 'p4'])
polynomials

Unnamed: 0,0,1,2,3,p1,p2,p3,p4
0,1.0,5.0,7.0,8.0,35.0,40.0,56.0,280.0
1,1.0,-5.0,0.0,25.0,-0.0,-125.0,0.0,-0.0
2,1.0,0.0,1.0,-1.0,0.0,-0.0,-1.0,-0.0
3,1.0,0.0,0.0,10.666667,0.0,0.0,0.0,0.0


In [8]:
polynomials = polynomials[['p1', 'p2', 'p3', 'p4']]
polynomials

Unnamed: 0,p1,p2,p3,p4
0,35.0,40.0,56.0,280.0
1,-0.0,-125.0,0.0,-0.0
2,0.0,-0.0,-1.0,-0.0
3,0.0,0.0,0.0,0.0


In [9]:
X_with_polynomials = pd.concat([X, indicator, polynomials], axis=1)
X_with_polynomials

Unnamed: 0,f1,f2,f3,m1,m3,p1,p2,p3,p4
0,5.0,7.0,8.0,False,False,35.0,40.0,56.0,280.0
1,-5.0,0.0,25.0,False,False,-0.0,-125.0,0.0,-0.0
2,0.0,1.0,-1.0,True,False,0.0,-0.0,-1.0,-0.0
3,0.0,0.0,10.666667,True,True,0.0,0.0,0.0,0.0


## Categorical features

Munging categorical data is another essential process during data preprocessing. 

**Unfortunately, sklearn’s machine learning library does not support handling categorical data. Even for tree-based models, it is necessary to convert categorical features to a numerical representation**

Before you start transforming your data, it is important to figure out if the feature you’re working on is ordinal (as opposed to nominal). **An ordinal feature** is best described as **a feature with natural, ordered categories and the distances between the categories is not known.**

Once you know what type of categorical data you’re working on, you can pick a suiting transformation tool. In sklearn that will be a OrdinalEncoder for ordinal data, and a OneHotEncoder for nominal data.

> Cardinal numbers, known as the “counting numbers,” indicate quantity. Ordinal numbers indicate the order or rank of things in a set (e.g., sixth in line; fourth place). Nominal numbers name or identify something (e.g., a zip code or a player on a team.) They do not show quantity or rank.

In [10]:
X = pd.DataFrame(
    np.array(['M', 'O-', 'medium',
              'M', 'O-', 'high',
              'F', 'O+', 'high',
              'F', 'AB', 'low',
              'F', 'B+', np.NaN])
    .reshape((5, 3)))
X.columns = ['sex', 'blood_type', 'edu_level']
X

Unnamed: 0,sex,blood_type,edu_level
0,M,O-,medium
1,M,O-,high
2,F,O+,high
3,F,AB,low
4,F,B+,


In [11]:
from sklearn.preprocessing import OrdinalEncoder

X1 = X.copy()
encoder = OrdinalEncoder()
# encoder = OrdinalEncoder(categories=['low', 'medium', 'high'])
X1.edu_level = encoder.fit_transform(X1.edu_level.values.reshape(-1, 1))
X1

Unnamed: 0,sex,blood_type,edu_level
0,M,O-,2.0
1,M,O-,0.0
2,F,O+,0.0
3,F,AB,1.0
4,F,B+,3.0


In [12]:
cat = pd.Categorical(X.edu_level,
                     categories=['missing', 'low',
                                 'medium', 'high'],
                     ordered=True)
cat

[medium, high, high, low, NaN]
Categories (4, object): [missing < low < medium < high]

In [13]:
cat = cat.fillna('missing')
cat

[medium, high, high, low, missing]
Categories (4, object): [missing < low < medium < high]

In [14]:
labels, unique = pd.factorize(cat, sort=True)
X.edu_level = labels
X

Unnamed: 0,sex,blood_type,edu_level
0,M,O-,2
1,M,O-,3
2,F,O+,3
3,F,AB,1
4,F,B+,0


In [15]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(dtype=np.int, sparse=True)
nominals = pd.DataFrame(
    onehot.fit_transform(X[['sex', 'blood_type']])
    .toarray(),
    columns=['F', 'M', 'AB', 'B+', 'O+', 'O-'])
nominals['edu_level'] = X.edu_level
nominals

Unnamed: 0,F,M,AB,B+,O+,O-,edu_level
0,0,1,0,0,0,1,2
1,0,1,0,0,0,1,3
2,1,0,0,0,1,0,3
3,1,0,1,0,0,0,1
4,1,0,0,1,0,0,0


Since there were no missing values in our data, it is important to have a word on how to handle missing values with the OneHotEncoder. A missing value can easily be handled as an extra feature. Note that to do this, you need to replace the missing value by an arbitrary value first (e.g. ‘missing’) If you, on the other hand, want to ignore the missing value and create an instance with all zeros (False), you can just set the handle_unkown parameter of the OneHotEncoder to ignore.

## Numerical features

Just like categorical data can be encoded, numerical features can be ‘decoded’ into categorical features. The two most common ways to do this are **discretization** and **binarization**.

### Discretization

Let’s turn to our example for some clarifications. Import the KBinsDiscretizer class and create a new instance with three bins, ordinal encoding and a uniform strategy (all bins have the same width). Then, fit and transform all our original, missing indicator and polynomial data.

In [16]:
from sklearn.preprocessing import KBinsDiscretizer

X = X_with_polynomials.copy()

disc = KBinsDiscretizer(n_bins=3, encode='ordinal',
                        strategy='uniform')
pd.DataFrame(disc.fit_transform(X))

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,2.0,2.0,1.0,0.0,0.0,2.0,2.0,2.0,2.0
1,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0
3,1.0,0.0,1.0,2.0,2.0,0.0,2.0,0.0,0.0


If the output doesn’t make sense to you, invoke the bin_edges_ attribute on the discretizer (disc) and take a look at how the bins are divided. Then try another strategy and see how the bin edges change accordingly.

In [17]:
pd.DataFrame([e.tolist() for e in disc.bin_edges_.tolist()]).T

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,-5.0,0.0,-1.0,0.0,0.0,0.0,-125.0,-1.0,0.0
1,-1.666667,2.333333,7.666667,0.333333,0.333333,11.666667,-70.0,18.0,93.333333
2,1.666667,4.666667,16.333333,0.666667,0.666667,23.333333,-15.0,37.0,186.666667
3,5.0,7.0,25.0,1.0,1.0,35.0,40.0,56.0,280.0


### Binarization

Feature binarization is the process of tresholding numerical features to get boolean values. Or in other words, assign a boolean value (True or False) to each sample based on a threshold. Note that binarization is an extreme form of two-bin discretization.

In [18]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0, copy=True)
binarizer.fit_transform(X.f3.values.reshape(-1, 1))

array([[1.],
       [1.],
       [0.],
       [1.]])

## Custom transformers

If you want to convert an existing function into a transformer to assist in data cleaning or processing, you can implement a transformer from an arbitrary function with FunctionTransformer. This class can be useful if you’re working with a Pipeline in sklearn, but can easily be replaced by applying a lambda function to the feature you want to transform (as showed below).

In [19]:
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log1p, validate=True)
transformer.fit_transform(X.f2.values.reshape(-1, 1))  # same output
X.f2.apply(lambda x: np.log1p(x))  # same output

0    2.079442
1    0.000000
2    0.693147
3    0.000000
Name: f2, dtype: float64

## Feature scaling

The next logical step in our preprocessing pipeline is to scale our features. Before applying any scaling transformations it is very important to **split your data into a train set and a test set.** If you start scaling before, your training (and test) data might end up scaled around a mean value (see below) that is not actually the mean of the train or test data, and go past the whole reason why you’re scaling in the first place.

### Standardization

Standardization is a transformation that **centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation.** After standardizing data the mean will be zero and the standard deviation one.

Standardization can drastically improve the performance of models. For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

Depending on your needs and data, sklearn provides a bunch of scalers: StandardScaler, MinMaxScaler, MaxAbsScaler and RobustScaler.

### Standard Scaler

Sklearn its main scaler, the StandardScaler, uses a strict definition of standardization to standardize data. It purely centers the data by using the following formula, where u is the mean and s is the standard deviation.

> x_scaled = (x — u) / s

Let’s take a look at our example to see this in practice. Before we start coding, we should remember that the value of our fourth instance was missing, and we replaced it by the mean. If we input the mean in the above formula, the result after standardizing should be zero. Let’s test this.

Import the StandardScaler class and create a new instance. Note that for sparse matrices you can set the with_mean parameter to False in order not to center the values around zero. Then, fit and transform the scaler to feature 3.

In [20]:
X.f3.values.reshape(-1, 1)

array([[ 8.        ],
       [25.        ],
       [-1.        ],
       [10.66666667]])

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit_transform(X.f3.values.reshape(-1, 1))

array([[-0.28562322],
       [ 1.53522482],
       [-1.24960159],
       [ 0.        ]])

### MinMax Scaler

The MinMaxScaler transforms features by scaling each feature to a given range. This range can be set by specifying the feature_range parameter (default at (0,1)). This scaler works better for cases where the distribution is not Gaussian or the standard deviation is very small. However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider another scaler.

> x_scaled = (x-min(x)) / (max(x)–min(x))

Importing and using the MinMaxScaler works — just as all the following scalers — in exactly the same way as the StandardScaler. The only difference sits in the parameters on initiation of a new instance.

Here we scale feature 3 (f3) to a scale between -3 and 3. As expected our maximum value (25) is transformed to 3 and our minimum value (-1) is transformed to -3. All the other values are linearly scaled between these values.

In [22]:
X.f3.values.reshape(-1, 1)

array([[ 8.        ],
       [25.        ],
       [-1.        ],
       [10.66666667]])

In [23]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(-3,3))
scaler.fit_transform(X.f3.values.reshape(-1, 1))

array([[-0.92307692],
       [ 3.        ],
       [-3.        ],
       [-0.30769231]])

### MaxAbs Scaler

The MaxAbsScaler works very similarly to the MinMaxScaler but automatically scales the data to a [-1,1] range based on the **absolute maximum**. This scaler is meant for **data that is already centered at zero or sparse data**. It does not shift/center the data, and thus does not destroy any sparsity.

> x_scaled = x / max(abs(x))

Let’s once again tackle feature 3 by transforming it using the MaxAbsScaler and compare the output with the original data.

In [24]:
X.f3.values.reshape(-1, 1)

array([[ 8.        ],
       [25.        ],
       [-1.        ],
       [10.66666667]])

In [25]:
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
scaler.fit_transform(X.f3.values.reshape(-1, 1))

array([[ 0.32      ],
       [ 1.        ],
       [-0.04      ],
       [ 0.42666667]])

### Robust Scaler

If your data contains many outliers, scaling using the mean and standard deviation of the data is likely to not work very well. In these cases, you can use the RobustScaler. **It removes the median and scales the data according to the quantile range**. The exact formula of the RobustScaler is not specified by the documentation. If you want full details you can always check [the source code](https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/preprocessing/data.py#L1035).

By default, the scaler uses the Inter Quartile Range (IQR), which is the range between the 1st quartile and the 3rd quartile. The quantile range can be manually set by specifying the quantile_range parameter when initiating a new instance of the RobustScaler. Here, we transform feature 3 using an quantile range from 10% till 90%.

In [26]:
X.f3.values.reshape(-1, 1)

array([[ 8.        ],
       [25.        ],
       [-1.        ],
       [10.66666667]])

In [27]:
from sklearn.preprocessing import RobustScaler

robust = RobustScaler(quantile_range = (0.1,0.9))
robust.fit_transform(X.f3.values.reshape(-1, 1))

array([[ -6.17283951],
       [ 72.5308642 ],
       [-47.83950617],
       [  6.17283951]])

## Normalization

Normalization is the process of scaling individual samples to have unit norm. In basic terms you need to normalize data when the algorithm predicts based on the weighted relationships formed between data points. Scaling inputs to unit norms is a common operation for text classification or clustering.

> One of the key differences between scaling (e.g. standardizing) and normalizing, is that normalizing is a row-wise operation, while scaling is a column-wise operation.

Although there are many other ways to normalize data, sklearn provides three norms (the value to which the individual values are compared): l1, l2 and max. When creating a new instance of the Normalizer class you can specify the desired norm under the norm parameter.
    
Below, the formula’s for the available norms are discussed and implemented in Python code — where the result is a list of denominators for each sample in data set X .

### 'max'

The max norm uses the absolute maximum and does for samples what the MaxAbsScaler does for features.

> x_normalized = x / max(x)

In [28]:
norm_max = list(max(list(abs(i) for i in X.iloc[r])) for r in range(len(X)))
norm_max

[280.0, 125.0, 1.0, 10.666666666666666]

### 'l1'

The l1 norm uses the sum of all the values as and thus gives equal penalty to all parameters, enforcing sparsity.

> x_normalized = x / sum(X)

In [29]:
norm_l1 = list(sum(list(abs(i) for i in X.iloc[r])) for r in range(len(X)))
norm_l1

[431.0, 155.0, 4.0, 12.666666666666666]

### 'l2'

The l2 norm uses the square root of the sum of all the squared values. This creates smoothness and rotational invariance. Some models, like PCA, assume rotational invariance, and so l2 will perform better.

> x_normalized = x / sqrt(sum((i**2) for i in X))

In [30]:
norm_l2 = list(math.sqrt(sum(list((i**2) for i in X.iloc[r])))
               for r in range(len(X)))
norm_l2

[290.6871170175933, 127.57350822173073, 2.0, 10.760008261045982]