# Preprocessing
* Preprocessing is the procedure of preparing data so it can be used for machine learning.

* Preprocessing is important because in geneal learning algorithms benefit from standardization and normalization of a data set.

* Scikit Learns sklearn.preprocessing package provides a lot of different preprocessing functions. We will take a look at standardization, normalization, encoding categorical features, imputation of missing values as well as generating polynomial features for algorithms like LinearRegression.

An important thing for almost all preprocessing methods is that we don’t fit it on the whole data but rather only on the training set and transform both the training and testing set with the scaler learned with only the training set.

To see how preprocessing works we first of all need to load in a dataset which we can perform preprocessing on. For that we will use the famous Iris dataset because it has different scales as well as an non-numeric column.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'label'])
X = np.array(iris.drop(['label'], axis=1))
y = np.array(iris['label'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
iris.head() 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,label
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Standardization
We will start of by taking a look at standardization.

Standardization of datasets is a really common proceature. For some algorithm and datasets even required. Algorithms like SVM with RBF kernel or Ridge and Lasso Regression assume that all features are standardized. For algorithms like Nearest Neighbors it’s important that features are scaled because you compute the distance.

An easy way to standardize our dataset is Scikit Learns _**StandardScaler**_.

In [2]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
print(scaler.mean_)
print(scaler.scale_)
X_train_scaled = scaler.transform(X_train)
X_train_scaled[:5] 

[5.80916667 3.0575     3.7275     1.1825    ]
[0.82036535 0.44453393 1.7439401  0.75029578]


array([[-1.47393679,  1.22037928, -1.5639872 , -1.30948358],
       [-0.13307079,  3.02001693, -1.27728011, -1.04292204],
       [ 1.08589829,  0.09560575,  0.38562104,  0.28988568],
       [-1.23014297,  0.77046987, -1.21993869, -1.30948358],
       [-1.7177306 ,  0.32056046, -1.39196294, -1.30948358]])

As you can see above we first imported the StandardScaler and then fit it on the training features. The we printed a few informations about the scaler. And lastly we transformed our training features and printed the first 5 rows to the console.

Now we need to transform our testing features using the same scaler.

In [3]:
X_test_scaled = scaler.transform(X_test)
X_test_scaled[:5]

array([[ 0.35451684, -0.57925837,  0.5576453 ,  0.02332414],
       [-0.13307079,  1.67028869, -1.16259727, -1.17620281],
       [ 2.30486738, -1.02916778,  1.81915651,  1.48941263],
       [ 0.23261993, -0.35430366,  0.44296246,  0.42316645],
       [ 1.2077952 , -0.57925837,  0.61498672,  0.28988568]])

All of the standardization and normalization algorithms we will use in this tutorial will have the same logic.

1. Fitting on the training dataset
2. Transforming the training dataset
3. Transforming the testing dataset

If we want our values between a minimum and maximum value we can use the _**MinMaxScaler**_.

In [4]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(0,1))
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax[:5] 

array([[0.08823529, 0.66666667, 0.        , 0.04166667],
       [0.41176471, 1.        , 0.0877193 , 0.125     ],
       [0.70588235, 0.45833333, 0.59649123, 0.54166667],
       [0.14705882, 0.58333333, 0.10526316, 0.04166667],
       [0.02941176, 0.5       , 0.05263158, 0.04166667]])

In [5]:
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax[:5] 

array([[0.52941176, 0.33333333, 0.64912281, 0.45833333],
       [0.41176471, 0.75      , 0.12280702, 0.08333333],
       [1.        , 0.25      , 1.03508772, 0.91666667],
       [0.5       , 0.375     , 0.61403509, 0.58333333],
       [0.73529412, 0.33333333, 0.66666667, 0.54166667]])

As you can see the only syntactical difference is that we need to specify a feature_range.

The last method of standardization we will talk about is calling each feature by its maximum absolute value using _**MaxAbsScaler**_.

In [6]:
from sklearn.preprocessing import MaxAbsScaler

max_abs_scaler = MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs[:5]

array([[0.5974026 , 0.81818182, 0.14925373, 0.08      ],
       [0.74025974, 1.        , 0.2238806 , 0.16      ],
       [0.87012987, 0.70454545, 0.65671642, 0.56      ],
       [0.62337662, 0.77272727, 0.23880597, 0.08      ],
       [0.57142857, 0.72727273, 0.19402985, 0.08      ]])

In [7]:
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs[:5] 

array([[0.79220779, 0.63636364, 0.70149254, 0.48      ],
       [0.74025974, 0.86363636, 0.25373134, 0.12      ],
       [1.        , 0.59090909, 1.02985075, 0.92      ],
       [0.77922078, 0.65909091, 0.67164179, 0.6       ],
       [0.88311688, 0.63636364, 0.71641791, 0.56      ]])

Because we scaled it by the absolute maximum value the largest value of the training set will be one.

## Normalization
Normalization is the process of scaling individual samples to have unit norm.

Scikit Learns _**Normalizer**_ allows us to normalize our features. We only need to pass it the norm we want to use to normalize the dataset. In the example we are using the L2 norm.

In [8]:
from sklearn.preprocessing import Normalizer

normalizer = Normalizer(norm='l2') # norm specifies the norm used to normalize the data
X_train_normalized = normalizer.fit_transform(X_train)
X_train_normalized[:5] 

array([[0.77577075, 0.60712493, 0.16864581, 0.03372916],
       [0.77381111, 0.59732787, 0.2036345 , 0.05430253],
       [0.76945444, 0.35601624, 0.50531337, 0.16078153],
       [0.786991  , 0.55745196, 0.26233033, 0.03279129],
       [0.78609038, 0.57170209, 0.23225397, 0.03573138]])

In [9]:
X_test_normalized = normalizer.transform(X_test)
X_test_normalized[:5]

array([[0.73659895, 0.33811099, 0.56754345, 0.14490471],
       [0.8068282 , 0.53788547, 0.24063297, 0.04246464],
       [0.70600618, 0.2383917 , 0.63265489, 0.21088496],
       [0.73350949, 0.35452959, 0.55013212, 0.18337737],
       [0.76467269, 0.31486523, 0.53976896, 0.15743261]])

## Encoding categorical features
Sometimes our data isn’t in the format we want or need to use it for Machine Learning. An example is categorical data that is represented as a string like the label column of our Iris dataset. We can use something like Scikit Learns LabelEncoder or One Hot Encoding to encode our data into a useful format.

First of we will print the first 5 labels so you can see that they are strings.

In [10]:
y[:5]

array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa'], dtype=object)

First of we will take a look at how Scikit Learns _**LabelEncoder**_ works. It can be used to transforms categorical textual data into numeric data.

In [11]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_le = le.fit_transform(y)
y_le[:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Next of we will take a look at one hot encoding which is really popular for classification problems using Neural Networks.

_**One Hot Encoding**_ transforms categorical data into arrays of booleans where only one of them is true(1)

In [12]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe.fit_transform([[1], [2], [3]]).toarray() 

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

## Imputation of missing values
Many real world datasets contain missing values, often encodes as blanks, NaNs or other placeholders. To use datasets in Scikit Learn we need to fill in those missing values. A basic strategy is to just delete all rows with missing values but this comes at the cost of losing all the information contained in this row. A better way is to impute the missing values. You can impute missing values with something like the columns mean, median or most frequent value.

Scikit Learn provides an _**Imputer**_. We only need to pass it what values are missing and a strategy to replace them with.

In [13]:
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean')
imp.fit_transform([[1, 2], [7, 8], [np.nan, np.nan]]) 



array([[1., 2.],
       [7., 8.],
       [4., 5.]])

## Generating polynomial features##
Often it’s useful to add complexity to the model by considering nonlinear features of the input data. A simple way of doing so is to use polynomial features.

In [14]:
from sklearn.preprocessing import PolynomialFeatures

X = [[1,2,3], [4,5,6]]
print(X)
poly = PolynomialFeatures(2)
poly.fit_transform(X) 

[[1, 2, 3], [4, 5, 6]]


array([[ 1.,  1.,  2.,  3.,  1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5.,  6., 16., 20., 24., 25., 30., 36.]])

source: https://gilberttanner.com/2018/09/17/scikit-learn-tutorial-12-preprocessing/