<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Preprocessing Libraries in sklearn

_Authors: Richard Harris (CHI)_

---

<a id="learning-objectives"></a>
### Learning Objectives
- Get an overview of pre-processing modules in Sklearn

### Lesson Guide
- [Intro to Preprocessing Modules in Sklearn](#intro-to-preprocessing-modules-in-sklearn)
	- [What's happening with fit, transform and fit_transform?](#whats-happening-with-fit-transform-and-fittransform)
- [Different Modules in sklearn](#different-modules-in-sklearn)
	- [Binarizer](#binarizer)
	- [FunctionTransformer](#functiontransformer)
	- [Imputer](#imputer)
	- [LabelBinarizer](#labelbinarizer)
	- [PolynomialFeatures](#polynomialfeatures)
	- [Scalers](#scalers)
- [Conclusion](#conclusion)


<a id="intro-to-preprocessing-modules-in-sklearn"></a>
## Intro to Preprocessing Modules in Sklearn

As we've seen, creating custom transformers for every one of our columns can be a little time consuming. Sklearn provides a number of libraries that preprocess your data. We've used a few of them as well.

For modeling techniques in sklearn, all libraries come standard with the following methods:

- `fit()`
- `predict()`
- `score()`

Within the preprocessing libraries, we see another set of standard behavior:

- `fit()`
- `transform()`
- `fit_transform()`

This standard interface makes it easy to combine transformations within pipelines and let sklearn do much of our work.

<a id="whats-happening-with-fit-transform-and-fittransform"></a>
### What's happening with fit, transform and fit_transform?

**fit()** will plan out the steps necessary for transformation but not apply it. For example, if we use `StandardScaler()` on a column, _inside_ of our instantiated object, the following instructions are written:

1. Here is the mean I will apply: *x*
2. Here is the standard deviation I will apply: *y*
3. When transform is called, take each value, subtract the mean I have stored, divide it by the standard deviation I have stored, return those transformed values

In other words, it lets us apply the same standards across multiple sets of data!

**transform()** just carries out the steps that have been stored during fitting. 

**fit_transform()** does both of these steps in one go -- it's what we do when we just want to transform a column right out of the gate.

<a id="different-modules-in-sklearn"></a>
## Different Modules in sklearn

We'll talk through each of the different libraries below with some sample code. The documentation and other details can be found in the sklearn API reference [here](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing). Libraries are presented in alpabetical order.

<a id="binarizer"></a>
### Binarizer

Binarizer takes a threshold and will code your data '1' if the threshold is above that threshold and '0' if the data is below that threshold.

```python

from sklearn.preprocessing import Binarizer

bin = Binarizer(20)
print bin.fit_transform(data)
```

<a id="functiontransformer"></a>
### FunctionTransformer

This lets us take a function and apply it to the input as if it were an sklearn object. Probably best to look at it in terms of an example.

```python

from sklearn.preprocessing import FunctionTransformer

def get_first_column(numpy_array):
    return numpy_array[:,0]

ft = FunctionTransformer(get_first_column)
ft.fit_transform(data)
```

This will let you reproducibly extract the first column out of any numpy array that you fit the transformer to. Good for custom scalers, data extraction, etc.

<a id="imputer"></a>
### Imputer

This processor gives you an sklearn implementation of `.fillna()` in Pandas, but within sklearn. Your choices for imputation are the average value, median value, or most frequent value. If you have a more specific imputation that you want to add, you may want to write a function and use `FunctionTransformer` instead.

```python

from sklearn.preprocessing import Imputer

impute = Imputer(strategy='median')
impute.fit_transform(data)
```

<a id="labelbinarizer"></a>
### LabelBinarizer

LabelBinarizer transforms a set of classes into binary features (yes or no). You could imagine this working similar to `pd.get_dummies()`:

```python

from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
data = np.array([1, 3, 7, 2, 4])

preprocessing.LabelBinarizer().fit_transform(data)

array([[1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0]])

```

This creates a column for each potential value (in numeric order) and then encodes it using 0 and 1.

<a id="polynomialfeatures"></a>
### PolynomialFeatures

This preprocessor creates polynomial terms and, optionally, interaction terms from your input. By default it will also put a _bias_ term as well. This example shows a case without the bias term.

```python
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

data = np.array([1, 2, 3, 4, 5])
pf = PolynomialFeatures(3, include_bias=False)
pf.fit_transform(data.reshape(-1, 1))

array([[   1.,    1.,    1.],
       [   2.,    4.,    8.],
       [   3.,    9.,   27.],
       [   4.,   16.,   64.],
       [   5.,   25.,  125.]])
```

<a id="scalers"></a>
### Scalers

The following sklearn scalers all rescale data according to certain methods, but are used the same way in each case:

- **MinMaxScaler** - scales the data using the max and min values so that it fits between 0 and 1
- **StandardScaler** - scales the data so that it has mean 0 and variance of 1 
- **RobustScaler** - scales the data similarly to Standard Scaler, but makes use of the median and scales using the interquartile range so as to avoid issues with large outliers.

There are other scalers but these commonly-used and easy to interpret.

<a id="conclusion"></a>
## Conclusion

These tools can make it very easy to do common tasks as part of a reproducible data pipeline. Every transformation here can be reapplied to new data as it comes in. It's not necessarily the most user-friendly system for iterating during model generation, but can come in very handy when creating a data _pipeline_.