<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Preprocessing Libraries in Scikit-Learn

_Authors: Richard Harris (CHI)_

---

<a id="learning-objectives"></a>
### Learning Objectives
- Get an overview of preprocessing modules in scikit-learn.

### Lesson Guide
- [Intro to Preprocessing Modules in Scikit-Learn](#intro-to-preprocessing-modules-in-sklearn)
	- [What's Happening With `fit`, `transform`, and `fit_transform`?](#whats-happening-with-fit-transform-and-fittransform)
- [Different Modules in Scikit-Learn](#different-modules-in-sklearn)
	- [Binarizer](#binarizer)
	- [FunctionTransformer](#functiontransformer)
	- [Imputer](#imputer)
	- [LabelBinarizer](#labelbinarizer)
	- [PolynomialFeatures](#polynomialfeatures)
	- [Scalers](#scalers)
- [Conclusion](#conclusion)


<a id="intro-to-preprocessing-modules-in-sklearn"></a>
## Intro to Preprocessing Modules in Scikit-Learn

As we've seen, creating custom transformers for every one of our columns can be a little time consuming. Luckily, scikit-learn provides a number of libraries that will preprocess your data and streamline the work. We've used a few of them as well.

For modeling techniques in scikit-learn, all libraries come standard with the following methods:

- `fit()`
- `predict()`
- `score()`

Within the preprocessing libraries, we see another set of standard behavior:

- `fit()`
- `transform()`
- `fit_transform()`

This standard interface makes it easy to combine transformations within pipelines and let scikit-learn do much of our work.

<a id="whats-happening-with-fit-transform-and-fittransform"></a>
### What's Happening with `fit`, `transform`, and `fit_transform`?

**fit()** will plan out the steps necessary for transformation but not apply them. For example, if we use `StandardScaler()` on a column _inside_ of our instantiated object, the following instructions are written:

1) Here is the mean we will apply: *x*.
2) Here is the standard deviation we will apply: *y*.
3) When transform is called, take each value, subtract the mean we have stored, divide it by the standard deviation we have stored and return those transformed values.

In other words, it lets us apply the same standards across multiple sets of data.

**transform()** just carries out the steps that have been stored during fitting. 

**fit_transform()** accomplishes both of these steps in one go — it's what we do when we just want to transform a column right out of the gate.

<a id="different-modules-in-sklearn"></a>
## Different Modules in Scikit-Learn

We'll talk through each of the different libraries below with some sample code. Documentation and other details can be found in the scikit-learn API reference [here](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing). Libraries are presented in alphabetical order.

<a id="binarizer"></a>
### Binarizer

Binarizer codes data as '1' or '0' according to a particular threshold. This program will code your data '1' if the data point is above that threshold and '0' if the data point is below or equal to that threshold.

```python

from sklearn.preprocessing import Binarizer

bin = Binarizer(20)
print bin.fit_transform(data)
```

<a id="functiontransformer"></a>
### FunctionTransformer

This processor lets us take a function and apply it to the input as if it were a scikit-learn object. It can be helpful to look at this in terms of an example: 

```python

from sklearn.preprocessing import FunctionTransformer

def get_first_column(numpy_array):
    return numpy_array[:,0]

ft = FunctionTransformer(get_first_column)
ft.fit_transform(data)
```

This will let you reproducibly extract the first column out of any NumPy array to which you fit the transformer. This processor is useful for custom scalers, data extraction, etc.

<a id="imputer"></a>
### Imputer

This processor provides you with a scikit-learn implementation of `.fillna()` in Pandas but within scikit-learn. Your choices for imputation are the average value, median value, or most frequent value. If you want to add a more specific imputation, you may want to write a function and use `FunctionTransformer` instead.

```python

from sklearn.preprocessing import Imputer

impute = Imputer(strategy='median')
impute.fit_transform(data)
```

<a id="labelbinarizer"></a>
### LabelBinarizer

LabelBinarizer transforms a set of classes into binary features (yes or no). You can think of this as working similar to `pd.get_dummies()`:

```python

from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
data = np.array([1, 3, 7, 2, 4])

preprocessing.LabelBinarizer().fit_transform(data)

array([[1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0]])

```

This creates a column for each potential value (in numeric order) and then encodes it using 0s and 1s.

<a id="polynomialfeatures"></a>
### PolynomialFeatures

This preprocessor creates polynomial terms and, optionally, interaction terms from your input. By default it will also create a _bias_ term. The following example shows a case without the bias term:

```python
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

data = np.array([1, 2, 3, 4, 5])
pf = PolynomialFeatures(3, include_bias=False)
pf.fit_transform(data.reshape(-1, 1)) 

array([[   1.,    1.,    1.],
       [   2.,    4.,    8.],
       [   3.,    9.,   27.],
       [   4.,   16.,   64.],
       [   5.,   25.,  125.]])
```

<a id="scalers"></a>
### Scalers

The following scikit-learn scalers all rescale data according to certain methods but are used the same way in each case:

- **MinMaxScaler**: Scales the data using the max and min values so that they fit between 0 and 1.
- **StandardScaler**: Scales the data so that they have a mean of 0 and a variance of 1.
- **RobustScaler**: Scales the data similarly to StandardScaler but makes use of the median and scales, employing the interquartile range to avoid issues with large outliers.

Other scalers exist, but these are the most commonly used and easiest to interpret.

<a id="conclusion"></a>
## Conclusion

These tools can make it easy to perform common tasks as part of a reproducible data pipeline. Every transformation here can be reapplied to new data as they come in. It's not necessarily the most user-friendly system for iterating during model generation, but it can certainly come in handy when creating a _data pipeline_.