# Binarization

## Table of Contents
1. Binarization
1. Encoding categorical features

## 1. Binarization

Feature binarization is the process of thresholding numerical features to get boolean values. It creates binary values from numeric values by assigning 
- a `0` to all values below a given threshold 
- a `1` to all values above that threshold.

Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.
It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modelled using the Bernoulli distribution in a Bayesian setting).

The utility class `Binarizer` can be used to binarize features according to a threshold. Below is an example using `Binarizer` to transform data.

Load libraries.

In [8]:
from sklearn import preprocessing

Create a sample dataset `X`.

In [10]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

Create an object `binarizer`.

In [12]:
binarizer = preprocessing.Binarizer()
binarizer

The fit method does nothing as each sample is treated independently of others.

In [14]:
binarizer.fit(0)

The transform method binarizes each element of `X`.

In [16]:
binarizer.transform(X)

It is possible to adjust the threshold of the binarizer:

In [18]:
binarizer_adjust = preprocessing.Binarizer(threshold=1)
binarizer_adjust.transform(X)

The above session introduces using the `Binarizer` class which can set feature values to 0 or 1 according to a threshold and that class can also be useful in the early stages of a `sklearn.pipeline.Pipeline`.

##2. Encoding categorical features

Often features are not given as continuous values but categorical. Some examples include color (“Red”, “Yellow”, “Blue”, "Red"), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country). 

Categorical variables are intentionally (for censorship) or implicitly encoded as numerical variables in order to be used as features in any given model, e.g. ["Red", "Yellow", "Blue", "Red"] becomes [0,1,2,0]. This method imparts an ordinal property to the variable, i.e. Red < Yellow < Blue, which may not be desired.

As the ordinal characteristic is usually not desired, one-hot encoding is necessary for the proper representation of the distinct elements of the variable. 

- One hot encoding transforms a single variable with \\(n\\) observations and \\(d\\) distinct values, to \\(d\\) binary variables with \\(n\\) observations each. Each observation indicating the presence (1) or absence (0) of the \\(d\\)th binary variable.

e.g. [Red, Yellow, Blue, Red] becomes 
[[1,0,0,1],
[0,1,0,0],
[0,0,1,0]]

### `OneHotEncoder` Transformer Class

Objects of this class:
- take as input (to the `fit` method) a matrix of integers, denoting the values taken on by categorical (discrete) features. It is assumed that input features take on values in the range [0, n_values).
- return as output (by the `transform` method) a sparse matrix where each column corresponds to one possible value of one feature.

Load libraries.

In [25]:
import pandas  as pd
import numpy as np

Create a sample dataset.

In [27]:
testdata = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish'],                         
                         'age': [3, 6, 3, 1],                         
                         'size':[4, 5, 2, 1]})
testdata

Create a `OneHotEncoder` object and store it in the `enc` variable. Define the parameter `sparse=False` to return an array instead of a sparse matrix in the output.

In [29]:
enc = preprocessing.OneHotEncoder(sparse=False)

Using the "age" column in the sample data to fit the `OneHotEncoder` object `enc`. The `fit` method records (in the `enc` object) the number of values of the feature.

In [31]:
enc.fit(testdata[["age"]])

Transform "age" data to a binary one-hot encoding. The output has 3 binary variables and 4 observations because the "age" feature has 3 distinct values and 4 observations.

In [33]:
enc.transform(testdata[["age"]]),\
enc.transform(testdata[["age"]]).shape

Now fit and transform two features, "age" and "size", using the object `enc`.The fit method records the number of values per feature.

In [35]:
enc.fit(testdata[["age","size"]])

Transform "age" and "size" data to a binary one-hot encoding. The output has 7 binary variables and 4 observations because the "age" and "size" features have 7 distinct values and 4 observations in total. The "age" feature has 3 values and the "size" feature has 4 values.

In [37]:
enc.transform(testdata[["age","size"]]),\
enc.transform(testdata[["age","size"]]).shape

Note: By default, the number of values each feature can take is inferred automatically from the dataset. It is possible to specify this explicitly using the parameter `n_values`.

The following command demonstrates how the number of values is defined for each feature. Create another OneHotEncoder object `enc_n`. Define the number of values each feature can take using `n_values` parameter. There are seven possible ages and six sizes in our dataset. Also define the parameter `sparse=False` to return an array instead of a sparse matrix in the output.

In [40]:
enc_n = preprocessing.OneHotEncoder(n_values=[7, 6], sparse=False)

Fit and transform the object again. There are 13 binary variables now because the "age" feature is fit to have 7 values and "size" feature is fit to have 6 values.

In [42]:
enc_n.fit_transform(testdata[["age","size"]]),\
enc_n.fit_transform(testdata[["age","size"]]).shape

The above session introduces using the `OneHotEncoder` class to convert categorical integer features to a binary one-hot encoding.