# Label Transformation

## Table of Contents

1. Transforming categorical labels
1. Transforming string variables

##1. Transforming categorical labels

If we're using a supervised machine learning technique, we need to make a distinction in the data between features and labels for each observation. Labels can be of multiple(three or more) categories in some multi-class classification problems. `Scikit-Learn` provides two utility classes `LabelEncoder` and `LabelBinarizer` to transform categorical labels.

###`LabelEncoder`

`LabelEncoder` can be used to normalize labels with value between 0 and n_classes-1.

Load library.

In [8]:
from sklearn import preprocessing

Load the utility class `LabelEncoder` and use an object `le` to store it. Then fit the label encoder using a sample data array.

In [10]:
le = preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])

The `fit` method records (in the object `le`) the number of classes in the data and hold (in the attribute `classes_`) the label for each class.

In [12]:
le.classes_

The `transform` method normalizes the new data to be values between 0 and n_classes-1.

In [14]:
le.transform([1, 1, 2, 6]) 

The `inverse_transform` method transforms labels back to original encoding.

In [16]:
le.inverse_transform([0, 0, 1, 2])

`LabelEncoder` can also be used to transform non-numerical labels to numerical labels.

Load the utility class `LabelEncoder` and use another object `le2` to store it. Then fit the label encoder using string data.

In [19]:
le2 = preprocessing.LabelEncoder()
le2.fit(["paris", "paris", "tokyo", "amsterdam"])

The `fit` method records (in the object `le2`) the number of classes in the data and hold (in the attribute `classes_`) the label for each class. Check the labels in a list:

In [21]:
list(le2.classes_)

The `transform` method normalizes the new string data to be values between 0 and n_classes-1.

In [23]:
le2.transform(["tokyo", "tokyo", "paris"])

The `inverse_transform` method transforms labels back to original encoding.

In [25]:
list(le2.inverse_transform([2, 2, 1]))

The above session introduces an utility class `LabelEncoder` to encode categorical integer or string labels with values in [0, n_classes-1]

###`LabelBinarizer`

`LabelBinarizer` is a utility class to help transform multi-class labels to binary labels.

Load the utility class `LabelBinarizer` and use an object `lb` to store it. Then fit the label binarizer using a sample data array.

In [30]:
lb = preprocessing.LabelBinarizer()
lb.fit([1, 2, 6, 4, 2])

The `fit` method records (in the object `lb`) the number of classes in the data and hold (in the attribute `classes_`) the label for each class.

In [32]:
lb.classes_

The `transform` method converts multi-class labels to binary labels, where `1` indicates belonging to the corresponding class and `0` indicates not belonging to the class.

In [34]:
lb.transform([1, 6])

`LabelBinarizer` can also be used to transform non-numerical labels to a label indicator matrix.

Load the utility class `LabelBinarizer` and use another object `lb2` to store it. Then fit the label binarizer using string data.

In [37]:
lb2 = preprocessing.LabelBinarizer()
lb2.fit(["paris", "paris", "tokyo", "amsterdam"])

The `fit` method records (in the object `lb2`) the number of classes in the data and hold (in the attribute `classes_`) the label for each class. Check the labels in a list:

In [39]:
list(lb2.classes_)

The `transform` method converts the new data from 3-classes labels to binary labels as the 1-of-K coding scheme.

In [41]:
y_binarize = lb2.transform(["tokyo", "tokyo", "paris"])
y_binarize

The `inverse_transform` method transforms binary labels back to multi-class labels.

In [43]:
lb.inverse_transform(y_binarize)

The above session introduces using `LabelBinarizer` to transform integer/string multi-class labels to binary labels.

##2. Transforming string variables

The utility classes introduced above (`LabelEncoder` and `LabelBinarizer`) can also be useful when transforming string variables in preprocessing data for machine learning.

Load libraries.

In [48]:
import pandas  as pd
import numpy as np

Create a sample dataset.

In [50]:
testdata = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish'],                         
                         'age': [3, 6, 3, 1],                         
                         'size':[4, 5, 2, 1]})
testdata

Create a `LabelBinarizer` object and store it in the `binarizer` variable. Use the "pet" column in the sample data to fit the `LabelBinarizer` object `binarizer`. The `fit` method records (in the `binarizer` object) the number of classes of the feature.

In [52]:
binarizer = preprocessing.LabelBinarizer()
binarizer.fit(testdata[["pet"]])

Transform "pet" data to a binary one-hot encoding. The output has 3 binary variables and 4 observations because the "pet" feature has 3 distinct values and 4 observations.

In [54]:
binarizer.transform(testdata[["pet"]]),\
binarizer.transform(testdata[["pet"]]).shape

This session introduces an example to apply `LabelBinarizer` class to transforming string variables.