In [8]:
import pandas as pd
from numpy import asarray
import numpy as np
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from category_encoders.binary import BinaryEncoder

# Machine Learning Mastery
    
    AUTHOR: Dr. Jason Brownlee 

### Ordinal and One-Hot Encodings for Categorical Data

## 

## Nominal and ordinal variables

**Numerical data**, as its name suggests, involves features that are only composed of numbers, such as integers or floating-point values.

**Categorical data** are variables that contain label values rather than numeric values, they are often called **nominal**. The number of possible values is often limited to a fixed set.

>**Ordinal variables** are categorical variables where we have a natural ordering of values and can be ordered or ranked.

A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin.

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric. This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

There are three common approaches for converting ordinal and categorical variables to numerical values:

- Ordinal Encoding
- One-Hot Encoding
- Dummy variable encoding

## Ordinal Encoding

In ordinal encoding, each unique category value is assigned an integer value. This is called an **ordinal encoding or an integer encoding** and is easily reversible. Often, integer values starting at zero are used.

>For example, “red” is 1, “green” is 2, and “blue” is 3.

It is a natural encoding for ordinal variables. **For categorical variables, it imposes an ordinal relationship where no such relationship may exist**. This can cause problems and a one-hot encoding may be used instead.

This ordinal encoding transform is available in the `scikit-learn` Python machine learning library via the `OrdinalEncoder` class.

By default, it will assign integers to labels in the order that is observed in the data. If a specific order is desired, it can be specified via the “categories” argument as a list with the rank order of all expected labels.

In [2]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

[['red']
 ['green']
 ['blue']]
[[2.]
 [1.]
 [0.]]


If a categorical target variable needs to be encoded for a classification predictive modeling problem, then the `LabelEncoder` class can be used. It does the same thing as the `OrdinalEncoder`, although it expects a one-dimensional input for the single target variable.

In [3]:
# Sample target data
y = np.array(['cat', 'dog', 'bird', 'cat', 'dog'])
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
print(y_encoded)  # [0 1 2 0 1]

# Convert back to original labels
y_original = encoder.inverse_transform(y_encoded)
print(y_original)  # ['cat' 'dog' 'bird' 'cat' 'dog']

[1 2 0 1 2]
['cat' 'dog' 'bird' 'cat' 'dog']


## One-Hot Encoding

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results. 

In this case, a one-hot encoding can be applied to the ordinal representation. This is where **the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable**.

In [5]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse_output=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


## Dummy Variable Encoding

The one-hot encoding creates one binary variable for each category. The problem is that this representation includes redundancy. For example...
> if we know that `[1, 0, 0]` represents “blue” and `[0, 1, 0]` represents “green” we don’t need another binary variable to represent “red“, instead we could use 0 values for both “blue” and “green” alone, e.g. `[0, 0]`.

This is called a **dummy variable encoding**, and always represents C categories with C-1 binary variables.

A dummy variable representation is required for some models. For example, in the case of a **linear regression model** (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

In [7]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first', sparse_output=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]
[[0. 1.]
 [1. 0.]
 [0. 0.]]


## Binary Encoding

Binary encoding is a categorical encoding technique that uses binary code – that is, a sequence of zeroes and ones – to represent the different categories of the variable. 

Binary encoding encodes the data in fewer dimensions than one-hot encoding. More generally, we determine the number of binary features needed to encode a variable as $\log_2(\text{number of distinct categories})$. This is particularly useful when we have highly cardinal variables. For example, if a variable contains 128 unique categories, with one-hot encoding, we would need 127 features to encode the variable, whereas with binary encoding, we would only need 7 ($log_2(128)=7$). 

In [16]:
# define data
data = asarray([['red'], ['green'], ['blue'],['purple'],['pink'],['black'],['brown']])
# define binary encoding
encoder = BinaryEncoder()
# transform data
result = encoder.fit_transform(data)
# display as a dataframe
import pandas as pd
result_df = pd.DataFrame(result)
display(result_df)  # more explicit display in notebook

# how to get the original labels back
# inverse_transform lets us recover the original labels from the binary encoding
original_labels = encoder.inverse_transform(result)
print("Recovered labels:")
print(original_labels)

Unnamed: 0,0_0,0_1,0_2
0,0,0,1
1,0,1,0
2,0,1,1
3,1,0,0
4,1,0,1
5,1,1,0
6,1,1,1


Recovered labels:
        0
0     red
1   green
2    blue
3  purple
4    pink
5   black
6   brown


# Breast Cancer Dataset
## OrdinalEncoder Transform

In [19]:
# load dataset
dataset = pd.read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
X = ordinal_encoder.fit_transform(X)

# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])
print('Output', y.shape)
print(y[:5])

Input (286, 9)
[[2. 2. 2. 0. 1. 2. 1. 2. 0.]
 [3. 0. 2. 0. 0. 0. 1. 0. 0.]
 [3. 0. 6. 0. 0. 1. 0. 1. 0.]
 [2. 2. 6. 0. 1. 2. 1. 1. 1.]
 [2. 2. 5. 4. 1. 1. 0. 4. 0.]]
Output (286,)
[1 0 1 0 1]


In [20]:
# Same previous example with dataframe treatment
dataset.columns = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat','Class']
# Columns to be encoded are specified
categorical_variables = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']
ordinal_encoder = OrdinalEncoder()
ordinal_ar = ordinal_encoder.fit_transform(dataset[categorical_variables])
# The returned ndarray is converted to a dataframe
ordinal_df = pd.DataFrame(ordinal_ar, columns=categorical_variables)
ordinal_df

Unnamed: 0,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,2.0,2.0,2.0,0.0,1.0,2.0,1.0,2.0,0.0
1,3.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0
2,3.0,0.0,6.0,0.0,0.0,1.0,0.0,1.0,0.0
3,2.0,2.0,6.0,0.0,1.0,2.0,1.0,1.0,1.0
4,2.0,2.0,5.0,4.0,1.0,1.0,0.0,4.0,0.0
...,...,...,...,...,...,...,...,...,...
281,3.0,0.0,5.0,5.0,1.0,1.0,0.0,1.0,0.0
282,3.0,2.0,4.0,4.0,1.0,1.0,0.0,1.0,1.0
283,1.0,2.0,5.0,5.0,1.0,1.0,1.0,4.0,0.0
284,3.0,2.0,2.0,0.0,0.0,1.0,1.0,1.0,0.0


# Breast Cancer Dataset
## OneHotEncoder Transform

In [23]:
# load dataset
dataset = pd.read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# one hot encode input variables
onehot_encoder = OneHotEncoder(drop='first', sparse_output=False)
X = onehot_encoder.fit_transform(X)

# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])

Input (286, 34)
[[0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
  0. 0. 1. 1. 0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 1. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
  0. 0. 1. 1. 1. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.
  0. 1. 0. 0. 0. 0. 0. 1. 0. 0.]]


In [26]:
# Same previous example with dataframe treatment
dataset.columns = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat','Class']
# Columns to be encoded are specified
categorical_variables = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']
onehot_encoder = OneHotEncoder(drop='first', sparse_output=False)
onehot_ar = onehot_encoder.fit_transform(dataset[categorical_variables])
# The returned ndarray is converted to a dataframe
onehot_df = pd.DataFrame(onehot_ar)
onehot_df.columns = onehot_encoder.get_feature_names_out()
onehot_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,276,277,278,279,280,281,282,283,284,285
age_'30-39',0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
age_'40-49',1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
age_'50-59',0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0
age_'60-69',0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
age_'70-79',0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
menopause_'lt40',0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
menopause_'premeno',1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0
tumor-size_'10-14',0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
tumor-size_'15-19',1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
tumor-size_'20-24',0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


# Breast Cancer Dataset
## BinaryEncoder Transform

In [None]:
dataset = pd.read_csv('breast-cancer.csv', header=None)
dataset.columns = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat','Class']
# Columns to be encoded are specified
categorical_variables = ['age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']
binary_encoder = BinaryEncoder()
binary_df = binary_encoder.fit_transform(dataset[categorical_variables])
# Returns directly a dataframe with the names in the columns
binary_df

Unnamed: 0,age_0,age_1,age_2,menopause_0,menopause_1,tumor-size_0,tumor-size_1,tumor-size_2,tumor-size_3,inv-nodes_0,...,node-caps_1,deg-malig_0,deg-malig_1,breast_0,breast_1,breast-quad_0,breast-quad_1,breast-quad_2,irradiat_0,irradiat_1
0,0,0,1,0,1,0,0,0,1,0,...,1,0,1,0,1,0,0,1,0,1
1,0,1,0,1,0,0,0,0,1,0,...,0,1,0,0,1,0,1,0,0,1
2,0,1,0,1,0,0,0,1,0,0,...,0,1,1,1,0,0,1,1,0,1
3,0,0,1,0,1,0,0,1,0,0,...,1,0,1,0,1,0,1,1,1,0
4,0,0,1,0,1,0,0,1,1,0,...,1,1,1,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
281,0,1,0,1,0,0,0,1,1,1,...,1,1,1,1,0,0,1,1,0,1
282,0,1,0,0,1,0,1,0,0,0,...,1,1,1,1,0,0,1,1,1,0
283,1,0,0,0,1,0,0,1,1,1,...,1,1,1,0,1,1,0,0,0,1
284,0,1,0,0,1,0,0,0,1,0,...,0,1,1,0,1,0,1,1,0,1


In [None]:
print(dataset['age'].nunique())
print(dataset['age'].unique())

6
["'40-49'" "'50-59'" "'60-69'" "'30-39'" "'70-79'" "'20-29'"]


6 categories in `age` were transformed in:

*   (6 - 1) = 5 columns with onehot encoding (`age_'30-39'`, `age_'40-49'`, `age_'50-59'`, `age_'60-69'`, `age_'70-79'`)
*   log2(6) = 2.5849... = 3 columns with binary encoding (`age_0`, `age_1`, `age_2`)

