<a href="https://colab.research.google.com/github/carighi/al_ml_workshop/blob/main/Feature_Engineering_Encode_Categorical_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering: encoding categorical data

#**Encode Categorical Data**



Machine learning models require all input and output variables to be numeric. This means
that if your data contains categorical data, you must encode it to numbers before you can fit
and evaluate a model. The two most popular techniques are an Ordinal encoding and a One
Hot encoding.

In this tutorial, you will learn:

* Encoding is a required pre-processing step when working with categorical data for machine
learning algorithms.
* How to use ordinal encoding for categorical variables that have a natural rank ordering.
* How to use one hot encoding for categorical variables that do not have a natural rank
ordering.


Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

This activity was prepared for the Practical AI/ML for Computational Biology and Chemistry Workshop (June 13-17, 2022, UD) https://github.com/udel-cbcb/al_ml_workshop

Code explanation has been enriched using [CS50.ai](https://cs50.ai/)

##Nominal and Ordinal Variables

* **Nominal Variable**. Variable comprises a finite set of discrete values with no rank-order
relationship between values.
* **Ordinal Variable**. Variable comprises a finite set of discrete values with a ranked
ordering between values.

Some algorithms can work with categorical data directly. For example, a decision tree can
be learned directly from categorical data with no data transform required (this depends on
the specific implementation). Many machine learning algorithms cannot operate on label data
directly. They require all input variables and output variables to be numeric. In general, this is
mostly a constraint of the effcient implementation of machine learning algorithms rather than
hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. This means that categorical data must be converted
to a numerical form. If the categorical variable is an output variable, you may also want to
convert predictions by the model back into a categorical form in order to present them or use
them in some application.

##Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical
values. They are:
* Ordinal Encoding
* One Hot Encoding
* Dummy Variable Encoding

###Ordinal Encoding
This is a preprocessing technique used for converting categorical data into numeric values while preserving their inherent ordering.
In ordinal encoding, each unique category value is assigned an integer value. An integer ordinal encoding is a natural encoding for ordinal variables. For categorical
variables, it imposes an ordinal relationship where no such relationship may exist. This can
cause problems and a one hot encoding may be used instead.

The following code uses OrdinalEncoder class from the sklearn.preprocessing module to convert categorical data into numerical data. This is often necessary in machine learning, as many algorithms work better with numerical data.


    from numpy import asarray: This line imports the asarray function from the numpy module. asarray is used to convert a given input into an array.

    from sklearn.preprocessing import OrdinalEncoder: This line imports the OrdinalEncoder class from the sklearn.preprocessing module.

    data = asarray([['red'], ['green'], ['blue']]): This line creates a numpy array with the colors 'red', 'green', and 'blue'.

    encoder = OrdinalEncoder(): This line creates an instance of the OrdinalEncoder class.

    result = encoder.fit_transform(data): This line fits the encoder to the data and then transforms the data. The fit_transform method is a combination of the fit and transform methods. fit determines the encoding based on the provided data, and transform applies this encoding to the data.

    The transformed data is then printed out. The colors have been replaced with numbers. Each unique color is assigned a unique number. The assignment is arbitrary and does not imply any order or importance.


In [2]:
# This line imports the asarray function from the numpy module. asarray is used to convert a given input into an array.
from numpy import asarray
# This line imports the OrdinalEncoder class from the sklearn.preprocessing module
from sklearn.preprocessing import OrdinalEncoder
# This line creates a numpy array with the colors 'red', 'green', and 'blue'
data = asarray([['red'], ['green'], ['blue']])
print("Original data: \n",data)
# This line creates an instance of the OrdinalEncoder class
encoder = OrdinalEncoder()
# This line fits the encoder to the data and then transforms the data.
# The fit_transform method is a combination of the fit and transform methods.
# fit determines the encoding based on the provided data, and transform applies this encoding to the data.
result = encoder.fit_transform(data)
print("Encoded data: \n",result)

Original data: 
 [['red']
 ['green']
 ['blue']]
Encoded data: 
 [[2.]
 [1.]
 [0.]]


We
can see that the numbers are assigned to the labels as we expected.

This **OrdinalEncoder** class is intended for input variables that are organized into rows and
columns, e.g. a matrix. If a categorical target variable needs to be encoded for a classification
problem, then the **LabelEncoder** class can be used. It does the same
thing as the **OrdinalEncoder**, although it expects a one-dimensional input for the single target
variable.

In [3]:
from numpy import asarray
from sklearn.preprocessing import LabelEncoder

data = asarray([['red'], ['green'], ['blue']])
data = data.ravel()  # Flatten the array
print("Original data: \n",data)

encoder = LabelEncoder()
result = encoder.fit_transform(data)
print("Encoded data: \n",result)

Original data: 
 ['red' 'green' 'blue']
Encoded data: 
 [2 1 0]


###One Hot Encoding
For categorical variables where no ordinal relationship exists, the integer encoding may not be
enough or even misleading to the model. Forcing an ordinal relationship via an ordinal encoding
and allowing the model to assume a natural ordering between categories may result in poor
performance or unexpected results (predictions halfway between categories). In this case, a one
hot encoding can be applied to the ordinal representation. This is where the integer encoded
variable is removed and one new binary variable is added for each unique integer value in the
variable.

The code below demonstrate one-hot encoding, a common technique used in machine learning to handle categorical data.

    It starts by importing necessary libraries. asarray is used to convert input to an array, and OneHotEncoder is used for the encoding process.

    An array named data is created with three elements: 'red', 'green', and 'blue'.

    An instance of OneHotEncoder is created with sparse=False to return a regular numpy array instead of a sparse matrix.

    The fit_transform method is used to fit the encoder to the data and then transform the data into a one-hot encoded format.

    The one-hot encoded data is then printed out. For each unique value in the original data, one-hot encoding creates a new binary column in the transformed data. In this case, 'red', 'green', and 'blue' each get their own column, and a '1' is placed in the column that corresponds to the original value, with '0's in the other columns.


In [None]:
# example of an one hot encoding
from numpy import asarray
# Encode categorical features as a one-hot numeric array.
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
# Will return sparse matrix if set True else will return an array.
encoder = OneHotEncoder(sparse=False)
# Fit OneHotEncoder to data, then transform data.
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


We can see the one hot encoding
matching our expectation of 3 binary variables in the order blue, green and red.

#Now with a real example:
##Breast Cancer Categorical Dataset
Going back to our example in [Basic Data Cleaning](https://colab.research.google.com/github/carighi/al_ml_workshop/blob/main/Basic_Data_Cleaning.ipynb), the Breast cancer dataset, which classifies breast cancer
patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine
input variables. It is a binary classification problem. A naive model can achieve an accuracy
of 70 percent on this dataset. A good score is about 76 percent.

You can learn more about the dataset here:
* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))

###Download Breast Cancer data files

In [15]:
!wget "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv" -O breast-cancer.csv
!wget "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names" -O breast-cancer.names
!head breast-cancer.csv

--2024-04-26 21:29:34--  https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24373 (24K) [text/plain]
Saving to: ‘breast-cancer.csv’


2024-04-26 21:29:34 (13.3 MB/s) - ‘breast-cancer.csv’ saved [24373/24373]

--2024-04-26 21:29:34--  https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3162 (3.1K) [text/plain]
Saving to: ‘breast-cancer.names’


2024-04-26 21:29:34 (21.8 MB/s) - ‘brea

In [16]:
# load and summarize the dataset
from pandas import read_csv
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# summarize
print('Input', X.shape)
print('Output', y.shape)

Input (286, 9)
Output (286,)


We
can see that we have 286 examples and nine input variables.



###Applying OrdinalEncoder Transform to the dataset
As mentioned before, an ordinal encoding involves mapping each unique label to an integer value. This type of
encoding is really only appropriate if there is a known relationship between the categories. This
relationship does exist for some of the variables in our dataset, and ideally, this should be
harnessed when preparing the data. In this case, we will ignore any possible existing ordinal
relationship and assume all variables are categorical. It can still be helpful to use an ordinal
encoding, at least as a point of reference with other encoding schemes.
We can use the OrdinalEncoder from scikit-learn to encode each variable to integers.

In [17]:
# ordinal encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
X = ordinal_encoder.fit_transform(X)
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])
print('Output', y.shape)
print(y[:5])

Input (286, 9)
[[2. 2. 2. 0. 1. 2. 1. 2. 0.]
 [3. 0. 2. 0. 0. 0. 1. 0. 0.]
 [3. 0. 6. 0. 0. 1. 0. 1. 0.]
 [2. 2. 6. 0. 1. 2. 1. 1. 1.]
 [2. 2. 5. 4. 1. 1. 0. 4. 0.]]
Output (286,)
[1 0 1 0 1]


We would expect the number of rows, and in this case, the number of columns, to be unchanged,
except all string values are now integer values. As expected, in this case, we can see that the
number of variables is unchanged, but all values are now ordinal encoded integers.

Next, let's evaluate machine learning on this dataset with this encoding. The best practice
when encoding variables is to fit the encoding on the training dataset, then apply it to the train
and test datasets. We will first split the dataset, then prepare the encoding on the training set,
and apply it to the test set.

Here is a summary the different steps for the code below:


    The code starts by importing necessary libraries and functions. LabelEncoder and OrdinalEncoder are used for preprocessing categorical data, and accuracy_score is used to evaluate the model's performance.

    The read_csv function is used to load a dataset from a CSV file named 'breast-cancer.csv'.

    The data is then split into features (X) and the target variable (y). The features are all columns except the last one, and the target is the last column.

    The dataset is split into training and testing sets using the train_test_split function.

    The OrdinalEncoder is used to convert categorical features into integer values.

    The LabelEncoder is used to convert the categorical target variable into integer values.

    A logistic regression model is created and trained using the training data.

    The trained model is used to make predictions on the test data.

    Finally, the accuracy of the model is calculated by comparing the predicted values with the actual values in the test set, and the accuracy is printed out.


In [19]:
# evaluate logistic regression on the breast cancer dataset with an ordinal encoding
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 75.79


In this case, the model achieved a classification accuracy of about 75.79 percent, which is a
reasonable score.

###Applying OneHotEncoder Transform
As mentioned before, one hot encoding is appropriate for categorical data where no relationship exists between
categories. The scikit-learn library provides the OneHotEncoder class to automatically one hot
encode one or more variables. By default the OneHotEncoder will output data with a sparse
representation, which is efficient given that most values are 0 in the encoded representation.
We will disable this feature by setting the sparse argument to False so that we can review the
effect of the encoding. Once defined, we can call the fit transform() function and pass it to
our dataset to create a quantile transformed version of our dataset.

In [None]:
# one-hot encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# one hot encode input variables
onehot_encoder = OneHotEncoder(sparse=False)
X = onehot_encoder.fit_transform(X)
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])

Input (286, 43)
[[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]]


We would expect the number of rows to remain the same, but the number of columns to
dramatically increase. As expected, in this case, we can see that the number of variables has
leaped up from 9 to 43 and all values are now binary values 0 or 1.

Next, let's evaluate machine learning on this dataset with this encoding as we did in the
previous section. The encoding is fit on the training set then applied to both train and test sets
as before.

In [None]:
# evaluate logistic regression on the breast cancer dataset with a one-hot encoding
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# one-hot encode input variables
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 70.53


Which encoder worked better?