# Practical methods for encoding categorical features

The encoding of _categorical (discrete) features_ as numerical values is a typical first-step toward preparing input data for machine learning. Ordinal (aka integer) encoding is commonly utilized to convert __ordinal (ordered)__ categorical features to numerical values. There are many different ways of encoding __nominal (arbitrarily ordered)__ categorical features to values that can be used with machine learning algorithms. Let's explore a few of the simple and common methods including one-hot (aka dummy) encoding, frequency encoding, target (aka mean) encoding, binary encoding, and hashing encoding techniques.

In [33]:
# Import modules
import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import category_encoders as ce
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from pprint import pprint

## Load and clean data

Let's work with the legendary Titanic dataset.

In [34]:
# Read CSV file into DataFrame
df = pd.read_csv("../datasets/titanic.csv")
print("Input dataset has {} rows and {} columns".format(df.shape[0], df.shape[1]))
df.head(10)

Input dataset has 891 rows and 12 columns


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [35]:
# Remove duplicate rows
df.drop_duplicates(inplace=True)
print("Input dataset has {} rows and {} columns".format(df.shape[0], df.shape[1]))

Input dataset has 891 rows and 12 columns


In [36]:
# Define the features, X, and labels, y
features = df.drop(["PassengerId", "Survived", "Name"], axis="columns")
labels = df["Survived"]
print("Input dataset has {} samples, {} features, and {} labels".format(features.shape[0], features.shape[1], np.size(np.unique(labels))))

Input dataset has 891 samples, 9 features, and 2 labels


In [37]:
colnames_dtypes_dict = features.dtypes.to_dict()
for colname, dtype in colnames_dtypes_dict.items():
    print("Column {} has data type {}".format(colname, dtype))

Column Pclass has data type int64
Column Sex has data type object
Column Age has data type float64
Column SibSp has data type int64
Column Parch has data type int64
Column Ticket has data type object
Column Fare has data type float64
Column Cabin has data type object
Column Embarked has data type object


In [38]:
colnames_nancounts_dict = features.isna().sum(axis="rows").to_dict()
for colname, nancount in colnames_nancounts_dict.items():
    if nancount > 0:
        print("Column {} has {} missing values".format(colname, nancount))

Column Age has 177 missing values
Column Cabin has 687 missing values
Column Embarked has 2 missing values


In [39]:
colnames_wNaNs_isNum_list = [colname for colname, dtype in colnames_dtypes_dict.items() if (dtype == np.int or dtype == np.float) and colnames_nancounts_dict[colname] > 0]
colnames_wNaNs_isCat_list = [colname for colname, dtype in colnames_dtypes_dict.items() if (dtype == np.str or dtype == np.object) and colnames_nancounts_dict[colname] > 0]
print("Columns {} are numerical features which contain missing values".format(colnames_wNaNs_isNum_list))
print("Columns {} are categorical features which contain missing values".format(colnames_wNaNs_isCat_list))

Columns ['Age'] are numerical features which contain missing values
Columns ['Cabin', 'Embarked'] are categorical features which contain missing values


In [40]:
# Replace missing values
# For numerical data, fill NaN values using the median along each column
imp_num = SimpleImputer(missing_values=np.nan, strategy="median")
num_clean = imp_num.fit_transform(features[colnames_wNaNs_isNum_list])
# For categorical data, fill NaN values using the most frequent value along each column
imp_cat = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cat_clean = imp_cat.fit_transform(features[colnames_wNaNs_isCat_list])
# Re-assign cleaned-up columns to DataFrame
num_clean_list = [num_clean[:,idx_arr] for idx_arr in range(num_clean.shape[1])]
cat_clean_list = [cat_clean[:,idx_arr] for idx_arr in range(cat_clean.shape[1])]
for colname, colvals in zip(colnames_wNaNs_isNum_list+colnames_wNaNs_isCat_list, num_clean_list+cat_clean_list):
    features[colname] = colvals
print("Input dataset has {} missing values".format(features.isna().sum(axis="rows").sum(axis="rows")))

Input dataset has 0 missing values


## Do train-test split of inputs

In [41]:
# Split input data into random train (75%) and test (25%) subsets
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.25, random_state=1) # random state fixed for reproducible output
# Copy the train and test subsets to avoid the SettingWithCopyWarning in pandas
features_train, features_test, labels_train, labels_test = features_train.copy(), features_test.copy(), labels_train.copy(), labels_test.copy()
print("Training dataset has {} samples".format(features_train.shape[0]))
print("Testing dataset has {} samples".format(features_test.shape[0]))

Training dataset has 668 samples
Testing dataset has 223 samples


## Do feature engineering

In [42]:
# To preserve training data for use in a pipeline, copy the dataset
features_train_exampl = features_train.copy()

In [43]:
colnames_isNum_list = [colname for colname, dtype in colnames_dtypes_dict.items() if dtype == np.int or dtype == np.float]
colnames_isCat_list = [colname for colname, dtype in colnames_dtypes_dict.items() if dtype == np.str or dtype == np.object]
print("Columns {} are numerical features".format(colnames_isNum_list))
print("Columns {} are categorical features".format(colnames_isCat_list))

Columns ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] are numerical features
Columns ['Sex', 'Ticket', 'Cabin', 'Embarked'] are categorical features


###  Standardizing numerical features

Center and scale select numerical features (i.e., remove the mean and scale to unit variance).

In [44]:
features_train_exampl[colnames_isNum_list].head(5)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
35,1,42.0,1,0,52.0
46,3,28.0,1,0,15.5
453,1,49.0,1,0,89.1042
291,1,19.0,1,0,91.0792
748,1,19.0,1,0,53.1


In [45]:
# Standardize the "Age" and "Fare" columns
std_scaler = StandardScaler()
features_scaled = std_scaler.fit_transform(features_train_exampl[["Age","Fare"]])
features_train_exampl[["Age","Fare"]] = features_scaled
features_train_exampl.head(5)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
35,1,male,0.930512,1,0,113789,0.444264,B96 B98,S
46,3,male,-0.125243,1,0,370371,-0.349514,B96 B98,Q
453,1,male,1.458389,1,0,17453,1.251182,C92,C
291,1,female,-0.803943,1,0,11967,1.294133,B49,C
748,1,male,-0.803943,1,0,113773,0.468186,D30,S


###  Encoding categorical features

In [46]:
features_train_exampl[colnames_isCat_list].head(5)

Unnamed: 0,Sex,Ticket,Cabin,Embarked
35,male,113789,B96 B98,S
46,male,370371,B96 B98,Q
453,male,17453,C92,C
291,female,11967,B49,C
748,male,113773,D30,S


### Encoding _ordinal_ categorical features

In the case of ordinal variables (i.e., "Cabin"), use an ordinal (aka integer) encoding scheme.

In [47]:
pprint("Cabin column had {} categories: {}".format(features_train_exampl["Cabin"].nunique(), features_train_exampl["Cabin"].value_counts().to_dict()))

("Cabin column had 113 categories: {'B96 B98': 522, 'C23 C25 C27': 4, 'E101': "
 "3, 'G6': 3, 'C92': 2, 'B20': 2, 'F2': 2, 'C126': 2, 'E121': 2, 'D20': 2, "
 "'E24': 2, 'B22': 2, 'B51 B53 B55': 2, 'D': 2, 'B58 B60': 2, 'C2': 2, 'E33': "
 "2, 'C83': 2, 'C68': 2, 'D36': 2, 'E44': 2, 'D33': 2, 'C22 C26': 2, 'F G73': "
 "2, 'C93': 2, 'B35': 2, 'C125': 2, 'D26': 2, 'C78': 2, 'F4': 2, 'B49': 2, "
 "'E10': 1, 'F E69': 1, 'A36': 1, 'E12': 1, 'C49': 1, 'C47': 1, 'A23': 1, "
 "'B5': 1, 'E8': 1, 'A14': 1, 'C99': 1, 'C52': 1, 'B39': 1, 'E49': 1, 'F38': "
 "1, 'B77': 1, 'C45': 1, 'C30': 1, 'B3': 1, 'D56': 1, 'C148': 1, 'D9': 1, "
 "'E63': 1, 'C123': 1, 'D10 D12': 1, 'A34': 1, 'E67': 1, 'C118': 1, 'C124': 1, "
 "'C128': 1, 'B69': 1, 'C46': 1, 'E40': 1, 'E34': 1, 'A19': 1, 'C62 C64': 1, "
 "'C70': 1, 'D7': 1, 'D47': 1, 'B28': 1, 'A31': 1, 'D46': 1, 'D19': 1, 'F33': "
 "1, 'D37': 1, 'A26': 1, 'C65': 1, 'A10': 1, 'B41': 1, 'C32': 1, 'C82': 1, 'F "
 "G63': 1, 'E38': 1, 'B73': 1, 'C87': 1, 'C50': 1, 'A7'

In [48]:
# First, replace each of the categories in the "Cabin" column with the first occurrence of a letter
def prep_cabinVar(feature_data, colname="Cabin"):
    feature_data = feature_data.copy()
    feature_data[colname] = feature_data[colname].str.extract(r"([a-zA-Z])")
    return feature_data

In [49]:
features_train_exampl = prep_cabinVar(features_train_exampl)
pprint("Cabin column now has {} categories: {}".format(features_train_exampl["Cabin"].nunique(), features_train_exampl["Cabin"].value_counts().to_dict()))

("Cabin column now has 7 categories: {'B': 550, 'C': 48, 'E': 23, 'D': 22, "
 "'A': 12, 'F': 10, 'G': 3}")


In [50]:
ord_enc = ce.OrdinalEncoder(cols=["Cabin"], return_df=True)
features_train_exampl = ord_enc.fit_transform(features_train_exampl)
features_train_exampl.head(5)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
35,1,male,0.930512,1,0,113789,0.444264,1,S
46,3,male,-0.125243,1,0,370371,-0.349514,1,Q
453,1,male,1.458389,1,0,17453,1.251182,2,C
291,1,female,-0.803943,1,0,11967,1.294133,1,C
748,1,male,-0.803943,1,0,113773,0.468186,3,S


### Encoding _nominal_ categorical features

In [51]:
colnames_isCat_catcounts_dict = features_train_exampl[colnames_isCat_list].nunique(axis="rows").to_dict()
for colname, catcount in colnames_isCat_catcounts_dict.items():
    if colname != "Cabin":
        print("Column {} has {} categories".format(colname, catcount))

Column Sex has 2 categories
Column Ticket has 537 categories
Column Embarked has 3 categories


In [52]:
# To determine whether one-hot (aka dummy) encoding can be utilized,
# it is necessary to distinguish between low-cardinality and high-cardinality features
# The higher the cardinality the larger the number of unique categorical values,
# and thus the larger the number of new columns required for one-hot encoding
colnames_isCat_nomLowCard_list = [colname for colname, catcount in colnames_isCat_catcounts_dict.items() if catcount <= 2 and colname != "Cabin"]
colnames_isCat_nomHighCard_list = [colname for colname, catcount in colnames_isCat_catcounts_dict.items() if catcount > 2 and colname != "Cabin"]
print("Columns {} are nominal variables with low-cardinality".format(colnames_isCat_nomLowCard_list))
print("Columns {} are nominal variables with high-cardinality".format(colnames_isCat_nomHighCard_list))

Columns ['Sex'] are nominal variables with low-cardinality
Columns ['Ticket', 'Embarked'] are nominal variables with high-cardinality


In the case of __nominal variables with low-cardinality__ (i.e., "Sex"), use a one-hot (aka dummy) encoding scheme.

In [53]:
onehot_enc = ce.OneHotEncoder(cols=["Sex"], return_df=True)
features_train_exampl = onehot_enc.fit_transform(features_train_exampl)
features_train_exampl.head(5)

Unnamed: 0,Pclass,Sex_1,Sex_2,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
35,1,1,0,0.930512,1,0,113789,0.444264,1,S
46,3,1,0,-0.125243,1,0,370371,-0.349514,1,Q
453,1,1,0,1.458389,1,0,17453,1.251182,2,C
291,1,0,1,-0.803943,1,0,11967,1.294133,1,C
748,1,1,0,-0.803943,1,0,113773,0.468186,3,S


In the case of __nominal variables with high-cardinality__ (i.e., "Ticket" and "Embarked"), let's try out a few of the encoding schemes commonly utilized.

In [54]:
# First, replace each of the categories in the "Ticket" column with all digits
def prep_tixVar(feature_data, colname="Ticket"):
    feature_data = feature_data.copy()
    feature_data[colname] = feature_data[colname].str.extract(r"([0-9]{3,})")
    feature_data.replace(np.nan,"-9999", inplace=True)
    return feature_data

In [55]:
features_train_exampl = prep_tixVar(features_train_exampl)
features_train_exampl.head(5)

Unnamed: 0,Pclass,Sex_1,Sex_2,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
35,1,1,0,0.930512,1,0,113789,0.444264,1,S
46,3,1,0,-0.125243,1,0,370371,-0.349514,1,Q
453,1,1,0,1.458389,1,0,17453,1.251182,2,C
291,1,0,1,-0.803943,1,0,11967,1.294133,1,C
748,1,1,0,-0.803943,1,0,113773,0.468186,3,S


#### 1) Frequency encoding

Encode the categorical values with their counts (i.e., the number of samples associated with a particular category).

In [56]:
def perform_freq_encode(feature_data, colnames_list):
    feature_data = feature_data.copy()
    for colname in colnames_list:
        cats_counts_dict = feature_data[colname].value_counts().to_dict()
        feature_data[colname] = feature_data[colname].map(cats_counts_dict)
    return feature_data

In [57]:
features_train_freq_enc = perform_freq_encode(features_train_exampl, colnames_isCat_nomHighCard_list)
print("Frequency encoding results in {} new columns".format(len(features_train_freq_enc.columns)-len(features_train_exampl.columns)))
features_train_freq_enc.head(5)

Frequency encoding results in 0 new columns


Unnamed: 0,Pclass,Sex_1,Sex_2,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
35,1,1,0,0.930512,1,0,1,0.444264,1,488
46,3,1,0,-0.125243,1,0,1,-0.349514,1,60
453,1,1,0,1.458389,1,0,2,1.251182,2,120
291,1,0,1,-0.803943,1,0,2,1.294133,1,120
748,1,1,0,-0.803943,1,0,1,0.468186,3,488


#### 2) Target (aka mean) encoding

Encode the categorical values with the target mean (i.e., the mean value of the target values given a particular category).

In [58]:
target_enc = ce.TargetEncoder(cols=colnames_isCat_nomHighCard_list, return_df=True)
features_train_target_enc = target_enc.fit_transform(features_train_exampl, labels_train)
print("Target encoding results in {} new columns".format(len(features_train_target_enc.columns)-len(features_train_exampl.columns)))
features_train_target_enc.head(5)

Target encoding results in 0 new columns


Unnamed: 0,Pclass,Sex_1,Sex_2,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
35,1,1,0,0.930512,1,0,0.36976,0.444264,1,0.321721
46,3,1,0,-0.125243,1,0,0.36976,-0.349514,1,0.383333
453,1,1,0,1.458389,1,0,0.830502,1.251182,2,0.558333
291,1,0,1,-0.803943,1,0,0.830502,1.294133,1,0.558333
748,1,1,0,-0.803943,1,0,0.36976,0.468186,3,0.321721


#### 3) Binary encoding
Encode the categorical values with binary code whose digits are then split into new columns.

In [59]:
binary_enc = ce.BinaryEncoder(cols=colnames_isCat_nomHighCard_list, return_df=True)
features_train_binary_enc = binary_enc.fit_transform(features_train_exampl)
print("Binary encoding results in {} new columns".format(len(features_train_binary_enc.columns)-len(features_train_exampl.columns)))
features_train_binary_enc.head(5)

Binary encoding results in 12 new columns


Unnamed: 0,Pclass,Sex_1,Sex_2,Age,SibSp,Parch,Ticket_0,Ticket_1,Ticket_2,Ticket_3,...,Ticket_6,Ticket_7,Ticket_8,Ticket_9,Ticket_10,Fare,Cabin,Embarked_0,Embarked_1,Embarked_2
35,1,1,0,0.930512,1,0,0,0,0,0,...,0,0,0,0,1,0.444264,1,0,0,1
46,3,1,0,-0.125243,1,0,0,0,0,0,...,0,0,0,1,0,-0.349514,1,0,1,0
453,1,1,0,1.458389,1,0,0,0,0,0,...,0,0,0,1,1,1.251182,2,0,1,1
291,1,0,1,-0.803943,1,0,0,0,0,0,...,0,0,1,0,0,1.294133,1,0,1,1
748,1,1,0,-0.803943,1,0,0,0,0,0,...,0,0,1,0,1,0.468186,3,0,0,1


#### 4) Hashing encoding

Transform the categorical values to a sparse matrix, using a hash function to find the appropriate column for a particular category.

In [60]:
hash_enc = ce.HashingEncoder(n_components=32, cols=colnames_isCat_nomHighCard_list, return_df=True)
features_train_hash_enc = hash_enc.fit_transform(features_train_exampl)
print("Hashing encoding results in {} new columns".format(len(features_train_hash_enc.columns)-len(features_train_exampl.columns)))
features_train_hash_enc.head(5)

Hashing encoding results in 30 new columns


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,col_30,col_31,Pclass,Sex_1,Sex_2,Age,SibSp,Parch,Fare,Cabin
0,0,0,0,0,0,0,0,1,0,0,...,0,0,1,1,0,0.930512,1,0,0.444264,1
1,0,0,0,0,0,0,1,0,0,0,...,0,0,3,1,0,-0.125243,1,0,-0.349514,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,1.458389,1,0,1.251182,2
3,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,1,-0.803943,1,0,1.294133,1
4,0,0,0,0,0,1,0,0,0,0,...,0,0,1,1,0,-0.803943,1,0,0.468186,3


## Create and train model

In [61]:
# Build data preprocessing and modeling pipeline
def build_pipeline(encoder, classifier):
    if isinstance(encoder, str) and encoder == "freq_enc":
        tixVar_encoder = FunctionTransformer(perform_freq_encode, kw_args={"colnames_list": ["Ticket"]})
        embarkVar_encoder = FunctionTransformer(perform_freq_encode, kw_args={"colnames_list": ["Embarked"]})
    else:
        tixVar_encoder = encoder
        embarkVar_encoder = encoder
        
    cabinVar_preprocesser = FunctionTransformer(prep_cabinVar, kw_args={"colname": "Cabin"})
    tixVar_preprocesser = FunctionTransformer(prep_tixVar, kw_args={"colname": "Ticket"})
    cabinVar_transformer = Pipeline(steps=[("preprocess", cabinVar_preprocesser),\
                                           ("encode", ce.OrdinalEncoder(return_df=True))])
    tixVar_transformer = Pipeline(steps=[("preprocess", tixVar_preprocesser),\
                                         ("encode", tixVar_encoder)])
    
    feat_transformer = ColumnTransformer([("numVars",StandardScaler(), ["Age","Fare"]),\
                                          ("cabin", cabinVar_transformer, ["Cabin"]),\
                                          ("sex", ce.OneHotEncoder(return_df=True), ["Sex"]),\
                                          ("tix", tixVar_transformer, ["Ticket"]),\
                                          ("embark", embarkVar_encoder, ["Embarked"])],\
                                         remainder="passthrough")
    
    pipeline =  Pipeline(steps=[("transform", feat_transformer),\
                                ("classify", classifier)])
    
    return pipeline

In [62]:
# Fit different models according to the training data
# whose nominal categorical features are encoded using different schemes
encs = ["freq_enc", ce.TargetEncoder(return_df=True), ce.BinaryEncoder(return_df=True), ce.HashingEncoder(n_components=32, return_df=True)]
enc_names = ["Frequency encoding", "Target encoding", "Binary encoding", "Hashing encoding"]

clfs = [LogisticRegression(max_iter=500), SVC(kernel="linear"), LinearSVC(dual=False), SVC(kernel="rbf"), SVC(kernel="poly")]
clf_names = ["Logit regression", "Linear SVM", "LinearSVC", "RBF SVM", "Poly SVM"]

enc_clf_scores_dict = defaultdict(lambda: defaultdict(list))
for enc, enc_name in zip(encs, enc_names):
    for clf, clf_name in zip(clfs, clf_names):
        # Build the classifier from the training data
        ml_pipe = build_pipeline(enc, clf)
        ml_pipe.fit(features_train, labels_train)
        # Calculate the accuracy classification score on the testing data
        test_score = ml_pipe.score(features_test, labels_test)
        enc_clf_scores_dict[enc_name][clf_name].append(test_score)

## Evaluate model

Which nominal categorical encoding scheme yields the greatest mean accuracy on the testing data?

In [63]:
# For each encoding scheme, print out the classifier which maximizes testing accuracy
# (i.e., results in the largest fraction of correctly classified samples)
BEST_enc = ""
BEST_clf = ""
BEST_score = 0
for enc_name in enc_names:
    best_clf_key = max(enc_clf_scores_dict[enc_name], key=enc_clf_scores_dict[enc_name].get)
    best_score_val = enc_clf_scores_dict[enc_name][best_clf_key][0]
    if best_score_val > BEST_score:
        BEST_enc = enc_name
        BEST_clf = best_clf_key
        BEST_score = best_score_val
    print("For {}, {} yields the top accuracy score of {:.3f}".format(enc_name, best_clf_key, best_score_val))

For Frequency encoding, Logit regression yields the top accuracy score of 0.821
For Target encoding, Poly SVM yields the top accuracy score of 0.816
For Binary encoding, LinearSVC yields the top accuracy score of 0.807
For Hashing encoding, LinearSVC yields the top accuracy score of 0.830


In [64]:
print("Overall, the ({}, {}) combination yields the top accuracy score of {:.3f}".format(BEST_enc, BEST_clf, BEST_score))

Overall, the (Hashing encoding, LinearSVC) combination yields the top accuracy score of 0.830
