# Cardinality

The number of different labels is known as cardinality. A high number of labels within a variable is known as __high cardinality__.

The variable "gender" contains only 2 labels, but a variable like "city" or "postcode" can contain a huge number of labels.


## Is high cardinality a problem?

High cardinality poses the following challenges: 

- Variables with too many labels tend to dominate those with only a few labels, particularly in **decision tree-based** algorithms.

- High cardinality may introduce noise.

- Some of the labels may only be present in the training data set and not in the test set, so machine learning algorithms may over-fit to the training set.

- Some labels may appear only in the test set, leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.

**Algorithms based on decision trees can be biased towards variables with high cardinality**.

Below is a demo about the effect of high cardinality on the performance of various machine learning algorithms.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# The machine learning models.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# To evaluate the models.
from sklearn.metrics import roc_auc_score

# To separate data into train and test.
from sklearn.model_selection import train_test_split

In [2]:
DATA_PATH = "../datasets/Titanic/Titanic.csv"

data = pd.read_csv(DATA_PATH)
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [3]:
data = data.replace("?", np.nan)
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [4]:
# Let's inspect the cardinality: the number
# of different labels.

print(f'Categories of name: {len(data.name.unique())}')
print(f'Categories of gender: {len(data.sex.unique())}')
print(f'Categories of ticket: {len(data.ticket.unique())}')
print(f'Categories of Cabin: {len(data.cabin.unique())}')
print(f'Categories of Embarked: {len(data.embarked.unique())}')
print(f'Total number of passengers: {len(data)}')

Categories of name: 1307
Categories of gender: 2
Categories of ticket: 929
Categories of Cabin: 187
Categories of Embarked: 4
Total number of passengers: 1309


In [5]:
# the number of categories of cabin
data.cabin.unique()

array(['B5', 'C22 C26', 'E12', 'D7', 'A36', 'C101', nan, 'C62 C64', 'B35',
       'A23', 'B58 B60', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C99',
       'C52', 'T', 'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B4',
       'E40', 'B38', 'E24', 'B51 B53 B55', 'B96 B98', 'C46', 'E31', 'E8',
       'B61', 'B77', 'A9', 'C89', 'A14', 'E58', 'E49', 'E52', 'E45',
       'B22', 'B26', 'C85', 'E17', 'B71', 'B20', 'A34', 'C86', 'A16',
       'A20', 'A18', 'C54', 'C45', 'D20', 'A29', 'C95', 'E25', 'C111',
       'C23 C25 C27', 'E36', 'D34', 'D40', 'B39', 'B41', 'B102', 'C123',
       'E63', 'C130', 'B86', 'C92', 'A5', 'C51', 'B42', 'C91', 'C125',
       'D10 D12', 'B82 B84', 'E50', 'D33', 'C83', 'B94', 'D49', 'D45',
       'B69', 'B11', 'E46', 'C39', 'B18', 'D11', 'C93', 'B28', 'C49',
       'B52 B54 B56', 'E60', 'C132', 'B37', 'D21', 'D19', 'C124', 'D17',
       'B101', 'D28', 'D6', 'D9', 'B80', 'C106', 'B79', 'C47', 'D30',
       'C90', 'E38', 'C78', 'C30', 'C118', 'D36', 'D48', 'D47', '

In [6]:
# to capture the first letter of cabin.

data['Cabin_reduced'] = data['cabin'].astype(str).str[0]
data[['cabin', 'Cabin_reduced']].head()

Unnamed: 0,cabin,Cabin_reduced
0,B5,B
1,C22 C26,C
2,C22 C26,C
3,C22 C26,C
4,C22 C26,C


In [7]:
print(f'Cabin: {len(data.cabin.unique())}')
print(f'Reduced cabin: {len(data.Cabin_reduced.unique())}')

Cabin: 187
Reduced cabin: 9


In [8]:
# split the training and testing sets

use_cols = ['cabin', 'Cabin_reduced', 'sex']

X_train, X_test, y_train, y_test = train_test_split(
    data[use_cols],    # x
    data['survived'],  # y  
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((916, 3), (393, 3))

## Uneven distribution of categories

When a variable is highly cardinal, some categories appear only on the training set, and others only on the testing set.

If present only in the training set, they may cause over-fitting. If present only on the testing set, the machine learning model will not know how to handle them, as they were not seen during training.

In [9]:
# transform a variable is from highly cardinal to low cardinal which prevents
# some categories appear only on the training set, and others only on the testing set.

unique_to_train_set = [
    x for x in X_train.cabin.unique() if x not in X_test.cabin.unique()
]

unique_to_train_set_num = len(unique_to_train_set)
print(f"Unique to train set num: {unique_to_train_set_num}")


total_data = len(data)
print(f"Total: {total_data}")

print(f"Train unique ratio: {unique_to_train_set_num / total_data * 100:.2f} %")

Unique to train set num: 117
Total: 1309
Train unique ratio: 8.94 %


In [10]:
# labels present only in the test set.

unique_to_test_set = [
    x for x in X_test.cabin.unique() if x not in X_train.cabin.unique()
]

unique_to_test_set_num = len(unique_to_test_set)
print(f"Unique to test set num: {unique_to_test_set_num}")

print(f"Test unique ratio: {unique_to_test_set_num / total_data * 100:.2f} %")

Unique to test set num: 37
Test unique ratio: 2.83 %


In [11]:
# label present in reduced dataset

unique_to_reduced_train_set = [
    x for x in X_train['Cabin_reduced'].unique()
    if x not in X_test['Cabin_reduced'].unique()
]

unique_to_reduced_train_set_num = len(unique_to_reduced_train_set)
print(f"Reduced train unique ratio: {unique_to_reduced_train_set_num / total_data * 100:.2f} %")

unique_to_reduced_test_set = [
    x for x in X_test['Cabin_reduced'].unique()
    if x not in X_train['Cabin_reduced'].unique()
]

unique_to_reduced_test_set_num = len(unique_to_test_set)
print(f"Reduced test unique ratio: {unique_to_reduced_test_set_num / total_data * 100:.2f} %")

Reduced train unique ratio: 0.08 %
Reduced test unique ratio: 2.83 %


In [12]:
cabin_dict = {k: i for i, k in enumerate(X_train.cabin.unique(), 0)}
cabin_dict

{nan: 0,
 'E36': 1,
 'C68': 2,
 'E24': 3,
 'C22 C26': 4,
 'D38': 5,
 'B50': 6,
 'A24': 7,
 'C111': 8,
 'F': 9,
 'C6': 10,
 'C87': 11,
 'E8': 12,
 'B45': 13,
 'C93': 14,
 'D28': 15,
 'D36': 16,
 'C125': 17,
 'B35': 18,
 'T': 19,
 'B73': 20,
 'B57 B59 B63 B66': 21,
 'A26': 22,
 'A18': 23,
 'B96 B98': 24,
 'G6': 25,
 'C78': 26,
 'C101': 27,
 'D9': 28,
 'D33': 29,
 'C128': 30,
 'E50': 31,
 'B26': 32,
 'B69': 33,
 'E121': 34,
 'C123': 35,
 'B94': 36,
 'A34': 37,
 'D': 38,
 'C39': 39,
 'D43': 40,
 'E31': 41,
 'B5': 42,
 'D17': 43,
 'F33': 44,
 'E44': 45,
 'D7': 46,
 'A21': 47,
 'D34': 48,
 'A29': 49,
 'D35': 50,
 'A11': 51,
 'B51 B53 B55': 52,
 'D46': 53,
 'E60': 54,
 'C30': 55,
 'D26': 56,
 'E68': 57,
 'A9': 58,
 'B71': 59,
 'D37': 60,
 'F2': 61,
 'C55 C57': 62,
 'C89': 63,
 'C124': 64,
 'C23 C25 C27': 65,
 'C126': 66,
 'E49': 67,
 'F E46': 68,
 'E46': 69,
 'D19': 70,
 'B58 B60': 71,
 'C82': 72,
 'B52 B54 B56': 73,
 'C92': 74,
 'E45': 75,
 'F G73': 76,
 'C65': 77,
 'E25': 78,
 'B3': 79,
 'D

In [13]:
# Replace the labels in Cabin with the dictionary
# we just created.

X_train.loc[:, 'Cabin_mapped'] = X_train.loc[:, 'cabin'].map(cabin_dict)
X_test.loc[:, 'Cabin_mapped'] = X_test.loc[:, 'cabin'].map(cabin_dict)

X_train[['Cabin_mapped', 'cabin']].head(10)

Unnamed: 0,Cabin_mapped,cabin
501,0,
588,0,
402,0,
1193,0,
686,0,
971,0,
117,1,E36
540,0,
294,2,C68
261,3,E24


In [14]:
# Now I will replace the letters in the reduced cabin variable
# using the same procedure.

# Create replacement dictionary.
cabin_dict = {k: i for i, k in enumerate(X_train['Cabin_reduced'].unique(), 0)}

# # Replace labels by numbers using dictionary.
# X_train.loc[:,'Cabin_reduced'] = X_train.loc[:, 'Cabin_reduced'].map(cabin_dict)
# X_test.loc[:, 'Cabin_reduced'] = X_test.loc[:, 'Cabin_reduced'].map(cabin_dict)

X_train['Cabin_reduced'] = X_train['Cabin_reduced'].map(cabin_dict)
X_test['Cabin_reduced'] = X_test['Cabin_reduced'].map(cabin_dict)


# X_train[['Cabin_reduced', 'cabin']].head(20)

In [15]:
# X_train.loc[:, 'sex'] = X_train.loc[:, 'sex'].map({'male': 0, 'female': 1})
X_train['sex'] = X_train['sex'].map({'male': 0, 'female': 1})
X_test['sex'] = X_test['sex'].map({'male': 0, 'female': 1})

X_train.sex.head()

501     1
588     1
402     1
1193    0
686     1
Name: sex, dtype: int64

In [16]:
# check if there are missing values in these variables.

X_train[['Cabin_mapped', 'Cabin_reduced', 'sex']].isnull().sum()

Cabin_mapped     0
Cabin_reduced    0
sex              0
dtype: int64

In [17]:
X_test[['Cabin_mapped', 'Cabin_reduced', 'sex']].isnull().sum()

Cabin_mapped     42
Cabin_reduced     0
sex               0
dtype: int64

In the test set, there are now 41 missing values for the highly cardinal variable. These were introduced when encoding the categories into numbers. 

**Many categories exist only in the test set**. 

Thus, when we created our encoding dictionary using only the train set, we did not generate a number to replace those labels present only in the test set. As a consequence, they were encoded as NaN. We will see in future notebooks how to tackle this problem. For now, I will fill in those missing values with 0.

## Random Forests

In [18]:
# Model trained with data with high cardinality.

# The model.
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# Train the model.
rf.fit(X_train[['Cabin_mapped', 'sex']], y_train)

# Make predictions on train and test set.
pred_train = rf.predict_proba(X_train[['Cabin_mapped', 'sex']])
pred_test = rf.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Random Forests roc-auc: 0.8561832352985574
Test set
Random Forests roc-auc: 0.7707953099939163


The performance of the Random Forests on the training set is quite superior to its performance on the test set. This indicates that the model is over-fitting, which means that it does a great job of predicting the outcome on the dataset it was trained on, but it lacks the power to generalise the prediction to unseen data.

In [19]:
# Model trained with data with low cardinality.

# The model.
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# Train the model.
rf.fit(X_train[['Cabin_reduced', 'sex']], y_train)

# Make predictions on train and test set.
pred_train = rf.predict_proba(X_train[['Cabin_reduced', 'sex']])
pred_test = rf.predict_proba(X_test[['Cabin_reduced', 'sex']])

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Random Forests roc-auc: 0.8163420365403872
Test set
Random Forests roc-auc: 0.8017670482827277


## Summary

Note that the Random Forests no longer over-fit to the training set. The model is much better at generalising the predictions (compare the ROC-AUC of this model vs the ROC-AUC of the previous model: : 0.81 vs 0.80).