<a href="https://colab.research.google.com/github/divya-r-kamat/case-studies/blob/master/TitanicSurvivalPrediction_Cardinality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 ### Predicting Survival on the Titanic

 Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [3]:

data = data.replace('?', np.nan)
data.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

There are 263 missing values for Age, 1014 for Cabin and 2 for Embarked.

In [4]:
def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan 

data['cabin'] = data['cabin'].apply(get_first_cabin)
data.to_csv('titanic.csv', index=False)

In [5]:
# alternatively, we can use the mean method after isnull
# to visualise the percentage of
# missing values for each variable

data.isnull().mean()

pclass       0.000000
survived     0.000000
name         0.000000
sex          0.000000
age          0.200917
sibsp        0.000000
parch        0.000000
ticket       0.000000
fare         0.000764
cabin        0.774637
embarked     0.001528
boat         0.628724
body         0.907563
home.dest    0.430863
dtype: float64

There are missing data in the variables Age (20% missing), Cabin -in which the passenger was traveling- (77% missing), and Embarked -the port from which the passenger got into the Titanic- (~0.2% missing).

### Missing data Not At Random (MNAR): Systematic missing values
In the Titanic dataset, both the missing values of the variables age and cabin, were introduced systematically. For many of the people who did not survive, the age they had or the cabin they were traveling in, could not be established. The people who survived could be otherwise asked for that information.

Can we infer this by looking at the data?

In a situation like this, we could expect a greater number of missing values for people who did not survive.

Let's have a look.

In [6]:
# let's create a binary variable that indicates 
# whether the value of cabin is missing

data['cabin_null'] = np.where(data['cabin'].isnull(), 1, 0)

In [7]:
# let's evaluate the percentage of missing values in
# cabin for the people who survived vs the non-survivors.

# the variable Survived takes the value 1 if the passenger
# survived, or 0 otherwise

# group data by Survived vs Non-Survived
# and find the percentage of nulls for cabin
data.groupby(['survived'])['cabin_null'].mean()

survived
0    0.873918
1    0.614000
Name: cabin_null, dtype: float64

In [8]:
# another way of doing the above, with less lines
# of code :)

data['cabin'].isnull().groupby(data['survived']).mean()

survived
0    0.873918
1    0.614000
Name: cabin, dtype: float64

We observe that the percentage of missing values is higher for people who did not survive (87%), respect to people who survived (60%). This finding is aligned with our hypothesis that the data is missing because after people died, the information could not be retrieved.

Note: Having said this, to truly underpin whether the data is missing not at random, we would need to get extremely familiar with the way data was collected. Analysing datasets, can only point us in the right direction or help us build assumptions.

In [9]:
# Let's do the same for the variable age:

# First we create a binary variable to indicates
# whether the value of Age is missing

data['age_null'] = np.where(data['age'].isnull(), 1, 0)

# and then look at the mean in the different survival groups:
data.groupby(['survived'])['age_null'].mean()

survived
0    0.234858
1    0.146000
Name: age_null, dtype: float64

In [10]:
# or the same with simpler code :)

data['age'].isnull().groupby(data['survived']).mean()

survived
0    0.234858
1    0.146000
Name: age, dtype: float64

Again, we observe a higher number of missing data for the people who did not survive the tragedy. The analysis therefore suggests that there is a systematic loss of data: people who did not survive tend to have more missing information. Presumably, the method chosen to gather the information, contributes to the generation of these missing data.

### Missing data Completely At Random (MCAR)

In [11]:
# In the titanic dataset, there are also missing values
# for the variable Embarked.
# Let's have a look.

# Let's slice the dataframe to show only the observations
# with missing values for Embarked

data[data['embarked'].isnull()]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,cabin_null,age_null
168,1,1,"Icard, Miss. Amelie",female,38,0,0,113572,80,B28,,6,,,0,0
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62,0,0,113572,80,B28,,6,,"Cincinatti, OH",0,0


These 2 women were traveling together.

A priori, there does not seem to be an indication that the missing information in the variable Embarked is depending on any other variable, and the fact that these women survived, means that they could have been asked for this information.

Very likely the values were lost at the time of building the dataset.

If these values are MCAR, the probability of data being missing for these 2 women is the same as the probability for values to missing for any other person on the titanic. Of course this will be hard, if possible at all, to prove. 

### Cardinality
The values of a categorical variable are selected from a group of categories, also called labels. For example, in the variable gender the categories or labels are male and female, whereas in the variable city the labels can be London, Manchester, Brighton and so on.

Different categorical variables contain different number of labels or categories. The variable gender contains only 2 labels, but a variable like city or postcode, can contain a huge number of different labels.

The number of different labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as high cardinality.

#### Are multiple labels in a categorical variable a problem?
High cardinality may pose the following problems:

Variables with too many labels tend to dominate over those with only a few labels, particularly in Tree based algorithms.

A big number of labels within a variable may introduce noise with little, if any, information, therefore making machine learning models prone to over-fit.

Some of the labels may only be present in the training data set, but not in the test set, therefore machine learning algorithms may over-fit to the training set.

Contrarily, some labels may appear only in the test set, therefore leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.

In particular, tree methods can be biased towards variables with lots of labels (variables with high cardinality). Thus, their performance may be affected by high cardinality.

In [12]:
# let's inspect the cardinality, this is the number
# of different labels, for the different categorical variables

print('Number of categories in the variable Name: {}'.format(
    len(data.name.unique())))

print('Number of categories in the variable Gender: {}'.format(
    len(data.sex.unique())))

print('Number of categories in the variable Ticket: {}'.format(
    len(data.ticket.unique())))

print('Number of categories in the variable Cabin: {}'.format(
    len(data.cabin.unique())))

print('Number of categories in the variable Embarked: {}'.format(
    len(data.embarked.unique())))

print('Total number of passengers in the Titanic: {}'.format(len(data)))

Number of categories in the variable Name: 1307
Number of categories in the variable Gender: 2
Number of categories in the variable Ticket: 929
Number of categories in the variable Cabin: 182
Number of categories in the variable Embarked: 4
Total number of passengers in the Titanic: 1309


While the variable Sex contains only 2 categories and Embarked 4 (low cardinality), the variables Ticket, Name and Cabin, as expected, contain a huge number of different labels (high cardinality).

To demonstrate the effect of high cardinality in train and test sets and machine learning performance, I will work with the variable Cabin. I will create a new variable with reduced cardinality.

In [13]:
# let's explore the values / categories of Cabin

# we know from the previous cell that there are 148
# different cabins, therefore the variable
# is highly cardinal

data.cabin.unique()

array(['B5', 'C22', 'E12', 'D7', 'A36', 'C101', nan, 'C62', 'B35', 'A23',
       'B58', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C99', 'C52', 'T',
       'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B4', 'E40',
       'B38', 'E24', 'B51', 'B96', 'C46', 'E31', 'E8', 'B61', 'B77', 'A9',
       'C89', 'A14', 'E58', 'E49', 'E52', 'E45', 'B22', 'B26', 'C85',
       'E17', 'B71', 'B20', 'A34', 'C86', 'A16', 'A20', 'A18', 'C54',
       'C45', 'D20', 'A29', 'C95', 'E25', 'C111', 'C23', 'E36', 'D34',
       'D40', 'B39', 'B41', 'B102', 'C123', 'E63', 'C130', 'B86', 'C92',
       'A5', 'C51', 'B42', 'C91', 'C125', 'D10', 'B82', 'E50', 'D33',
       'C83', 'B94', 'D49', 'D45', 'B69', 'B11', 'E46', 'C39', 'B18',
       'D11', 'C93', 'B28', 'C49', 'B52', 'E60', 'C132', 'B37', 'D21',
       'D19', 'C124', 'D17', 'B101', 'D28', 'D6', 'D9', 'B80', 'C106',
       'B79', 'C47', 'D30', 'C90', 'E38', 'C78', 'C30', 'C118', 'D36',
       'D48', 'D47', 'C105', 'B36', 'B30', 'D43', 'B24', 'C2', 'C65',


Let's now reduce the cardinality of the variable. How? instead of using the entire cabin value, I will capture only the first letter.

Rationale: the first letter indicates the deck on which the cabin was located, and is therefore an indication of both social class status and proximity to the surface of the Titanic. Both are known to improve the probability of survival.

In [14]:
# let's capture the first letter of Cabin
data['Cabin_reduced'] = data['cabin'].astype(str).str[0]

data[['cabin', 'Cabin_reduced']].head()

Unnamed: 0,cabin,Cabin_reduced
0,B5,B
1,C22,C
2,C22,C
3,C22,C
4,C22,C


In [15]:
print('Number of categories in the variable Cabin: {}'.format(
    len(data.cabin.unique())))

print('Number of categories in the variable Cabin reduced: {}'.format(
    len(data.Cabin_reduced.unique())))

Number of categories in the variable Cabin: 182
Number of categories in the variable Cabin reduced: 9


In [16]:
import matplotlib.pyplot as plt

# to build machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# to evaluate the models
from sklearn.metrics import roc_auc_score

# to separate data into train and test
from sklearn.model_selection import train_test_split

In [17]:
# let's separate into training and testing set
# in order to build machine learning models

use_cols = ['cabin', 'Cabin_reduced', 'sex']

# this functions comes from scikit-learn
X_train, X_test, y_train, y_test = train_test_split(
    data[use_cols], 
    data['survived'],  
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((916, 3), (393, 3))

High cardinality leads to uneven distribution of categories in train and test sets
When a variable is highly cardinal, often some categories land only on the training set, or only on the testing set. If present only in the training set, they may lead to over-fitting. If present only on the testing set, the machine learning algorithm will not know how to handle them, as it has not seen them during training.

In [18]:
# Let's find out labels present only in the training set

unique_to_train_set = [
    x for x in X_train.cabin.unique() if x not in X_test.cabin.unique()
]

len(unique_to_train_set)

113

There are 113 Cabins only present in the training set, and not in the testing set.

In [19]:
# Let's find out labels present only in the test set

unique_to_test_set = [
    x for x in X_test.cabin.unique() if x not in X_train.cabin.unique()
]

len(unique_to_test_set)

36

Variables with high cardinality tend to have values (i.e., categories) present in the training set, that are not present in the test set, and vice versa. This will bring problems at the time of training (due to over-fitting) and scoring of new data (how should the model deal with unseen categories?).

This problem is almost overcome by reducing the cardinality of the variable. See below

In [20]:
# Let's find out labels present only in the training set
# for Cabin with reduced cardinality

unique_to_train_set = [
    x for x in X_train['Cabin_reduced'].unique()
    if x not in X_test['Cabin_reduced'].unique()
]

len(unique_to_train_set)

1

In [21]:
# Let's find out labels present only in the test set
# for Cabin with reduced cardinality

unique_to_test_set = [
    x for x in X_test['Cabin_reduced'].unique()
    if x not in X_train['Cabin_reduced'].unique()
]

len(unique_to_test_set)

0

Observe how by reducing the cardinality there is now only 1 label in the training set that is not present in the test set. And no label in the test set that is not contained in the training set as well.

### Effect of cardinality on Machine Learning Model Performance

In order to evaluate the effect of categorical variables in machine learning models, I will quickly replace the categories by numbers.

In [22]:
# Let's re-map Cabin into numbers so we can use it to train ML models


cabin_dict = {k: i for i, k in enumerate(X_train.cabin.unique(), 0)}
cabin_dict

{'A10': 142,
 'A11': 51,
 'A14': 93,
 'A16': 143,
 'A18': 23,
 'A20': 82,
 'A21': 47,
 'A24': 7,
 'A26': 22,
 'A29': 49,
 'A34': 37,
 'A36': 104,
 'A5': 95,
 'A6': 101,
 'A9': 58,
 'B102': 80,
 'B18': 111,
 'B19': 108,
 'B20': 137,
 'B24': 98,
 'B26': 32,
 'B28': 97,
 'B3': 77,
 'B30': 115,
 'B35': 18,
 'B36': 83,
 'B37': 144,
 'B4': 130,
 'B41': 134,
 'B42': 103,
 'B45': 13,
 'B49': 96,
 'B5': 42,
 'B50': 6,
 'B51': 52,
 'B52': 72,
 'B57': 21,
 'B58': 70,
 'B61': 81,
 'B69': 33,
 'B71': 59,
 'B73': 20,
 'B77': 85,
 'B78': 146,
 'B86': 90,
 'B94': 36,
 'B96': 24,
 'C101': 27,
 'C103': 124,
 'C105': 88,
 'C110': 123,
 'C111': 8,
 'C116': 107,
 'C118': 126,
 'C123': 35,
 'C124': 64,
 'C125': 17,
 'C126': 66,
 'C128': 30,
 'C130': 118,
 'C132': 91,
 'C148': 87,
 'C2': 99,
 'C22': 4,
 'C23': 65,
 'C28': 138,
 'C30': 55,
 'C39': 39,
 'C49': 135,
 'C51': 120,
 'C52': 105,
 'C54': 94,
 'C55': 62,
 'C6': 10,
 'C62': 132,
 'C65': 75,
 'C68': 2,
 'C7': 84,
 'C78': 26,
 'C80': 145,
 'C82': 71,
 '

In [23]:
# replace the labels in Cabin, using the dic created above
X_train.loc[:, 'Cabin_mapped'] = X_train.loc[:, 'cabin'].map(cabin_dict)
X_test.loc[:, 'Cabin_mapped'] = X_test.loc[:, 'cabin'].map(cabin_dict)

X_train[['Cabin_mapped', 'cabin']].head(10)

Unnamed: 0,Cabin_mapped,cabin
501,0,
588,0,
402,0,
1193,0,
686,0,
971,0,
117,1,E36
540,0,
294,2,C68
261,3,E24


We see how NaN takes the value 0 in the new variable, E36 takes the value 1, C68 takes the value 2, and so on.

In [24]:
# Now I will replace the letters in the reduced cabin variable
# with the same procedure

# create replace dictionary
cabin_dict = {k: i for i, k in enumerate(X_train['Cabin_reduced'].unique(), 0)}

# replace labels by numbers with dictionary
X_train.loc[:, 'Cabin_reduced'] = X_train.loc[:, 'Cabin_reduced'].map(
    cabin_dict)
X_test.loc[:, 'Cabin_reduced'] = X_test.loc[:, 'Cabin_reduced'].map(cabin_dict)

X_train[['Cabin_reduced', 'cabin']].head(20)

Unnamed: 0,Cabin_reduced,cabin
501,0,
588,0,
402,0,
1193,0,
686,0,
971,0,
117,1,E36
540,0,
294,2,C68
261,1,E24


In [25]:
# re-map the categorical variable Sex into numbers

X_train.loc[:, 'sex'] = X_train.loc[:, 'sex'].map({'male': 0, 'female': 1})
X_test.loc[:, 'sex'] = X_test.loc[:, 'sex'].map({'male': 0, 'female': 1})

X_train.sex.head()

501     1
588     1
402     1
1193    0
686     1
Name: sex, dtype: int64

In [26]:
# check if there are missing values in these variables

X_train[['Cabin_mapped', 'Cabin_reduced', 'sex']].isnull().sum()

Cabin_mapped     0
Cabin_reduced    0
sex              0
dtype: int64

In [27]:
X_test[['Cabin_mapped', 'Cabin_reduced', 'sex']].isnull().sum()

Cabin_mapped     41
Cabin_reduced     0
sex               0
dtype: int64

In the test set, there are now 41 missing values for the highly cardinal variable. These were introduced when encoding the categories into numbers.

How?

Many categories exist only in the test set. Thus, when we created our encoding dictionary using only the train set, we did not generate a number to replace those labels present only in the test set. 

For now, I will fill those missing values with 0.

In [28]:
# let's check the number of different categories in the encoded variables
len(X_train.Cabin_mapped.unique()), len(X_train.Cabin_reduced.unique())

(147, 9)

In [29]:
X_train.Cabin_reduced.unique()

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

Let's go ahead and evaluate the effect of labels in machine learning algorithms.

### Random Forest

In [30]:
# model built on data with high cardinality for cabin

# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[['Cabin_mapped', 'sex']], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_mapped', 'sex']])
pred_test = rf.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Random Forests roc-auc: 0.853790650048556
Test set
Random Forests roc-auc: 0.7691361097284443


We observe that the performance of the Random Forests on the training set is quite superior to its performance in the test set. This indicates that the model is over-fitting, which means that it does a great job at predicting the outcome on the dataset it was trained on, but it lacks the power to generalise the prediction to unseen data.

In [31]:
# model built on data with low cardinality for cabin

# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[['Cabin_reduced', 'sex']], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_reduced', 'sex']])
pred_test = rf.predict_proba(X_test[['Cabin_reduced', 'sex']])

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Random Forests roc-auc: 0.8163420365403872
Test set
Random Forests roc-auc: 0.8017670482827277


We can see now that the Random Forests no longer over-fit to the training set. In addition, the model is much better at generalising the predictions (compare the roc-auc of this model on the test set vs the roc-auc of the model above also in the test set: 0.81 vs 0.80)

### Adaboost

In [32]:
# model build on data with plenty of categories in Cabin

# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[['Cabin_mapped', 'sex']], y_train)

# make predictions on train and test set
pred_train = ada.predict_proba(X_train[['Cabin_mapped', 'sex']])
pred_test = ada.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))

print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Adaboost roc-auc: 0.8296861713101102
Test set
Adaboost roc-auc: 0.7604391350035948


In [33]:
# model build on data with fewer categories in Cabin Variable

# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[['Cabin_reduced', 'sex']], y_train)

# make predictions on train and test set
pred_train = ada.predict_proba(X_train[['Cabin_reduced', 'sex']])
pred_test = ada.predict_proba(X_test[['Cabin_reduced', 'sex']].fillna(0))

print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Adaboost roc-auc: 0.8161256723642566
Test set
Adaboost roc-auc: 0.8001078480172557


Similarly, the Adaboost model trained on the variable with high cardinality is overfit to the train set. Whereas the Adaboost trained on the low cardinal variable is not overfitting and therefore does a better job in generalising the predictions.

In addition, building an AdaBoost on a model with less categories in Cabin, is a) simpler and b) should a different category in the test set appear, by taking just the front letter of cabin, the ML model will know how to handle it because it was seen during training.

### Logistic Regression

In [34]:
# model build on data with plenty of categories in Cabin variable

# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')

# train the model
logit.fit(X_train[['Cabin_mapped', 'sex']], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_mapped', 'sex']])
pred_test = logit.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Logistic regression roc-auc: 0.8133909298124677
Test set
Logistic regression roc-auc: 0.7750815773463858


In [35]:
# model build on data with fewer categories in Cabin Variable

# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')

# train the model
logit.fit(X_train[['Cabin_reduced', 'sex']], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_reduced', 'sex']])
pred_test = logit.predict_proba(X_test[['Cabin_reduced', 'sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Logistic regression roc-auc: 0.8123468468695123
Test set
Logistic regression roc-auc: 0.8008268347989602


We can draw the same conclusion for Logistic Regression: reducing the cardinality improves the performance and generalisation of the algorithm.

### Gradient Boosted Classifier

In [36]:
# model build on data with plenty of categories in Cabin variable

# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[['Cabin_mapped', 'sex']], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_mapped', 'sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Gradient Boosted Trees roc-auc: 0.862631390919749
Test set
Gradient Boosted Trees roc-auc: 0.7733117637298823


In [37]:
# model build on data with plenty of categories in Cabin variable

# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[['Cabin_reduced', 'sex']], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_reduced', 'sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_reduced', 'sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Gradient Boosted Trees roc-auc: 0.816719415917359
Test set
Gradient Boosted Trees roc-auc: 0.8015181682429069


### Summary

- Strings need to be encoded as numbers for use with Scikit-Learn
- High Cardinality may cause over fitting and operationalization problems
- Reducing Cardinality may improve model performance

### Reference

https://www.udemy.com/course/feature-engineering-for-machine-learning/