<a href="https://www.kaggle.com/code/eamonntweedy/spaceship-titanic-tree-models?scriptVersionId=114226274" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Spaceship-titanic: predicting with tree models

In this notebook, we will do some data investigation, data pre-processing, feature investigation, and prediction with the Starship-Titanic Kaggle dataset.  The policies used in imputing missing data were inspired by this very helpful thread:
https://www.kaggle.com/competitions/spaceship-titanic/discussion/315987

* [Downloading data](#download)
* [Loading data, etc.](#loading)
    - [Cabin and group features](#cabin_group)
    - [Visualizing feature data](#graphs)
* [Data cleanup](#cleanup)
    - [Assumptions/rules for imputing missing data](#missing)
    - [Imputing missing data](#fill)
    - [Finishing up the feature engineering](#finish)
* [Model baseline evaluation](#models)
* [Selecting features](#features)
    - [Selecting features using SequentialFeatureSelector()](#sfs)
* [Hyperparameter tuning](#tuning)
    - [BayesSearchCV](#bayes)
        * [XGBoost](#bayes_x)
        * [LightGBM](#bayes_l)
        * [CatBoost](#bayes_c)
* [Predicting and writing to submission file](#pred)
        

<a id="download"></a>
## Downloading spaceship-titanic dataset

First, we download the spaceship-titanice dataset from Kaggle.  We use the kaggle library from fastai for convenience.

In [None]:
%%capture
! pip install kaggle

import os
from pathlib import Path

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
if iskaggle:
    path = Path('../input/spaceship-titanic')
#    !pip install -Uqq fastai
else:
    import zipfile,kaggle
    path = Path('spaceship-titanic')
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

<a id="loading"></a>
## Loading data, initial feature engineering, and visualizing data

Load in our training and test data as dataframes.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv(path/'train.csv')
df_test = pd.read_csv(path/'test.csv')

<a id="cabin_group"></a>
### Cabin and group features

We create several new features with the following data pre-processing function:
1. We split 'Cabin' into its component parts 'CabinDeck', 'CabinNum', 'CabinSide'.
2. We've harvested all that we can from 'Cabin', so we drop it.  The feature 'Name' doesn't appear to be immediately useful, so we drop that too.
3. We splits off the 'Group' number from 'PassengerId'.

In [None]:
cat_feat = ['HomePlanet','CryoSleep','Destination','VIP','CabinDeck','CabinSide','Group']
num_feat = ['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','CabinNum','Expenses']

def extract_cabin_group(df):
    df[['CabinDeck','CabinNum','CabinSide']]= df.Cabin.str.split('/',expand=True)
    df['CabinNum']=df['CabinNum'].astype(float)
    df.drop(['Cabin','Name'],axis=1,inplace=True)
    df['Group']=df.PassengerId.str.split('_').str[0].astype(float)

extract_cabin_group(df)
extract_cabin_group(df_test)

<a id="graphs"></a>
### Visualzing feature data

We take a look at bar graphs of counts for each categorical variable in the training set, as well as counts of transported and non-transported passenger by categorical variables.

In [None]:
fig, axes = plt.subplots(6,2, figsize=(20,35))
idx = 0
for col in cat_feat[:-1]:
    sns.countplot(data=df, y=col, palette='magma', orient='h',
                  ax=axes[idx][0]).set_title(f'Count of {col}', fontsize='16')
    sns.countplot(data=df, y=col, palette='mako', orient='h',  hue='Transported',
                  ax=axes[idx][1]).set_title(f'Count of {col} per transported', fontsize='16')
    idx +=1
plt.show()

In order to make some observation that will help us decide how to impute missing data, we also look at the relative breakdown of pairs of categorical features, both in the training and test data sets.

In [None]:
plot_cols = ['CabinDeck','Destination','VIP','CryoSleep','CabinSide']
fig, axes = plt.subplots(5,4, figsize=(30,35))
idx = 0
for col in plot_cols:
    sns.countplot(data=df, y=col, palette='magma', orient='h',
                  ax=axes[idx][0]).set_title(f'Count of {col} (train)', fontsize='16')
    sns.countplot(data=df, y=col, palette='magma', orient='h',  hue='HomePlanet',
                  ax=axes[idx][1]).set_title(f'Count of {col} per HomePlanet (train)', fontsize='16')
    sns.countplot(data=df_test, y=col, palette='magma', orient='h',
                  ax=axes[idx][2]).set_title(f'Count of {col} (test)', fontsize='16')
    sns.countplot(data=df_test, y=col, palette='magma', orient='h',  hue='HomePlanet',
                  ax=axes[idx][3]).set_title(f'Count of {col} per HomePlanet (test)', fontsize='16')
    idx +=1
plt.show()

In [None]:
plot_cols = ['CabinDeck','VIP','CryoSleep','CabinSide']
fig, axes = plt.subplots(4,4, figsize=(30,35))
idx = 0
for col in plot_cols:
    sns.countplot(data=df, y=col, palette='magma', orient='h',
                  ax=axes[idx][0]).set_title(f'Count of {col} (train)', fontsize='16')
    sns.countplot(data=df, y=col, palette='magma', orient='h',  hue='Destination',
                  ax=axes[idx][1]).set_title(f'Count of {col} per Destination (train)', fontsize='16')
    sns.countplot(data=df_test, y=col, palette='magma', orient='h',
                  ax=axes[idx][2]).set_title(f'Count of {col} (test)', fontsize='16')
    sns.countplot(data=df_test, y=col, palette='magma', orient='h',  hue='Destination',
                  ax=axes[idx][3]).set_title(f'Count of {col} per Destination (test)', fontsize='16')
    idx +=1
plt.show()

In [None]:
plot_cols = ['VIP','CabinDeck','CabinSide']
fig, axes = plt.subplots(3,4, figsize=(30,35))
idx = 0
for col in plot_cols:
    sns.countplot(data=df, y=col, palette='magma', orient='h',
                  ax=axes[idx][0]).set_title(f'Count of {col} (train)', fontsize='16')
    sns.countplot(data=df, y=col, palette='magma', orient='h',  hue='CryoSleep',
                  ax=axes[idx][1]).set_title(f'Count of {col} per CryoSleep (train)', fontsize='16')
    sns.countplot(data=df_test, y=col, palette='magma', orient='h',
                  ax=axes[idx][2]).set_title(f'Count of {col} (test)', fontsize='16')
    sns.countplot(data=df_test, y=col, palette='magma', orient='h',  hue='CryoSleep',
                  ax=axes[idx][3]).set_title(f'Count of {col} per CryoSleep (test)', fontsize='16')
    idx +=1
plt.show()

In [None]:
plot_cols = ['VIP','CabinDeck']
fig, axes = plt.subplots(2, 4, figsize=(30,35))
idx = 0
for col in plot_cols:
    sns.countplot(data=df, y=col, palette='magma', orient='h',
                  ax=axes[idx][0]).set_title(f'Count of {col} (train)', fontsize='16')
    sns.countplot(data=df, y=col, palette='magma', orient='h',  hue='CabinSide',
                  ax=axes[idx][1]).set_title(f'Count of {col} per CabinSide (train)', fontsize='16')
    sns.countplot(data=df_test, y=col, palette='magma', orient='h',
                  ax=axes[idx][2]).set_title(f'Count of {col} (test)', fontsize='16')
    sns.countplot(data=df_test, y=col, palette='magma', orient='h',  hue='CabinSide',
                  ax=axes[idx][3]).set_title(f'Count of {col} per CabinSide (test)', fontsize='16')
    idx +=1
plt.show()

The above charts will help us to deduce some assumptions and rules for imputing missing data.  First, let's also take a look at how CabinNum relates to Group within each CabinDeck class.

In [None]:
decks = ['A','B','C','D','E','F','G']
bill_cols = ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
fig, axes = plt.subplots(7, 10, figsize=(35,35))
fig.suptitle('Group vs. CabinNum in each CabinDeck class',fontsize='24')
idx_1 = 0
for deck in decks:
    idx_2 = 0
    for col in bill_cols:
        subset = df[df.CabinDeck==deck]
        sns.scatterplot(data=subset, x='Group', y='CabinNum',ax=axes[idx_1][idx_2]).set_title(f'{deck} (train)', fontsize='16')
        idx_2 += 1
    for col in bill_cols:
        subset = df_test[df_test.CabinDeck==deck]
        sns.scatterplot(data=subset, x='Group', y='CabinNum',ax=axes[idx_1][idx_2]).set_title(f'{deck} (test)', fontsize='16')
        idx_2 += 1
    idx_2 =0
    idx_1 +=1
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()

We can see that within each class, CabinDeck and Group are fairly close to being linearly related.  We will use this observation later - we shall use linear regression to fill the missing CabinNum fields.

<a id="cleanup"></a>
## Data cleanup

<a id="missing"></a>
### Assumptions and rules for imputing missing data:

We describe some observations about missing data and the rules we use to impute.  Note that it can be helpful if the missing data isn't filled in too carefully - adding some noise to the training set can have a regularizing effect that can reduce chance of model overfitting the training set.

Missing HomePlanet:
1. If a passenger has CabinDeck in [A,B,C] then they are from Europa.
2. If a passenger has CabinDeck G then they are from Earth
3. If a passenger has CabinDeck D then they are from Europa or Mars.  We set to Mars (mode among these two).
4. If a passenger has CabinDeck F then they are from Earth or Mars.  We set to Earth (mode among these two).
5. If a passenger has CabinDeck E then can be from any - we set to Earth (mode among these)
5. If a passenger is going to PSO or Trappist, we set to Earth (mode among these).
6. If a passenter is going to Cancri, we set to Europa (mode among these).

Missing CabinDeck: Just use HomePlanet
1. If a passenger is from Earth, then set CabinDeck to G, the most likely.
2. If a passenger is from Mars, then set CabinDeck to F, the most likely.
3. If a passenger is from Europa, then they're roughly equally likely to be in B or C.  Just choose B arbitrarily.

Missing CabinSide: Equally distributed between P and S - just choose S arbitrarily.

Missing VIP: Almost nobody is VIP.  Fill in all missing VIP as False.

Missing Cryosleep: We base this entirely on spending categories.
1. If a passenger has nonzero spending in any category, then set CryoSleep = False.
2. If a passenger has zero spending in all categories, we set CryoSleep=True (much more likely)

Missing Destination: vast majority are going to Trappist, and there's no other category that would have likelier destination.

Missing Age: Fill with mean age (though may be a more nuanced way to do this).

Missing CabinNum: Seems linearly correlated with Group.  We fill in missing CabinNum using a linear regression prediction on group, within each CabinDeck class.

Missing Bills: We fill all missing bills with 0.  Note that roughly half of those with missing bills have CryoSleep=True, so necessarily should have all bills 0.


<a id="fill"></a>
### Imputing missing data

In [None]:
from sklearn.linear_model import LinearRegression

The following block of functions will be used to fill in the missing data, according to the observations and rules above.

In [None]:
def fill_missing_home(df):
    filter_ABC = (df.HomePlanet.isna())&((df.CabinDeck == 'A')|(df.CabinDeck == 'B')|(df.CabinDeck == 'C'))
    filter_G_PSO_Trap = (df.HomePlanet.isna())&((df.CabinDeck == 'G')|(df.Destination == 'PSO J318.5-22')|(df.Destination == 'TRAPPIST-1e'))
    filter_Can = (df.HomePlanet.isna())&(df.Destination == '55 Cancri e')
    filter_D = (df.HomePlanet.isna())&(df.CabinDeck == 'D')
    filter_EF = (df.HomePlanet.isna())&((df.CabinDeck == 'E')|(df.CabinDeck == 'F'))
    df.loc[filter_ABC,'HomePlanet']='Europa'
    df.loc[filter_G_PSO_Trap,'HomePlanet']='Earth'
    df.loc[filter_Can,'HomePlanet']='Europa'
    df.loc[filter_D,'HomePlanet']='Mars'
    df.loc[filter_EF,'HomePlanet']='Earth'

def fill_missing_dest(df):
    df.loc[df.Destination.isna(),'Destination']='TRAPPIST-1e'
    
def fill_missing_deck(df):
    filter_Earth = (df.CabinDeck.isna())&(df.HomePlanet == 'Earth')
    filter_Mars = (df.CabinDeck.isna())&(df.HomePlanet == 'Mars')
    df.loc[filter_Earth,'CabinDeck']='G'
    df.loc[filter_Mars,'CabinDeck']='F'
    df.loc[df.CabinDeck.isna(),'CabinDeck']='B'
    
def fill_missing_side(df):
    df.loc[df.CabinSide.isna(),'CabinSide']='S'
    
def fill_missing_cryo(df):
    bill_cols = ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
    filter_bill = (df.CryoSleep.isna())&((df[bill_cols]>0).any(1))
    df.loc[filter_bill,'CryoSleep'] = False
    df.loc[df.CryoSleep.isna(),'CryoSleep']=True
    
def fill_missing_bills(df):
    bill_cols = ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
    for col in bill_cols:
        df.loc[df[col].isna(),col]=0

def fill_missing_cabin_num(df):
    deck_labels = ['B','C','D','E','F','G']
    for deck in deck_labels:
        df_deck =df[df.CabinDeck==deck]
        df_deck_ok = df_deck[df_deck.CabinNum.notnull()]
        df_deck_nan = pd.DataFrame(df_deck[df_deck.CabinNum.isna()])
        X_nan = df_deck_nan.loc[:,'Group'].values.astype(int).reshape(-1,1)
        X = df_deck_ok.loc[:,'Group'].values.astype(int).reshape(-1,1)
        Y = df_deck_ok.loc[:,'CabinNum'].values.astype(int).reshape(-1,1)
        lr = LinearRegression()
        lr.fit(X,Y)
        df.loc[(df.CabinNum.isna())&(df.CabinDeck==deck),'CabinNum'] = df.Group.apply(lambda x:max(0,np.rint(lr.predict([[x]]).item())))
    
def fill_missing_cats(df):
    fill_missing_home(df)
    df.loc[df.VIP.isna(),'VIP'] = False
    fill_missing_dest(df)
    fill_missing_side(df)
    fill_missing_deck(df)
    fill_missing_bills(df)
    fill_missing_cryo(df)
    mean_age = np.rint(df['Age'].mean()).astype(float)
    df.loc[df.Age.isna(),'Age']=mean_age
    fill_missing_cabin_num(df)

In [None]:
fill_missing_cats(df)
fill_missing_cats(df_test)

Verify that we've filled in all missing data:

In [None]:
df.isna().sum().sum()

In [None]:
df_test.isna().sum().sum()

<a id="finishing"></a>
### Finishing up the feature engineering

We convert the categorical features to numerals by their category codes:

In [None]:
for data in [df,df_test]:
    for feature in cat_feat:
        data[feature] = pd.Categorical(data[feature])
    data[cat_feat] = data[cat_feat].apply(lambda x: x.cat.codes)

Verify that dataframes look as we expect:

In [None]:
df.head()

Now that all expenses have been filled in, we create a new feature that is their sum:

In [None]:
for data in [df,df_test]:
    data['Expenses']=data['RoomService']+data['FoodCourt']+data['ShoppingMall']+data['Spa']+data['VRDeck']

Finally, we build the dataframes we'll use for modeling going forward:

In [None]:
X = df[cat_feat+num_feat]
X_test = df_test[cat_feat+num_feat]
y = df['Transported'].astype(int)

<a id="models"></a>
## Model baseline evaluation and model selection

In [None]:
%%capture
! pip install catboost
! pip install lightgbm
! pip install xgboost
! pip install scikit-optimize

In [None]:
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, mean_absolute_error
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

We split into training and validation sets:

In [None]:
X_train,X_val,y_train,y_val = train_test_split(X,y,random_state=42,shuffle=True)

In [None]:
skf = StratifiedKFold(shuffle=True,random_state=42)

We'll begin with a baseline cross-validation for five different tree/forest models: RandomForestClassifier(), AdaBoostClassifier, XGBClassifier(), LGBMClassifier(), and CatBoostClassifier().

In [None]:
def cross_val(clf,name,features):
    cvs=cross_val_score(clf,X[features],y,cv=skf)
    print(f'{name} has CV scores: {cvs} with mean: {cvs.mean()}')

In [None]:
feat_all = cat_feat+num_feat

In [None]:
rfc = RandomForestClassifier()
abc = AdaBoostClassifier()
xgbc = XGBClassifier(verbosity=0)
lgbmc = LGBMClassifier()
cbc = CatBoostClassifier(verbose=False)
clf_list = [(rfc,'Random Forest'),(abc,'AdaBoost'),(xgbc,'XGBoost'),(lgbmc,'LightGBM'),(cbc,'CatBoost')]

for clf in clf_list:
    cross_val(clf[0],clf[1],feat_all)

CV scores off the shelf look strong from CatBoostClassifier() and LGBMClassifier(), whereas the other three are a little worse.  We will move forward with XGBoost, LightGBM, and CatBoost.

<a id="features"></a>
## Selecting features

<a id="sfs"></a>
### Best feature sets using SequentialFeatureSelector()

We shall use scikit-learn SequentialFeatureSelector to identify best feature lists.  SequentialFeatureSelection works with a particular model, and so potentially gives different feature lists for different models.  We shall use XGBoost for this selection, one may want to in principle execute different feature selection processes for different models.

#### Finding N-best feature lists for XGBoost, N = 14,...,10

We save the resulting feature lists from SequentialFeatureSelector() in a dictionary 'sfs_dict'.  Entries of sfs_dict are of the form N:[list of N best features].

In [None]:
xgbc = XGBClassifier()

sfs_dict = {}
feat_all = np.array(cat_feat+num_feat)
for n in [14,13,12,11,10]:
    sfs = SequentialFeatureSelector(xgbc, n_features_to_select=n,direction='backward',cv=skf)
    sfs.fit(X,y)
    sfs.get_feature_names_out
    best_feat =  feat_all[sfs.get_support()]
    print(f'Best {n} features: {best_feat}')
    sfs_dict[n]=best_feat

Now we compute CV scores for XGBoost, LightGBM, and CatBoost using feature lists of various sizes.

In [None]:
xgbc = XGBClassifier(verbosity=0)
lgbmc = LGBMClassifier()
cbc = CatBoostClassifier(verbose=False)

clf_list = [(xgbc,'XGBoost'),(lgbmc,'LightGBM'),(cbc,'CatBoost')]

for num in sfs_dict:
    print(f'CV scores for {num} features: \n ---------- \n')
    for clf in clf_list:
        cross_val(clf[0],clf[1],sfs_dict[num])
    print('========== \n')

Those are pretty good, with the best 14 list giving strongest CV accuracy scores among feature lists and CatBoost giving the strongest CV accuracy scores among classifiers.  Going forward we will use the best 14 features.  We will tune hyperparameters for all three classifiers, but I expect CatBoost will give the best results overall.

In [None]:
best_14=sfs_dict[14]

<a id="tuning"></a>
## Hyperparameter tuning XGBoost, LightGBM, and CatBoost

In [None]:
%%capture
! pip install scikit-optimize
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.metrics import accuracy_score

We create some hyperparameter spaces for use in our BayesSearchCV searches:

<a id="bayes"></a>
### BayesSearchCV

In [None]:
xgb = XGBClassifier(verbosity = 0)
params_xgb = {
    'n_estimators':Integer(50,200),
    'learning_rate':Real(0.01, 0.5,'log-uniform'),
    'max_depth':Integer(1, 12),
    'subsample':Real(0.1, 1.0,'uniform'),
    'colsample_bytree':Real(0.5,1.0,'uniform'),
    'colsample_bylevel':Real(0.5,1.0,'uniform'),
    'colsample_bynode':Real(0.5,1.0,'uniform'),
    'reg_lambda':Real(0,10,'uniform'),
    'reg_alpha':Real(0,10,'uniform'),
    'num_parallel_tree':Integer(1,10),
    'min_child_weight':Real(0,10,'uniform'),
    'min_child_samples':Integer(10,50)
}

opt_xgb = BayesSearchCV(xgb,search_spaces=params_xgb,verbose=1,cv=skf,n_jobs=5, n_iter=100, random_state=42)

lgb = LGBMClassifier(verbosity = -1)
params_lgb = {
    'n_estimators':Integer(50,300),
    'learning_rate':Real(0.01, 0.5,'log-uniform'),
    'max_depth':Integer(1, 12),
    'num_leaves':Integer(2,2000,'log-uniform'),
    'subsample':Real(0.1, 1.0,'uniform'),
    'min_child_weight':Real(1e-3,10,'uniform'),
    'min_child_samples':Integer(10,50),
    'colsample_bytree':Real(0.5,1.0,'uniform'),
    'reg_lambda':Real(0,10,'uniform')
}
opt_lgb = BayesSearchCV(lgb,search_spaces=params_lgb,verbose=1,cv=skf,n_jobs=5, n_iter=100, random_state=42)

cb = CatBoostClassifier(verbose=False)
params_cb = {
    'n_estimators':Integer(600,1200),
    'depth':Integer(1, 12),
    'subsample':Real(0.1, 1.0,'uniform'),
    'random_strength': Real(1e-9, 10, 'log-uniform'),
    'bagging_temperature': Real(0.0, 1.0),
    'rsm':Real(0.5,1.0,'uniform'),
    'l2_leaf_reg':Real(1,30,'uniform')
}
opt_cb = BayesSearchCV(cb,search_spaces=params_cb,verbose=1,cv=skf, n_jobs=5, n_iter=100, random_state=42)

<a id="bayes_x"></a>
#### BayesSearchCV for XGBoost

In [None]:
opt_xgb.fit(X[best_14],y)
print(f'Results for XGBoost ---> Best params: {opt_xgb.best_params_} \n Best score: {opt_xgb.best_score_}')

<a id="bayes_l"></a>
#### BayesSearchCV for LightGBM

In [None]:
opt_lgb.fit(X[best_14],y)
print(f'Results for LightGBM ---> Best params: {opt_lgb.best_params_} \n Best score: {opt_lgb.best_score_}')

<a id="bayes_c"></a>
#### BayesSearchCV for CatBoost

In [None]:
opt_cb.fit(X[best_14],y)
print(f'Results for CatBoost ---> Best params: {opt_cb.best_params_} \n Best score: {opt_cb.best_score_}')

We train CatBoostClassifier with the hyperparameters identified by BayesSearchCV:

In [None]:
xgbc = XGBClassifier(verbosity = 0,colsample_bylevel=1,colsample_bynode=0.5686848205241387,colsample_bytree=1, learning_rate=0.33326322321728113,max_depth=12,min_child_samples=50,min_child_weight=0,n_estimators=79,num_parallel_tree=10, reg_alpha=10,reg_lambda=10, subsample=0.8287935953498862)
lgbmc = LGBMClassifier(colsample_bytree=1,learning_rate=0.0574159608617656,max_depth=11,min_child_samples=19,min_child_weight=2.6513926370045096,n_estimators=300,num_leaves=46,reg_lambda=9.733031038389,subsample=0.1)
cbc = CatBoostClassifier(verbose=False,bagging_temperature=0.5356585211512814, depth=7,l2_leaf_reg=1,n_estimators=1200,random_strength=4.810440963747201,rsm=0.5,subsample=0.702836334143052)

In [None]:
clf_list = [('XGBoost',xgbc),('LightGBM',lgbmc),('CatBoost',cbc)]
for clf in clf_list:
    cross_val(clf[1],clf[0],best_14)

CatBoost has the strongest CV scores, and so we train and predict using it:

In [None]:
def train_predict(clf,features):
    clf.fit(X[features],y)
    preds = clf.predict(X_test[features])
    return preds

In [None]:
preds=train_predict(cb,best_14)

In [None]:
def subm(preds, suff):
    df_test['Transported'] = preds.astype(bool)
    sub_df = df_test[['PassengerId','Transported']]
    sub_df.to_csv(f'sub-{suff}.csv', index=False)

subm(preds, 'CatBoost with 14 best features from backward SFS, and hyperparameters optimized using BayesSearchCV')