# Titanic

In this tutorial we would show how one can easily with just few steps approach quite canonical competition (and dataset) hosted on widely known platform for datascience competitions [kaggle](https://www.kaggle.com). On that toy problem we would explore some base cases of using catboost, such as model training and predicting, as well as some usefull features like training visualization and model tuning.

[Orginal](https://github.com/catboost/catboost/blob/master/catboost/tutorials/kaggle_titanic_catboost_demo.ipynb)

In [16]:
import numpy as np
import pandas as pd
from catboost import Pool, CatBoostClassifier, cv, CatboostIpythonWidget

In [None]:
! wget https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv

In [57]:
categorical_features = ['pclass', 'name', 'sex', 'sibsp', 'parch', 'ticket',
          'cabin', 'embarked']

In [58]:
column_types = {feature: "category" for feature in categorical_features}
train_df = pd.read_csv('train.csv', dtype=column_types)

In [59]:
train_df.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Feature preparation

First of all lets check how many absent values do we have:

as we cat see, "Age", "Cabin" and "Embarked" indeed have some missing values, so lets fill them with some number way out of their distributions - so the model would be able to easily distinguish between them and take it into account:

Now lets separate features and target variable:

In [60]:
train_df.isnull().sum(axis=0)

survived      0
pclass        0
name          0
sex           0
age         177
sibsp         0
parch         0
ticket        0
fare          0
cabin       687
embarked      2
dtype: int64

In [61]:
for column in ("cabin", "embarked"):
    train_df[column].cat.add_categories(("NA", ), inplace=True)
    train_df[column].fillna("NA", inplace=True)
train_df.age.fillna(-1, inplace=True);

In [62]:
X = train_df.drop('survived', axis=1)
y = train_df.survived
categorical_features_indices = ((X.dtypes == "category").nonzero()[0])
X[categorical_features] = X[categorical_features].apply(lambda v: v.cat.codes)

Pay attention that our features are of differnt types - some of them are numeric, some are categorical, and some are even just strings, which normally should be handled in some specific way (for example encoded with bag-of-words representation). But in our case we could treat these string features just as categorical one.

# Model training (CatBoost time!)

First of all, lets split our train data to train and validation sets:

In [63]:
from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X.values, y.values,
                                                                train_size=0.85, random_state=1234)

Now lest create the model itself: I would go here with default parameters (as they provide a _really_ good baseline almost all the time), the only thing I'd like to specify here is `custom_loss` parameter, as this would give me an ability to see what's going on in terms of this competition metric - accuracy, as well as to be able to watch for logloss, as it would be more smooth on dataset of such size.

In [64]:
cat_model = CatBoostClassifier(
    custom_loss=['Accuracy'],
    random_seed=42
)

In [65]:
cat_model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    # verbose=True,
    # plot=False # doesn't work in Python 3
);

In [66]:
from sklearn.metrics import roc_auc_score

In [67]:
roc_auc_score(y_validation, cat_model.predict_proba(X_validation)[:, 1])

0.8751438434982739

# Early stopping

If you have a validation set (and you should), it's always easier and better to use early stopping and make predictions for test with best model:

In [68]:
model_simple = CatBoostClassifier(
    eval_metric='Accuracy',
    use_best_model=False,
    random_seed=42,
    iterations=1000
)

model_with_earlystop = CatBoostClassifier(
    eval_metric='Accuracy',
    use_best_model=True,
    random_seed=42,
    iterations=1000
)

model_simple.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
)

model_with_earlystop.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
);

In [69]:
from sklearn.metrics import accuracy_score

print('Simple model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, model_simple.predict(X_validation))
))

print('Early-stopped model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, model_with_earlystop.predict(X_validation))
))

Simple model validation accuracy: 0.8284
Early-stopped model validation accuracy: 0.8433


Though as was shown earlier simple validation scheme does not precisely describes model out-of-train score (may be biased because of dataset split) it is still nice to track model improvement dynamics - and thereby as we can see from this example it is really good to stop boosting process earlier (before the overfitting kicks in)