## Titanic - Machine Learning from Disaster

### 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place.

This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the ["Join Competition button](https://www.kaggle.com/account/login?returnUrl=%2Fc%2Ftitanic) to create an account and gain access to the [competition data](https://www.kaggle.com/c/titanic/data). Then check out [Alexis Cook’s Titanic Tutorial](https://www.kaggle.com/alexisbcook/titanic-tutorial) that walks you through step by step how to make your first submission!

[![img](https://storage.googleapis.com/kaggle-media/welcome/video_thumbnail.jpg)](https://www.youtube.com/watch?v=8yZMXCaFshs&feature=youtu.be)

### Data Description

The data has been split into two groups:

- training set (train.csv)
- test set (test.csv)

**The training set** should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use [feature engineering ](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)to create new features.

**The test set** should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include **gender_submission.csv**, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

### Data Dictionary

| **Variable** | **Definition**                             | **Key**                                        |
| :----------- | :----------------------------------------- | :--------------------------------------------- |
| survival     | Survival                                   | 0 = No, 1 = Yes                                |
| pclass       | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex          | Sex                                        |                                                |
| Age          | Age in years                               |                                                |
| sibsp        | # of siblings / spouses aboard the Titanic |                                                |
| parch        | # of parents / children aboard the Titanic |                                                |
| ticket       | Ticket number                              |                                                |
| fare         | Passenger fare                             |                                                |
| cabin        | Cabin number                               |                                                |
| embarked     | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

### Variable Notes

**pclass**: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

**parch**: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Link: https://www.kaggle.com/c/titanic/overview  

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from catboost import (
    CatBoostClassifier,
    Pool,
    sum_models,
    to_classifier,
)
from sklearn.model_selection import StratifiedKFold

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

## Первичная загрузка данных

In [3]:
df = pd.read_csv("../../data/titanic/prepare.csv", index_col="PassengerId")
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,...,Title,FamilySize,TicketSeries,NameSeries,SurnameSeries,TitleSeries,IsTest,isAgePreds,CabinSeries,isCabinSeriesPreds
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,3,Owen Harris,1,22.000000,1,0,A/5 21171,-0.503595,,...,Mr,2,2,855,100,12,0,0,5,1
2,1.0,1,John Bradley (Florence Briggs Thayer),0,38.000000,1,0,PC 17599,0.734503,C85,...,Mrs,2,17,601,182,13,0,0,2,0
3,1.0,3,Laina,0,26.000000,0,0,STON/O2. 3101282,-0.490544,,...,Miss,1,34,690,329,9,0,0,4,1
4,1.0,1,Jacques Heath (Lily May Peel),0,35.000000,1,0,113803,0.382925,C123,...,Mrs,2,40,541,267,13,0,0,2,0
5,0.0,3,William Henry,1,35.000000,0,0,373450,-0.488127,,...,Mr,1,40,1102,15,12,0,0,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,,3,Woolf,1,26.994279,0,0,A.5. 3236,-0.488127,,...,Mr,1,5,1119,753,12,1,1,5,1
1306,,1,Fermina,0,39.000000,0,0,PC 17758,1.461829,C105,...,Dona,1,17,366,593,3,1,0,2,0
1307,,3,Simon Sivertsen,1,38.500000,0,0,SOTON/O.Q. 3101262,-0.503595,,...,Mr,1,31,973,699,12,1,0,5,1
1308,,3,Frederick,1,31.073377,0,0,359309,-0.488127,,...,Mr,1,40,390,827,12,1,1,5,1


<IPython.core.display.Javascript object>

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Survived            891 non-null    float64
 1   Pclass              1309 non-null   int64  
 2   Name                1309 non-null   object 
 3   Sex                 1309 non-null   int64  
 4   Age                 1309 non-null   float64
 5   SibSp               1309 non-null   int64  
 6   Parch               1309 non-null   int64  
 7   Ticket              1309 non-null   object 
 8   Fare                1309 non-null   float64
 9   Cabin               295 non-null    object 
 10  Embarked            1309 non-null   int64  
 11  FullName            1309 non-null   object 
 12  Surname             1309 non-null   object 
 13  Title               1309 non-null   object 
 14  FamilySize          1309 non-null   int64  
 15  TicketSeries        1309 non-null   int64  
 16  NameSe

<IPython.core.display.Javascript object>

# Prepare

In [5]:
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,...,Title,FamilySize,TicketSeries,NameSeries,SurnameSeries,TitleSeries,IsTest,isAgePreds,CabinSeries,isCabinSeriesPreds
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,3,Owen Harris,1,22.000000,1,0,A/5 21171,-0.503595,,...,Mr,2,2,855,100,12,0,0,5,1
2,1.0,1,John Bradley (Florence Briggs Thayer),0,38.000000,1,0,PC 17599,0.734503,C85,...,Mrs,2,17,601,182,13,0,0,2,0
3,1.0,3,Laina,0,26.000000,0,0,STON/O2. 3101282,-0.490544,,...,Miss,1,34,690,329,9,0,0,4,1
4,1.0,1,Jacques Heath (Lily May Peel),0,35.000000,1,0,113803,0.382925,C123,...,Mrs,2,40,541,267,13,0,0,2,0
5,0.0,3,William Henry,1,35.000000,0,0,373450,-0.488127,,...,Mr,1,40,1102,15,12,0,0,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,,3,Woolf,1,26.994279,0,0,A.5. 3236,-0.488127,,...,Mr,1,5,1119,753,12,1,1,5,1
1306,,1,Fermina,0,39.000000,0,0,PC 17758,1.461829,C105,...,Dona,1,17,366,593,3,1,0,2,0
1307,,3,Simon Sivertsen,1,38.500000,0,0,SOTON/O.Q. 3101262,-0.503595,,...,Mr,1,31,973,699,12,1,0,5,1
1308,,3,Frederick,1,31.073377,0,0,359309,-0.488127,,...,Mr,1,40,390,827,12,1,1,5,1


<IPython.core.display.Javascript object>

In [6]:
X = df.drop(
    [
        "Survived",
        "Name",
        "Ticket",
        "Cabin",
        "FullName",
        "Surname",
        "Title",
        "IsTest",
        "isAgePreds",
        "isCabinSeriesPreds",
    ],
    axis=1,
)
X

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,FamilySize,TicketSeries,NameSeries,SurnameSeries,TitleSeries,CabinSeries
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,3,1,22.000000,1,0,-0.503595,2,2,2,855,100,12,5
2,1,0,38.000000,1,0,0.734503,0,2,17,601,182,13,2
3,3,0,26.000000,0,0,-0.490544,2,1,34,690,329,9,4
4,1,0,35.000000,1,0,0.382925,2,2,40,541,267,13,2
5,3,1,35.000000,0,0,-0.488127,2,1,40,1102,15,12,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,3,1,26.994279,0,0,-0.488127,2,1,5,1119,753,12,5
1306,1,0,39.000000,0,0,1.461829,0,1,17,366,593,3,2
1307,3,1,38.500000,0,0,-0.503595,2,1,31,973,699,12,5
1308,3,1,31.073377,0,0,-0.488127,2,1,40,390,827,12,5


<IPython.core.display.Javascript object>

In [7]:
y = df[["Survived"]]
y.value_counts(dropna=False)

Survived
0.0         549
NaN         418
1.0         342
dtype: int64

<IPython.core.display.Javascript object>

In [8]:
train_index = y[~y["Survived"].isna()].index
train_index

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            882, 883, 884, 885, 886, 887, 888, 889, 890, 891],
           dtype='int64', name='PassengerId', length=891)

<IPython.core.display.Javascript object>

In [9]:
X_train = X[X.index.isin(train_index)]
X_test = X[~X.index.isin(train_index)]

y_train = y[y.index.isin(train_index)]

X_train.shape, X_test.shape, y_train.shape

((891, 13), (418, 13), (891, 1))

<IPython.core.display.Javascript object>

In [10]:
X_train, X_true, y_train, y_true = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42
)
X_train.shape, X_true.shape, y_train.shape, y_true.shape

((801, 13), (90, 13), (801, 1), (90, 1))

<IPython.core.display.Javascript object>

# Train

In [11]:
skf = StratifiedKFold(n_splits=5)

<IPython.core.display.Javascript object>

In [12]:
ensemble = []

for i, (train_index, val_index) in enumerate(skf.split(X_train, y_train)):
    X_sub_train, X_sub_val = X_train.iloc[train_index], X_train.iloc[val_index]
    y_sub_train, y_sub_val = y_train.iloc[train_index], y_train.iloc[val_index]

    model = CatBoostClassifier()

    model.fit(
        Pool(X_sub_train, y_sub_train),
        eval_set=Pool(X_sub_val, y_sub_val),
        verbose=False,
    )

    ensemble.append(model)
    print(model.best_score_)

{'learn': {'Logloss': 0.04144909381866455}, 'validation': {'Logloss': 0.39303266479286475}}
{'learn': {'Logloss': 0.054608813290848934}, 'validation': {'Logloss': 0.44493270396210916}}
{'learn': {'Logloss': 0.05476003343038924}, 'validation': {'Logloss': 0.325476066507587}}
{'learn': {'Logloss': 0.05555668419485624}, 'validation': {'Logloss': 0.4295912293065287}}
{'learn': {'Logloss': 0.05457798958219902}, 'validation': {'Logloss': 0.3543510821864202}}


<IPython.core.display.Javascript object>

In [13]:
models_avrg = sum_models(ensemble, weights=[1.0 / len(ensemble)] * len(ensemble))
models_avrg

<catboost.core.CatBoost at 0x7f6bba3c7ca0>

<IPython.core.display.Javascript object>

In [14]:
models_avrg.get_feature_importance()

array([11.33028341, 25.79584303, 10.78633536,  2.39988041,  1.5656157 ,
        7.53589078,  4.14638897,  4.04107384,  4.29316402,  8.73023567,
        9.82632591,  4.89756951,  4.6513934 ])

<IPython.core.display.Javascript object>

In [15]:
pd.DataFrame(
    {
        "Column": X_test.columns,
        "Score": models_avrg.get_feature_importance(),
    }
).sort_values(by="Score", ascending=False)

Unnamed: 0,Column,Score
1,Sex,25.795843
0,Pclass,11.330283
2,Age,10.786335
10,SurnameSeries,9.826326
9,NameSeries,8.730236
5,Fare,7.535891
11,TitleSeries,4.89757
12,CabinSeries,4.651393
8,TicketSeries,4.293164
6,Embarked,4.146389


<IPython.core.display.Javascript object>

# Validate

In [16]:
y_preds_1 = to_classifier(models_avrg).predict(X_true)
y_preds_1

array([1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0.,
       1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1.,
       0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,
       1., 0., 1., 0., 1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 0., 0.,
       0., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0.,
       0., 1., 1., 0., 0.])

<IPython.core.display.Javascript object>

In [17]:
(y_true["Survived"] == y_preds_1).sum() / len(y_true)

0.8333333333333334

<IPython.core.display.Javascript object>

# Submission

In [18]:
y_preds_avrg = to_classifier(models_avrg).predict(X_test).astype(int)
y_preds_avrg

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

<IPython.core.display.Javascript object>

In [19]:
submission = pd.DataFrame(
    {"PassengerId": X_test.index, "Survived": y_preds_avrg}
).set_index("PassengerId")
submission

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,1
...,...
1305,0
1306,1
1307,0
1308,0


<IPython.core.display.Javascript object>

In [20]:
submission.to_csv("../../data/titanic/submission.csv")

<IPython.core.display.Javascript object>