## Titanic - Machine Learning from Disaster

### 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place.

This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the ["Join Competition button](https://www.kaggle.com/account/login?returnUrl=%2Fc%2Ftitanic) to create an account and gain access to the [competition data](https://www.kaggle.com/c/titanic/data). Then check out [Alexis Cook’s Titanic Tutorial](https://www.kaggle.com/alexisbcook/titanic-tutorial) that walks you through step by step how to make your first submission!

[![img](https://storage.googleapis.com/kaggle-media/welcome/video_thumbnail.jpg)](https://www.youtube.com/watch?v=8yZMXCaFshs&feature=youtu.be)

### Data Description

The data has been split into two groups:

- training set (train.csv)
- test set (test.csv)

**The training set** should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use [feature engineering ](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)to create new features.

**The test set** should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include **gender_submission.csv**, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

### Data Dictionary

| **Variable** | **Definition**                             | **Key**                                        |
| :----------- | :----------------------------------------- | :--------------------------------------------- |
| survival     | Survival                                   | 0 = No, 1 = Yes                                |
| pclass       | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex          | Sex                                        |                                                |
| Age          | Age in years                               |                                                |
| sibsp        | # of siblings / spouses aboard the Titanic |                                                |
| parch        | # of parents / children aboard the Titanic |                                                |
| ticket       | Ticket number                              |                                                |
| fare         | Passenger fare                             |                                                |
| cabin        | Cabin number                               |                                                |
| embarked     | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

### Variable Notes

**pclass**: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

**parch**: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Link: https://www.kaggle.com/c/titanic/overview  

**Help**

- https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling
- https://www.kaggle.com/ash316/eda-to-prediction-dietanic 
- https://www.kaggle.com/pavlofesenko/simplest-top-10-titanic-0-80861

In [1]:
from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from catboost import (
    CatBoostClassifier,
    CatBoostRegressor,
    Pool,
    sum_models,
    to_classifier,
)
from sklearn.model_selection import StratifiedKFold

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
le = preprocessing.LabelEncoder()

<IPython.core.display.Javascript object>

## Первичная загрузка данных

In [4]:
gender_submission = pd.read_csv(
    "../data/titanic/gender_submission.csv", index_col="PassengerId"
)
gender_submission

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,1
894,0
895,0
896,1
...,...
1305,0
1306,1
1307,0
1308,0


<IPython.core.display.Javascript object>

In [5]:
train = pd.read_csv("../data/titanic/train.csv", index_col="PassengerId")
train

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


<IPython.core.display.Javascript object>

In [6]:
test = pd.read_csv("../data/titanic/test.csv", index_col="PassengerId")
test

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...
1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


<IPython.core.display.Javascript object>

На вход были даны 3 дата фрейма, было произведено их обьединение в один для более удобной дальнейшей работы

In [7]:
df = pd.concat([train, test])
df["Survived"].fillna(0, inplace=True)
df["Survived"] = df["Survived"].astype(int)
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1306,0,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


<IPython.core.display.Javascript object>

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  1309 non-null   int32  
 1   Pclass    1309 non-null   int64  
 2   Name      1309 non-null   object 
 3   Sex       1309 non-null   object 
 4   Age       1046 non-null   float64
 5   SibSp     1309 non-null   int64  
 6   Parch     1309 non-null   int64  
 7   Ticket    1309 non-null   object 
 8   Fare      1308 non-null   float64
 9   Cabin     295 non-null    object 
 10  Embarked  1307 non-null   object 
dtypes: float64(2), int32(1), int64(3), object(5)
memory usage: 117.6+ KB


<IPython.core.display.Javascript object>

In [9]:
(df.isna().sum() / len(df)).sort_values()

Survived    0.000000
Pclass      0.000000
Name        0.000000
Sex         0.000000
SibSp       0.000000
Parch       0.000000
Ticket      0.000000
Fare        0.000764
Embarked    0.001528
Age         0.200917
Cabin       0.774637
dtype: float64

<IPython.core.display.Javascript object>

# Simplest Top 10% Titanic [0.80861]

Link: https://www.kaggle.com/code/pavlofesenko/simplest-top-10-titanic-0-80861

In [10]:
train["Boy"], test["Boy"] = [
    (df.Name.str.split().str[1] == "Master.").astype("int") for df in [train, test]
]
train["Surname"], test["Surname"] = [
    df.Name.str.split(",").str[0] for df in [train, test]
]

model = CatBoostClassifier(
    one_hot_max_size=4, iterations=100, random_seed=0, verbose=False
)

model.fit(
    train[["Sex", "Pclass", "Embarked", "Boy", "Surname"]].fillna(""),
    train["Survived"],
    cat_features=[0, 2, 4],
)

pred = model.predict(
    test[["Sex", "Pclass", "Embarked", "Boy", "Surname"]].fillna("")
).astype("int")

submission_X = pd.DataFrame({"PassengerId": test.index, "Survived": pred}).set_index(
    "PassengerId"
)
submission_X

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,1
894,0
895,0
896,1
...,...
1305,0
1306,1
1307,0
1308,0


<IPython.core.display.Javascript object>

# Обработка полей и генерация новых признаков

## Pclass

In [11]:
df["Pclass"].value_counts(dropna=False)

3    709
1    323
2    277
Name: Pclass, dtype: int64

<IPython.core.display.Javascript object>

## Name

In [12]:
df["Name"].value_counts(dropna=False)

Connolly, Miss. Kate                                   2
Kelly, Mr. James                                       2
Braund, Mr. Owen Harris                                1
Johnson, Master. Harold Theodor                        1
Gustafsson, Mr. Alfred Ossian                          1
                                                      ..
Carter, Miss. Lucile Polk                              1
Silvey, Mr. William Baird                              1
Kallio, Mr. Nikolai Erland                             1
Louch, Mrs. Charles Alexander (Alice Adelaide Slow)    1
Peter, Master. Michael J                               1
Name: Name, Length: 1307, dtype: int64

<IPython.core.display.Javascript object>

### Surname

In [13]:
df["Surname"] = df["Name"].str.split(",").str[1].str.split(".").str[1].str.strip()

<IPython.core.display.Javascript object>

In [14]:
cnt = df["Surname"].value_counts()
cnt[cnt == 1].index

Index(['Samuel L (Edwiga Grabowska)', 'Nathan', 'Douglas Bullen', 'Marin',
       'John Samuel', 'Mary Conover', 'Sigvard Harald Elias', 'Nourelain',
       'Thomas Parham', 'Ernest Courtenay (Lilian Hughes)',
       ...
       'William Baird', 'Lucile Polk', 'Doolina Margaret "Daisy"',
       'Sidney (Emily Hocking)', 'Mark', 'Johan Henrik Johannesson',
       'Benjamin (Esther Ada Bloomfield)', 'Leon', 'Johan Emil', 'Michael J'],
      dtype='object', length=1038)

<IPython.core.display.Javascript object>

In [15]:
df.loc[df["Surname"].isin(cnt[cnt == 1].index), "Surname"] = "X"
df["Surname"].value_counts()

X                1038
John               15
William            11
Patrick            10
James               9
                 ... 
Frank               2
Ernest              2
Juho                2
Charles Henry       2
John James          2
Name: Surname, Length: 89, dtype: int64

<IPython.core.display.Javascript object>

In [16]:
df["Surname"] = le.fit_transform(df["Surname"])

<IPython.core.display.Javascript object>

### Prefix

In [17]:
df["Prefix"] = df["Name"].str.split(",").str[1].str.split(".").str[0].str.strip()

<IPython.core.display.Javascript object>

In [18]:
df["Prefix"] = (
    df["Prefix"]
    .replace(["Ms", "Mlle"], "Miss")
    .replace(["Mme", "Countess", "Lady", "Dona"], "Mrs")
    .replace(["Dr", "Major", "Col", "Sir", "Rev", "Jonkheer", "Capt", "Don"], "Mr",)
)

<IPython.core.display.Javascript object>

In [19]:
df["Is_boy"] = (df["Prefix"] == "Master").astype(int)

<IPython.core.display.Javascript object>

In [20]:
df["Prefix"].value_counts()

Mr              783
Miss            264
Mrs             200
Master           61
the Countess      1
Name: Prefix, dtype: int64

<IPython.core.display.Javascript object>

In [21]:
df["Prefix"] = le.fit_transform(df["Prefix"])

<IPython.core.display.Javascript object>

In [22]:
df.drop("Name", axis=1, inplace=True)

<IPython.core.display.Javascript object>

## Sex

In [23]:
df["Sex"].value_counts(dropna=False)

male      843
female    466
Name: Sex, dtype: int64

<IPython.core.display.Javascript object>

In [24]:
df["Sex"] = le.fit_transform(df["Sex"])

<IPython.core.display.Javascript object>

## SibSp/Parch

### Братьев и сестер/супругов на борту (SibSp)

In [25]:
df["SibSp"].value_counts()

0    891
1    319
2     42
4     22
3     20
8      9
5      6
Name: SibSp, dtype: int64

<IPython.core.display.Javascript object>

### Количество родителей/детей на борту (Parch)

In [26]:
df["Parch"].value_counts(dropna=False)

0    1002
1     170
2     113
3       8
5       6
4       6
6       2
9       2
Name: Parch, dtype: int64

<IPython.core.display.Javascript object>

In [27]:
df["Family_Size"] = 1 + df["Parch"] + df["SibSp"]
df["Family_Size"].value_counts()

1     790
2     235
3     159
4      43
6      25
5      22
7      16
11     11
8       8
Name: Family_Size, dtype: int64

<IPython.core.display.Javascript object>

In [28]:
df["Is_alone"] = df["Family_Size"].map(lambda x: 1 if x == 1 else 0)

<IPython.core.display.Javascript object>

## Ticket

In [29]:
df["Ticket"].value_counts()

CA. 2343        11
CA 2144          8
1601             8
PC 17608         7
S.O.C. 14879     7
                ..
113792           1
36209            1
323592           1
315089           1
359309           1
Name: Ticket, Length: 929, dtype: int64

<IPython.core.display.Javascript object>

In [30]:
df["Ticket_Class"] = (
    df["Ticket"]
    .str.replace(".", "", regex=False)
    .str.replace("/", "", regex=False)
    .str.split(" ")
    .map(lambda x: "X" if x[0].isdigit() else x[0])
)

df["Ticket_Class"].value_counts()

X          957
PC          92
CA          68
A5          28
SOTONOQ     24
WC          15
SCPARIS     14
STONO       14
A4          10
FCC          9
C            8
SOC          8
SOPP         7
STONO2       7
SCParis      5
SCAH         5
PP           4
LINE         4
WEP          4
FC           3
SOTONO2      3
SCA4         2
SWPP         2
PPP          2
SC           2
SCA3         1
A            1
LP           1
AQ4          1
STONOQ       1
Fa           1
CASOTON      1
AS           1
SCOW         1
SOP          1
SP           1
AQ3          1
Name: Ticket_Class, dtype: int64

<IPython.core.display.Javascript object>

In [31]:
df.drop("Ticket", axis=1, inplace=True)

<IPython.core.display.Javascript object>

In [32]:
df["Ticket_Class"] = le.fit_transform(df["Ticket_Class"])

<IPython.core.display.Javascript object>

## Fare

In [33]:
df["Fare"].value_counts(dropna=False, normalize=True)

8.0500     0.045837
13.0000    0.045073
7.7500     0.042017
26.0000    0.038197
7.8958     0.037433
             ...   
26.2833    0.000764
14.0000    0.000764
15.0000    0.000764
6.2375     0.000764
7.7208     0.000764
Name: Fare, Length: 282, dtype: float64

<IPython.core.display.Javascript object>

In [34]:
df["Fare"].fillna(0, inplace=True)

<IPython.core.display.Javascript object>

In [35]:
df["Fare"].describe()

count    1309.000000
mean       33.270043
std        51.747063
min         0.000000
25%         7.895800
50%        14.454200
75%        31.275000
max       512.329200
Name: Fare, dtype: float64

<IPython.core.display.Javascript object>

## Embarked

In [36]:
df["Embarked"].value_counts(dropna=False)

S      914
C      270
Q      123
NaN      2
Name: Embarked, dtype: int64

<IPython.core.display.Javascript object>

In [37]:
df["Embarked"].fillna("S", inplace=True)

<IPython.core.display.Javascript object>

In [38]:
df["Embarked"] = le.fit_transform(df["Embarked"])

<IPython.core.display.Javascript object>

## Age

In [39]:
df["Age"].isna().value_counts()

False    1046
True      263
Name: Age, dtype: int64

<IPython.core.display.Javascript object>

In [40]:
X_age_train = df[~df["Age"].isna()].drop(["Age", "Cabin"], axis=1)
y_age_train = df[~df["Age"].isna()][["Age"]]
X_age_train.shape, y_age_train.shape

((1046, 13), (1046, 1))

<IPython.core.display.Javascript object>

In [41]:
X_age_test = df[df["Age"].isna()].drop(["Age", "Cabin"], axis=1)
X_age_test.shape

(263, 13)

<IPython.core.display.Javascript object>

In [42]:
X_age_train, X_age_val, y_age_train, y_age_val = train_test_split(
    X_age_train, y_age_train, test_size=0.1, random_state=42
)
X_age_train.shape, X_age_val.shape, y_age_train.shape, y_age_val.shape

((941, 13), (105, 13), (941, 1), (105, 1))

<IPython.core.display.Javascript object>

In [43]:
model = CatBoostRegressor()

<IPython.core.display.Javascript object>

In [44]:
model.fit(
    Pool(X_age_train, y_age_train),
    eval_set=Pool(X_age_val, y_age_val),
    plot=True,
    verbose=False,
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostRegressor at 0x2da1b2ffcd0>

<IPython.core.display.Javascript object>

In [45]:
df_age = pd.DataFrame(
    {"PassengerId": X_age_test.index, "Age": model.predict(X_age_test)}
).set_index("PassengerId")
df_age

Unnamed: 0_level_0,Age
PassengerId,Unnamed: 1_level_1
6,35.862693
18,32.631904
20,28.739220
27,26.750433
29,23.285582
...,...
1300,23.334662
1302,23.288575
1305,30.672859
1308,34.394330


<IPython.core.display.Javascript object>

In [46]:
df["Age"] = df["Age"].fillna(df_age["Age"]).astype(int)
df["Age"].describe()

count    1309.000000
mean       29.600458
std        13.440037
min         0.000000
25%        22.000000
50%        28.000000
75%        37.000000
max        80.000000
Name: Age, dtype: float64

<IPython.core.display.Javascript object>

## Cabin

In [47]:
df["Cabin"].isna().value_counts()

True     1014
False     295
Name: Cabin, dtype: int64

<IPython.core.display.Javascript object>

In [48]:
df["Cabin"].value_counts()

C23 C25 C27        6
G6                 5
B57 B59 B63 B66    5
C22 C26            4
F33                4
                  ..
A14                1
E63                1
E12                1
E38                1
C105               1
Name: Cabin, Length: 186, dtype: int64

<IPython.core.display.Javascript object>

In [49]:
df["Cabin_Class"] = df["Cabin"].str.get(0)
df["Cabin_Class"].value_counts(dropna=False).sort_index()

A        22
B        65
C        94
D        46
E        41
F        21
G         5
T         1
NaN    1014
Name: Cabin_Class, dtype: int64

<IPython.core.display.Javascript object>

In [50]:
X_cabin_train = df[~df["Cabin"].isna()].drop(["Cabin", "Cabin_Class"], axis=1)
y_cabin_train = df[~df["Cabin"].isna()][["Cabin_Class"]]
X_cabin_train.shape, y_cabin_train.shape

((295, 14), (295, 1))

<IPython.core.display.Javascript object>

In [51]:
X_cabin_test = df[df["Cabin"].isna()].drop(["Cabin", "Cabin_Class"], axis=1)
X_cabin_test.shape

(1014, 14)

<IPython.core.display.Javascript object>

In [52]:
X_cabin_train, X_cabin_val, y_cabin_train, y_cabin_val = train_test_split(
    X_cabin_train, y_cabin_train, test_size=0.1, random_state=42
)
X_cabin_train.shape, X_cabin_val.shape, y_cabin_train.shape, y_cabin_val.shape

((265, 14), (30, 14), (265, 1), (30, 1))

<IPython.core.display.Javascript object>

In [53]:
model = CatBoostClassifier(loss_function="MultiClass")

<IPython.core.display.Javascript object>

In [54]:
model.fit(
    Pool(X_cabin_train, y_cabin_train),
    eval_set=Pool(X_cabin_val, y_cabin_val),
    plot=True,
    verbose=False,
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x2da1b9a4b80>

<IPython.core.display.Javascript object>

In [55]:
df_cabin = pd.DataFrame(
    {
        "PassengerId": X_cabin_test.index,
        "Cabin_Class": model.predict(X_cabin_test).flatten(),
    }
).set_index("PassengerId")
df_cabin

Unnamed: 0_level_0,Cabin_Class
PassengerId,Unnamed: 1_level_1
1,F
3,E
5,F
6,F
8,F
...,...
1304,F
1305,F
1307,F
1308,F


<IPython.core.display.Javascript object>

In [56]:
df["Cabin_Class"] = df["Cabin_Class"].fillna(df_cabin["Cabin_Class"])
df["Cabin_Class"] = le.fit_transform(df["Cabin_Class"])
df["Cabin_Class"].value_counts()

5    760
2    127
3    117
1    103
4    100
6     67
0     34
7      1
Name: Cabin_Class, dtype: int64

<IPython.core.display.Javascript object>

In [57]:
df.drop("Cabin", axis=1, inplace=True)

<IPython.core.display.Javascript object>

# Разбивка данных для обучения

In [58]:
df

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Surname,Prefix,Is_boy,Family_Size,Is_alone,Ticket_Class,Cabin_Class
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,3,1,22,1,0,7.2500,2,88,2,0,2,0,2,5
2,1,1,0,38,1,0,71.2833,0,88,3,0,2,0,14,2
3,1,3,0,26,0,0,7.9250,2,88,1,0,1,1,31,4
4,1,1,0,35,1,0,53.1000,2,88,3,0,2,0,36,2
5,0,3,1,35,0,0,8.0500,2,84,2,0,1,1,36,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,0,3,1,30,0,0,8.0500,2,88,2,0,1,1,2,5
1306,0,1,0,39,0,0,108.9000,0,88,3,0,1,1,14,2
1307,0,3,1,38,0,0,7.2500,2,88,2,0,1,1,28,5
1308,0,3,1,34,0,0,8.0500,2,28,2,0,1,1,36,5


<IPython.core.display.Javascript object>

In [59]:
y = df["Survived"]
X = df.drop("Survived", axis=1)
X.shape, y.shape

((1309, 14), (1309,))

<IPython.core.display.Javascript object>

In [60]:
X_test = X[X.index.isin(test.index)]

X = X[X.index.isin(train.index)]
y = y[y.index.isin(train.index)].astype(int)

X.shape, y.shape, X_test.shape

((891, 14), (891,), (418, 14))

<IPython.core.display.Javascript object>

In [61]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Pclass        891 non-null    int64  
 1   Sex           891 non-null    int32  
 2   Age           891 non-null    int32  
 3   SibSp         891 non-null    int64  
 4   Parch         891 non-null    int64  
 5   Fare          891 non-null    float64
 6   Embarked      891 non-null    int32  
 7   Surname       891 non-null    int32  
 8   Prefix        891 non-null    int32  
 9   Is_boy        891 non-null    int32  
 10  Family_Size   891 non-null    int64  
 11  Is_alone      891 non-null    int64  
 12  Ticket_Class  891 non-null    int32  
 13  Cabin_Class   891 non-null    int32  
dtypes: float64(1), int32(8), int64(5)
memory usage: 76.6 KB


<IPython.core.display.Javascript object>

## Разбивка на фолды и сборка ансамбля

In [62]:
skf = StratifiedKFold(n_splits=5)

<IPython.core.display.Javascript object>

In [63]:
ensemble = []

for i, (train_index, val_index) in enumerate(skf.split(X, y)):
    X_fold_train, X_fold_val = X.iloc[train_index], X.iloc[val_index]
    y_fold_train, y_fold_val = y.iloc[train_index], y.iloc[val_index]

    model = CatBoostClassifier(
        eval_metric="Accuracy", one_hot_max_size=4, iterations=100, random_seed=i
    )

    model.fit(
        Pool(X_fold_train, y_fold_train),
        eval_set=Pool(X_fold_val, y_fold_val),
        verbose=False,
    )

    ensemble.append(model)
    print(model.best_score_)

{'learn': {'Accuracy': 0.8932584269662921, 'Logloss': 0.3041403988731532}, 'validation': {'Accuracy': 0.8547486033519553, 'Logloss': 0.3950184792262716}}
{'learn': {'Accuracy': 0.9046283309957924, 'Logloss': 0.2926385635881539}, 'validation': {'Accuracy': 0.8539325842696629, 'Logloss': 0.42595238700774535}}
{'learn': {'Accuracy': 0.8779803646563815, 'Logloss': 0.31785196858848497}, 'validation': {'Accuracy': 0.8820224719101124, 'Logloss': 0.3343120390260318}}
{'learn': {'Accuracy': 0.8990182328190743, 'Logloss': 0.29889468183195056}, 'validation': {'Accuracy': 0.8426966292134831, 'Logloss': 0.3772663242824674}}
{'learn': {'Accuracy': 0.8835904628330996, 'Logloss': 0.31836158204555415}, 'validation': {'Accuracy': 0.8932584269662921, 'Logloss': 0.3130150534290677}}


<IPython.core.display.Javascript object>

In [64]:
models_avrg = sum_models(ensemble, weights=[1.0 / len(ensemble)] * len(ensemble))
models_avrg

<catboost.core.CatBoost at 0x2da1b9ed100>

<IPython.core.display.Javascript object>

In [65]:
models_avrg.get_feature_importance()

array([ 8.51130151, 42.85706891,  7.4885993 ,  2.19388614,  0.64145053,
        3.09017294,  2.99091119,  0.88531264,  6.15068699,  1.14115244,
        3.17274078,  0.28745825,  2.5367527 , 18.05250568])

<IPython.core.display.Javascript object>

In [66]:
pd.DataFrame(
    {"Column": X_test.columns, "Score": models_avrg.get_feature_importance(),}
).sort_values(by="Score", ascending=False)

Unnamed: 0,Column,Score
1,Sex,42.857069
13,Cabin_Class,18.052506
0,Pclass,8.511302
2,Age,7.488599
8,Prefix,6.150687
10,Family_Size,3.172741
5,Fare,3.090173
6,Embarked,2.990911
12,Ticket_Class,2.536753
3,SibSp,2.193886


<IPython.core.display.Javascript object>

# Предсказание результатов

In [67]:
y_preds_avrg = to_classifier(models_avrg).predict(X_test)  # 0.78468
y_preds_avrg

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

<IPython.core.display.Javascript object>

In [68]:
submission = pd.DataFrame(
    {"PassengerId": X_test.index, "Survived": y_preds_avrg}
).set_index("PassengerId")
submission

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,0
...,...
1305,0
1306,1
1307,0
1308,0


<IPython.core.display.Javascript object>

In [69]:
(submission["Survived"] == submission_X["Survived"]).value_counts()

True     388
False     30
Name: Survived, dtype: int64

<IPython.core.display.Javascript object>

In [70]:
submission.value_counts()

Survived
0           284
1           134
dtype: int64

<IPython.core.display.Javascript object>

In [71]:
submission.to_csv("../data/titanic/submission.csv")

<IPython.core.display.Javascript object>

In [72]:
submission_X.to_csv("../data/titanic/submission_X.csv")

<IPython.core.display.Javascript object>