# Titanic med CatBoost

## Frågeställningar
1) Vad för typ av boosting är det?
    * "It provides a gradient boosting framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm." -Wikipedia
2) Behövs encoding för kategoriska värden?
    * Nej
3) Kan den hantera missing values? (NaN/None etc)
    * Ja, det finns en parameter "nan_mode" som avgör om NaN ger fel eller ska tolkas som max eller min.
4) Hur lång tid tar det att träna en modell
    * Ca 0.7, 2 sekunder
5) Hur många hyperparametrar finns det och vilka är de viktigaste?
    * Det ser ut att finnas ca 120 hyperparametrar. De viktigaste är:
        * iterations
        * learning_rate
        * depth
        * subsample
        * colsample_bylevel
        * min_data_in_leaf
6) Hur bra fungerar den på större dataset?
    * Ungefärlig progression för n stycken titanic => ~sekunder:
        * 1 => 2
        * 27 => 14
        * 100 => 17
        * 500 => 39

#### Importerar bibliotek

In [58]:
from inspect import signature

import catboost as cb
import pandas as pd
from sklearn.model_selection import train_test_split

#### Importerar data

In [59]:
titanic_data = pd.read_csv('train_titanic.csv')

In [60]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [61]:
cleaned_titanic = titanic_data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]

In [62]:
mapping = {'male': 1, 'female': 0}
numeric_titanic = cleaned_titanic.replace({'Sex': mapping})

  numeric_titanic = cleaned_titanic.replace({'Sex': mapping})


In [63]:
numeric_titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,0,3,1,22.0,1,0,7.25
1,1,1,0,38.0,1,0,71.2833
2,1,3,0,26.0,0,0,7.925
3,1,1,0,35.0,1,0,53.1
4,0,3,1,35.0,0,0,8.05


In [64]:
X = numeric_titanic.drop('Survived', axis = 1)
y = numeric_titanic['Survived']

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [66]:
model = cb.CatBoostClassifier(train_dir=None)

In [67]:
model.fit(X_train, y_train)

Learning rate set to 0.008417
0:	learn: 0.6870664	total: 2.18ms	remaining: 2.18s
1:	learn: 0.6818590	total: 3.96ms	remaining: 1.98s
2:	learn: 0.6764998	total: 6.15ms	remaining: 2.04s
3:	learn: 0.6716459	total: 7.78ms	remaining: 1.94s
4:	learn: 0.6665112	total: 9.79ms	remaining: 1.95s
5:	learn: 0.6620379	total: 11.2ms	remaining: 1.85s
6:	learn: 0.6573529	total: 13.2ms	remaining: 1.87s
7:	learn: 0.6528825	total: 15.6ms	remaining: 1.93s
8:	learn: 0.6484189	total: 17.4ms	remaining: 1.92s
9:	learn: 0.6435296	total: 19.4ms	remaining: 1.92s
10:	learn: 0.6409343	total: 20.6ms	remaining: 1.85s
11:	learn: 0.6363886	total: 22.3ms	remaining: 1.83s
12:	learn: 0.6324095	total: 23.5ms	remaining: 1.79s
13:	learn: 0.6281663	total: 24.9ms	remaining: 1.75s
14:	learn: 0.6237409	total: 26.3ms	remaining: 1.73s
15:	learn: 0.6211418	total: 27.3ms	remaining: 1.68s
16:	learn: 0.6175598	total: 28.4ms	remaining: 1.64s
17:	learn: 0.6137877	total: 30.3ms	remaining: 1.65s
18:	learn: 0.6099868	total: 32ms	remaining: 

<catboost.core.CatBoostClassifier at 0x270a3bfad90>

In [68]:
def multiply_df(df: pd.DataFrame, n: int):
    return pd.concat([df for _ in range(n)], ignore_index=True)

In [69]:
mega_titanic = multiply_df(numeric_titanic, 500)

In [70]:
display(len(numeric_titanic))
display(len(mega_titanic))

891

445500

In [71]:
X = mega_titanic.drop('Survived', axis = 1)
y = mega_titanic['Survived']
mega_X_train, mega_X_test, mega_y_train, mega_y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [72]:
mega_model = cb.CatBoostClassifier(train_dir=None)

In [73]:
mega_model.fit(mega_X_train, mega_y_train)

Learning rate set to 0.119632
0:	learn: 0.5929909	total: 44.8ms	remaining: 44.7s
1:	learn: 0.5135915	total: 81ms	remaining: 40.4s
2:	learn: 0.4663576	total: 120ms	remaining: 39.9s
3:	learn: 0.4300151	total: 161ms	remaining: 40s
4:	learn: 0.4068412	total: 196ms	remaining: 38.9s
5:	learn: 0.3819751	total: 237ms	remaining: 39.2s
6:	learn: 0.3656641	total: 280ms	remaining: 39.8s
7:	learn: 0.3556397	total: 319ms	remaining: 39.6s
8:	learn: 0.3472892	total: 360ms	remaining: 39.6s
9:	learn: 0.3378216	total: 407ms	remaining: 40.3s
10:	learn: 0.3270989	total: 446ms	remaining: 40.1s
11:	learn: 0.3185473	total: 487ms	remaining: 40.1s
12:	learn: 0.3135475	total: 525ms	remaining: 39.8s
13:	learn: 0.3091274	total: 562ms	remaining: 39.6s
14:	learn: 0.3024793	total: 602ms	remaining: 39.5s
15:	learn: 0.2970708	total: 640ms	remaining: 39.4s
16:	learn: 0.2923765	total: 681ms	remaining: 39.4s
17:	learn: 0.2859441	total: 720ms	remaining: 39.3s
18:	learn: 0.2769539	total: 761ms	remaining: 39.3s
19:	learn: 0.

<catboost.core.CatBoostClassifier at 0x270a9998e90>

In [86]:
catboost_sig = signature(cb.CatBoostClassifier.__init__)
display(catboost_sig)
len(str(catboost_sig).split(','))

<Signature (self, iterations=None, learning_rate=None, depth=None, l2_leaf_reg=None, model_size_reg=None, rsm=None, loss_function=None, border_count=None, feature_border_type=None, per_float_feature_quantization=None, input_borders=None, output_borders=None, fold_permutation_block=None, od_pval=None, od_wait=None, od_type=None, nan_mode=None, counter_calc_method=None, leaf_estimation_iterations=None, leaf_estimation_method=None, thread_count=None, random_seed=None, use_best_model=None, best_model_min_trees=None, verbose=None, silent=None, logging_level=None, metric_period=None, ctr_leaf_count_limit=None, store_all_simple_ctr=None, max_ctr_complexity=None, has_time=None, allow_const_label=None, target_border=None, classes_count=None, class_weights=None, auto_class_weights=None, class_names=None, one_hot_max_size=None, random_strength=None, random_score_type=None, name=None, ignored_features=None, train_dir=None, custom_loss=None, custom_metric=None, eval_metric=None, bagging_temperature

120