# Modelling

## Overview
In this notebook we will try to model our data using catboost and see how well we can do. The f1 macro score is specified as the metric we should use for evaluation. Bayesian hyperparameter tuning is used.

---
### Load libraries and data

In [15]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import f1_score, make_scorer

import catboost as cb
from hyperopt import hp, tpe, fmin

In [34]:
train = pd.read_csv('data/final_train.csv')

Drop ID columns and split training data into X and y

In [35]:
train.drop(['Id', 'idhogar'], axis=1, inplace=True)

In [36]:
X = train.drop('Target', axis=1)
y = train.Target

# take one from y so values are 0-3, instead of 1-4 (this is required for catboost)
y -= 1
X.fillna(X.meaneduc.mean(), inplace=True)

Catboost will preform better when the columns with categorical values are explicitly passed to it.

In [37]:
cate_cols = [X.columns.get_loc(col) for col in X.columns if train[col].dtype != 'float64']

Split into training and validation data

In [38]:
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.2)

Objective function for Hyperopt

In [41]:
def objective(params):
    cat_params = {'depth':10, 'l2_leaf_reg':9, 'learning_rate':0.15}
    cat_params['depth'] = int(params['depth'])
    
    clf = cb.CatBoostClassifier(loss_function='MultiClass',
                                eval_metric='MultiClass',
                                use_best_model=True,
                                iterations=100,
                                verbose=False, 
                                classes_count=4,
                                learning_rate=cat_params['learning_rate'],
                                depth=cat_params['depth'],
                                l2_leaf_reg=cat_params['l2_leaf_reg']
                                )
    clf.fit(train_X, train_y, eval_set=(val_X, val_y), use_best_model=True, cat_features=cate_cols)
    
    pred_y = clf.predict(val_X)
    score = f1_score(val_y, pred_y, average='macro')
    print('params: ', params, ' -- score: ', score)
    return score

In [42]:
space = {'depth': hp.quniform('depth', 3, 8, 1),
         'l2_leaf_rag': hp.quniform('l2_leaf_rag', 3, 8, 1)}

best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50)

params:  {'depth': 7.0, 'l2_leaf_rag': 7.0}  -- score:  0.7526395291032701
params:  {'depth': 7.0, 'l2_leaf_rag': 5.0}  -- score:  0.7573407146406984
params:  {'depth': 4.0, 'l2_leaf_rag': 6.0}  -- score:  0.5947449433904906
params:  {'depth': 8.0, 'l2_leaf_rag': 5.0}  -- score:  0.7459777833323072
params:  {'depth': 5.0, 'l2_leaf_rag': 4.0}  -- score:  0.6932658505232344
params:  {'depth': 7.0, 'l2_leaf_rag': 5.0}  -- score:  0.7308379103128568
params:  {'depth': 6.0, 'l2_leaf_rag': 5.0}  -- score:  0.7324158245874087
params:  {'depth': 6.0, 'l2_leaf_rag': 8.0}  -- score:  0.7087100657644466
params:  {'depth': 7.0, 'l2_leaf_rag': 4.0}  -- score:  0.7675390969952225
params:  {'depth': 8.0, 'l2_leaf_rag': 8.0}  -- score:  0.7656404669001636
params:  {'depth': 7.0, 'l2_leaf_rag': 3.0}  -- score:  0.7602884379759443
params:  {'depth': 8.0, 'l2_leaf_rag': 4.0}  -- score:  0.7430856801847904
params:  {'depth': 6.0, 'l2_leaf_rag': 4.0}  -- score:  0.7488231192644822
params:  {'depth': 6.0, '