## Optimization
This part will show you how hyperparameter-tuning is done. At the end test scores will be compared.

In [1]:
# import libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings(action='ignore')

# read data
df = pd.read_csv('data/train.csv')

# drop irrelevant columns
df.drop(columns = ["Name", "PassengerId", "Cabin", "Ticket"], inplace=True)

# handling missing values
df.Age.fillna(df.Age.mean(), inplace=True)
df.Embarked.fillna('N/A', inplace=True)

# separating target and features
X = df.drop(columns = ["Survived"])
y = df.Survived

In [2]:
# split train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

# labeling
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# oversampling
from imblearn.over_sampling import SMOTE 
smote = SMOTE(random_state=1)

# classifier model
from sklearn.ensemble import GradientBoostingClassifier

# pipeline
from imblearn.pipeline import Pipeline

# importing function to display scores
import data_preparation as dp

In [3]:
X_train_labeled = X_train.copy()
X_test_labeled = X_test.copy()

col = ["Sex", "Embarked"]
for c in col:
    X_train_labeled[c] = le.fit_transform(X_train[c].astype('str'))
    X_test_labeled[c] = le.transform(X_test[c].astype('str'))

In [4]:
pipe = Pipeline(steps=[('smote', smote),
                       ('gbc', GradientBoostingClassifier(random_state=1))])

pipe.fit(X_train_labeled, y_train)

Pipeline(steps=[('smote', SMOTE(random_state=1)),
                ('gbc', GradientBoostingClassifier(random_state=1))])

In [5]:
dp.scores(X_train_labeled, y_train, X_test_labeled, y_test, pipe)

Pipeline(steps=[('smote', SMOTE(random_state=1)),
                ('gbc', GradientBoostingClassifier(random_state=1))])

CV score:     83.03%
X-test score: 81.61%
RMSE:         0.4288

Train score
              precision    recall  f1-score   support

           0       0.91      0.95      0.93       421
           1       0.90      0.84      0.87       247

    accuracy                           0.91       668
   macro avg       0.90      0.89      0.90       668
weighted avg       0.91      0.91      0.90       668



X-test score

              precision    recall  f1-score   support

           0       0.80      0.91      0.85       128
           1       0.86      0.68      0.76        95

    accuracy                           0.82       223
   macro avg       0.83      0.80      0.81       223
weighted avg       0.82      0.82      0.81       223





In [6]:
from sklearn.model_selection import GridSearchCV

# instantiate grid search for hyperparameter tuning
param = {'gbc__loss': ['log_loss', 'deviance', 'exponential'],
         'gbc__n_estimators':  [100, 200, 300],
         'gbc__min_samples_split': [2, 3, 4, 5],
         'gbc__min_samples_leaf': [1, 2, 3, 4, 5],
         'gbc__max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
         'gbc__max_features': [None, 'auto', 'sqrt', 'log2'],
        }
gs = GridSearchCV(estimator=pipe,
                  param_grid=param,
                  cv=3)

gs.fit(X_train_labeled, y_train)
gs.best_params_

{'gbc__loss': 'exponential',
 'gbc__max_depth': 5,
 'gbc__max_features': None,
 'gbc__min_samples_leaf': 1,
 'gbc__min_samples_split': 3,
 'gbc__n_estimators': 100}

In [9]:
# checking score with tuned model
pipe_tuned = Pipeline(steps=[('smote', smote),
                             ('gbc', GradientBoostingClassifier(random_state=1, 
                                                       loss = 'exponential',
                                                       max_depth = 5,
                                                       min_samples_split = 3
                                                      ))])

pipe_tuned.fit(X_train_labeled, y_train)
dp.scores(X_train_labeled, y_train, X_test_labeled, y_test, pipe_tuned)

Pipeline(steps=[('smote', SMOTE(random_state=1)),
                ('gbc',
                 GradientBoostingClassifier(loss='exponential', max_depth=5,
                                            min_samples_split=3,
                                            random_state=1))])

CV score:     83.23%
X-test score: 77.58%
RMSE:         0.4735

Train score
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       421
           1       0.97      0.94      0.96       247

    accuracy                           0.97       668
   macro avg       0.97      0.96      0.97       668
weighted avg       0.97      0.97      0.97       668



X-test score

              precision    recall  f1-score   support

           0       0.76      0.88      0.82       128
           1       0.80      0.63      0.71        95

    accuracy                           0.78       223
   macro avg       0.78      0.76      0.76       223
weighted avg       0.78      

The test score became smaller after tuning. We will stick to the baseline model.

In [19]:
feat_impt = pd.concat([pd.DataFrame(X_train_labeled.columns, columns = ['Features']),
                       pd.DataFrame(pipe_tuned.steps[1][1].feature_importances_, columns = ['FI_score'])
                       ],
                      axis = 1)

In [20]:
feat_impt.sort_values(by='FI_score', ascending=False)

Unnamed: 0,Features,FI_score
1,Sex,0.381585
2,Age,0.233097
5,Fare,0.218864
0,Pclass,0.071828
3,SibSp,0.052864
6,Embarked,0.029648
4,Parch,0.012114


We can see sex, age, and fare has the highest importance in the analysis. We see sex and fare in both classification analysis and heatmap, but it seems classification takes age information more seriously.