# Spaceship Titanic

## Training XGBoost Classifier

## Table of Contents
- [Spaceship Titanic](#spaceship-titanic)
- [Training XGBoost Classifier](#training-xgboost-classifier)
- [Table of Contents](#table-of-contents)
- [Config](#config)
- [Dependencies](#dependencies)
- [Data Extraction](#data-extraction)
- [Hyper Parameter Tuning](#hyper-parameter-tuning)
- [Test Model](#test-model)
- [Train Model](#train-model)
- [Save Model](#save-model)
- [Conclusions](#conclusions)

### Config

Set up directory variables.

In [147]:
transformed_dataset_directory = "../transformed-data"
transformed_training_X_dataset_directory = f"{transformed_dataset_directory}/train_X.csv"
transformed_training_y_dataset_directory = f"{transformed_dataset_directory}/train_y.csv"

models_directory = "../models"
model_save_path = f"{models_directory}/xgb_classifier.json"

Control hyper parameter tuning.

>NOTE: If `do_hyper_parameter_tuning` == `False` include best hyper parameters in `hyper_parameters`.

In [148]:
do_hyper_parameter_tuning = False
hyper_parameters = {'colsample_bytree': 0.8, 'eval_metric': 'error', 'learning_rate': 0.2, 'max_depth': 6, 'n_estimators': 14, 'use_label_encoder': False}
# {'colsample_bytree': 0.7, 'eval_metric': 'error', 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100, 'use_label_encoder': False}

### Dependencies

In [149]:
%conda install pandas numpy matplotlib seaborn xgboost scikit-learn

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [150]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

### Data Extraction

In [151]:
X = pd.read_csv(transformed_training_X_dataset_directory)
X.head()

Unnamed: 0,RoomService,Spa,VRDeck,Cabin_2,HomePlanet_Earth,HomePlanet_Europa,CryoSleep_False,CryoSleep_True,Cabin_1_A,Cabin_1_B,Cabin_1_C,Cabin_1_D,Cabin_1_E,Cabin_1_F,Cabin_1_G,Cabin_1_T,Cabin_3_P,Cabin_3_S
0,0.0,0.0,0.0,0,0,1,1,0,0,1,0,0,0,0,0,0,1,0
1,109.0,549.0,44.0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1
2,43.0,6715.0,49.0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1
3,0.0,3329.0,193.0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1
4,303.0,565.0,2.0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,1


In [152]:
y = pd.read_csv(transformed_training_y_dataset_directory)
y.head()

Unnamed: 0,Transported
0,False
1,True
2,False
3,False
4,True


#### Train/Test Split

In [153]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=13
)

In [154]:
print(X_train.info()) 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7823 entries, 2897 to 338
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   RoomService        7823 non-null   float64
 1   Spa                7823 non-null   float64
 2   VRDeck             7823 non-null   float64
 3   Cabin_2            7823 non-null   int64  
 4   HomePlanet_Earth   7823 non-null   int64  
 5   HomePlanet_Europa  7823 non-null   int64  
 6   CryoSleep_False    7823 non-null   int64  
 7   CryoSleep_True     7823 non-null   int64  
 8   Cabin_1_A          7823 non-null   int64  
 9   Cabin_1_B          7823 non-null   int64  
 10  Cabin_1_C          7823 non-null   int64  
 11  Cabin_1_D          7823 non-null   int64  
 12  Cabin_1_E          7823 non-null   int64  
 13  Cabin_1_F          7823 non-null   int64  
 14  Cabin_1_G          7823 non-null   int64  
 15  Cabin_1_T          7823 non-null   int64  
 16  Cabin_3_P          782

In [155]:
X_train.head()

Unnamed: 0,RoomService,Spa,VRDeck,Cabin_2,HomePlanet_Earth,HomePlanet_Europa,CryoSleep_False,CryoSleep_True,Cabin_1_A,Cabin_1_B,Cabin_1_C,Cabin_1_D,Cabin_1_E,Cabin_1_F,Cabin_1_G,Cabin_1_T,Cabin_3_P,Cabin_3_S
2897,0.0,0.0,0.0,492,1,0,0,1,0,0,0,0,0,0,1,0,0,1
1783,0.0,78.0,5063.0,72,0,1,1,0,0,0,1,0,0,0,0,0,0,1
2467,0.0,0.0,0.0,95,0,1,0,1,0,0,1,0,0,0,0,0,0,1
6545,0.0,0.0,0.0,1432,0,0,0,1,0,0,0,0,0,1,0,0,1,0
5623,441.0,471.0,0.0,1140,0,0,1,0,0,0,0,0,0,1,0,0,0,1


In [156]:
y_train.head()

Unnamed: 0,Transported
2897,True
1783,True
2467,True
6545,True
5623,False


#### Hyper Parameter Tuning

Provide `hyper_parameters_to_search` to find best possible combination of hyper parameters.

>NOTE: Longer ranges will require longer processing times. Once an optimal set of hyper parameters is found, set `do_hyper_parameter_tuning` to `False` and `hyper_parameters` to `hyper_parameter_grid_search.best_params_` in the [Config](#config) section.

In [157]:
if do_hyper_parameter_tuning == True:
    
    xgb_classifier = XGBClassifier()

    # 'n_estimators': range(10,20),
    # 'max_depth': range(1, 20),

    hyper_parameters_to_search = {
    'n_estimators': [1, 5, 10, 50, 100, 500, 1000],
    'max_depth': [1, 5, 10, 50, 100, 500, 1000],
    'learning_rate': [.1, .2, .3, .4, .5, .6, .7, .8, .9],
    'colsample_bytree': [.7, .8, .9, 1],
    'eval_metric': ['error'],
    'use_label_encoder': [False]
    }

    hyper_parameter_grid_search = GridSearchCV(estimator = xgb_classifier, param_grid = hyper_parameters_to_search,
    cv = 3, n_jobs = 1, verbose = 4, return_train_score=True)

    hyper_parameter_grid_search.fit(X_train, y_train)

    hyper_parameters = hyper_parameter_grid_search.best_params_
    print("Best hyper parameters found were: ")
    print(hyper_parameters)
    hyper_parameter_grid_search.score(X_test,y_test)
else:
    print("Hyper parameter tuning skipped...")

Hyper parameter tuning skipped...


#### Test Model
Train the model using the `X_train` and `y_train` split and test it against the test split. Find the mean square error (MSE) to evaluate the dataset.

In [158]:
xgb_train_test_classifier = XGBClassifier(**hyper_parameters)
xgb_train_test_classifier.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8,
              enable_categorical=False, eval_metric='error', gamma=0, gpu_id=-1,
              importance_type=None, interaction_constraints='',
              learning_rate=0.2, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=14, n_jobs=8, num_parallel_tree=1, predictor='auto',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              subsample=1, tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

In [159]:

mse = mean_squared_error(y_test, xgb_train_test_classifier.predict(X_test))
print("The mean squared error (MSE) on test set: {:.4f}".format(mse))

The mean squared error (MSE) on test set: 0.2000


#### Train Model
We train the model on the entirety of the training set.

In [160]:
xgb_classifier = XGBClassifier(**hyper_parameters)
xgb_classifier.fit(X, y)

mse = mean_squared_error(y, xgb_classifier.predict(X))
print("The mean squared error (MSE) on training set: {:.4f}".format(mse))

The mean squared error (MSE) on training set: 0.1830


#### Save Model

In [161]:
xgb_classifier.save_model(fname=model_save_path)

#### Conclusions

From best result:
The mean squared error (MSE) on test set: 0.2000
The mean squared error (MSE) on training set: 0.1830