# Spaceship Titanic

## Training XGBoost Classifier

## Table of Contents
- [Spaceship Titanic](#spaceship-titanic)
- [Training XGBoost Classifier](#training-xgboost-classifier)
- [Table of Contents](#table-of-contents)
- [Config](#config)
- [Dependencies](#dependencies)
- [Data Extraction](#data-extraction)
- [Hyper Parameter Tuning](#hyper-parameter-tuning)
- [Conclusions](#conclusions)

### Config

Set up directory variables.

In [None]:
transformed_dataset_directory = "../transformed-data"
transformed_training_X_dataset_directory = f"{transformed_dataset_directory}/train_X.csv"
transformed_training_y_dataset_directory = f"{transformed_dataset_directory}/train_y.csv"

Control hyper parameter tuning.

>NOTE: If `do_hyper_parameter_tuning` == `False` include best hyper parameters in `hyper_parameters`.

In [None]:
do_hyper_parameter_tuning = True
hyper_parameters = {}

### Dependencies

In [None]:
%conda install pandas numpy matplotlib seaborn xgboost scikit-learn

In [None]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

### Data Extraction

In [None]:
X = pd.read_csv(transformed_training_X_dataset_directory)
X.head()

In [None]:
y = pd.read_csv(transformed_training_y_dataset_directory)
y.head()

#### Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=13
)

In [None]:
print(X_train.info()) 

In [None]:
X_train.head()

In [None]:
y_train.head()

#### Hyper Parameter Tuning

Provide `hyper_parameters_to_search` to find best possible combination of hyper parameters.

>NOTE: Longer ranges will require longer processing times. Once an optimal set of hyper parameters is found, set `do_hyper_parameter_tuning` to `False` and `hyper_parameters` to `hyper_parameter_grid_search.best_params_` in the [Config](#config) section.

In [None]:
if do_hyper_parameter_tuning == True:
    
    xgb_classifier = XGBClassifier()

    hyper_parameters_to_search = {
    'n_estimators': range(5,15),
    'max_depth': range(1, 15),
    'learning_rate': [.1, .2, .3, .4, .5, .6],
    'colsample_bytree': [.7, .8, .9, 1],
    'eval_metric': ['error'],
    'use_label_encoder': [False]
    }

    hyper_parameter_grid_search = GridSearchCV(estimator = xgb_classifier, param_grid = hyper_parameters_to_search,
    cv = 3, n_jobs = 1, verbose = 0, return_train_score=True)

    hyper_parameter_grid_search.fit(X_train, y_train)

    hyper_parameters = hyper_parameter_grid_search.best_params_
    print("Best hyper parameters found were: ")
    print(hyper_parameters)
    hyper_parameter_grid_search.score(X_test,y_test)
else:
    print("Hyper parameter tuning skipped...")

#### Train Model

In [None]:
xgb_classifier = XGBClassifier(**hyper_parameters)
xgb_classifier.fit(X_train, y_train)

In [None]:

mean_squared_error = mean_squared_error(y_test, xgb_classifier.predict(X_test))
print("The mean squared error (MSE) on test set: {:.4f}".format(mean_squared_error))

#### Conclusions