# Spaceship Titanic

## Training XGBoost Classifier

## Table of Contents
- [Spaceship Titanic](#spaceship-titanic)
- [Training XGBoost Classifier](#training-xgboost-classifier)
- [Table of Contents](#table-of-contents)
- [Config](#config)
- [Dependencies](#dependencies)
- [Data Extraction](#data-extraction)
- [Hyper Parameter Tuning](#hyper-parameter-tuning)
- [Conclusions](#conclusions)

### Config

Set up directory variables.

In [1]:
transformed_dataset_directory = "../transformed-data"
transformed_training_X_dataset_directory = f"{transformed_dataset_directory}/train_X.csv"
transformed_training_y_dataset_directory = f"{transformed_dataset_directory}/train_y.csv"

Control hyper parameter tuning.

>NOTE: If `do_hyper_parameter_tuning` == `False` include best hyper parameters in `hyper_parameters`.

In [2]:
do_hyper_parameter_tuning = True
hyper_parameters = {}

### Dependencies

In [3]:
%conda install pandas numpy matplotlib seaborn xgboost scikit-learn

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [4]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

### Data Extraction

In [5]:
X = pd.read_csv(transformed_training_X_dataset_directory)
X.head()

Unnamed: 0.1,Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0,39.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,0,0,1,1,0
1,1,24.0,109.0,9.0,25.0,549.0,44.0,1,0,0,1,0,0,0,1,1,0
2,2,58.0,43.0,3576.0,0.0,6715.0,49.0,0,1,0,1,0,0,0,1,0,1
3,3,33.0,0.0,1283.0,371.0,3329.0,193.0,0,1,0,1,0,0,0,1,1,0
4,4,16.0,303.0,70.0,151.0,565.0,2.0,1,0,0,1,0,0,0,1,1,0


In [6]:
y = pd.read_csv(transformed_training_y_dataset_directory)
y.head()

Unnamed: 0.1,Unnamed: 0,Transported
0,0,False
1,1,True
2,2,False
3,3,False
4,4,True


#### Train/Test Split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=13
)

In [8]:
print(X_train.info()) 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7823 entries, 2897 to 338
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Unnamed: 0                 7823 non-null   int64  
 1   Age                        7823 non-null   float64
 2   RoomService                7823 non-null   float64
 3   FoodCourt                  7823 non-null   float64
 4   ShoppingMall               7823 non-null   float64
 5   Spa                        7823 non-null   float64
 6   VRDeck                     7823 non-null   float64
 7   HomePlanet_Earth           7823 non-null   int64  
 8   HomePlanet_Europa          7823 non-null   int64  
 9   HomePlanet_Mars            7823 non-null   int64  
 10  CryoSleep_False            7823 non-null   int64  
 11  CryoSleep_True             7823 non-null   int64  
 12  Destination_55 Cancri e    7823 non-null   int64  
 13  Destination_PSO J318.5-22  7823 non-null   int

In [11]:
X_train.head()

Unnamed: 0.1,Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
2897,2897,21.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,1,0,0,1,1,0
1783,1783,39.0,0.0,10153.0,1.0,78.0,5063.0,0,1,0,1,0,1,0,0,1,0
2467,2467,26.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,1,0,0,1,0
6545,6545,7.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,1,0,0,1,1,0
5623,5623,27.0,441.0,0.0,397.0,471.0,0.0,0,0,1,1,0,0,0,1,1,0


In [12]:
y_train.head()

Unnamed: 0.1,Unnamed: 0,Transported
2897,2897,True
1783,1783,True
2467,2467,True
6545,6545,True
5623,5623,False


#### Hyper Parameter Tuning

Provide `hyper_parameters_to_search` to find best possible combination of hyper parameters.

>NOTE: Longer ranges will require longer processing times. Once an optimal set of hyper parameters is found, set `do_hyper_parameter_tuning` to `False` and `hyper_parameters` to `hyper_parameter_grid_search.best_params_` in the [Config](#config) section.

In [10]:
if do_hyper_parameter_tuning == True:
    
    xgb_classifier = XGBClassifier()

    hyper_parameters_to_search = {
    'n_estimators': range(5,15),
    'max_depth': range(1, 15),
    'learning_rate': [.1, .2, .3, .4, .5, .6],
    'colsample_bytree': [.7, .8, .9, 1],
    'eval_metric': ['error'],
    'use_label_encoder': [False]
    }

    hyper_parameter_grid_search = GridSearchCV(estimator = xgb_classifier, param_grid = hyper_parameters_to_search,
    cv = 3, n_jobs = 1, verbose = 0, return_train_score=True)

    hyper_parameter_grid_search.fit(X_train, y_train)

    hyper_parameters = hyper_parameter_grid_search.best_params_
    print("Best hyper parameters found were: ")
    print(hyper_parameters)
    hyper_parameter_grid_search.score(X_test,y_test)
else:
    print("Hyper parameter tuning skipped...")

Traceback (most recent call last):
  File "C:\Users\blake\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\blake\anaconda3\lib\site-packages\xgboost\core.py", line 506, in inner_f
    return f(**kwargs)
  File "C:\Users\blake\anaconda3\lib\site-packages\xgboost\sklearn.py", line 1199, in fit
    raise ValueError(label_encoding_check_error)
ValueError: The label must consist of integer labels of form 0, 1, 2, ..., [num_class - 1].

Traceback (most recent call last):
  File "C:\Users\blake\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\blake\anaconda3\lib\site-packages\xgboost\core.py", line 506, in inner_f
    return f(**kwargs)
  File "C:\Users\blake\anaconda3\lib\site-packages\xgboost\sklearn.py", line 1199, in fit
    raise ValueError(label_encoding_check_

KeyboardInterrupt: 

#### Train Model

In [None]:
xgb_classifier = XGBClassifier(**hyper_parameters)
xgb_classifier.fit(X_train, y_train)

In [None]:

mean_squared_error = mean_squared_error(y_test, xgb_classifier.predict(X_test))
print("The mean squared error (MSE) on test set: {:.4f}".format(mean_squared_error))

#### Conclusions