# Titanic - Predictive Modeling!
In our last session, we focused on performing feature engineering on the raw Titanic dataset. At the end of that session, we exported our cleaned dataset into another CSV file. We are now ready to start our predictive modeling process!

## Notebook Setup

In [60]:
# Importing the necessary Python libraries
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Loading in the cleaned datasets
X = pd.read_csv('../data/cleaned_train.csv')
y = pd.read_csv('../data/y.csv')

In [3]:
# Viewing the first few rows of the X dataset
X.head()

Unnamed: 0,Pclass,Sex,SibSp,Parch,Fare,Embarked_S,Embarked_C,Embarked_Q,Embarked_nan,age_imputed_child,age_imputed_teen,age_imputed_young_adult,age_imputed_adult,age_imputed_elder
0,3,0,1,0,7.25,1,0,0,0,0,0,1,0,0
1,1,1,1,0,71.2833,0,1,0,0,0,0,0,1,0
2,3,1,0,0,7.925,1,0,0,0,0,0,1,0,0
3,1,1,1,0,53.1,1,0,0,0,0,0,0,1,0
4,3,0,0,0,8.05,1,0,0,0,0,0,0,1,0


In [4]:
# Viewing the first few rows of the y dataset
y.head()

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0


## Data Separation
When working with a training dataset, it is a good idea to hold out a portion of the data so that we have something we can validate the model against. In the cell below, we will use Scikit-Learn's `train_test_split` functionality to split the data into respective training and validation sets.

In [28]:
# Splitting the datasets between training and validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [29]:
# Viewing the first few rows of the X_train dataset
X_train.head()

Unnamed: 0,Pclass,Sex,SibSp,Parch,Fare,Embarked_S,Embarked_C,Embarked_Q,Embarked_nan,age_imputed_child,age_imputed_teen,age_imputed_young_adult,age_imputed_adult,age_imputed_elder
331,1,0,0,0,28.5,1,0,0,0,0,0,0,1,0
733,2,0,0,0,13.0,1,0,0,0,0,0,1,0,0
382,3,0,0,0,7.925,1,0,0,0,0,0,0,1,0
704,3,0,1,0,7.8542,1,0,0,0,0,0,1,0,0
813,3,1,4,2,31.275,1,0,0,0,1,0,0,0,0


In [30]:
# Viewing the first few rows of the y_train dataset
y_train.head()

Unnamed: 0,Survived
331,0
733,0
382,0
704,0
813,0


In [10]:
# Checking to see that the 20% split worked properly
len(X_val) / len(X)

0.20089786756453423

## Hyperparameter Tuning

In [32]:
# Instantiating a Random Forest Classifier object
rfc_gridsearch = RandomForestClassifier()

In [33]:
# Defining the parameter grid for hyperparameter tuning
params = {'n_estimators': [10, 50, 100],
          'min_samples_split': [2, 5, 10],
          'min_samples_leaf': [1, 2, 5],
          'max_depth': [10, 20, 50]
         }

In [36]:
# Instantiating the GridSearchCV object
hyperparameter_tuner = GridSearchCV(estimator = rfc_gridsearch,
                                    param_grid = params)

In [39]:
# Running the hyperparameter tuning job
hyperparameter_tuner.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': [10, 20, 50],
                         'min_samples_leaf': [1, 2, 5],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [10, 50, 100]})

In [40]:
# Viewing the best parameters from the hyperparameter tuning job
hyperparameter_tuner.best_params_

{'max_depth': 20,
 'min_samples_leaf': 2,
 'min_samples_split': 10,
 'n_estimators': 50}

## Model Training

In [41]:
# Instantiating a new Random Forest Classifier object
rfc_model = RandomForestClassifier(n_estimators = 50,
                                   max_depth = 20,
                                   min_samples_split = 10,
                                   min_samples_leaf = 2)

In [42]:
# Performing the model training
rfc_model.fit(X_train, y_train)

RandomForestClassifier(max_depth=20, min_samples_leaf=2, min_samples_split=10,
                       n_estimators=50)

## Model Validation

In [45]:
# Getting predictions on the X_val dataset using the trained RFC model
val_preds = rfc_model.predict(X_val)

In [52]:
# Getting the metrics with the validation dataset
val_accuracy = accuracy_score(y_val, val_preds)
val_roc_auc = roc_auc_score(y_val, val_preds)
val_confusion_matrix = confusion_matrix(y_val, val_preds)

In [59]:
# Printing out the validation metrics
print(f'Accuracy Score: {val_accuracy}')
print(f'ROC-AUC Score: {val_roc_auc}')
print(f'Confusion Matrix: \n{val_confusion_matrix}')

Accuracy Score: 0.8156424581005587
ROC-AUC Score: 0.800965250965251
Confusion Matrix: 
[[93 12]
 [21 53]]


## Saving out a Simple Model

In [63]:
# Saving the RFC model to a pickle
with open('../models/rfc_model.pkl', 'wb') as f:
    pickle.dump(rfc_model, f)

## Loading our Trained Model

In [65]:
# Loading in the RFC model from serialized file
with open('../models/rfc_model.pkl', 'rb') as f:
    rfc_loaded_model = pickle.load(f)

In [66]:
# Getting predictions with the loaded model
loaded_preds = rfc_loaded_model.predict(X_val)

In [67]:
# Showing the metrics with the loaded preds
val_accuracy = accuracy_score(y_val, loaded_preds)
val_roc_auc = roc_auc_score(y_val, loaded_preds)
val_confusion_matrix = confusion_matrix(y_val, loaded_preds)

print(f'Accuracy Score: {val_accuracy}')
print(f'ROC-AUC Score: {val_roc_auc}')
print(f'Confusion Matrix: \n{val_confusion_matrix}')

Accuracy Score: 0.8156424581005587
ROC-AUC Score: 0.800965250965251
Confusion Matrix: 
[[93 12]
 [21 53]]
