# Compare model results and final model Selection

In this section, we will do the following:
1. Evaluate all the saved models on the validation set.
2. Select the best model based on performance on the validation set.
3. Evaluate the best model on the holdout test set.

In [65]:
# Import the relevant packages
import joblib
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score
from time import time

In [134]:
# Read the validation and test data
X_val = pd.read_csv('./data/val_features.csv')
y_val = pd.read_csv('./data/val_labels.csv')

X_test = pd.read_csv('./data/test_features.csv')
y_test = pd.read_csv('./data/test_labels.csv')

In [136]:
X_val.head()

Unnamed: 0.1,Unnamed: 0,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,849,1,1,29.699118,89.1042,1,1
1,331,1,0,45.5,28.5,0,1
2,260,3,0,29.699118,7.75,0,0
3,316,2,1,24.0,26.0,1,0
4,292,2,0,36.0,12.875,0,1


In [138]:
y_val.head()

Unnamed: 0,Survived
0,1
1,0
2,0
3,1
4,0


In [140]:
X_test.head()

Unnamed: 0.1,Unnamed: 0,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,424,3,0,18.0,20.2125,2,0
1,837,3,0,29.699118,8.05,0,0
2,525,3,0,40.5,7.75,0,0
3,328,3,1,31.0,20.525,2,0
4,70,2,0,32.0,10.5,0,0


In [142]:
y_test.head()

Unnamed: 0,Survived
0,0
1,0
2,0
3,1
4,0


In [155]:
# Read the saved best models
gb_mdl = joblib.load('./data/models/GB_model.pkl')
rf_mdl = joblib.load('./data/models/RF_model.pkl')
stacked_mdl = joblib.load('./data/models/stacked_model.pkl')

In [157]:
# Create helper function for evaluating model
def evaluate_model(model, features, labels):
    start = time()
    pred = model.predict(features)
    end = time()
    accuracy = round(accuracy_score(labels, pred), 3)
    precision = round(precision_score(labels, pred), 3)
    recall = round(recall_score(labels, pred), 3)
    print('{} -- Accuracy: {} / Precision: {} / Recall: {} / Latency: {}ms'.format(str(model).split('(')[0],
                                                                                   accuracy,
                                                                                   precision,
                                                                                   recall,
                                                                                   round((end - start)*1000, 1)))

**Evaluate best models on Validation set**

In [160]:
# Create a loop for evaluate the models on Validation set
for mdl in [gb_mdl, rf_mdl, stacked_mdl]:
    evaluate_model(mdl, X_val, y_val)

GradientBoostingClassifier -- Accuracy: 0.809 / Precision: 0.804 / Recall: 0.631 / Latency: 3.0ms
RandomForestClassifier -- Accuracy: 0.798 / Precision: 0.784 / Recall: 0.615 / Latency: 3.9ms
StackingClassifier -- Accuracy: 0.809 / Precision: 0.792 / Recall: 0.646 / Latency: 8.0ms


Considerations for Model Selection
* Accuracy:
GradientBoostingClassifier and StackingClassifier both have an accuracy of 0.809, which is slightly better than the RandomForestClassifier with an accuracy of 0.798.
However, accuracy is not the only metric to rely on, especially in imbalanced datasets like Titanic survival, where the cost of false negatives (missed survivors) may be more important than overall accuracy.
* Precision:
GradientBoostingClassifier has the highest precision (0.804), followed by StackingClassifier (0.792) and RandomForestClassifier (0.784).
Precision measures how many of the predicted survivors (positive class) are actually survivors. A higher precision is desirable when false positives (predicting a non-survivor as a survivor) are costly.
* Recall:
StackingClassifier has the highest recall (0.646), which means it is better at identifying actual survivors (true positives).
Recall is important in scenarios where it's critical to capture as many true survivors as possible, even if it means accepting a higher number of false positives. In this case, StackingClassifier is better than the other models in identifying more survivors.
* Latency:
GradientBoostingClassifier has the lowest latency (3.0ms), followed by RandomForestClassifier (3.9ms), and StackingClassifier has the highest latency (8.0ms).
While latency is a concern for real-time applications or large datasets, it's less important in the context of the Titanic dataset, where the primary focus is on predictive accuracy.
* Conclusion
If the primary goal is to minimize false positives (e.g., minimizing the number of passengers incorrectly predicted as survivors), GradientBoostingClassifier would be the best choice due to its higher precision and competitive accuracy.

If the goal is to capture as many survivors as possible, even at the cost of some false positives, StackingClassifier is the best choice due to its higher recall (0.646).

In terms of overall accuracy and precision, GradientBoostingClassifier slightly edges out the other models, but StackingClassifier offers a better recall, making it a good choice if identifying survivors is more important than minimizing false positives.

Best Model (Overall): GradientBoostingClassifier, due to its strong accuracy and precision.

**Evaluate best models on Test set**

In [165]:
evaluate_model(gb_mdl,X_test, y_test)

GradientBoostingClassifier -- Accuracy: 0.799 / Precision: 0.845 / Recall: 0.645 / Latency: 3.0ms


The above model is chosen as the best for making predictons on whether people abroard the Titanic will survive or not.