# Titanic Passenger Survival Prediction

## Project Overview

This project focuses on predicting whether a passenger survived the sinking of the Titanic based on various features like ticket class, age, gender, and family relations aboard the ship. The dataset provides detailed information about each passenger, enabling the use of classification models to predict survival outcomes. This project demonstrates the use of machine learning classification techniques on one of the most famous datasets in the field of data science.

## Source

This dataset is vailable on Kaggle in the following link:

> https://www.kaggle.com/c/titanic/data

## Data Dictionary

The dataset contains the following columns:

- **survival**: Survival (0 = No, 1 = Yes)
- **pclass**: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- **sex**: Gender of the passenger
- **age**: Age of the passenger in years
- **sibsp**: Number of siblings/spouses aboard the Titanic
- **parch**: Number of parents/children aboard the Titanic
- **ticket**: Ticket number
- **fare**: Passenger fare
- **cabin**: Cabin number
- **embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

### Variable Notes

- **pclass**: A proxy for socio-economic status (SES)
  - 1st = Upper class
  - 2nd = Middle class
  - 3rd = Lower class

- **age**: Age is fractional if less than 1. If the age is estimated, it is in the form of `xx.5`.

- **sibsp**: Number of siblings/spouses aboard the Titanic.
  - Sibling = brother, sister, stepbrother, stepsister
  - Spouse = husband, wife (mistresses and fiancés were ignored)

- **parch**: Number of parents/children aboard the Titanic.
  - Parent = mother, father
  - Child = daughter, son, stepdaughter, stepson
  - Some children traveled only with a nanny, therefore `parch=0` for them.

## Objective

The goal of this project is to build a classification model that predicts the survival of passengers aboard the Titanic based on the provided features. The project includes data exploration, feature engineering, model building, and evaluation of classification models.

### Problem Statement

- **Model Training**: The objective of model training is to train the model with the dataset so that it can recognise the pattern present in the data so that it can predict the survival of passengers.
- **Model Evaluation**: Evaluate the performance of the model with the help of different evaluation metrics such as accuracy, precision, recall and F1 score.
- **Model Optimization**: Find the optimal model using cross validation and hyperparameter tuning so that performance of the model is enhanced.

### Load Libraries

In [14]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Model and Evaluation Metrics
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Model Optimization
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV

### Settings

In [3]:
# warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
model_path = "../models"
csv_path = os.path.join(data_path, "titanic_en.csv")

### Load Data

In [4]:
df = pd.read_csv(csv_path)

In [5]:
# Check data
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,FamilySize,Sex_male,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,2,1,0,1
1,1,1,38.0,1,0,71.2833,2,0,0,0
2,1,3,26.0,0,0,7.925,1,0,0,1
3,1,1,35.0,1,0,53.1,2,0,0,1
4,0,3,35.0,0,0,8.05,1,1,0,1


### Preprocessing

In [6]:
# Separate the Input and Output Features to use in supervised machine learning model
X = df.drop("Survived", axis= 1)
y =df["Survived"]

In [7]:
# Split Train and Test Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [8]:
# Standardize the data to convert all the data in same scale

# Define scaler
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

In [36]:
# Define a function to train the model and print the evaluation metrics
def train_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Prediction on train and test data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Print Evaluation Metrics
    print("=" * 60)
    print("Training Evaluation")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred): 0.2f}")
    print(f"Precision: {precision_score(y_train, y_train_pred):0.2f}")
    print(f"Recall: {recall_score(y_train, y_train_pred):0.2f}")
    print(f"F1: {f1_score(y_train, y_train_pred):0.2f}")
    print("=" * 60)
    print("Testing Evaluation")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred): 0.2f}")
    print(f"Precision: {precision_score(y_test, y_test_pred):0.2f}")
    print(f"Recall: {recall_score(y_test, y_test_pred):0.2f}")
    print(f"F1: {f1_score(y_test, y_test_pred):0.2f}")

In [37]:
# Try with XGBoost Classifier
xgbc = XGBClassifier()
train_evaluate(xgbc)

Training Evaluation
Accuracy:  0.97
Precision: 0.98
Recall: 0.93
F1: 0.95
Testing Evaluation
Accuracy:  0.79
Precision: 0.74
Recall: 0.76
F1: 0.75


### Insights

The model's performance on the training data is very high, with an accuracy of **97%**. This suggests that the model is doing an excellent job of predicting the outcomes for the training dataset. The precision of **98%** indicates that most of the predicted survivors are indeed survivors (very few false positives), and the recall of **93%** suggests that the model is able to correctly identify most of the true survivors (few false negatives). The F1-score of **0.95**, which is the harmonic mean of precision and recall, also shows a strong balance between them.

On the testing data, the model’s accuracy drops to **79%**, which is still reasonably good but significantly lower than the training accuracy. The precision (**0.74**) and recall (**0.76**) also show a drop compared to the training set. This indicates that the model is not as effective when dealing with unseen data and may be struggling to generalize the patterns it learned during training.

- **Precision**: The value of 0.74 means that when the model predicts someone will survive, **74%** of the time, they actually did survive. The model makes more false positive errors compared to the training data.
- **Recall**: The value of **0.76** means that the model is able to identify **76%** of the actual survivors. It’s missing some survivors (false negatives), meaning it's not as comprehensive in its predictions.
- **F1-Score**: This is a balance between precision and recall. While the model is still performing decently, the drop from **0.95** in training to **0.75** in testing reflects the gap in performance, which is a sign of **overfitting**.

### Model Optimization

In [25]:
# Cross Validation: Use techniques like k-fold cross-validation to get a better estimate of how the model performs
# on unseen data and reduce the chance of overfitting.

kf = KFold(n_splits= 5, shuffle= True, random_state= 42)
xgbc_cv = XGBClassifier()
scores = cross_val_score(xgbc_cv,X, y, cv = kf)

print(f"Mean Cross Validation Score: {scores.mean(): 0.2}")

Mean Cross Validation Score:  0.81


### Insights

- The score of **0.81** means that across multiple train-test splits (folds), the model consistently achieves around 81% accuracy.
- A mean cross-validation score that is close to your test set accuracy **(0.79)** indicates that the model is performing consistently across different subsets of the data, which is a positive sign. It means the model is less likely to **overfit** or **underfit** to specific parts of the dataset.
- If the cross-validation score was much lower or higher than the testing accuracy, it could suggest issues such as **overfitting or data leakage**. However, since **0.81** and **0.79** are quite close, it suggests that your model is generally **reliable and consistent**.

In [19]:
#  Function to tune the hyperparameter
def tune_hyperparameter(model, param_dict):
    # Define GridSearchCV
    gscv = GridSearchCV(
        model,
        param_grid= param_dict,
        cv = 5,
        verbose= 1
    )

    # Train model the different hyperparameter
    gscv.fit(X, y)

    # Print best score
    print(f"Best Score: {gscv.best_score_: 0.2f}")

    # Return best hyperparameter set
    best_params = gscv.best_params_
    return best_params

In [39]:
# Define hyperparameter dictionary for XGBoostRegressor
param_dict = {
    "n_estimators": [ 500],
    "max_depth": [8, 10],
    "min_child_weight": [3, 5],
    "colsample_bytree": [0.5, 1.0],
    "alpha": [0, 1, 2],
    "labmda": [0, 1],
    "gamma": [0, 0.1, 1.0]
}

# Define XGBoost Regressor
xgbr_ht = XGBClassifier()

# Hyperpermeter tuning to get best hyperparameters
best_params = tune_hyperparameter(xgbr_ht, param_dict)
print(best_params)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best Score:  0.85
{'alpha': 1, 'colsample_bytree': 1.0, 'gamma': 0, 'labmda': 0, 'max_depth': 8, 'min_child_weight': 3, 'n_estimators': 500}


In [40]:
# Build with best parameter
model = XGBClassifier(**best_params)
train_evaluate(model)

Training Evaluation
Accuracy:  0.92
Precision: 0.94
Recall: 0.84
F1: 0.89
Testing Evaluation
Accuracy:  0.82
Precision: 0.81
Recall: 0.73
F1: 0.77


### Conclusion

The results after hyperparameter tuning reflect a more balanced and improved performance, especially in terms of preventing overfitting compared to your previous model. Let’s break down the performance changes:

#### Training Performance:

The training evaluation metrics show that the model is performing well on the training data but with slightly lower performance compared to your previous evaluation (where accuracy was **97%**, precision was 98%, and recall was **93%**). This suggests that the model is now less overfitted and more generalizable.

- **Precision (0.94)**: The model predicts true positives more accurately, with fewer false positives (i.e., when the model predicts survivors, they are correct 94% of the time).
- **Recall (0.84)**: The recall is slightly lower, meaning that 84% of the actual survivors are correctly predicted, but 16% are missed (false negatives).
- **F1-Score (0.89)**: The F1-score, balancing precision and recall, is high, showing the model is well-rounded.

#### Testing Performance:
  
After hyperparameter tuning, the testing performance has improved slightly compared to the previous evaluation:

- **Accuracy** increased from **79%** to **82%**, which is a sign that the model is better at generalizing to unseen data.
- **Precision** improved to **0.81** (previously **0.74**), meaning that the model makes fewer false positive errors on the test set.
- **Recall** dropped slightly from **0.76** to **0.73**, meaning the model misses a few more actual positives (survivors) compared to before.
- **F1-Score** increased to **0.77** (from **0.75**), indicating an overall improvement in the balance between precision and recall on the test set.