# Titanic Passenger Survival Prediction

## Project Overview

This project focuses on predicting whether a passenger survived the sinking of the Titanic based on various features like ticket class, age, gender, and family relations aboard the ship. The dataset provides detailed information about each passenger, enabling the use of classification models to predict survival outcomes. This project demonstrates the use of machine learning classification techniques on one of the most famous datasets in the field of data science.

## Source

This dataset is vailable on Kaggle in the following link:

> https://www.kaggle.com/c/titanic/data

## Data Dictionary

The dataset contains the following columns:

- **survival**: Survival (0 = No, 1 = Yes)
- **pclass**: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- **sex**: Gender of the passenger
- **age**: Age of the passenger in years
- **sibsp**: Number of siblings/spouses aboard the Titanic
- **parch**: Number of parents/children aboard the Titanic
- **ticket**: Ticket number
- **fare**: Passenger fare
- **cabin**: Cabin number
- **embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

### Variable Notes

- **pclass**: A proxy for socio-economic status (SES)
  - 1st = Upper class
  - 2nd = Middle class
  - 3rd = Lower class

- **age**: Age is fractional if less than 1. If the age is estimated, it is in the form of `xx.5`.

- **sibsp**: Number of siblings/spouses aboard the Titanic.
  - Sibling = brother, sister, stepbrother, stepsister
  - Spouse = husband, wife (mistresses and fiancés were ignored)

- **parch**: Number of parents/children aboard the Titanic.
  - Parent = mother, father
  - Child = daughter, son, stepdaughter, stepson
  - Some children traveled only with a nanny, therefore `parch=0` for them.

## Objective

The goal of this project is to build a classification model that predicts the survival of passengers aboard the Titanic based on the provided features. The project includes data exploration, feature engineering, model building, and evaluation of classification models.

### Problem Statement

- **Model Training**: The objective of model training is to train the model with the dataset so that it can recognise the pattern present in the data so that it can predict the survival of passengers.
- **Model Evaluation**: Evaluate the performance of the model with the help of different evaluation metrics such as accuracy, precision, recall and F1 score.
- **Model Optimization**: Find the optimal model using cross validation and hyperparameter tuning so that performance of the model is enhanced.

### Load Libraries

In [1]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Model and Evaluation Metrics
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Model Optimization
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV

### Settings

In [2]:
# warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
model_path = "../models"
csv_path = os.path.join(data_path, "titanic_en.csv")

### Load Data

In [3]:
df = pd.read_csv(csv_path)

In [4]:
# Check data
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,FamilySize,Sex_male,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,2,1,0,1
1,1,1,38.0,1,0,71.2833,2,0,0,0
2,1,3,26.0,0,0,7.925,1,0,0,1
3,1,1,35.0,1,0,53.1,2,0,0,1
4,0,3,35.0,0,0,8.05,1,1,0,1


### Preprocessing

In [5]:
# Separate the Input and Output Features to use in supervised machine learning model
X = df.drop("Survived", axis= 1)
y =df["Survived"]

In [6]:
# Split Train and Test Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [7]:
# Standardize the data to convert all the data in same scale

# Define scaler
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

In [8]:
# Define a function to train the model and print the evaluation metrics
def train_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Prediction on train and test data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Print Evaluation Metrics
    print("=" * 60)
    print("Training Evaluation")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred): 0.2f}")
    print(f"Precision: {precision_score(y_train, y_train_pred):0.2f}")
    print(f"Recall: {recall_score(y_train, y_train_pred):0.2f}")
    print(f"F1: {f1_score(y_train, y_train_pred):0.2f}")
    print("=" * 60)
    print("Testing Evaluation")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred): 0.2f}")
    print(f"Precision: {precision_score(y_test, y_test_pred):0.2f}")
    print(f"Recall: {recall_score(y_test, y_test_pred):0.2f}")
    print(f"F1: {f1_score(y_test, y_test_pred):0.2f}")

In [9]:
# Try with SVM Classifier
svc = SVC()
train_evaluate(svc)

Training Evaluation
Accuracy:  0.68
Precision: 0.71
Recall: 0.25
F1: 0.37
Testing Evaluation
Accuracy:  0.66
Precision: 0.76
Recall: 0.26
F1: 0.38


### Insights

The similar results between the training and testing sets suggest that the model is not severely overfitting, but the overall low accuracy and F1 score hint at underfitting. The model may not be complex enough to capture the underlying relationships in the data.

#### Training Metrics:

- **Accuracy (0.68)**: The model correctly classifies **68%** of the training data, which is modest and suggests that the model is not capturing the complexity of the data very well.
- **Precision (0.71)**: Precision refers to the percentage of passengers the model predicted as survivors (positive class) who actually survived. With a precision of **0.71 (or 71%)**, the model is reasonably good at identifying true positives, meaning it’s making relatively few false positive errors.
- **Recall (0.25)**: The recall (also known as sensitivity) is quite low at **0.25 (25%)**. This means the model only identifies **25%** of actual survivors, missing **75%** of them (many false negatives). The model is not doing well at finding all the actual survivors.
- **F1 Score (0.37)**: The F1 score is the harmonic mean of precision and recall, balancing both metrics. The F1 score of **0.37** suggests that while precision is decent, the poor recall significantly drags down the overall performance.

#### Testing Metrics:

- **Accuracy (0.66)**: The testing accuracy is **66%**, slightly lower than the training accuracy of **68%**. This suggests that the model's generalization to unseen data is fairly consistent but still not particularly strong.
- **Precision (0.76)**: On the test set, the model's precision improves to **0.76**, meaning it is even more confident in the predictions of survivors, with fewer false positives (i.e., non-survivors wrongly predicted as survivors).
- **Recall (0.26)**: However, recall remains very low at **26%**, similar to the training recall. The model continues to miss many actual survivors, indicating a large number of false negatives.
**F1 Score (0.38)**: The F1 score remains low at **0.38**, similar to the training score, because the recall is poor despite a decent precision.

#### Possible Causes of Low Performance:

1. **Class Imbalance:**

The Titanic dataset typically has an imbalance between survivors and non-survivors, with fewer survivors. This imbalance can affect the SVM classifier’s ability to learn to correctly identify the minority class (survivors).

2. **Model Complexity:**
   
The SVM model is not tuned(used with default) for this problem. The choice of kernel, regularization parameters, or the lack of feature scaling could be limiting the model's performance. SVMs are sensitive to these aspects.

### Model Optimization

In [10]:
#  Function to tune the hyperparameter
def tune_hyperparameter(model, param_dict):
    # Define GridSearchCV
    gscv = GridSearchCV(
        model,
        param_grid= param_dict,
        cv = 5,
        verbose= 1
    )

    # Train model the different hyperparameter
    gscv.fit(X, y)

    # Print best score
    print(f"Best Score: {gscv.best_score_: 0.2f}")

    # Return best hyperparameter set
    best_params = gscv.best_params_
    return best_params

In [11]:
# Define hyperparameter dictionary for SVC
param_dict = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'], 
    'gamma': [0.1, 1, 10]
}

# Define SVC
svc_ht = SVC()

# Hyperpermeter tuning to get best hyperparameters
best_params = tune_hyperparameter(svc_ht, param_dict)
print(best_params)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
Best Score:  0.79
{'C': 0.1, 'gamma': 0.1, 'kernel': 'linear'}


In [12]:
# Build with best parameter
model = SVC(**best_params)
train_evaluate(model)

Training Evaluation
Accuracy:  0.79
Precision: 0.74
Recall: 0.68
F1: 0.71
Testing Evaluation
Accuracy:  0.78
Precision: 0.75
Recall: 0.70
F1: 0.73


### Conclusion

After hyperparameter tuning, the performance of the SVM classifier on the Titanic Survival dataset has improved significantly, with better balance between precision and recall, as well as more consistent performance across training and testing sets. Let's discuss the evaluation metrics,

#### Training Metrics:

- **Accuracy (0.79)**: The model correctly classifies **79%** of the training data, a considerable improvement from the previous accuracy of **68%**. This suggests the model is learning the patterns in the training data much better after hyperparameter tuning.
- **Precision (0.74)**: Precision has slightly decreased from **0.76 to 0.74**. This means that **74%** of passengers the model predicts as survivors are actual survivors. While it's a slight decrease from the previous tuning, it's still reasonable and shows fewer false positives.
- **Recall (0.68)**: The recall has increased significantly from **0.25 to 0.68**, meaning the model now correctly identifies **68%** of actual survivors in the training data. This is a big improvement and shows the model is now detecting more true positives and missing fewer actual survivors.
- **F1 Score (0.71)**: The F1 score, which balances precision and recall, is now **0.71**, much higher than before (**0.37**). This indicates that the model has a better balance between precision and recall, and is performing well on the training data.

#### Testing Metrics:

- **Accuracy (0.78)**: The testing accuracy is **78%**, very close to the training accuracy of **79%**. This is a good sign of the model generalizing well to unseen data, with consistent performance on both the training and testing sets.
- **Precision (0.75)**: On the test set, precision is **0.75**, which means that **75%** of passengers predicted as survivors are indeed actual survivors. This is a slight improvement from the training precision and suggests that the model is quite reliable in making correct positive predictions.
- **Recall (0.70)**: The recall on the test set is **0.70**, meaning the model correctly identifies **70%** of actual survivors in the test data. This is close to the training recall of **0.68**, indicating consistent performance. It shows that the model is now much better at detecting true positives (survivors) compared to the previous results.
- **F1 Score (0.73)**: The F1 score on the test set is **0.73**, showing a good balance between precision and recall. The F1 score being close to that of the training data (**0.71**) indicates that the model has improved overall performance.

#### Summary

1. **Balanced Performance**: The model now has a much better balance between precision and recall, with a high F1 score. This shows that hyperparameter tuning has led to significant improvements in its ability to correctly predict survivors without missing too many or making too many false predictions.
2. **Improved Generalization**: The near-equal performance on both training and testing sets suggests that the model is well-optimized and can generalize effectively to unseen data, reducing the risk of overfitting or underfitting.
