<div align="center"><h1><b>Global Changemakers Study Group</b></h1></div>

<div align="center"><h2><b>Practical Session</b></h2></div>

# Titanic Dataset: Building Classification Models, Hyperparameter Tuning, and Model Validation

## Introduction

In this practical presentation session, we will explore the process of building and validating classification models using the Titanic dataset. The objective is to predict the survival of passengers based on various features such as age, gender, class, and other relevant attributes. This session will guide you through the steps of model building, hyperparameter tuning, and model validation using three different machine learning models: Support Vector Machine (SVM), Random Forest Classifier, and Feedforward Neural Network.

## Objectives
- **Data Cleaning**: Create a function that takes in the data and cleans the data.
- **Data Scaling**: Scale the data using sklearn's StandardScaler
- **Model Building**: Construct three classification models: SVM, Random Forest, and Feedforward Neural Network using `scikit-learn` and `keras` libraries.
- **Initial Evaluation with Train-Test Split**: Perform an initial evaluation by splitting the dataset into training and testing sets.
- **Hyperparameter Tuning with Cross-Validation**: Optimize the models using Grid Search with Cross-Validation to tune hyperparameters.
- **Final Evaluation on Test Set**: Confirm the models' performance on the test set to ensure generalizability.

## Libraries

The following libraries will be used in this session:

- `numpy`: For numerical computations.
- `pandas`: For data manipulation and analysis.
- `scikit-learn`: For building and evaluating machine learning models.
- `keras`: For constructing and training neural networks.

## Dataset Overview

The pre-cleaned Titanic dataset provides information on the passengers of the Titanic ship. Key features include:

- `Ticket`: Ticket number. (Removed after data cleaning)
- `Name`: Passenger's name. (Removed after data cleaning)
- `PassengerId`: Unique identifier for each passenger. (Removed after data cleaning)
- `Cabin`: Cabin number. (Removed after data cleaning)
- `Survived`: Survival indicator (0 = No, 1 = Yes).
- `Pclass`: Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd).
- `Sex`: Gender of the passenger.
- `Age`: Age of the passenger.
- `Fare`: Passenger fare.
- `SibSp`: Number of siblings/spouses aboard the Titanic.
- `Parch`: Number of parents/children aboard the Titanic.
- `Family_Size`: Number of individuals in each family. Created by adding `Parch` and `SibSp`.
- `Embarked`: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
- `Title`: one-hot encoded `Title` for each passenger. Created from the passenger names in the original dataset.

## Key Steps in the Workflow

1. **Load and clean the Dataset**:
    - We will be creating a function that loads and cleans the dataset to ensure that it is ready for model building and evaluation. Preprocessing steps such as handling missing values, and encoding categorical variables will be done within the function.

2. **Scale Data**:
    - After building and validating the models, we will scale the data and perform steps 2-5 on the scaled data to assess if there is any improvement in performance.

2. **Model Building**:
    - We will build three classification models: SVM, Random Forest, and Feedforward Neural Network using the `scikit-learn` and `keras` libraries.

3. **Initial Evaluation of Models**:
    - We will perform an initial evaluation by splitting the dataset into training and testing sets. This will give us a quick overview of the model performance on unseen data.

4. **Hyperparameter Tuning with Cross-Validation**:
    - To optimize our models, we will use Grid Search with Cross-Validation to tune the hyperparameters. This step ensures that we select the best set of hyperparameters that maximize model performance.

5. **Final Evaluation on Test Set**:
    - After tuning the hyperparameters, we will retrain the models using the best parameters on the entire training set. The final evaluation will employ cross-validation on the training set to assess the models' performance and generalizability.



## Metrics for Evaluation

We will compare the models' performance based on various metrics such as accuracy, precision, recall, and F1-score. These metrics will help us understand the strengths and weaknesses of each model in predicting passenger survival.

This comprehensive approach to model building and validation aims to ensure that our models are not only well-tuned but also generalize effectively to unseen data, providing robust and reliable predictions.


# Table of Contents:

- **1. [Load Dependencies](#load-dependencies)**

- **2. [Load Data, Create a function to clean the Data, Separate dataset and Execute train-test split](#load-clean-separate-data)**
  - **2.1. [Load Data](#load)**
  - **2.2. [Create function](#clean)**
  - **2.3. [Separate Data](#separate)**

- **3. [Build Models](#Model-Building)**

- **4. [Initial Evaluation of Models](#Evaluation)**

- **5. [Tuning Model Hyperparameters](#Hyperparameter-tuning)**

- **6. [Final Evaluation of Models](#Evaluation)**

- **7. [Conclusion](#Conclusion)**

<a id="load-dependencies"></a>

# 1. Load dependencies
As usual, the first step in a python data project is to import the necessary Python packages we will be utilising during the practical session.

In [1]:
import warnings
warnings.filterwarnings('ignore')

# run the code below if you do not have `scikeras`
!pip install scikeras

# 'scikeras' is a library that extends the capabilities of scikit-learn by providing compatibility between scikit-learn and Keras, a popular deep learning framework.
# It offers utilities to seamlessly integrate Keras models into scikit-learn workflows.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,  cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense
from scikeras.wrappers import KerasClassifier

Collecting scikeras
  Downloading scikeras-0.13.0-py3-none-any.whl (26 kB)
Collecting keras>=3.2.0 (from scikeras)
  Downloading keras-3.3.3-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scikit-learn>=1.4.2 (from scikeras)
  Downloading scikit_learn-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
Collecting namex (from keras>=3.2.0->scikeras)
  Downloading namex-0.0.8-py3-none-any.whl (5.8 kB)
Collecting optree (from keras>=3.2.0->scikeras)
  Downloading optree-0.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: namex, optree, scikit-learn, keras, sciker

<a id="load-clean-separate-data"></a>

# 2. Load Data, Create a function to clean the Data, Separate dataset and Execute train-test split
The Titanic dataset is imported and loaded into a Pandas DataFrame for manipulation and analysis. The target variable contains missing values, which need to be handled since machine learning models cannot accept them. The dataset is separated into train (with non-null target values) and test (with null target values) variables. The train data is used for training models, while the test data is used for making predictions after training. The train data is then split into training and validation sets following an 80-20 rule to ensure the model generalizes well and the evaluation metrics are reliable.

<a id="load"></a>
## 2.1. Load Data
We start by importing the Titanic dataset, which is stored in a CSV file format. Loading the data into a Pandas DataFrame allows us to easily manipulate and analyze it using Python.

In [2]:
train = pd.read_csv("https://raw.githubusercontent.com/WoDauKuro/ALX-DS-Projects/main/Titanic%20survival%20prediction/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/WoDauKuro/ALX-DS-Projects/main/Titanic%20survival%20prediction/test.csv")
# Concatenate `train` and `test`
df_scaled = pd.concat([train, test], axis=0, sort=True)

In [4]:
# Check the information of df
df_scaled.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          1046 non-null   float64
 1   Cabin        295 non-null    object 
 2   Embarked     1307 non-null   object 
 3   Fare         1308 non-null   float64
 4   Name         1309 non-null   object 
 5   Parch        1309 non-null   int64  
 6   PassengerId  1309 non-null   int64  
 7   Pclass       1309 non-null   int64  
 8   Sex          1309 non-null   object 
 9   SibSp        1309 non-null   int64  
 10  Survived     891 non-null    float64
 11  Ticket       1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB


<a id="clean"></a>
## 2.2. Create function
After loading the data, you will observe that there are several problems with the data; missing values, unimportant columns, categorical values etc. To handle these problems, we would create a function that takes in the dataframe, cleans it and returns the cleaned dataframe.

In [6]:
def titan_sweep(df_scaled):
    # create new Title column
    df_scaled['Title'] = df_scaled['Name'].str.extract('([A-Za-z]+)\.', expand=True)

    # replace rare titles with more common ones
    mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr',
           'Don': 'Mr', 'Mme': 'Mrs', 'Jonkheer': 'Mr', 'Lady': 'Mrs',
           'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}
    df_scaled.replace({'Title': mapping}, inplace=True)

    # impute missing Age values using median of Title groups
    title_ages = dict(df_scaled.groupby('Title')['Age'].median())

    # create a column of the median ages
    df_scaled['age_med'] = df_scaled['Title'].apply(lambda x: title_ages[x])

    # replace all missing ages with the value in this column
    df_scaled['Age'].fillna(df_scaled['age_med'], inplace=True, )
    df_scaled.drop(columns='age_med', inplace=True)

    # impute missing Fare values using median of Pclass groups
    class_fares = dict(df_scaled.groupby('Pclass')['Fare'].median())

    # create a column of the average fares
    df_scaled['fare_med'] = df_scaled['Pclass'].apply(lambda x: class_fares[x])

    # replace all missing fares with the value in this column
    df_scaled['Fare'].fillna(df_scaled['fare_med'], inplace=True, )
    df_scaled.drop(columns='fare_med', inplace=True)

    # Impute missing "embarked" value
    df_scaled['Embarked'].fillna(method='backfill', inplace=True)

    #  create Family_Size column
    df_scaled['Family_Size'] = df_scaled['Parch'] + df_scaled['SibSp']

    # convert to category dtype and then encode with numbers using `.cat.codes` attribute
    df_scaled['Sex'] = df_scaled['Sex'].astype('category').cat.codes

    # Encode other categorical variables using one-hot encoding
    categorical = ['Embarked', 'Title']
    for item in categorical:
        df_scaled = pd.concat([df_scaled,
                        pd.get_dummies(df_scaled[item], prefix=item)], axis=1)
        df_scaled.drop(columns=item, inplace=True)

    # drop the variables we won't be using
    df_scaled.drop(['Cabin', 'Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)

    return df_scaled

# Implement the function
df_scaled = titan_sweep(df_scaled)

In [7]:
# Check the information of the cleaned df
df_scaled.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 0 to 417
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Age           1309 non-null   float64
 1   Fare          1309 non-null   float64
 2   Parch         1309 non-null   int64  
 3   Pclass        1309 non-null   int64  
 4   Sex           1309 non-null   int8   
 5   SibSp         1309 non-null   int64  
 6   Survived      891 non-null    float64
 7   Family_Size   1309 non-null   int64  
 8   Embarked_C    1309 non-null   bool   
 9   Embarked_Q    1309 non-null   bool   
 10  Embarked_S    1309 non-null   bool   
 11  Title_Dr      1309 non-null   bool   
 12  Title_Master  1309 non-null   bool   
 13  Title_Miss    1309 non-null   bool   
 14  Title_Mr      1309 non-null   bool   
 15  Title_Mrs     1309 non-null   bool   
 16  Title_Rev     1309 non-null   bool   
dtypes: bool(9), float64(3), int64(4), int8(1)
memory usage: 94.6 KB


<a id="separate-scale"></a>
## 2.3. Separate Data
After cleaning our data, we still observe missing values for the `Survived` column. We left that column as is because it is our target variable and tampering with it would not be good for model performance i.e. our model would be predicting for values that do not represent the true values.
Since machine learning models do not accept missing values, we need to remove or handle rows with missing values in the target variable before proceeding.

A good approach to handling this issue is to separate the dataframe into `train` and `test`variables. The `train` variable will contain the data with non-null values in the target variable and will be used for training the machine learning models. The `test` variable will contain the features for rows where the target variable is null and will be used for testing or prediction after the model is trained. This approach ensures that we have a complete target variable for training while still being able to make predictions for the rows with missing target values.

With the dataset separated into `train` and `test` variables, the next step is to split the data into training and validation sets. We adhere to the 80-20 rule, where approximately 80% of the data is used for training and 20% for testing. This split helps us assess how well the model generalizes to new, unseen data and ensures that our evaluation metrics are reliable.


# Data Scaling in Model Building

Data scaling, also known as feature scaling or normalization, is a crucial preprocessing step in machine learning. It involves transforming the numerical features of a dataset to a similar scale.

#### Importance of Data Scaling in Model Building:

1. Ensures that all features contribute equally to the result.
2. Helps algorithms converge faster.
3. Prevents features with larger ranges from dominating the learning process.


### Understanding the importance of data scaling, let us now Scale our data to see how our models perform.


**NOTE:**
Data leakage is a serious issue as it undermines the validity of the model evaluation.

Data leakage occurs when information from outside the training dataset is used to create the model, which can lead to overly optimistic performance estimates and a model that fails to generalize to new, unseen data.

Data Leakage can occur when information from the test set leaks into the training set, often due to improper data splitting or preprocessing.

Example: Scaling the entire dataset before splitting into train and test sets can cause the test set information to influence the training process.

**Always split your dataset into training, validation, and test sets before any preprocessing.**


In [8]:
# Separate data into feature (`X`) and target variables (`y`)
train = df_scaled[pd.notnull(df_scaled['Survived'])]
X_test = df_scaled[pd.isnull(df_scaled['Survived'])].drop(['Survived'], axis=1)

col_to_scale = ['Age', 'Fare', 'Parch', 'Pclass', 'SibSp', 'Family_Size']

# Execute train-test split
X_train, X_val, y_train, y_val = train_test_split(train.drop(['Survived'], axis=1), train['Survived'], test_size=0.2, random_state=42)

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the training data
X_train[col_to_scale] = scaler.fit_transform(X_train[col_to_scale])

# Transform the validation data
X_val[col_to_scale] = scaler.transform(X_val[col_to_scale])

# Transform the test data
X_test[col_to_scale] = scaler.transform(X_test[col_to_scale])

# Now X_train, X_val, y_train, y_val, and X_test2 are ready for modeling

<a id="Model-Building"></a>

# 3. Build Models
In this crucial step, we embark on constructing our predictive models! Recognizing the distinctive construction method of the feedforward neural network compared to the SVM and random forest models, we initiate by crafting a dedicated function specifically tailored for the neural network. Subsequently, we proceed to define this neural network alongside the other models, facilitating a cohesive framework for comparative analysis and evaluation.


In [9]:
# Define custom scorer for Keras model
def create_ffnn():
    """
    Constructs a Feedforward Neural Network (FFNN) model using Keras Sequential API.

    Returns:
    model : Sequential
        A compiled FFNN model ready for training and evaluation.

    Notes:
    - This function defines a simple FFNN architecture with three fully connected layers.
    - The input layer size is automatically determined based on the input data dimension.
    - The hidden layers use Rectified Linear Unit (ReLU) activation function for introducing non-linearity.
    - The output layer uses a sigmoid activation function, suitable for binary classification tasks.
    - The model is compiled with binary cross-entropy loss function, Adam optimizer, and accuracy metric.
    """
    model = Sequential()
    model.add(Dense(12, input_dim=X_train.shape[1], activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Define models
svm_model = SVC()
rf_model = RandomForestClassifier()
ffnn_model = KerasClassifier(build_fn=create_ffnn, epochs=100, batch_size=10, verbose=0)

<a id="Evaluation"></a>

# 4. Initial Evaluation of Models
Initial Evaluation with Train-Test Split" serves as our first step in assessing the performance of various machine learning models on our dataset. In this evaluation, we employ a train-test split methodology to train multiple models and evaluate their predictive capabilities. The models under scrutiny include Support Vector Machine, Random Forest, and a Feedforward Neural Network. For each model, we measure key performance metrics such as accuracy, precision, recall, and F1-score. This initial evaluation provides valuable insights into the effectiveness of different algorithms in predicting outcomes and guides further optimization and refinement of our predictive models.


In [10]:
# Initial Evaluation with Train-Test Split
print("Initial Evaluation with Train-Test Split")
for model_name, model in [("Support Vector Machine", svm_model), ("Random Forest", rf_model), ("Feedforward Neural Network", ffnn_model)]:
    model.fit(X_train, y_train)
    y_pred_val = model.predict(X_val)
    print(f"Model: {model_name}")
    print("Accuracy: ", accuracy_score(y_val, y_pred_val))
    print("Precision: ", precision_score(y_val, y_pred_val))
    print("Recall: ", recall_score(y_val, y_pred_val))
    print("F1-score: ", f1_score(y_val, y_pred_val))
    print("\n")

Initial Evaluation with Train-Test Split
Model: Support Vector Machine
Accuracy:  0.8268156424581006
Precision:  0.8115942028985508
Recall:  0.7567567567567568
F1-score:  0.7832167832167832


Model: Random Forest
Accuracy:  0.8379888268156425
Precision:  0.8082191780821918
Recall:  0.7972972972972973
F1-score:  0.8027210884353742


Model: Feedforward Neural Network
Accuracy:  0.8268156424581006
Precision:  0.8412698412698413
Recall:  0.7162162162162162
F1-score:  0.7737226277372263




<a id="Hyperparameter-tuning"></a>

# 5. Tuning Model Hyperparameters
Cross-Validation with Hyperparameter Tuning on Training Set" introduces a pivotal stage in refining our machine learning models for optimal performance. Here, we employ a rigorous approach known as cross-validation coupled with hyperparameter tuning to systematically explore various model configurations. We define parameter grids tailored to each algorithm—Support Vector Machine, Random Forest, and Feedforward Neural Network—to explore different combinations of hyperparameters. Utilizing GridSearchCV, a method for hyperparameter optimization, we exhaustively search through these parameter grids while evaluating performance using a cross-validation strategy on the training set. By doing so, we identify the most effective model configurations that maximize predictive accuracy. This meticulous process ensures that our models are finely tuned to capture the complexities of the dataset, enhancing their generalization and predictive power.

In [11]:
# Hyperparameter Tuning using Cross-Validation on Training Set
print("Cross-Validation with Hyperparameter Tuning on Training Set")

# Define parameter grids
svm_param_grid = {'C': [0.1, 1, 10, 15], 'kernel': ['linear', 'rbf'], 'gamma': [5, 10, 30, 50]}
rf_param_grid = {'n_estimators': [50, 100, 200, 350], 'max_depth': [5, 10, 20, 35]}
ffnn_param_grid = {'epochs': [50, 100, 200, 350], 'batch_size': [10, 20, 30, 40]}

# Define GridSearchCV
svm_grid_search = GridSearchCV(svm_model, svm_param_grid, scoring='accuracy', cv=5)
rf_grid_search = GridSearchCV(rf_model, rf_param_grid, scoring='accuracy', cv=5)
ffnn_grid_search = GridSearchCV(ffnn_model, ffnn_param_grid, scoring='accuracy', cv=5)

# Fit models
svm_grid_search.fit(X_train, y_train)
rf_grid_search.fit(X_train, y_train)
ffnn_grid_search.fit(X_train, y_train)

# Get best models
best_svm_model = svm_grid_search.best_estimator_
best_rf_model = rf_grid_search.best_estimator_
best_ffnn_model = ffnn_grid_search.best_estimator_

print("Best Parameters:")
print("Support Vector Machine: ", svm_grid_search.best_params_)
print("\n")
print("Random Forest: ", rf_grid_search.best_params_)
print("\n")
print("Feedforward Neural Network: ", ffnn_grid_search.best_params_)

Cross-Validation with Hyperparameter Tuning on Training Set




Best Parameters:
Support Vector Machine:  {'C': 1, 'gamma': 5, 'kernel': 'linear'}
Random Forest:  {'max_depth': 5, 'n_estimators': 100}
Feedforward Neural Network:  {'batch_size': 10, 'epochs': 50}


<a id="Evaluation"></a>

# 6. Final Evaluation of Models
### How we usually Evaluate models
Evaluate the model directly by comparing its predictions on unseen data with the true labels.
### What we would do
The final model evaluation after hyperparameter tuning marks the culmination of our model development journey. Through meticulous optimization and validation on the training set, including hyperparameter tuning, we fine-tuned our Support Vector Machine, Random Forest, and Feedforward Neural Network models. In this phase, we assess the performance of these refined models by employing cross-validation on the training set. By averaging the performance scores over multiple folds, we obtain a realistic measure of their effectiveness in real-world scenarios. This evaluation provides valuable insights into how well our models generalize to unseen data and their ability to accurately predict outcomes. It serves as a crucial validation step, ensuring that our models are robust, reliable, and ready for deployment in practical applications.






In [12]:
# Final Evaluation on Test Set after Hyperparameter Tuning
print("Final Evaluation using cross validation")
for model_name, model in [("Support Vector Machine", best_svm_model), ("Random Forest", best_rf_model), ("Feedforward Neural Network", best_ffnn_model)]:
    acc_score = cross_val_score(model, X_train, y_train, cv=5).mean()
    prec_score = cross_val_score(model, X_train, y_train, cv=5, scoring='precision_macro').mean()
    reca_score = cross_val_score(model, X_train, y_train, cv=5, scoring='recall_macro').mean()
    f1_score = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_macro').mean()
    print(f"Accuracy score is: {model_name}: {acc_score:.4f}")
    print(f"Precision score is: {model_name}: {prec_score:.4f}")
    print(f"Recall score is: {model_name}: {reca_score:.4f}")
    print(f"F1 score is: {model_name}: {f1_score:.4f}")
    print("\n")


Final Evaluation on Test Set
Accuracy score is: Support Vector Machine: 0.8300
Precision score is: Support Vector Machine: 0.8229
Recall score is: Support Vector Machine: 0.8111
F1 score is: Support Vector Machine: 0.8157
Accuracy score is: Random Forest: 0.8371
Precision score is: Random Forest: 0.8355
Recall score is: Random Forest: 0.8098
F1 score is: Random Forest: 0.8125
Accuracy score is: Feedforward Neural Network: 0.8328
Precision score is: Feedforward Neural Network: 0.8238
Recall score is: Feedforward Neural Network: 0.7974
F1 score is: Feedforward Neural Network: 0.8092


<a id="Conclusion"></a>

# 6. Conclusion
Scaling the data using StandardScaler had a noticeable impact on the performance of the models. After scaling, the accuracy, precision, recall, and F1-score of all models improved compared to their initial evaluations. The SVM and Random Forest models, in particular, showed significant improvements, highlighting the importance of data scaling in enhancing model performance.
Overall, this project demonstrates that standardizing the features of a dataset can lead to better model performance and more reliable predictions. The insights gained emphasize the value of data preprocessing techniques, such as scaling, in the machine learning workflow.