# Titanic Survival Prediction

## Introduction
This notebook demonstrates how to use various machine learning algorithms to predict survival on the Titanic based on passenger features. The dataset used in this notebook is the Titanic dataset from Kaggle. This includes:

- **Data Loading and Preprocessing**: Loading the dataset, handling missing values, and encoding categorical features.
- **Model Definition**: Defining a range of classification models from `scikit-learn`.
- **Hyperparameter Tuning**: Using `GridSearchCV` to optimize hyperparameters for each model.
- **Model Evaluation**: Assessing the performance of each model based on accuracy and other relevant metrics.
- **Best Model Identification**: Determining the best model based on its performance on the test set.

## About Dataset

The [Titanic Dataset](https://www.kaggle.com/c/titanic) contains information about passengers on the Titanic and their survival status. The dataset includes variables such as:

- **Pclass**: The passenger's class (1st, 2nd, or 3rd), which can be a proxy for socio-economic status.
- **Sex**: The gender of the passenger (male or female).
- **Age**: The age of the passenger, which can impact survival chances.
- **SibSp**: The number of siblings or spouses aboard the Titanic.
- **Parch**: The number of parents or children aboard the Titanic.
- **Fare**: The fare paid for the ticket, which can be indicative of the passenger's class and wealth.
- **Embarked**: The port where the passenger boarded the Titanic (C = Cherbourg, Q = Queenstown, S = Southampton).

## Objective

The objective of this notebook is to build and evaluate machine learning models to predict whether a passenger survived or not based on the provided features. We will train and test various classification algorithms and evaluate their performance in terms of accuracy and other relevant metrics. The aim is to identify the best-performing model that can accurately predict survival and provide insights into which factors are most influential in determining a passenger's survival.

## Case Study Outline

1. **Exploratory Data Analysis (EDA)**
   - Data profiling
   - Visualization of key features

2. **Data Preprocessing**
   - Handling missing values
   - Encoding categorical variables

3. **Model Building, Hyperparameter Tuning & Evaluation**

   i. Model Building
      - Random Forest
      - Support Vector Machine (SVM)
      - K-Nearest Neighbors (KNN)
      - Logistic Regression
      - Decision Tree
      - Gradient Boosting
      - AdaBoost
      - Gaussian Naive Bayes (GaussianNB)
      - Multi-layer Perceptron (MLPClassifier)

   ii. Hyperparameter Tuning
      - Grid Search
      - Random Search

   iii. Model Evaluation
      - Train-test split
      - Cross-validation
      - Performance metrics (Accuracy, Precision, Recall, F1 Score)

Let's begin by loading the dataset and exploring its contents.


## Exploratory Data Analysis (EDA)
In this section, we will perform exploratory data analysis (EDA) to understand the structure and characteristics of the dataset. We will examine the features, their data types, and the distribution of values to gain insights into the data. This will help us identify any missing values, outliers, or patterns that may be relevant for building our predictive models.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from pandas_profiling import ProfileReport
import warnings

# suppress warnings
warnings.filterwarnings('ignore')

# load data
data = sns.load_dataset('titanic')

  from pandas_profiling import ProfileReport


In [2]:
# Generate the profile report
profile = ProfileReport(data, title='Car Price Prediction Report', explorative=True)

# Display the profiling report
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Data Preprocessing
In this section, we will preprocess the data by handling missing values and encoding categorical variables. This step is essential for preparing the dataset for training machine learning models. We will use appropriate techniques to address missing values and convert categorical variables into a suitable format for model training.

In [3]:
# Data preprocessing
# Fill missing values in 'age' with median
data['age'].fillna(data['age'].median(), inplace=True)

# Fill missing values in 'embarked' with mode
data['embarked'].fillna(data['embarked'].mode()[0], inplace=True)

# Convert 'deck' to categorical and add 'Unknown' category
data['deck'] = data['deck'].astype('category')
data['deck'] = data['deck'].cat.add_categories('Unknown')
data['deck'].fillna('Unknown', inplace=True)

# Encode categorical features
data['sex'] = data['sex'].map({'male': 0, 'female': 1})
data['embarked'] = data['embarked'].map({'C': 0, 'Q': 1, 'S': 2})
data['deck'] = data['deck'].cat.codes  # Convert 'deck' to numeric codes
data['alive'] = data['alive'].map({'yes': 1, 'no': 0})
data['alone'] = data['alone'].astype(int)  # Convert 'alone' to int

# Define features and target
features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'deck', 'alone']
X = data[features]
y = data['survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Making Model and Hyperparameter Tuning to Find the Best Model
In this section, we will define a range of classification models and tune their hyperparameters using `GridSearchCV` to optimize their performance. We will train each model on the training data and evaluate its performance using cross-validation. The goal is to identify the best-performing model based on accuracy and other relevant metrics.

In [4]:
# Define models and their hyperparameters
models = {
    'RandomForest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'LogisticRegression': LogisticRegression(max_iter=10000),
    'DecisionTree': DecisionTreeClassifier(),
    'GradientBoosting': GradientBoostingClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'GaussianNB': GaussianNB(),
    'MLPClassifier': MLPClassifier(max_iter=10000)
}

param_grids = {
    'RandomForest': {
        'n_estimators': [100, 200],
        'max_depth': [None, 10, 20]
    },
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf']
    },
    'KNN': {
        'n_neighbors': [3, 5, 7],
        'weights': ['uniform', 'distance']
    },
    'LogisticRegression': {
        'C': [0.1, 1, 10],
        'solver': ['liblinear', 'lbfgs']
    },
    'DecisionTree': {
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 10, 20]
    },
    'GradientBoosting': {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1, 1],
        'max_depth': [3, 5, 7]
    },
    'AdaBoost': {
        'n_estimators': [50, 100],
        'learning_rate': [0.01, 0.1, 1]
    },
    'GaussianNB': {},  # No hyperparameters to tune for GaussianNB
    'MLPClassifier': {
        'hidden_layer_sizes': [(100,), (50, 50)],
        'activation': ['tanh', 'relu'],
        'solver': ['adam', 'lbfgs'],
        'alpha': [0.0001, 0.05]
    }
}

# Function to perform model tuning and evaluation
def tune_and_evaluate(models, param_grids, X_train, y_train, X_test, y_test):
    best_model = None
    best_score = 0
    best_model_name = ""
    model_scores = []
    for model_name, model in models.items():
        print(f"Tuning {model_name}...")
        grid_search = GridSearchCV(model, param_grids[model_name], cv=5, scoring='accuracy')
        grid_search.fit(X_train, y_train)
        
        best_estimator = grid_search.best_estimator_
        predictions = best_estimator.predict(X_test)
        score = accuracy_score(y_test, predictions)
        
        print(f"Best parameters for {model_name}: {grid_search.best_params_}")
        print(f"Accuracy for {model_name}: {score}")
        print(classification_report(y_test, predictions))
        
        model_scores.append((model_name, score))
        
        if score > best_score:
            best_score = score
            best_model = best_estimator
            best_model_name = model_name
            
    print(f"\nBest Model: {best_model_name} with Accuracy: {best_score}")
    
    # Plotting model accuracies
    model_scores = sorted(model_scores, key=lambda x: x[1], reverse=True)
    models, scores = zip(*model_scores)
    plt.figure(figsize=(12, 6))
    bars = plt.bar(models, scores, color=sns.color_palette("husl", len(models)))
    
    # Add accuracy numbers on top of bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width() / 2.0, height + 0.02, f'{height:.2f}', 
                 ha='center', va='bottom', fontsize=10)
    
    plt.xlabel('Model')
    plt.ylabel('Accuracy')
    plt.title('Model Accuracy Comparison')
    plt.xticks(rotation=45)
    plt.show()
    
    return best_model_name, best_model, best_score
    
# Run the tuning and evaluation
best_model_name, best_model, best_score = tune_and_evaluate(models, param_grids, X_train, y_train, X_test, y_test)

Tuning RandomForest...
Best parameters for RandomForest: {'max_depth': 10, 'n_estimators': 200}
Accuracy for RandomForest: 0.8044692737430168
              precision    recall  f1-score   support

           0       0.81      0.87      0.84       105
           1       0.79      0.72      0.75        74

    accuracy                           0.80       179
   macro avg       0.80      0.79      0.80       179
weighted avg       0.80      0.80      0.80       179

Tuning SVM...
Best parameters for SVM: {'C': 0.1, 'kernel': 'linear'}
Accuracy for SVM: 0.7877094972067039
              precision    recall  f1-score   support

           0       0.80      0.85      0.82       105
           1       0.76      0.70      0.73        74

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179

Tuning KNN...
Best parameters for KNN: {'n_neighbors': 3, 'weights': 'distance'}
Accuracy for KNN

## Conclusion

In this analysis, we evaluated and tuned several classification models to determine the best performer for our dataset. Each model's performance was assessed using metrics such as accuracy, precision, recall, and F1-score. The following models were tuned and evaluated:

### Model Performance Summary

- **Random Forest**
  - **Best Parameters**: `{'max_depth': 10, 'n_estimators': 200}`
  - **Accuracy**: 0.804
  - **Precision**: 0.81 (Class 0), 0.79 (Class 1)
  - **Recall**: 0.87 (Class 0), 0.72 (Class 1)
  - **F1-Score**: 0.84 (Class 0), 0.75 (Class 1)

- **Support Vector Machine (SVM)**
  - **Best Parameters**: `{'C': 0.1, 'kernel': 'linear'}`
  - **Accuracy**: 0.788
  - **Precision**: 0.80 (Class 0), 0.76 (Class 1)
  - **Recall**: 0.85 (Class 0), 0.70 (Class 1)
  - **F1-Score**: 0.82 (Class 0), 0.73 (Class 1)

- **K-Nearest Neighbors (KNN)**
  - **Best Parameters**: `{'n_neighbors': 3, 'weights': 'distance'}`
  - **Accuracy**: 0.715
  - **Precision**: 0.72 (Class 0), 0.69 (Class 1)
  - **Recall**: 0.83 (Class 0), 0.55 (Class 1)
  - **F1-Score**: 0.77 (Class 0), 0.62 (Class 1)

- **Logistic Regression**
  - **Best Parameters**: `{'C': 1, 'solver': 'liblinear'}`
  - **Accuracy**: 0.799
  - **Precision**: 0.81 (Class 0), 0.78 (Class 1)
  - **Recall**: 0.86 (Class 0), 0.72 (Class 1)
  - **F1-Score**: 0.83 (Class 0), 0.75 (Class 1)

- **Decision Tree**
  - **Best Parameters**: `{'max_depth': 20, 'min_samples_split': 20}`
  - **Accuracy**: 0.810
  - **Precision**: 0.82 (Class 0), 0.79 (Class 1)
  - **Recall**: 0.87 (Class 0), 0.73 (Class 1)
  - **F1-Score**: 0.84 (Class 0), 0.76 (Class 1)

- **Gradient Boosting**
  - **Best Parameters**: `{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}`
  - **Accuracy**: 0.816
  - **Precision**: 0.83 (Class 0), 0.80 (Class 1)
  - **Recall**: 0.87 (Class 0), 0.74 (Class 1)
  - **F1-Score**: 0.85 (Class 0), 0.77 (Class 1)

- **AdaBoost**
  - **Best Parameters**: `{'learning_rate': 1, 'n_estimators': 50}`
  - **Accuracy**: 0.799
  - **Precision**: 0.83 (Class 0), 0.76 (Class 1)
  - **Recall**: 0.83 (Class 0), 0.76 (Class 1)
  - **F1-Score**: 0.83 (Class 0), 0.76 (Class 1)

- **Gaussian Naive Bayes**
  - **Best Parameters**: `{}`
  - **Accuracy**: 0.760
  - **Precision**: 0.80 (Class 0), 0.71 (Class 1)
  - **Recall**: 0.79 (Class 0), 0.72 (Class 1)
  - **F1-Score**: 0.79 (Class 0), 0.71 (Class 1)

- **MLP Classifier**
  - **Best Parameters**: `{'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'solver': 'adam'}`
  - **Accuracy**: 0.788
  - **Precision**: 0.81 (Class 0), 0.76 (Class 1)
  - **Recall**: 0.84 (Class 0), 0.72 (Class 1)
  - **F1-Score**: 0.82 (Class 0), 0.74 (Class 1)

### Best Model

The **Gradient Boosting** model emerged as the best performer with an accuracy of **0.816**. This model achieved the highest accuracy and demonstrated robust performance across precision, recall, and F1-score metrics. Gradient Boosting effectively captures complex patterns in the data, making it the most suitable model for this classification task.

Future work could involve further hyperparameter tuning, feature engineering, and exploring additional models to enhance the predictive performance and robustness of the analysis.

## Acknowledgments

I would like to express my gratitude to the following:

- **Kaggle**: For providing the dataset which made this analysis possible.
- **Scikit-Learn Documentation**: For valuable tools and resources.
- **Open Source Libraries**: Such as scikit-learn, matplotlib, and seaborn for their valuable contributions.

Thank you for your support and contributions.

## References

- [Scikit-Learn Documentation](https://scikit-learn.org/stable/documentation.html)
- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html)
- [Seaborn Documentation](https://seaborn.pydata.org/tutorial.html)
- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)
- [NumPy Documentation](https://numpy.org/doc/stable/)

## Author

- [Ahmad Bin Sadiq](https://www.linkedin.com/in/ahmadbinsadiq/)
- **Email:** ahmadbinsadiq@gmail.com

---