In [7]:
# Load the dataset
import pandas as pd

train_data = pd.read_csv('titanic/train.csv')
test_data = pd.read_csv('titanic/test.csv')

# Get basic info and check for missing values
train_data.info()
train_data.describe()
train_data.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [8]:
import pandas as pd
from pycaret.classification import setup, compare_models, pull

# PyCaret requires the target column to be defined, here it's 'Survived'
# Drop irrelevant columns (like 'PassengerId', 'Name', etc.) for simplicity
titanic_data = train_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
#titanic_data = train_data.copy()

# Handle missing values (simple imputation for demonstration)
titanic_data['Age'] = titanic_data['Age'].fillna(titanic_data['Age'].median())
titanic_data['Embarked'] = titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0])

# Convert categorical columns to strings for PyCaret
categorical_cols = ['Sex', 'Embarked']
for col in categorical_cols:
    titanic_data[col] = titanic_data[col].astype(str)

# PyCaret setup
clf_setup = setup(data=titanic_data, target='Survived', session_id=42, verbose=False)

# Compare models
best_model = compare_models(n_select=16)  # Select top 16 models

# Display the comparison results
model_comparison = pull()  # Pull comparison DataFrame
print(model_comparison)

# Optionally save the results for analysis
model_comparison.to_csv('titanic_model_comparison.csv', index=False)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8219,0.8623,0.7074,0.8122,0.7526,0.6149,0.6217,0.046
lightgbm,Light Gradient Boosting Machine,0.8218,0.8649,0.737,0.789,0.7584,0.6181,0.6223,0.218
rf,Random Forest Classifier,0.8108,0.848,0.7286,0.7739,0.7472,0.5966,0.6005,0.068
ada,Ada Boost Classifier,0.8074,0.8326,0.733,0.7647,0.743,0.5901,0.596,0.041
lr,Logistic Regression,0.7978,0.8528,0.7076,0.7565,0.7275,0.5677,0.5719,0.033
et,Extra Trees Classifier,0.7946,0.8298,0.7036,0.7552,0.7243,0.5615,0.5662,0.056
nb,Naive Bayes,0.7882,0.8198,0.6911,0.7448,0.7134,0.5464,0.5507,0.026
ridge,Ridge Classifier,0.7882,0.8536,0.6824,0.7467,0.7106,0.5447,0.5482,0.024
lda,Linear Discriminant Analysis,0.7882,0.8536,0.6824,0.7467,0.7106,0.5447,0.5482,0.025
dt,Decision Tree Classifier,0.7803,0.7657,0.7203,0.7104,0.7131,0.5354,0.5377,0.024


                                    Model  Accuracy     AUC  Recall   Prec.  \
gbc          Gradient Boosting Classifier    0.8219  0.8623  0.7074  0.8122   
lightgbm  Light Gradient Boosting Machine    0.8218  0.8649  0.7370  0.7890   
rf               Random Forest Classifier    0.8108  0.8480  0.7286  0.7739   
ada                  Ada Boost Classifier    0.8074  0.8326  0.7330  0.7647   
lr                    Logistic Regression    0.7978  0.8528  0.7076  0.7565   
et                 Extra Trees Classifier    0.7946  0.8298  0.7036  0.7552   
nb                            Naive Bayes    0.7882  0.8198  0.6911  0.7448   
ridge                    Ridge Classifier    0.7882  0.8536  0.6824  0.7467   
lda          Linear Discriminant Analysis    0.7882  0.8536  0.6824  0.7467   
dt               Decision Tree Classifier    0.7803  0.7657  0.7203  0.7104   
qda       Quadratic Discriminant Analysis    0.7481  0.8150  0.7020  0.6772   
knn                K Neighbors Classifier    0.7095 

# Titanic Model Comparison Results

This summary provides an evaluation of machine learning models applied to the Titanic dataset using PyCaret. Each row represents a different model, and the columns outline performance metrics and other relevant information.

## Column Descriptions

| Column Name | Description |
| --- | --- |
| **Model** | The name of the machine learning algorithm used. |
| **Accuracy** | The proportion of correctly classified instances out of the total. |
| **AUC** | Area Under the Receiver Operating Characteristic (ROC) Curve. This measures the model's ability to distinguish between classes (higher is better). |
| **Recall** | The proportion of true positives correctly identified by the model. Indicates sensitivity to identifying the positive class. |
| **Prec.** | Precision: The proportion of true positives out of all predicted positives. A higher precision means fewer false positives. |
| **F1** | The harmonic mean of Precision and Recall, balancing both metrics. |
| **Kappa** | Cohen's Kappa: Measures the agreement between predicted and actual classes, adjusted for chance. |
| **MCC** | Matthews Correlation Coefficient: A balanced measure considering all confusion matrix categories (true/false positives/negatives). |
| **TT (Sec)** | Training Time in seconds: The time taken to train the model. |

## Top 5 Models

The table below highlights the performance of the top 5 models ranked by **Accuracy**:

| Model | Accuracy | AUC | Recall | Precision | F1 | Kappa | MCC | TT (Sec) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Gradient Boosting Classifier | 0.8219 | 0.8623 | 0.7074 | 0.8122 | 0.7526 | 0.6149 | 0.6217 | 0.046 |
| Light Gradient Boosting Machine | 0.8218 | 0.8649 | 0.7370 | 0.7890 | 0.7584 | 0.6181 | 0.6223 | 0.218 |
| Random Forest Classifier | 0.8108 | 0.8480 | 0.7286 | 0.7739 | 0.7472 | 0.5966 | 0.6005 | 0.068 |
| Ada Boost Classifier | 0.8074 | 0.8326 | 0.7330 | 0.7647 | 0.7430 | 0.5901 | 0.5960 | 0.041 |
| Logistic Regression | 0.7978 | 0.8528 | 0.7076 | 0.7565 | 0.7275 | 0.5677 | 0.5719 | 0.033 |

___

## Key Observations

1.  **Gradient Boosting Classifier** has the highest accuracy (82.19%) and a competitive AUC (86.23%), making it a strong choice overall.
2.  **Light Gradient Boosting Machine** has a slightly lower accuracy (82.18%) but a higher AUC (86.49%) and Recall (73.70%).
3.  **Logistic Regression** is the simplest model but performs reasonably well with an accuracy of 79.78%.
4.  Models with higher **MCC** and **Kappa** generally indicate better overall predictive performance.
5.  **Training Time (TT)** varies significantly among models, with some algorithms like Gradient Boosting being faster to train compared to Light GBM.