üõ≥Ô∏è Titanic Survival Prediction

Problem Statement

The aim of this project is to use machine learning to predict whether a passenger survived the Titanic disaster.
The prediction is made using basic passenger information such as age, gender, passenger class, and family details.

This is a binary classification problem, which means the model predicts one of two possible outcomes:

1 ‚Üí Passenger Survived

0 ‚Üí Passenger Did Not Survive

By analyzing the given data, the model learns patterns that help determine a passenger‚Äôs chances of survival.

Importing the Dependencies

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

print("Imports successful!!")

Dataset Overview

Dataset Source: Kaggle ‚Äì Titanic: Machine Learning from Disaster

Total Records: 891 passengers

Target Variable: Survived

Key Features:

Pclass ‚Äì Passenger class

Sex ‚Äì Gender

Age ‚Äì Age of passenger

SibSp ‚Äì Siblings/Spouses aboard

Parch ‚Äì Parents/Children aboard

Fare ‚Äì Ticket fare

Embarked ‚Äì Port of embarkation

In [None]:
# load the dataset
df = pd.read_csv("D:\\Project\\My Project\\ML Projects\\Titanic Survival Prediction\\Titanic-Dataset.csv")
print("Dataset loaded successfully!!")
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

Handelling missing Values

In [None]:
# drop 'Cabin' if it exists (no error if absent)
df.drop(columns=['Cabin'], errors='ignore', inplace=True)   # inplace=True means: ‚ÄúModify the existing df directly instead of creating a new copy.‚Äù

In [None]:
# fill missing values in 'Age' with mean age
df['Age'] = df.groupby(['Sex','Pclass'])['Age'].transform(lambda x: x.fillna(x.median())) # fill missing 'Age' values with median age of respective 'Sex' and 'Pclass'

In [None]:
# fill missing values in embarked with mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df.isnull().sum()   # check again for missing values

# Encode categorical columns for sex and embarked


In [None]:
print(df["Sex"].value_counts())
print(df['Embarked'].value_counts())

In [None]:
df.replace({'Sex': {'male': 0, 'female': 1},'Embarked': {'S': 0, 'C': 1, 'Q': 2}}, inplace=True)    # replace categorical columns with numerical values
df.head()

Data Analysis

In [None]:
# get the statistical measures of the data
df.describe()

In [None]:
# finding the number of survivors and non-survivors
df['Survived'].value_counts()

In [None]:
sns.set()
plt.figure(figsize=(8, 6))

In [None]:
# making a count plot for 'Survived' column
sns.countplot(x='Survived', data=df)

In [None]:
df["Sex"].value_counts()    # checking the number of males and  females|

In [None]:
#  making a sex count plot
sns.countplot(x='Sex', data=df)

In [None]:
# number of servivors based on gender
sns.countplot(x='Sex', hue='Survived', data=df)

In [None]:
# pclass count plot
sns.countplot(x='Pclass', data=df) # pclass = passenger class

In [None]:
sns.countplot(x='Pclass', hue='Survived', data=df)

Separeating feature and Target

In [None]:
# separating the features and target
X = df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Survived'], axis=1) # These columns are not required for prediction
Y = df['Survived'] # Containts the survival status (0 or 1)

In [None]:
print(X)
print(Y)

Spliting the data into traning data & test data 

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2) # 20% data for testing and 80% for training, random_state=2 for reproducibility

In [None]:
# check the x and y shapes
print(X.shape, Y.shape)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape) # check the shape of the split data 

Model Model Training Logistic Regression

In [None]:
model = LogisticRegression()    # create an instance of Logistic Regression model
model.fit(x_train, y_train)    # training the Logistic Regression model with training data

In [None]:
# checking the accuracy on training data
x_train_prediction = model.predict(x_train)
print(x_train_prediction)

In [None]:
traning_data_accuracy = accuracy_score(y_train, x_train_prediction) # calculating the accuracy on training data
print("Accuracy on training data : ", traning_data_accuracy)

In [None]:
# accuracy on test data
x_test_prediction = model.predict(x_test)
print(x_test_prediction)

In [None]:
test_data_accuracy = accuracy_score(y_test, x_test_prediction) # calculating the accuracy on test data
print("Accuracy on test data : ", test_data_accuracy)

In [None]:
# use the predictions already computed in x_test_prediction
print(classification_report(y_test, x_test_prediction))
print("Confusion Matrix:\n", confusion_matrix(y_test, x_test_prediction))


Final Result

Best Model: Gradient Boosting Classifier

Final Accuracy: ~84%

Key Insight: Feature engineering and ensemble models greatly improve performance over baseline models.

Conclusion

This project demonstrates a complete end-to-end machine learning pipeline, including data preprocessing, feature engineering, model comparison, and evaluation.
The final model achieves strong predictive performance and is suitable for real-world binary classification tasks.