# **Objective**



1.   Use the Titanic dataset to build a model that predicts whether a passenger on the Titanic survived or not. This is a classic beginner project with readily available data.
2.   The dataset typically used for this project contains information about individual passengers, such as their age, gender, ticket class, fare, cabin, and whether or not they survived.







# Downloading the dataset

In [None]:
%pip install kaggle



In [None]:
!mkdir ~/.kaggle

In [None]:
! kaggle datasets download brendan45774/test-file

Downloading test-file.zip to /content
  0% 0.00/11.2k [00:00<?, ?B/s]
100% 11.2k/11.2k [00:00<00:00, 31.5MB/s]


In [None]:
!unzip /content/test-file.zip -d /content/

Archive:  /content/test-file.zip
  inflating: /content/tested.csv     


# Importing Libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

# Reading & Understanding the data

In [None]:
data = pd.read_csv('/content/tested.csv')

In [None]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


# Checking for missing data

In [None]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

# Data Cleanup

## The following columns: Age, Fare, and Cabin had null values in the dataset

### The null values in Age and Fare columns is filled with median instead of mean due to the presence of outliers

In [None]:
columns = ['Age', 'Fare']
for col in columns:
    data[col].fillna(data[col].median(), inplace = True)

### The null values in Cabin column is filled with NA

In [None]:
data['Cabin'].fillna('NA', inplace=True)

### Next, check the number of duplicate values in the dataset

In [None]:
data.duplicated().sum()

0

# Feature Engineering

In [None]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [None]:
data.Name.head(20)

0                                      Kelly, Mr. James
1                      Wilkes, Mrs. James (Ellen Needs)
2                             Myles, Mr. Thomas Francis
3                                      Wirz, Mr. Albert
4          Hirvonen, Mrs. Alexander (Helga E Lindqvist)
5                            Svensson, Mr. Johan Cervin
6                                  Connolly, Miss. Kate
7                          Caldwell, Mr. Albert Francis
8             Abrahim, Mrs. Joseph (Sophie Halaut Easu)
9                               Davies, Mr. John Samuel
10                                     Ilieff, Mr. Ylio
11                           Jones, Mr. Charles Cresson
12        Snyder, Mrs. John Pillsbury (Nelle Stevenson)
13                                 Howard, Mr. Benjamin
14    Chaffee, Mrs. Herbert Fuller (Carrie Constance...
15        del Carlo, Mrs. Sebastiano (Argenia Genovesi)
16                                    Keane, Mr. Daniel
17                                    Assaf, Mr.

### Creating a new feature of title from name column based on the pattern

In [None]:
data['Title'] = data['Name'].str.extract(r',\s(.*?)\.')

In [None]:
data['Title'].unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Ms', 'Col', 'Rev', 'Dr', 'Dona'],
      dtype=object)

In [None]:
data['Title'] = data['Title'].replace('Ms', 'Miss')
data['Title'] = data['Title'].replace('Dona', 'Mrs')
data['Title'] = data['Title'].replace(['Col', 'Rev', 'Dr'], 'Gen')

In [None]:
data['Title'].unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Gen'], dtype=object)

### Creating another feature of Age group by making bins

In [None]:
bins = [-np.inf, 17, 32, 45, 50, np.inf]
labels = ["Children", "Young", "Mid-Aged", "Senior-Adult", 'Elderly']
data['Age_Group'] = pd.cut(data['Age'], bins = bins, labels = labels)

In [None]:
data['Age_Group'] = data['Age_Group'].astype('object')

### Dropping unnecessary columns

In [None]:
data.drop(['PassengerId', 'Name', 'Ticket'], axis = 1, inplace = True)

### Combining the columns SibSp and Parch in a new column Family

In [None]:
data['Family'] = data['SibSp'] + data['Parch']

In [None]:
data.drop(['SibSp', 'Parch'], axis = 1, inplace = True)

In [None]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Cabin,Embarked,Title,Age_Group,Family
0,0,3,male,34.5,7.8292,,Q,Mr,Mid-Aged,0
1,1,3,female,47.0,7.0,,S,Mrs,Senior-Adult,1
2,0,2,male,62.0,9.6875,,Q,Mr,Elderly,0
3,0,3,male,27.0,8.6625,,S,Mr,Young,0
4,1,3,female,22.0,12.2875,,S,Mrs,Young,2


# Exploratory Data Analysis

In [None]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Survived,418.0,0.363636,0.481622,0.0,0.0,0.0,1.0,1.0
Pclass,418.0,2.26555,0.841838,1.0,1.0,3.0,3.0,3.0
Age,418.0,29.599282,12.70377,0.17,23.0,27.0,35.75,76.0
Fare,418.0,35.576535,55.850103,0.0,7.8958,14.4542,31.471875,512.3292
Family,418.0,0.839713,1.519072,0.0,0.0,0.0,1.0,10.0


In [None]:
data.describe(include = 'O').T

Unnamed: 0,count,unique,top,freq
Sex,418,2,male,266
Cabin,418,77,,327
Embarked,418,3,S,270
Title,418,5,Mr,240
Age_Group,418,5,Young,257


In [None]:
survival_count = data['Survived'].value_counts()
fig = px.pie(data, names = survival_count.index,  values = survival_count.values,
             title = f'Distribution of Survived', hole=0.2, color_discrete_sequence = px.colors.qualitative.Prism)
fig.update_traces(textinfo='percent+label')
fig.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig.show()

In [None]:
pclass_count = data.Pclass.value_counts()
fig = px.pie(data, names= pclass_count.index, values = pclass_count.values, title=f'Distribution of Pclass',
             hole=0.2, color_discrete_sequence= px.colors.qualitative.Prism)
fig.update_traces(textinfo='percent+label')
fig.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig.show()

In [None]:
sex_count = data.Sex.value_counts()
fig = px.pie(data, names= sex_count.index, values = sex_count.values, title=f'Distribution of Sex',
             hole=0.2, color_discrete_sequence= px.colors.qualitative.Prism)
fig.update_traces(textinfo='percent+label')
fig.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig.show()

In [None]:
fig_age = px.histogram(data, x='Age', nbins=30, histnorm='probability density')
fig_age.update_traces(marker=dict(color='#420152'), selector=dict(type='histogram'))
fig_age.update_layout(title='Distribution of Age', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Age', yaxis_title='Probability Density', bargap=0.02, plot_bgcolor = 'white')
fig_age.show()

In [None]:
fig_age = px.histogram(data, x='Fare', nbins=30, histnorm='probability density')
fig_age.update_traces(marker=dict(color='#420152'), selector=dict(type='histogram'))
fig_age.update_layout(title='Distribution of Fare', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Fare', yaxis_title='Probability Density', bargap=0.02, plot_bgcolor = 'white')
fig_age.show()

In [None]:
embarked_count = data.Embarked.value_counts()
fig = px.pie(data, names= embarked_count.index, values = embarked_count.values, title=f'Distribution of Embarked',
             hole=0.2, color_discrete_sequence= px.colors.qualitative.Prism)
fig.update_traces(textinfo='percent+label')
fig.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig.show()

In [None]:
title_count = data.Title.value_counts()
fig = px.pie(data, names= title_count.index, values = title_count.values, title=f'Distribution of Title',
             hole=0.2, color_discrete_sequence= px.colors.qualitative.Prism)
fig.update_traces(textinfo='percent+label')
fig.update_layout(legend_title_text='Categories:', legend=dict(orientation="h", yanchor="bottom", y=1.02))
fig.show()

In [None]:
fig = px.histogram(data, x = 'Pclass', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.qualitative.Prism)
fig.update_layout(title = 'Survival according to passenger classes', plot_bgcolor = 'white')
fig.show()

In [None]:
fig = px.histogram(data, x = 'Sex', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.qualitative.Prism)
fig.update_layout(title = 'Survival according to gender', plot_bgcolor = 'white')
fig.show()

In [None]:
fig = px.histogram(data, x = 'Age_Group', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.qualitative.Prism)
fig.update_layout(title = 'Survival according to age groups', plot_bgcolor = 'white')
fig.show()

In [None]:
fig = px.histogram(data, x = 'Family', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.qualitative.Prism)
fig.update_layout(title = 'Survival according to number of family members', plot_bgcolor = 'white')
fig.show()

In [None]:
fig = px.histogram(data, x = 'Embarked', barmode = 'group', color = 'Survived', color_discrete_sequence=px.colors.sequential.Viridis)
fig.update_layout(title = 'Survival according to embarked', plot_bgcolor = 'white')
fig.show()

# **Based on the above analysis:**

## 1.   Pclass = 1 had the lowest mortality rate, whereas Pclass = 3 had the highest mortality rate.

## 2.   The dataset is characterized by a significant presence of individuals from Pclass = 3, along with a notably high proportion of males.

## 3.    There were no surviving males, while all females survived.

## 4.    The Young Age Group experienced the highest number of fatalities, whereas Elderly individuals had a relatively better survival rate.

## 5.   According to the analysis, individuals with fewer family members were more likely to survive.

## 6. Queenstown saw a high survival rate among its passengers, whereas Southampton recorded the highest number of casualties.







# Data Preprocessing

In [None]:
encoder = LabelEncoder()
cols = ['Sex', 'Age_Group', 'Cabin', 'Embarked', 'Title']

for col in cols:
    data[col] = encoder.fit_transform(data[col])

In [None]:
X = data.drop('Survived', axis = 1)
y = data['Survived']

## Synthetic Minority Oversampling Technique (SMOTE)

### Used specifically in the context of addressing the class imbalance problem in machine learning. It works by generating synthetic examples for the minority class to balance the class distribution.
### It does this by selecting minority class instances, finding their nearest neighbors, and creating new instances by interpolating between them.

In [None]:
smote = SMOTE(random_state = 42)
X_balanced, y_balanced = smote.fit_resample(X, y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size = 0.3, random_state = 42)

In [None]:
scalar = StandardScaler()
X_train_scaled = scalar.fit_transform(X_train)
X_test_scaled = scalar.transform(X_test)

# Model Building

## 1. Logistic Regression

In [None]:
model1 = LogisticRegression()

model1.fit(X_train_scaled, y_train)

model1_pred = model1.predict(X_test_scaled)

In [None]:
print('The classification report of Logistic Regression : ', '\n\n\n', classification_report(y_test, model1_pred))

The classification report of Logistic Regression :  


               precision    recall  f1-score   support

           0       1.00      1.00      1.00        78
           1       1.00      1.00      1.00        82

    accuracy                           1.00       160
   macro avg       1.00      1.00      1.00       160
weighted avg       1.00      1.00      1.00       160



## 2. Random Forest

In [None]:
model2 = RandomForestClassifier()

model2.fit(X_train_scaled, y_train)

model2_pred = model2.predict(X_test_scaled)

In [None]:
print('The classification report of Random Forest : ', '\n\n\n', classification_report(y_test, model2_pred))

The classification report of Random Forest :  


               precision    recall  f1-score   support

           0       1.00      1.00      1.00        78
           1       1.00      1.00      1.00        82

    accuracy                           1.00       160
   macro avg       1.00      1.00      1.00       160
weighted avg       1.00      1.00      1.00       160



## 3. Gradient Bossting Classifier

In [None]:
model3 = GradientBoostingClassifier()

model3.fit(X_train_scaled, y_train)

model3_pred = model3.predict(X_test_scaled)

In [None]:
print('The classification report of Gradient Bossting Classifier : ', '\n\n\n', classification_report(y_test, model3_pred))

The classification report of Gradient Bossting Classifier :  


               precision    recall  f1-score   support

           0       1.00      1.00      1.00        78
           1       1.00      1.00      1.00        82

    accuracy                           1.00       160
   macro avg       1.00      1.00      1.00       160
weighted avg       1.00      1.00      1.00       160



# Conclusion & Key Findings:

## The analysis of the Titanic dataset unearthed several noteworthy findings. To handle missing data, Age and Fare columns with medians are computed considering the presence of outliers, while we labeled the Cabin column as "NA". We also engineered new features such as Title, Age_Group, and Family to enrich our understanding of passenger demographics.

## Furthermore, the analysis observed that Pclass 3 had the highest mortality rate, with no surviving males and all females surviving. Family size seemed to have an impact on survival, and passengers embarking from Queenstown exhibited a higher survival rate compared to those departing from Southampton.


---

