<a href="https://colab.research.google.com/github/funkypro/Titanic-Survival-Predictor/blob/main/Titanic_Survival_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Problem:
In 1912, the RMS Titanic sank. While there was an element of luck, some groups of people were more likely to survive than others. Our goal is to build a predictive model that answers the question: “What sorts of people were more likely to survive?”

In [14]:
import pandas as pd
import seaborn as sns

In [15]:
# Load the dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

print(df.head())
# print(df.info())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


# **Data Cleaning**

In [16]:
# Do we have null values?
df.isnull().sum()

# Age has got 177 nulls, Cabin - 687 nulls, Embarked - 2 nulls

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [17]:
# Cabin column can be dropped (not important).
# Will fill missing Age values with median age.
# Will fill missing Embarked values with mode value.

df.drop('Cabin', axis=1, inplace=True)

df['Age'] = df['Age'].fillna(df['Age'].median())

df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode())



In [18]:
# Map 'male' to 0 and 'female' to 1
df['Sex'] = df['Sex'].map({'male':0, 'female':1})

# handle 'Embarked' using 'Dummy Variables'
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Drop columns that don't help a math model
df.drop(['Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)

In [15]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,0,3,0,22.0,1,0,7.25,False,True
1,1,1,1,38.0,1,0,71.2833,False,False
2,1,3,1,26.0,0,0,7.925,False,True
3,1,1,1,35.0,1,0,53.1,False,True
4,0,3,0,35.0,0,0,8.05,False,True


# **Training the Model**

We will be doing Train-Test Split. Will hide 20% of the data from the model (the "Test Set") and use it as a final exam later.

In [19]:
# Splitting the Data
from sklearn.model_selection import train_test_split

# X = features y = target
X = df.drop('Survived', axis=1)
y = df['Survived']

# Split: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, shuffle= True)

In [9]:
X_test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
709,3,0,28.0,1,1,15.2458,False,False
439,2,0,31.0,0,0,10.5,False,True
840,3,0,20.0,0,0,7.925,False,True
720,2,1,6.0,0,1,33.0,False,True
39,3,1,14.0,1,0,11.2417,False,False


In [20]:
# Training the Decision Tree

from sklearn.tree import DecisionTreeClassifier

# Initialize the model
model = DecisionTreeClassifier(max_depth=3)

model.fit(X_train,y_train)
prediction = model.predict(X_test)



In [23]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, prediction)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Model Accuracy: 79.89%


In [22]:
#Finding the strongest predictor

import pandas as pd
import matplotlib.pyplot as plt

importances = model.feature_importances_

feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print(feature_importance_df)

      Feature  Importance
1         Sex    0.605737
0      Pclass    0.209536
2         Age    0.075353
5        Fare    0.061240
3       SibSp    0.048135
4       Parch    0.000000
6  Embarked_Q    0.000000
7  Embarked_S    0.000000
