# **EDA Exercise**

**Overview**

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history.

The training-set has 891 examples and 11 input variables + the target variable (survived). We have uploaded the dataset here and can be directly accessed with Pandas read_csv().

**Feature Description:**

PassengerId - this is just a generated Id of each passenger

Pclass - which class did the passenger ride in - first, second or third

Name

Sex - male or female

Age

SibSp - were the passenger's spouse or siblings with them on the ship

Parch - were the passenger's parents or children with them on the ship

Ticket - ticket number

Fare - ticket price

Cabin

Embarked - port of embarkation

Survived - did the passenger survive the sinking of the Titanic?


**Objective:** The broader objective is to build a model that would predict the survival probability of a person, given their basic features. In this exercise, you only need to focus on the Exploratory Data Analysis step.

**Important note:** If all the options of a question are correct, you only need to choose the last option stating All of the above.

**Steps to be performed:**

Load libraries

Load the dataset. Dataset Link: https://raw.githubusercontent.com/dphi-official/First_ML_Model/master/titanic.csv

Observe the first 5 rows of the data

In [None]:
# Load Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the dataset
rms_titanic_data = pd.read_csv(" https://raw.githubusercontent.com/dphi-official/First_ML_Model/master/titanic.csv")

**Initial Review of Data**

In [None]:
# Observe the first 5 rows of the data
rms_titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
rms_titanic_data.shape

(891, 12)

In [None]:
rms_titanic_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


**Question 1:**

**Write the appropriate code to find answers to the following questions:**

Select the correct statement about the titanic dataset


*   The 'Fare' feature has 0 missing values
*   The no. of male passengers are more than female passengers
*   All of the above

In [None]:
# Solution 1
# Check if column "Fare" has no missing values

rms_titanic_data.isnull().sum()        # Output Satisfies the first option "The 'Fare' feature has 0 missing values"

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [None]:
# Check if male passengers > female passengers
rms_titanic_data['Sex'].value_counts()          # Output Satisfies the second option "The no. of male passengers are more than female passengers"

Unnamed: 0_level_0,count
Sex,Unnamed: 1_level_1
male,577
female,314


In [None]:
if rms_titanic_data['Sex'].value_counts()["male"] > rms_titanic_data['Sex'].value_counts()["female"]:
  print("The no. of male passengers are more than female passengers")

The no. of male passengers are more than female passengers


**Answer to Question 1 is "All of the above"**

**Question 2:**

**Write the appropriate code to find answers to the following questions:**

What is the proportion of passengers who survived?

Note: In this question, we are asking for an answer as a proportion. Therefore, your answer should take a value between 0 and 1 rounded to 2 decimal places

*   0.38
*   0.39
*   0.40
*   0.41

In [None]:
# Solutin 2:
# Call for the number of occurance of each unique value
# in the target variable "Survived"
rms_titanic_data['Survived'].value_counts()


Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,549
1,342


In [None]:
# METHOD 1
# Get the total number of passengers
total_passengers = rms_titanic_data['Survived'].value_counts().sum()

# Get the number of passengers that survived
passengers_survived = rms_titanic_data['Survived'].value_counts()[1]

# Calculate the proportion of passengers who survived
proportion_survived = passengers_survived / total_passengers

print(f"The proportion of passengers who survived is: {proportion_survived:.2f}")

The proportion of passengers who survived is: 0.38


In [None]:
# METHOD 2
# Get the number of passengers that survived
passengers_survived = rms_titanic_data['Survived'] == 1

# Calculate the proportion of passengers who survived
proportion_survived = passengers_survived.mean()
print(f"The proportion of passengers who survived is: {proportion_survived:.2f}")

The proportion of passengers who survived is: 0.38


**Answer to Question 2 is "0.38"**

**Question 3:**

**Write the appropriate code to find answers to the following questions:**

What is the median Fare of the passengers?

Note: Write your answer up to 4 decimal places

*   14.4542
*   13.4542
*   32.2042
*   None of the above


In [None]:
median_fare = rms_titanic_data['Fare'].median()
print(f"the median Fare of the passengers is: {median_fare:.4f}")

the median Fare of the passengers is: 14.4542


**Answer to Question 3 is "14.4542"**

**Question 4:**

**Write the appropriate code to find answers to the following questions:**

Select the correct option:

*   Percentage of women survived was more than percentage of men survived
*   It looks like first-class passengers were given priority to survive
*   It looks like Children were given priority to survive
*   All of the above


In [None]:
women_count = rms_titanic_data[rms_titanic_data['Sex'] == 'female'].shape[0]
#men_count = rms_titanic_data[rms_titanic_data['Sex'] == 'male'].shape
print(women_count)
#print(men_count)

314


In [None]:
# Solution 4:
# To check if the percentage of women survived was more than percentage of men survived

# METHOD 1
women_count = rms_titanic_data[rms_titanic_data['Sex'] == 'female'].shape[0]
men_count = rms_titanic_data[rms_titanic_data['Sex'] == 'male'].shape[0]

women_survived_count = rms_titanic_data[rms_titanic_data['Sex'] == 'female']['Survived'].sum()
men_survived_count =rms_titanic_data[rms_titanic_data["Sex"] == "male"]["Survived"].sum()


print(f"Number of women survived: {women_survived_count}")
print(f"Number of men survived: {men_survived_count}")

percentage_women_survived = (women_survived_count / women_count) * 100
percentage_men_survived = (men_survived_count / men_count) * 100

print(f"Percentage of women survived: {percentage_women_survived:.2f}%")
print(f"Percentage of men survived: {percentage_men_survived:.2f}%")

if percentage_women_survived  > percentage_men_survived :
  print("Percentage of women survived was more than percentage of men survived")   # This will satisfy option 1
else:
  print("Percentage of women survived was not more than percentage of men survived")



Number of women survived: 233
Number of men survived: 109
Percentage of women survived: 74.20%
Percentage of men survived: 18.89%
Percentage of women survived was more than percentage of men survived


In [None]:
# METHOD 2: Improved Code

# Get the total number of women
women_count = rms_titanic_data[rms_titanic_data["Sex"] == "female"]
men_count = rms_titanic_data[rms_titanic_data["Sex"] == "male"]

# Calculate the percentage of survivals for each group
percentage_women_survived2 =  women_count["Survived"].mean() * 100
percentage_men_survived2 =  men_count["Survived"].mean() * 100

print(f"Percentage of women survived: {percentage_women_survived2:.2f}%")
print(f"Percentage of men survived: {percentage_men_survived2:.2f}%")

# Check if the percentage of women who survived is more than the men
if percentage_women_survived2 > percentage_men_survived2:
    print("Percentage of women survived was more than percentage of men survived")     # This will satisfy option 1
else:
    print("Percentage of women survived was not more than percentage of men survived")

Percentage of women survived: 74.20%
Percentage of men survived: 18.89%
Percentage of women survived was more than percentage of men survived


In [None]:
# Solution 4: (Continues)
# To check if it looks like first-class passengers were given priority to survive
first_class_survivor = rms_titanic_data[rms_titanic_data["Pclass"] == 1] ["Survived"].mean()
print(f"Number of first-class passengers who survived: {first_class_survivor:.2f}" )
second_class_survivor = rms_titanic_data[rms_titanic_data["Pclass"] == 2] ["Survived"].mean()
print(f"Number of second-class passengers who survived: {second_class_survivor:.2f}")
third_class_survivor = rms_titanic_data[rms_titanic_data["Pclass"] == 3] ["Survived"].mean()
print(f"Number of third-class passengers who survived: {third_class_survivor:.2f}")

# Check for priority to survive
if first_class_survivor > second_class_survivor and first_class_survivor > third_class_survivor:  # This will satisfy option 2
    print("First-class passengers were given priority to survive")
else:
    print("First-class passengers were not given priority to survive")

Number of first-class passengers who survived: 0.63
Number of second-class passengers who survived: 0.47
Number of third-class passengers who survived: 0.24
First-class passengers were given priority to survive


In [None]:
# Solution 4: (Continues)
# Check if it looks like Children were given priority to survive

# Define age groups
children = rms_titanic_data[rms_titanic_data['Age'] <= 12]
teenagers = rms_titanic_data[(rms_titanic_data['Age'] > 12) & (rms_titanic_data['Age'] <= 19)]
adults = rms_titanic_data[rms_titanic_data['Age'] > 19]

# Calculate the proportion of each age group who survived
children_survivors = children['Survived'].mean() if not children.empty else 0     # "if not children.empty else 0" is used for safeguard, in case there are no children in the data.
teenagers_survivors = teenagers["Survived"].mean() if not teenagers.empty else 0
adults_survivors = adults["Survived"].mean() if not adults.empty else 0

print(f"Children survival rate (Age <= 12): {children_survivors:.2f}")
print(f"Teenagers survival rate (Age 13-19): {teenagers_survivors:.2f}")
print(f"Adults survival rate (Age > 19): {adults_survivors:.2f}")

# Check for priority to survive
if (children_survivors > teenagers_survivors) and (children_survivors > adults_survivors):      # This will satisfy option 3
    print("Children were given priority to survive")
else:
    print("Children were not given priority to survive")



Children survival rate (Age <= 12): 0.58
Teenagers survival rate (Age 13-19): 0.41
Adults survival rate (Age > 19): 0.38
Children were given priority to survive


**Answer to Question 4 is "All of the above"**

**Question 5:**

**Write the appropriate code to find answers to the following questions:**

Create a subset of the data, only taking observations for which the passsenger survived. Call this newly created dataset as survived_passengers.

How many of the survived passengers had embarked from 'Southampton' i.e. 'S'?

*   644
*   217
*   168
*   77

In [None]:
# Method 1: Following all instructions
# Create of subset for passenger survived
survived_passengers = rms_titanic_data[rms_titanic_data["Survived"] == 1]

# Get the number of survived passengers who embarked from Southampton
survived_passengers_southampton = survived_passengers[survived_passengers["Embarked"] == "S"].shape[0]

print(f"Number of survived passengers who embarked from Southampton (S): {survived_passengers_southampton}")


Number of survived passengers who embarked from Southampton (S): 217


In [None]:
# Method 2: Note that this method does not follow the Question's instruction
# Get the number of survived passengers who embarked from Southampton using .loc
survived_southampton_count = rms_titanic_data.loc[(rms_titanic_data['Survived'] == 1) & (rms_titanic_data['Embarked'] == 'S')].shape[0]

print(f"Number of survived passengers who embarked from Southampton (S): {survived_southampton_count}")

Number of survived passengers who embarked from Southampton (S): 217


**Answer to Question 5 is "217"**

**Question 6:**

**Write the appropriate code to find answers to the following questions:**

Five highest fares of the passengers(not unique):

*   [512.3292, 512.3292, 512.3292, 263.0, 263.0]  
*   [510.3292, 512.3292, 512.3292, 263.0, 263.0]
*   [512.3292, 512.3292, 512.3292, 263.0, 256.0]
*   [512.3292, 520.3292, 512.3292, 263.0, 263.0]


In [None]:
# Get the top five Fare for the cruise - !st five highest fare
Top_five_fare = rms_titanic_data.sort_values(by="Fare", ascending=False).head()
Top_five_fare_list = Top_five_fare["Fare"].tolist()

print(Top_five_fare_list)

[512.3292, 512.3292, 512.3292, 263.0, 263.0]


In [None]:
# Get the top five Fare for the cruise - 1st five highest fare (chained methods)
top_five_fare_chained = rms_titanic_data.sort_values(by="Fare", ascending=False)["Fare"].head().tolist()

print(top_five_fare_chained)

[512.3292, 512.3292, 512.3292, 263.0, 263.0]


**Answer to Question 6 is "[512.3292, 512.3292, 512.3292, 263.0, 263.0]"**

**Question 7:**

**Write the appropriate code to find answers to the following questions:**

Median age of the passengers is:

*   27.0
*   28.0
*   29.0
*   30.0

In [None]:
# To compute the Median age of the passengers
median_age = rms_titanic_data["Age"].median()
print(f"The median age of the passengers is: {median_age:.1f}")

The median age of the passengers is: 28.0


**Answer to Question 7 is "28.0"**

**Question 8:**

**Write the appropriate code to find answers to the following questions:**

Select the correct statement:

*   There are 891 unique values in the Name column
*   There are 714 unique values in the Name column

In [None]:
# Get the total amount of unique values in the Name column
unique_names_count = rms_titanic_data["Name"].nunique()

print(f"Total number of unique names: {unique_names_count}")

Total number of unique names: 891


**Answer to Question 8 is "There are 891 unique values in the Name column"**

**Question 9:**

**Write the appropriate code to find answers to the following questions:**

Most of the passengers have _____ siblings/spouses.

*   5
*   1
*   0
*   2

In [None]:
# Get the answer to: Most of the passengers have _____ siblings/spouses.
rms_titanic_data["SibSp"].value_counts()

Unnamed: 0_level_0,count
SibSp,Unnamed: 1_level_1
0,608
1,209
2,28
4,18
3,16
8,7
5,5


**Answer to Question 9 is "0"**

**Question 10:**

**Write the appropriate code to find answers to the following questions:**

Which of the following feature plays an important role in the survival of the passengers?

*   Name
*   Age
*   Ticket

In [None]:
# To determine which feature among "Name", "Age", and "Ticket" played an important role in the survival of the passengers
# Selecting the columns, including 'Survived' for correlation analysis
comparism_features = rms_titanic_data[["Name", "Age", "Ticket", "Survived"]]
print(comparism_features.head())

# Note: Selecting columns does not directly answer the question of feature importance.
# To determine feature importance, further analysis or modeling is required.
# For example, you could analyze survival rates by age groups, or explore patterns in names or tickets.

                                                Name   Age            Ticket  \
0                            Braund, Mr. Owen Harris  22.0         A/5 21171   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0          PC 17599   
2                             Heikkinen, Miss. Laina  26.0  STON/O2. 3101282   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0            113803   
4                           Allen, Mr. William Henry  35.0            373450   

   Survived  
0         0  
1         1  
2         1  
3         1  
4         0  


In [None]:
# Calculate correlation matrix for numeric columns
# Exclude non-numeric columns like 'Name' and 'Ticket'
numeric_comparism_features = comparism_features.select_dtypes(include=['number'])
display(numeric_comparism_features.corr())

Unnamed: 0,Age,Survived
Age,1.0,-0.077221
Survived,-0.077221,1.0


In [None]:
# Take the mean of the Age column
age_mean_before = rms_titanic_data['Age'].mean().round(2)
age_mean_before

np.float64(29.7)

In [None]:
# Fill the missing values in the Age column with the mean of the Age column
age_mean_before_filled = rms_titanic_data['Age'].fillna(age_mean_before)
age_mean_before_filled.info()

<class 'pandas.core.series.Series'>
RangeIndex: 891 entries, 0 to 890
Series name: Age
Non-Null Count  Dtype  
--------------  -----  
891 non-null    float64
dtypes: float64(1)
memory usage: 7.1 KB


In [None]:
age_mean_after = age_mean_before_filled.mean().round(2)
age_mean_after

np.float64(29.7)

In [None]:
# Maximum Occurance Value for Embarked column
rms_titanic_data['Embarked'].value_counts()


Unnamed: 0_level_0,count
Embarked,Unnamed: 1_level_1
S,644
C,168
Q,77


In [None]:
rms_titanic_data['Embarked'].fillna('S', inplace = True)

In [None]:
rms_titanic_data['Embarked'].value_counts()

Unnamed: 0_level_0,count
Embarked,Unnamed: 1_level_1
S,646
C,168
Q,77
