# Titanic Dataset: Exploratory Data Analysis

This is part 1 of my project on the famous Titanic dataset from Kaggle. In this part, we are going to do an exploratory data analysis to answer the following questions:                                                                                                                                                                                                                                 
1. Who were the passengers on the Titanic? We will look at class, age, gender etc.
2. How many people have travelled alone? How big on average were the families?
3. What was the most popular port of embarkation?
4. What decks did passengers of different classes occupy?
5. And, most importantly, what were the factors that affected the chances of survival?

Let's start, shall we?

## Import Libraries

Firstly, we import the libraries needed for data analysis. We import pandas and numpy libraries to manipulate the dataset, and matplotlib and seaborn for data visualisations.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Import Dataset

The Titanic datasets are downloaded from Kaggle. Since there are two datasets, we are going to use a train dataset since it has a dependent variable (if a person survived or not) needed for an exploratory data analysis. 

In [3]:
dataset = pd.read_csv('datasets/titanic_train.csv')

## Quick Overview of the Dataset

In [5]:
dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can see that for this dataset we have the following columns:

PassengerId - a unique identifier for each passenger

Survived - binary value that shows if a person survived(1), or not (0)

Pclass -  the cabin class of the passenger (1 is the highest, 3 is the lowest)

Name, Sex, Age - self explanatory

SibSp - number of siblings and spouses travelling with the passenger (weird combination, I know!)

Parch - number of parents and children travelling with the passenger

Ticket - ticket number

Fare -  the price of the ticket

Cabin - cabin number of the passenger

Embarked - the port of embarkation (S - Southampton, C - Cherbourg, Q - Queenstown)


In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Info() function gives us a great way to look at the dataset and to see how complete is the dataset. We can see that the majority of the columns have all entries. However, Cabin column has majority of it's entries missing. Moreover, around a fifth of entries for age column are missing. 

In [10]:
dataset.describe().round(2)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.38,2.31,29.7,0.52,0.38,32.2
std,257.35,0.49,0.84,14.53,1.1,0.81,49.69
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.12,0.0,0.0,7.91
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.45
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.33


describe() function is useful to get a sense of the numerical columns and how the data is distributed. In particular, it shows us that the oldest person on the Titanic (or in this dataset to be precise) is 80 years old whereas the lowest age is 0.42 (which is around 5 months old). The most expensive fare is 512.33 whereas the average fare was 32.20 pounds. Do you think the person who paid such an expensive fare survived?

In [18]:
dataset.loc[dataset['Fare'] == 512.3292, 'Survived']

258    1
679    1
737    1
Name: Survived, dtype: int64

In fact, three passengers who paid the largest fare survived. And they say money is not a solution!

## A Closer Look at the Passengers of the Titanic

Now let's try to answer the first question. What do we know about the passengers of the Titanic?