In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# To display all columns in the dataset.
pd.set_option('display.max_columns', None)

In [2]:
data = pd.read_csv('datas/titanic.csv')
data.head()


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
data.shape

(1309, 14)

In [4]:
data.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [8]:
data.isnull().mean().mul(100.).round(2)

pclass        0.00
survived      0.00
name          0.00
sex           0.00
age          20.09
sibsp         0.00
parch         0.00
ticket        0.00
fare          0.08
cabin        77.46
embarked      0.15
boat         62.87
body         90.76
home.dest    43.09
dtype: float64

In the variables Age there is 20% of data missing.

There is 77% of data missing in the variable Cabin, in which the passenger was traveling.

There is 0.15% of data missing in the field Embarked (the port from which the passenger boarded the Titanic).


## Mechanisms of Missing Data
#### **Missing data Not At Random (MNAR)**

The missing values of the variables age and cabin, were introduced systematically. For many of those who did not survive, their age or their cabin remains unknown. The people who survived could have been otherwise asked for that information.

Can we infer this by looking at the data?

If data is MNAR, we could expect a greater number of missing values for people who did not survive.

Let's have a look.

In [9]:
# Let's create a binary variable that indicates 
# if the value of cabin is missing.

data['cabin_null'] = np.where(data['cabin'].isnull(), 1, 0)

In [11]:
# Let's evaluate the percentage of missing values in
# cabin for the people who survived vs the non-survivors.

# The variable Survived takes the value 1 if the passenger
# survived, or 0 otherwise.

# Group data by Survived vs Non-Survived
# and find the percentage of NaN for Cabin.

data.groupby(['survived'])['cabin_null'].mean().mul(100)

survived
0    87.391842
1    61.400000
Name: cabin_null, dtype: float64

In [15]:
# Let's do the same for the variable age:

data['age'].isna().groupby(data['survived']).mean().mul(100)

survived
0    23.485785
1    14.600000
Name: age, dtype: float64

We observe more missing data points for the people who did not survive. The analysis therefore suggests that there was a systematic loss of data: people who did not survive had more missing information. Presumably, the method chosen to gather the information contributes to the generation of this missing data.

#### **Missing data Completely At Random (MCAR)**

In [None]:
# In the titanic dataset, there are also missing values
# for the variable Embarked.

# Let's have a look.

# Let's slice the dataframe to show only the observations
# with missing values for Embarked.

d


These 2 women were traveling together. Miss Icard was the maid of Mrs. Stone.

A priori, there does not seem to be an indication that the missing information in the variable "embarked" is dependent on any other variable, and the fact that these women survived means that they could have been asked for this information.

It is very likely the values were lost at the time of building the dataset.

If these values are MCAR, the likelihood of data missing for these two women is the same as the likelihood of data missing for any other person on the Titanic. Of course, this will be hard, if possible at all, to prove.
