# Titanic Dataset

## Overview

This project will analyse the Titanic data set and try to find factors which increase the chance to survive the sinking of the ship. First, we start with an exploration of the data. In the following, we define a metric for the survival chance and present and discuss the results.

## Table of contents

* [Exploration](#Exploring-the-data)
* [Analysis](#Analysis:-Chance-to-survive)
* [Results](#Results)
* [Conclusion](#Conclusion)



## Exploring the data

First, the data is loaded and some basic information is printed to get an idea about the data set. Also took some insight from https://www.kaggle.com/c/titanic/data.

In [None]:
import numpy as np
import pandas as pd
#import csv

data = pd.read_csv("titanic-data.csv")

data.head()

In [None]:
data.describe()   

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(data["Fare"], bins=40)
plt.title("Histogram of passengers' fares")
plt.xlabel("Fare")
plt.ylabel("Number")
print()

In [None]:
# The "Age" column is not missing for some of the passengers. 
# We need to exclude those to print a histogram.
plt.hist(data["Age"][data["Age"].notnull()])
plt.title("Histogram of passengers' ages")
plt.xlabel("Age")
plt.ylabel("Number")
print()

To get an idea if the age of a passenger had some relation with his or her chance to survive, we add a new "AgeClass" column.

In [None]:
import math

def classify_age(age):
    if age and not math.isnan(age): 
        # Use integer division.
        return int(age) // 10
    else:
        return -1  # instead of NaN
    
data["AgeClass"] = data["Age"].apply(classify_age)

data.head()

## Analysis: Chance to survive 

In the following we will analyse the Titanic dataset to find factors which increased the chance to survive the tragedy and those which did not have an effect.

Therefore I define the "chance of survival" as the probability of survival, i.e. the share of survivors in the population or a subgroup thereof:

$$ P_{survive}(group) = \frac{\textrm{number of survivors in } group}{\textrm{total number of persons}}$$


In [None]:
population_size = len(data)
num_survived = len(data[data["Survived"] == 1])
p_survival = num_survived / population_size
print("{} of {} survived, this corresponds to a survival chance of {:.3}"
     .format(num_survived, population_size, p_survival))


In the titanic dataset, over the whole population the chance to survive is $P_{survive}(total)=0.384$.

Lets compare  how different factors influence this chance. First, we define a function that calculates the probability to survive over groups defined by a factor.

In [None]:
def compute_survival_probabilty(criterion):
    grouped = data.groupby(criterion)
    return grouped.sum()["Survived"] / grouped.size()

Furthermore, we define a function to print the probability for each factor and create a bar plot.

In [None]:
def show_probability_for_criterion(criterion):
    
    def get_range_from_group_index(keys):
        if not type(keys[0]) == str:
            start = min(keys)
            end = max(keys) + 1
            return np.arange(start, end)
        return np.arange(len(keys))
    
    title = "Survival rate by {}".format(criterion)
    pp = compute_survival_probabilty(criterion)
    #print(title)
    #print(pp)
    keys = list(data.groupby(criterion).groups.keys())
    x = get_range_from_group_index(keys)
    %matplotlib inline
    plt.bar(x, pp)
    plt.xticks(x, keys)
    plt.title(title)
    plt.xlabel(criterion)
    plt.ylabel(r"$P_{survive}$")
    axes = plt.gca()
    axes.set_ylim([0, 1])
    

## Results

In the following, the data is grouped by sex, ticket class and the port of embarkation and the respective chance to survive is shown.

### Sex

In [None]:
show_probability_for_criterion("Sex")

The gender seems to have by far the largest effect on the survival rate. Women have a much better chance to have survived the sinking of the ship. This is probably due to the "women and children first" protocol and the fact that the crew of the Titanic was not used to the rescue procedure and too few passengers boarded the rescue vessels, a consequence is that less men were able to get a spot in one of the life-saving vessels.

### Ticket class

In [None]:
show_probability_for_criterion("Pclass")

The ticket class has an effect on the survival rate, too. Passengers from the first class survived the incident about twice as often as those from the third class, which have the lowest survival rate of all.

### Age

As described above, the age of passengers has been categorized into slots of 10 years each. The value "-1" represents all passengers where the age is unknown.

In [None]:
show_probability_for_criterion("AgeClass")

The survival rate is 100.0 for people above 80 and 0.0 for people between 70 and 79. This "definite" numbers are due to the small number of individuals in these groups:

In [None]:
data.groupby("AgeClass").size()

Besides these, the figures show that the survival rate for children below the age of 10 was better than average, and that for people with unknown age is worst.

The former observation can be explained by the "women and children first" protocol. The latter may be related to the passenger class distribution of this group. I guess there are more third class passengers within this group, but this shall not be part of this investigation.

For all other age groups, the survival rate is more or less in average.

### Port of embarkation

In [None]:
show_probability_for_criterion("Embarked")

The embarkation port seems to have an effect, too. Passengers embarked in Cherbourgh had a higher chance to survive than the average, and about 66% more of these passengers survived than of the passengers which embarked at Southampton. While this is unexpected at first, there might be a relation between the port and the other class. From what I know, Southampton used to be a large port for common people migrating to the U.S. and maybe the first tour of the Titanic attracted many upperclass French to join its trip in Cherbourgh. 

Indeed, a look into the following figures show this relation:

In [None]:
def portname(p):
    if p == "C":
        return "Cherbourgh"
    elif p == "Q":
        return "Queenstown"
    else:
        return "Southampton"
    
    
def passenger_class(p):
    return "Class {}".format(p) 


port_and_class = data.groupby("Pclass").apply(lambda x: x.groupby("Embarked").size())
total_embarked = port_and_class.sum()
port_and_class_shares = port_and_class.div(total_embarked)
for key in port_and_class_shares:
    plt.figure()
    plt.pie(port_and_class_shares[key], labels=port_and_class_shares[key].keys().map(passenger_class))
    plt.title(portname(key))

The share of first class passengers is much higher for those embarking at Cherbourgh than for the other ports. Thus, there port of embarkation and the ticket class are ticket class dependent. This explains why passengers from Cherbourgh had a higher survival rate. 

It would be interesting to find out which of these two factors really caused a higher survival chance. I doubt that the port has an "causation effect". But thinking about a proper A/B test setup goes beyond the scope of this project, I guess.

#### Survival per port and class
Just some fiddling with Pandas:

In [None]:
grouped_survival = data.groupby("Embarked").apply(lambda x: x.groupby("Pclass").agg(["sum", "count"]))["Survived"]
grouped_survival["share"] = grouped_survival["sum"] / grouped_survival["count"]
grouped_survival.rename({"sum": "survived", "count": "total"}, inplace=True)
grouped_survival

Interestingly, the survival rate for people embarking in Queenstown is higher than for Southampton, allthough the share of 3rd class passengers is higher. 

## Conclusion

We have investigated the relation between different factors and the chance to survive the sinking of the Titanic. We have shown that sex, ticket class, the age  and the port of embarkation are correlated to the survival rate. The figures indicate a correlation between ticket class and port of embarkation. 

### Limitations and possible extensions

The data provided by Udacity for this project is only part of the Kaggle data. They use it for a machine learning task and therefore split the data into a training and a test set, Udacity provided the former as input. However, it is unclear where the data originates from. According to Wikipedia, the total number of passengers aboard the RMS Titanic is estimated to 2,224. Neither Kaggle nor Udacity provide any information about how the present data sets (size 1309 or 891 for Udacity) have been sampled from this population. The sampling could have been biased and thus we cannot tell whether or not the results of this investigation can be generalized.

The Udacity data corresponds to only the training set of the Kaggle data. The test set contains another 418 entries. This project could be extended by repeating the analysis on the union of both the training and the test data. 

Also, further investigation on the relation between port of embarkation and ticket class is not part of this project.



## Sources

* [Pandas documentation, e.g. on groupby()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)
* [Matplotlib documentation, e.g. on pie charts](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.pie.html)
* [Wikipedia: RMS Titanic](https://en.wikipedia.org/wiki/RMS_Titanic)
* [stackoverflow, e.g. "How can I add a table of contents to an ipython notebook?"](https://stackoverflow.com/questions/21151450/how-can-i-add-a-table-of-contents-to-an-ipython-notebook) 