# Titanic Survival

#### Grading:


- Code: 90 pts
- Markdown Documentation: 10 pts


We are going to study the survival rate of passengers on titanic and what variables affected survival.

Load the dataset in `titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

## Imports.


In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

## Pandas Display Options.

In [2]:
# Telling pandas not convert to html tags.
pd.set_option('display.html.table_schema', True)
# Max columns and rows to display.
pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 8)

In [3]:
from IPython.core.display import HTML
#HTML(filename='../data/titanic.html')

In [4]:
# you would need xlrd - pip install xlrd
t_file = pd.ExcelFile('../data/titanic.xls')
t_df = t_file.parse("titanic", header=None)
t_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.338,B5,S,2,,"St Louis, MO"
2,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"


### Women and children first?

*** 1. Use the `groupby` method to calculate the proportion of passengers that survived by sex. (10 pts)*** 

# Survival Rate by Gender

In [5]:
# Trying to ensure we start with fresh data and that we are not corrupting that data.
df_titanic = t_df.loc[1:, :]

In [6]:
# For loop to iterate over both female/male arguments.
for gender in ("male", "female"):
    # Group by iteration of gender.
    df_gender = df_titanic.groupby(by = 3).get_group(gender)
    # Get the number of rows for total sample size.
    total_gender = df_gender.shape[0]
    # Get the number of survivors from the sample.
    total_survived = df_gender.groupby(by = 1).get_group(1).shape[0]
    # Caclulate survival rate.
    survival_rate = total_survived / total_gender
    print(gender[0].upper()  + gender[1:], "passenger survival rate: {:.2%}".format(survival_rate))

Male passenger survival rate: 19.10%
Female passenger survival rate: 72.75%


*** 2. Calculate the same proportion, but by class and sex. (10 pts)*** 

In [7]:
# Trying to ensure we start with fresh data and that we are not corrupting that data.
df_titanic = t_df.loc[1:, :]

# Method to Calculate Gender and Pclass Survival Rate.

In [8]:
# Method to calculate survival by pclass with some uniquely gendered subset.
def dispaly_survival_2(_df_gender):
    # Finding the gender type.
    gender_type = _df_gender.iloc[:, 3].unique()[0]
    # Finding the number of rows, i.e. the total size.
    total_size = _df_gender.shape[0]
    # Iterating over every pclass.
    for pclass in (1, 2, 3):
        # Getting number of cases of survival and dividing by the total set size.
        survival_rate = _df_gender.groupby(by = 0).get_group(pclass)\
            .groupby(by = 1).get_group(1).shape[0] / total_size
        print("{} and pclass {} survival rate : {:.2%}"\
            .format(gender_type[0].upper() + gender_type[1:], pclass, survival_rate))

# Survival Rate by Male and PClass

In [9]:
dispaly_survival_2(t_df.groupby(by = 3).get_group("male"))

Male and pclass 1 survival rate : 7.24%
Male and pclass 2 survival rate : 2.97%
Male and pclass 3 survival rate : 8.90%


# Survival Rate by Female and PClass

In [10]:
dispaly_survival_2(t_df.groupby(by = 3).get_group("female"))

Female and pclass 1 survival rate : 29.83%
Female and pclass 2 survival rate : 20.17%
Female and pclass 3 survival rate : 22.75%


*** 3. Create age categories: children (under 14 years), adolescents (14-20), adult (21-64), and senior(65+), and calculate survival proportions by age category, class and sex. (20 pts)***

In [11]:
# Trying to ensure we start with fresh data and that we are not corrupting that data.
df_titanic = t_df.loc[1:, :]

# Method to Calculate Age, Gender, and Pclass Survival Rate.

In [12]:
# Takes as argument an aged subset of data and returns the survival rate for male/female and pclass.
def display_survival_3(_df_age_group, _age_type):
    # For male/female iterators.
    for sex in ("male", "female"):
        # Get gendered subset.
        df_sex = _df_age_group.groupby(by = 3).get_group(sex)
        # Calculate total subset size.
        total_people = df_sex.shape[0]
        # Separating the male/female portions of data with header.
        print("*" * 15, sex[0].upper() + sex[1:], _age_type, "*" * 15)
        # For pclass iterators.
        for pclass in (1, 2, 3):
            # Try to calculate survival rate.
            try:
                # Group by the pclass and get current pclass iterator.
                df_pclass = df_sex.groupby(by = 0).get_group(pclass)
                # Get the total survived for this pclass.
                total_survived = df_pclass[df_pclass.loc[:, 1] == 1].shape[0]
                # Divide total survived by total gendered subset.
                proportion_survived = total_survived / total_people
                print(sex[0].upper() + sex[1:], _age_type, "and pclass", pclass,
                      "survival rate: {:.2%}".format(proportion_survived))
            # If no data, print data does not exist.
            except:
                print(sex[0].upper() + sex[1:], _age_type, "and pclass", pclass,
                      "survival rate: data does not exist")
        print()

# Child Survival Proportions

In [13]:
display_survival_3(df_titanic[df_titanic.loc[:, 4] < 14], "child")

*************** Male child ***************
Male child and pclass 1 survival rate: 9.43%
Male child and pclass 2 survival rate: 20.75%
Male child and pclass 3 survival rate: 22.64%

*************** Female child ***************
Female child and pclass 1 survival rate: 0.00%
Female child and pclass 2 survival rate: 30.43%
Female child and pclass 3 survival rate: 32.61%



# Adolescent Survival Proportions

In [14]:
display_survival_3(pd.merge(df_titanic[df_titanic.loc[:, 4] >= 14],
                         df_titanic[df_titanic.loc[:, 4] < 21],
                         how = "inner"), "adolescent")

*************** Male adolescent ***************
Male adolescent and pclass 1 survival rate: 1.15%
Male adolescent and pclass 2 survival rate: 2.30%
Male adolescent and pclass 3 survival rate: 9.20%

*************** Female adolescent ***************
Female adolescent and pclass 1 survival rate: 23.81%
Female adolescent and pclass 2 survival rate: 19.05%
Female adolescent and pclass 3 survival rate: 30.16%



# Adult Survival Proportions

In [15]:
display_survival_3(pd.merge(df_titanic[df_titanic.loc[:, 4] >= 21],
                   df_titanic[df_titanic.loc[:, 4] < 65],
                   how = "inner"), "adult")

*************** Male adult ***************
Male adult and pclass 1 survival rate: 9.09%
Male adult and pclass 2 survival rate: 1.98%
Male adult and pclass 3 survival rate: 7.71%

*************** Female adult ***************
Female adult and pclass 1 survival rate: 40.29%
Female adult and pclass 2 survival rate: 23.74%
Female adult and pclass 3 survival rate: 13.67%



# Senior Survival Proportions

In [16]:
display_survival_3(df_titanic[df_titanic.loc[:, 4] >= 65], "senior")

*************** Male senior ***************
Male senior and pclass 1 survival rate: 8.33%
Male senior and pclass 2 survival rate: 0.00%
Male senior and pclass 3 survival rate: 0.00%

*************** Female senior ***************
Female senior and pclass 1 survival rate: 100.00%
Female senior and pclass 2 survival rate: data does not exist
Female senior and pclass 3 survival rate: data does not exist

