# Titanic Data Evaluation

I will use the Titanic data set to ask whether survival was random, or dependent on any external criteria. In particular I will be focusing on three factors:
<ol>
<li>Was the <b>Passenger Class</b> a factor in the likelihood of survival?</li>
<li>Was <b>Gender</b> a factor in survival?</li>
<li>Was <b>Age</b> a factor in survial?</li>
</ol>

In order to do this I will be using the data from the provided "titanic-data.csv" file which contains information on a sample of 891 of the 2,224 people onboard.

## Procedure

### Step 1: Review the raw data
In order to answer the question, I will first extract the data from the csv file and examine it to determine the information it contains, and the type of data present.

### Step 2: Clean the data
The next step will be to clean the data by removing any irrelevent criteria which may be present, as well as combining data into groups in order to help better understand any patterns.

### Step 3: Perform analysis to answer the question
Once the dataset has been properly formatted, I will begin the analysis to determine whether survival was by chance or was affected by any of the three factors identified in my question.

<b>FIRST:</b>
I will perform a high-level review of the data to determine a baseline of overall survival.

<b>SECOND:</b>
I will then examine the data in regards to whether Class had an impact on survival rate.

<b>THIRD:</b>
I will analyze the data to see if gender was a factor.

<b>FOURTH:</b>
I will look at the effect of age (in terms of age group) on survival rate.


In [None]:
#Render plots inline
%matplotlib inline

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for all graphs
sns.set_style("dark")

In [None]:
# Read in the dataset and create a dataframe called "titanic_raw_data"
titanic_raw_data = pd.read_csv('titanic-data.csv')

In [None]:
# Print the first few records to review data format and categories
titanic_raw_data.head()

In [None]:
# Print the last few records
titanic_raw_data.tail()

In [None]:
# Descriptive statistics on the dataset:
titanic_raw_data.describe()

In [None]:
# Understand data types contained in the set
titanic_raw_data.dtypes

## Description of the raw data:

The dataset contains the following information:
<ol>
<li><b>PassengerID:</b> An index for all of the passengers running from 0 to 890, indicating 891 total</li>
<li><b>Survived:</b> A simple indicator for whether the passenger survived ("1") or died ("0")</li>
<li><b>Pclass:</b> An indicator of whether the passenger was booked as 1st Class ("1"), 2nd Class ("2"), or 3rd Class ("3")</li>
<li><b>Name:</b> The passenger's full name</li>
<li><b>Sex:</b> The passenger's gender</li>
<li><b>Age:</b> The passenger's age in years</li>
<li><b>SibSp</b> The number of Siblings/Spouses the passenger had</li>
<li><b>Parch:</b> The number of parents/children the passenger had</li>
<li><b>Ticket:</b> The passenger's ticket number</li>
<li><b>Fare:</b> The actual fare paid by the passenger</li>
<li><b>Cabin:</b> The passenger's assigned cabin number</li>
<li><b>Embarked:</b>The port from which the passenger embarked on the journey</li>
</ol>

After reviewing this list I have determined that the <b>Name</b>, <b>SibSp</b>, <b>Parch</b>, <b>Ticket</b>, <b>Fare</b>, <b>Cabin</b>, and <b>Embarked</b> data are all irrelevent in the context of the question I wish to answer, and will remove them from the dataset:

In [None]:
# Remove unnecessary columns
titanic_trimmed_data = titanic_raw_data.drop(['Name','Ticket','Cabin','Fare','Embarked', 'Parch', 'SibSp'], axis=1)

In [None]:
# Check titanic_trimmed_data type
titanic_trimmed_data.head()

At this point I will also create a new column called <b>Agegroup</b> to aggregate the passengers into one of three categories according to their age:
<ol>
<li><b>Child:</b> Containing passengers from 0 - 15 years of age</li>
<li><b>Young:</b> Containing passengers from 16 - 25 years of age</li>
<li><b>Mature:</b> Containing passengers 26+ years of age</li>
</ol>

I've chosen to use three broad age categories for convenience, and to facilitate a general analysis of the data.  The cutoff ages were selected based on a combination of broad societal standards, as well as biological maturity.  

In [None]:
# Create new column called Agegroup with Child: 0 - 15, Young: 16 - 25, Mature: 26+

titanic_trimmed_data['Agegroup'] = pd.cut(titanic_trimmed_data['Age'], [0, 15, 25, 100], labels=["Child", "Young", "Mature"])
titanic_data = titanic_trimmed_data

I will check the data types for the new set by looking at the Col headers and data types:

In [None]:
# Confirm column addition by checking the first few rows
titanic_data.head()

In [None]:
# Check data types
titanic_data.dtypes

The new <b>Agegroup</b> column needs to have it's data type changed from "category" to "object":

In [None]:
# Convert 'Agegroup" dtype from category to object
titanic_data['Agegroup'] = titanic_data['Agegroup'].astype(object)
titanic_data_age_cleaned = titanic_data['Agegroup']
print titanic_data.dtypes

I also want to have a dataset in which the Surivived categories are strings of "Deceased" and "Survived" instead of 0 and 1.  So I will create a copy of the titanic_data set as titanic_data_s to hold the conversion

Now we can begin our detailed look at the impact of our three identified crieteria, <b>Class</b>, <b>Gender</b>, and <b>Age</b>.

# Q1: Was Passenger Class a factor in survival?
As both the graph and the table above indicated, there is some evidence that Passenger Class may have played a part in survival rate, with 1st and 2nd class passenger survival being disproportionately higher than 3rd class passengers.

We will begin by generating a table that breaks down survival by passenger class:

In [None]:
survival_class_group = titanic_data_s.groupby(['Survived', 'Pclass'])
survival_class_group.describe()

The table above contains a considerable amount of information. We'll condense this by first summarizing the total number of survivors by Passenger Class.

In [None]:
# Titanic P Class survial
titanic_class_survival_count = titanic_data.groupby(['Pclass'])[['Survived']].count().unstack().plot(kind='bar').set_ylabel('Survival Count')
titanic_class_survival_count = titanic_data.groupby(['Pclass'])[['Survived']].count()
titanic_class_survival_count

Next we'll summarize Survival Rate by Class:

In [None]:
titanic_class_survival = titanic_data.groupby(['Pclass'])[['Survived']].mean().unstack().plot(kind='bar').set_ylabel('Survival Rate')
titanic_class_survival = titanic_data.groupby(['Pclass'])[['Survived']].mean()
titanic_class_survival

From these we can see that although more 3rd class passengers survived than 1st and 2nd class passengers combined (491 compared to 216 and 184, respectively) the survival rate strongly indicates that Passenger Class was a major contributing factor in the likelihood of survival, as most passengers classified as 1st or 2nd Class managed to survive (63% abd 47%, respectively), while only about 1/4 of 3rd Class passengers did.

In order to determine whether class had an affect on passenger survival, we will run the chi-square test to check whether our observed results are statistically different from the expected results. Given that we have previously determined the overall surivivor rate, we can use this test to calculate the expected and compare it to the observed.

For this test:
<blockquote><b>H<sub>0</sub>:</b> There is no difference between the observed and expected survival for each class<br/>
<b>H<sub>1</sub>: </b> The observed survival is statistically different from the expected survival for class</blockquote>
Once again, we will employ <b>&#945;</b> = 0.05

In [None]:
# Run the chi_sq function
chi_sq(titanic_data, ('Pclass'))

The results of the chi square test for Class shows a p = 1.71<sup>-14</sup>, which is well below the significance level of 0.05, meaning we can reject the null.  This test also confirms the t-test results, and allows us to conclude that Class indeed was a factor in survival rate

# Q2: Was Gender a factor in survival?
While the initial evaluation of all survivors indicated that Passenger Class was a factor in survival, later confirmed through deeper testing, there was little information regarding whether women survived at a higher rate than men.

In order to consider whether gender had an effect, we will compare the total number of male vs female survivors:

In [None]:
# Titanic Gender Survival Count
titanic_gender_survival_count = titanic_data.groupby(['Sex'])[['Survived']].count().unstack().plot(kind='bar').set_ylabel('Survival Count')
titanic_gender_survival_count = titanic_data.groupby(['Sex'])[['Survived']].count()
titanic_gender_survival_count

And next do the same to compare male vs female survival rate:

In [None]:
# Titanic Gender Survival Rate
titanic_gender_survival = titanic_data.groupby(['Sex'])[['Survived']].mean().unstack().plot(kind='bar').set_ylabel('Survival Rate')
titanic_gender_survival = titanic_data.groupby(['Sex'])[['Survived']].mean()
titanic_gender_survival

Much like the case for Passenger Class, we see here that the total number of male survivors is nearly double female, yet the survival rates are drastically different, with females surviving at nearly 4 times the rate of men.

The numbers lead us to believe that gender was indeed a factor in survival.  This is well illustrated in the pie-graphs below.



### Male Survival

In [None]:
# Copy titanic_data
titanic_data_s = titanic_data.copy()

# Convert the Survived categories and confirm
titanic_data_s['Survived'].replace({0:"Deceased", 1:"Survived"}, inplace=True)
titanic_data_s.head()

In [None]:
titanic_data_s.dtypes

In [None]:
# Chech to see that titanic_data is unchanged
titanic_data.head()

In [None]:
titanic_data.dtypes

In [None]:
# Run descriptive statistics on the entire dataset to see if manipulation had any effect on the data 
titanic_data_s.describe()

Looking at the data, we see that there are 177 passengers for whom Age is missing.  This will be important when we evaluate whether the passenger's age had any impact on their likelihood of survival, but for now we will just take note of this.

Finally, I will create a function <b>chi_sq</b> to test the impact of the three criteria we highlighted

In [None]:
def chi_sq(data_set, criteria):
    titanic_chi = data_set.copy()

#Overall survival rate by Data Set
    titanic_survived = titanic_chi['Survived'].mean()

#Observed and expected rates
    observed = titanic_chi.groupby(criteria).sum()['Survived']
    total = titanic_chi.groupby(criteria).count()['Survived']
    expected = titanic_chi.groupby(criteria).count()['Survived']*titanic_survived
    proportion=observed/total

#data frame
    allData=pd.concat([observed, expected, total, proportion], axis=1)
    allData.columns=["Observed", "Expected", "Total", "Observed Rate"]
    allData['Expected Rate'] = titanic_survived

#test
    print scipy.stats.chisquare(observed, expected)
    return allData
   

# General Survival Review
At this point I want to get an idea of the data set. First, let's check a breakdown of passengers according to age:

In [None]:
# A look at the age breakdown for all passengers
age_hist=titanic_data['Age'].plot.hist(bins=range(0, 86, 5), legend=True, alpha=0.8)
age_hist.set_ylabel('Passengers')
age_hist.set_xlabel('Age')
age_hist.set_title('Titanic Passenger Ages')
pd.DataFrame(titanic_data['Age'].describe())

As this shows, most of the passengers on the Titanic were between 20 and 40 years of age. Next, let's see the overall survival rate for all passengers

In [None]:
# Determine the ratio of total survivors and fatalities
overall_titanic_survival = titanic_data_s.groupby('Survived')
pid_df = overall_titanic_survival.agg({'PassengerId' : 'size'})
pid_df / len(titanic_data)

From this, we see that only about 38.4% of the passengers survived

In [None]:
# Comparison of survival according to passenger age to compare with overal ages

age_comp = titanic_data_s.groupby('Survived')['Age'].plot.hist(bins=range(0, 86, 5), legend=True, alpha=0.8)
plt.xlabel('Age')
plt.title('Survived vs Died by Age')
pd.DataFrame(overall_titanic_survival['Age'].describe())

Projecting the survial-by-age histogram on top of the overall age histogram shows a similar breakdown, with most survivors being between 20 and 40 years of age.  However, passengers younger than 20 seem to have suffered very few fatalities.  This is interesting and leads us to suspect that younger passengers survied at a higher rate, and will be examined further later on.

In [None]:
# Gender Survival Pie Chart

labels = 'Survived', 'Deceased'
sizes = [titanic_data.groupby('Sex')['Survived'].mean()['male'], 1.0 - titanic_data.groupby('Sex')['Survived'].mean()['male']]
colors = ['lightblue', 'Pink']
explode = (0.1, 0) # only "explode" survivors

plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%')
plt.axis('equal')
plt.title('Survival by Gender')

### Female Survival

In [None]:
#Female Survival Pie Chart
labels = 'Survived', 'Deceased'
sizes = [titanic_data.groupby('Sex')['Survived'].mean()['female'], 1.0 - titanic_data.groupby('Sex')['Survived'].mean()['female']]
colors = ['lightblue', 'pink']
explode = (0.1, 0) # only "explode" survivors

plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%')
plt.axis('equal')
plt.title('Female Survival')

These charts fully illustrate the degree to which females were more likely to survive than men, with approximately 75% of the females surviving compared to less than 20% of men surviving.

In order to confirm the impact of gender we again run the chi-square test to check whether our observed results are statistically different from the expected results.

For this test:
<blockquote><b>H<sub>0</sub>:</b> There is no difference between the observed and expected survival based on gender<br/>
<b>H<sub>1</sub>: </b> Survival by gender is not equal</blockquote>
Once again, we will employ <b>&#945;</b> = 0.05

In [None]:
# Run the ch_sq function
chi_sq(titanic_data, ('Sex'))

The results of the chi square test for Gender shows a p = 3.97<sup>-37</sup>, which is well below the significance level of 0.05, meaning we can confidently reject the null.  This test also confirms the t-test results, and allows us to conclude that Gender was a major factor in survival rate

# Q3: Was age a factor in survival?

So far we have seen that both passenger class and gender had an impact on the likelihood of surviving.  Our last question will focus on the second part of the famous "Women and Children first!" nautical saying by examining whether passenger age was a factor in survival rate.


Again, we start by first summarizing the survival data according to age.

Remember that we had previously created an Age Group category for us to aggregate the passengers into three categories: 
<ol>
<li><b>Child:</b> 0 - 15 years old</li>
<li><b>Young:</b> 16 - 25 years old</li>
<li><b>Mature:</b> 25+ years old</li>
</ol>

In [None]:
# Clean Age data to remove blank ages
titanic_data_age_cleaned = titanic_trimmed_data.dropna()

# Confirm removal of NaN
titanic_data_age_cleaned.describe()

The new count total of 714 confirms that we no longer have passengers without any age data. 

At this point we have removed all of the passengers without any age data, and confirmed that there has been no change in the resuts, so we can proceed.

Now we'll look at both counts and rate for Age Group

In [None]:
# Titanic Age Group survival total count
titanic_age_survival_count = titanic_data_age_cleaned.groupby(['Agegroup'])[['Survived']].count().unstack().plot(kind='bar').set_ylabel('Survival Count')
titanic_age_survival_count = titanic_data_age_cleaned.groupby(['Agegroup'])[['Survived']].count()
titanic_age_survival_count

In [None]:
# Titanic Age Group survival Rate
titanic_age_survival = titanic_data_age_cleaned.groupby(['Agegroup'])[['Survived']].mean().unstack().plot(kind='bar').set_ylabel('Survival Rate')
titanic_age_survival = titanic_data_age_cleaned.groupby(['Agegroup'])[['Survived']].mean()
titanic_age_survival

We can see that the total number of Child (0 - 15 years) passengers that survived was much lower than either Young (16 - 25) or Mature (26+). This is not surprising, as our passenger breakdown above showed, most passengers in our data set were 20+ years old, so we would expect more of them to survive.

However, we also see that roughly 59% of passengers aged 0 - 15 years (Child group) survived, 34.4% of passengers aged 16 - 25 (Young group) survived, and 40% of passengers aged 26+ (Mature group)survived. This is compared to the overall survival rate of 38.4% (obtained from the review of overall data).

Even though this information does appear to support the hypothesis that age impacted survival rate, we will begin to confirm this by running the chi-square test.

For this test:
<blockquote><b>H<sub>0</sub>:</b> There is no difference between the observed and expected survival for each age group<br/>
<b>H<sub>1</sub>: </b> The observed survival is statistically different from the expected survival for age group</blockquote>
Once again, we will employ <b>&#945;</b> = 0.05

In [None]:
# Run the chi_sq function
chi_sq(titanic_data, ('Agegroup'))

The chi-square results (p = 0.005) show that indeed the Child group (0 - 15y) did survive at a rate statistically greater than expected.  However, these results also show that both Young and Mature were very close to the expected rate of survival, and that the greater overall number of Young and Mature survivors was due to the greater number of passengers in those groups.

# Other Observations
At this point it may be of interest to check survival rates when we combine multiple criteria to see whether one has a greater influence over another, or if there are any potential correlative relationships.

To do so, we will evaluate Class and Gender, Age Group and Gender, and Class and Age Group.

# Survival according to Class and Gender

In [None]:
# Break down total survival by Class and Gender
titanic_c_g_survival=titanic_data.groupby(['Sex', 'Pclass'])[['Survived']].mean().unstack().plot(kind='bar').set_ylabel('Survival Rate')
titanic_c_g_survival=titanic_data.groupby(['Sex', 'Pclass'])[['Survived']].mean()
titanic_c_g_survival

We can see that females, regardless of passenger class, were much more likely to suvive than males, although the rate of survival between 1st Class male passengers is close to that of 3rd Class female passengers.

## Survival according to Age Group and Gender

In [None]:
# Break down total survival by Age Group and Gender
titanic_a_g_survival=titanic_data.groupby(['Agegroup', 'Sex'])[['Survived']].mean().unstack().plot(kind='bar').set_ylabel('Survival Rate')
titanic_a_g_survival=titanic_data.groupby(['Agegroup', 'Sex'])[['Survived']].mean()
titanic_a_g_survival

In this case we again see that females have a higher likelihood of surviving than males, although the difference in the Child Age Group is small (65% survival rate for females, and 52.5% for males).

## Survival by Age Group and Class