# Titanic Survivability
## Introduction
<p>On 15 April 1912, Titanic, the largest ship of its time, sank after hitting an iceberg in the North Atlantic Ocean. Of the 2,224 people estimated on board, only 705 survived. Although limited, there were enough lifeboats to save 1,178 people and yet fewer made it.</p>

## Questions
<p>How likely would a passenger survive the tragedy?</p>

- If you are rich, would you most likely be prioritized?
- "Women and Children First". Does your age or gender influence your chances of survivability?
    
### Objectives
<p>This study analyzes the likelihood of survivability of passengers on board of the Titanic. The analysis is divided according to Demographics and Social Economic Status. The former will be based on Gender and Age and the latter will be based on Ticket Class and Fare.</p>

### Variables
Dependent Variable: If the passenger survived or not. <br>
Independent Variables: 1. Gender 2. Age 3. Ticket Class 4. Fare. <br>
Null Hypothesis: The likelihood of surviving the event are not influenced by demographics and socio economic status. <br> 
Hypothesis: The likelihood of survival is influenced by the demographics and socio economic status of the passengers.

#### Data Wrangling
### Data Acquisition
<p>The data provided is a list of names of 891 of the 2,224 passengers with the corresponding information for each on board. Below is the Data Dictionary of the data set from [Kaggle](https://www.kaggle.com/c/titanic/data).</p>

- survival: Survival (0 = No, 1 = Yes)
- pclass: Ticket class (1st = Upper, 2nd = Middle, 3rd = Lower)
- sex: Sex
- Age: Age in years (Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5)
- sibsp: # of siblings / spouse aboard the Titanic (Sibling = brother, sister, stepbrother, stepsister, Spouse = husband, wife (mistresses and fiancés were ignored))
- parch: # of parents / children aboard the Titanic (Parent = mother, father, Child = daughter, son, stepdaughter, stepson, Some children travelled only with a nanny, therefore parch=0 for them.)
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin number
- embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [4]:
import pandas as pd
import numpy as np

titanic_df = pd.read_csv('titanic-data.csv') # Read CSV and stores in to titanic_df variable.

### Data Cleaning
<p>Once my file is loaded, I check if there are duplicate values in any of the column that could affect the analysis. I am also looking for inconsistencies in values, data type or missing values that may affect the investigation.</p>

In [5]:
# PassengerId and Name must be unique. I check if there are any duplicate values in each col.
# There are no duplicates on the data.
print titanic_df.duplicated('PassengerId').sum()
print titanic_df.duplicated('Name').sum()
# I also check if the Ticket # is unique. Turns out that the Ticket isn't unique for each passenger.
# Seems odd and I will take note and come back to it if needed.
print titanic_df.duplicated('Ticket').sum()

0
0
210


In [6]:
# I check the data type of each column for any inconsistencies. Seems odd to have Age as an float64.
# I investigate and print out a couple of rows with non-whole number age.
titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [7]:
# Saw 7 entries that are less than 1. Looking at their names, I see prefix "Master",
# which is what is given to children. In these case, these were babies below the age of 1.
non_whole = titanic_df['Age'] < 1
titanic_df[non_whole]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S
305,306,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
831,832,1,2,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S


In [8]:
# I check the head of the data set.
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
# and the tail.
titanic_df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [10]:
# Looking Age col, I see that there are some empty fields. I check how many there are.
missing_age = titanic_df['Age'].isnull()
print 'There are {} logs with their Age not specified.'.format(missing_age.sum())

There are 177 logs with their Age not specified.



### Women and Children First
<p>Knowing that there are 177 logs that does not have age specified will affect the analysis if we base our analysis of the 2nd question on the Age. I attempt to limit the descrepency created by the age problem by rephrasing my question to distinguish the survivability between women and children vs Male adult passengers.</p>

In [15]:
titanic_df['womChil'] = 0 # Created a column that groups women and children.
women = titanic_df['Sex'] == 'female' # Criteria - all female passengers.
child = titanic_df['Age'] < 19 # Criteria - all children under the age of 18 years.
# Passengers with missing age, I identify the children from the group of male passengers by looking 
# for the title 'Master' in their Names, which are titles given to minors on board without their
# parents.
masters = titanic_df['Name'].str.contains('Master') # Criteria all male children.
titanic_df['womChil'][women | child | masters] = 1 # Add 1 (yes) that fits the criterias
women_children = titanic_df.groupby('womChil') # Group passengers Women and Children.
women_children_survived = women_children['Survived'] == 1
women_children_survived = women_children.sum()['Survived'].iloc[1]
print "Of the 342 passengers that survived from the sample of 891 on the data provided, {} are \
women and children.".format(women_children_survived)

Of the 342 passengers that survived from the sample of 891 on the data provided, 259 are women and children.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [None]:
# At this point, it did not occur to me to check how many survived from the data provided 
# with list of 891 names. So I quickly check and saw that out of 891 passengers on the data
# provided there are 342 passengers that survived.
survived = titanic_df['Survived']
survived.sum()

In [None]:
# Knowing that 83 of the passengers fit the criteria of women and children, I double check on
# the remianing survivors that does not have age specified and see if there are clues that I can
# find to distinguish them as women or children.
non_womChil = titanic_df['womChil'] == 0
titanic_df[non_womChil & survived & missing_age].sort(['Pclass'])

In [None]:
# I check quickly the values with describe to find out if there are more inconsistencies.
titanic_df.describe()

In [None]:
# I check if the Sex is missing values. It seems that everything is in order.
male = titanic_df['Sex'] == 'male'
female = titanic_df['Sex'] == 'female'
total_sex = female.sum() + male.sum()
"There are {} males and {} females on board. Total of {} people".format(male.sum(), \
                                                                female.sum(), total_sex.sum())

In [None]:
# I decide to only use PassengerId, Survived, Pclas, Sex, Age and Fare columns for my analysis.
# I removed the rest of the columns that I do need.
titanic_df_neat = titanic_df.drop([])