# Project 2 / Titanic-Lusitania Sleuthing

## Step 1: Background

#### Descriptions

The Titanic dataset shows the available data on each passenger aboard the RMS Titanic, which sank in the North Atlantic Ocean in April 1912. This includes their passenger class, name, sex, age, family, port of embarkation and other relevant information. The data also shows their outcome from the sinking of the Titanic, whether they survived or died. 

The Lusitania dataset has more in-depth information on the passengers aboard the RMS Lusitania, which was sunk in June 1906 by a German submarine in WWI. This information includes more information on the crew aboard the ship, such as Engineering Crew, Deck Crew, and Victualling Crew to name a few.  

#### Question: How did the passengers of each vessel prioritize safety amidst a catastrophe?

## Step 2: Import Packages and Data

In [1]:
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
%matplotlib inline



#### 1. Read the data

In [2]:
path = "data/titanic.csv"

titanic = pd.read_csv(path)
titanic.head(3)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
titanic.columns

Index([u'pclass', u'survived', u'name', u'sex', u'age', u'sibsp', u'parch',
       u'ticket', u'fare', u'cabin', u'embarked', u'boat', u'body',
       u'home.dest'],
      dtype='object')

#### 2. Check that the age column doesn't have any unreasonable values 

In [4]:
titanic.describe()



Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881138,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.413493,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.17,0.0,0.0,0.0,1.0
25%,2.0,0.0,,0.0,0.0,,
50%,3.0,0.0,,0.0,0.0,,
75%,3.0,1.0,,1.0,0.0,,
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


#### 3. Check for missing values.  How do you know that a value is missing?

Yes, there are missing values. They are labeled as 'NaN' or not a number.

#### 4. Does it makes sense to guess at the value?

We should try to make an educated guess at the values. If we merely drop the NaNs, then we're deleting relevant data, which may compromise the analysis.

## Step 3: Data Imputation

#### 5. Well let’s say that it does... You likely noticed that Age has some missing values. How many are missing?

In [5]:
print "Missing age values:", titanic['age'].isnull().sum()

Missing age values: 263


#### 6. For the Age of the passangers ... how would you guess at the missing values using the other data present in the CSV.

In [6]:
print "Mean age:", titanic['age'].mean()
print "Median age:", titanic['age'].median()

Mean age: 29.8811376673
Median age: 28.0


## Step 4: Group Statistics

#### 7. Are there any groups that were especially adversely affected in the Titanic wreck? 

The worst off were the males, but even more so for males in 2nd and 3rd class. Those groups each had about a 15% chance of survival. 

In [7]:
pd.pivot_table(titanic, index=['sex', 'pclass'], values=['age', 'survived'])

Unnamed: 0_level_0,Unnamed: 1_level_0,age,survived
sex,pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,37.037594,0.965278
female,2,27.499223,0.886792
female,3,22.185329,0.490741
male,1,41.029272,0.340782
male,2,30.81538,0.146199
male,3,25.962264,0.15213


#### 8. Survival Rate by Sex
Females ultimately had a much higher survival rate than males.

In [8]:
titanic.groupby(by='sex')['survived'].mean()

sex
female    0.727468
male      0.190985
Name: survived, dtype: float64

#### 9. Survival Rate by Passenger Class
There appears to be a strong relationship between class and survival rate: first-class had the highest survival rate, then second, then third.

In [9]:
titanic.groupby(by='pclass')['survived'].mean()

pclass
1    0.619195
2    0.429603
3    0.255289
Name: survived, dtype: float64

## Step 5: Comparing the Titanic with Lusitania

In [10]:
# Import data
path1 = "data/lusitania.csv"
lusitania = pd.read_csv(path1)
lusitania.head()

Unnamed: 0,Family name,Title,Personal name,Fate,Age,Department/Class,Passenger/Crew,Citizenship,Position,Status,...,Country,Lifeboat,Rescue Vessel,Body No.,Ticket No.,Cabin No.,Traveling Companions and other notes,Value,Adult/Minor,Sex
0,CAMERON,Mr.,Charles W.,Lost,38,Band,Crew,British,,,...,,,,,,,,1,Adult,Male
1,CARR-JONES,Mr.,E.,Lost,37,Band,Crew,British,,,...,,,,,,,,1,Adult,Male
2,DRAKEFORD,Mr.,Edward,Saved,30,Band,Crew,British,Violin,,...,,,,,,,,1,Adult,Male
3,HAWKINS,Mr.,Handel,Saved,25,Band,Crew,British,Cello,,...,,,,,,,,1,Adult,Male
4,HEMINGWAY,Mr.,John William,Saved,27,Band,Crew,British,Double Bass,,...,,,,,,,,1,Adult,Male


In [11]:
lusitania.columns

Index([u'Family name', u'Title', u'Personal name', u'Fate', u'Age',
       u'Department/Class', u'Passenger/Crew', u'Citizenship', u'Position',
       u'Status', u'City', u'County', u'State', u'Country', u'Lifeboat',
       u'Rescue Vessel', u'Body No.', u'Ticket No.', u'Cabin No.',
       u'Traveling Companions and other notes', u'Value', u'Adult/Minor',
       u'Sex'],
      dtype='object')

In [12]:
lusitania.shape

(1961, 23)

## Step 6: Data Cleaning

In [13]:
# Drop unwanted columns
lusitania.drop(['Title', 'Personal name', 'Position', 'City', 
                'County', 'Traveling Companions and other notes', 
                'Lifeboat', 'Rescue Vessel'], axis=1, inplace=True)

#### 10. Create a function to help clean the data

In [14]:
def clean_age_column(age):
    if not isinstance(age, basestring):
        return age    
    case_ = re.search('(\d+)-months', age)
    if case_:
        return int(case_.groups(1)[0]) / 12.    
    case_ = re.search('(\d+) or (\d+)', age)
    if case_:
        c1, c2 = case_.groups(1)
        return (int(c1) + int(c2)) / 2.    
    case_ = re.search('(\d+) \((\d+)\?\)', age)
    if case_:
        c1, c2 = case_.groups(1)
        return (int(c1) + int(c2)) / 2.    
    case_ = re.search('(\d+)\s?\?', age)
    if case_:
        return int(case_.groups(1)[0])   
    case_ = re.search('(\d+) \(\?\)', age)
    if case_:
        return int(case_.groups(1)[0])      
    if age == 'Infant':
        return 1   
    elif age == '2_':
        return 25  
    elif age == '?':
        return np.nan    
    else:
        return age

#### 11. Clean the Age column

In [15]:
# Remove the spaces
lusitania.columns = [x.strip().replace(' ', '') for x in lusitania.columns]

In [16]:
cleaned_age = lusitania['Age'].apply(clean_age_column)
cleaned_age.unique()

array(['38', '37', '30', '25', '27', '48', nan, '24', 19, '57', '50', '56',
       '41', '19', '33', '29', '18', '20', '21', '26', '17', '58', '47',
       '54', '35', '43', '59', '53', '44', '51', '40', '49', '42', '32',
       '31', '34', '22', '45', '36', 29, '52', '23', '60', '28', '16',
       '46', '15', '39', 63, '55', '64', 53, 0.75, '6', '9', '14', '10',
       '12', '62', '5', '8', '65', '68', '76', '61', '63', 0.25, '1.5',
       '2.5', 1.5, '3', '2', 25, 0.6666666666666666, '4', 1.25,
       1.1666666666666667, 1, 0.5, 49.0, 48.0, 22, 1.4166666666666667,
       1.0833333333333333, 0.16666666666666666, 0.4166666666666667, '11',
       61.5, 57.0, 38.0, 27.0, 31.5, 23.5, '7', 42, 31, 0.8333333333333334,
       '70', 62, '13', 0.9166666666666666, 30, 34, '1', 43.0, '67', '73',
       '72', '4.25', '69', 26, 54, 21, 16], dtype=object)

In [17]:
lusitania['Age'] = cleaned_age.astype(float)

In [18]:
print lusitania['Age'].describe()

count    1307.000000
mean       32.298202
std        14.365629
min         0.166667
25%              NaN
50%              NaN
75%              NaN
max        76.000000
Name: Age, dtype: float64


#### 12. Are there any missing values in the Lusitania dataset?

Yes, there are too many missing values, which will probably lead to a skewed analysis. Let's skip age. 

In [19]:
print "Lusitania missing age values:", lusitania['Age'].isnull().sum()

Lusitania missing age values: 654


#### 13. Are there any groups that were especially adversely affected in the Lusitania wreck? 

There was no correlation between 'Sex' and 'Fate'. With a p-value 0.2658 > 0.05, we can conclude there is no statistical significance between these 2 features.

In [20]:
# please note: you should collapse the not on board and saved died from trauma into another column
table = pd.crosstab(lusitania['Sex'],lusitania['Fate'])
table

Fate,Lost,Not on board,Saved,Saved (died from trauma)
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,325,1,191,1
Male,868,0,572,3


In [21]:
chi, p, df, expected = chi2_contingency(table)
print "p-value:", p

p-value: 0.265833818484


## Conclusion

The safety of women and children was prioritized in the sinking of the RMS Titanic, but turned out to be the opposite for those in the RMS Lusitania. In the case of the RMS Lusitania, a German submarine had just bombed the vessel, possibly leading to a panicked emergency state of the individuals. Crew and passengers aboard the RMS Lusitania may have had a "save yourself" mentality, and may not have taken additional steps to save women and children.