# Project 2:  Lucid Titanic Sleuthing

#### Category descriptions:
(from Kaggle description) https://www.kaggle.com/c/titanic/data  

**Pclass** is a proxy for socio-economic status (SES)  
- 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

**Age** is in Years; Fractional if Age less than One (1)
- If the Age is Estimated, it is in the form xx.5

The following are the definitions used for sibsp and parch:
- **Sibling**:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
- **Spouse**:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
- **Parent**:   Mother or Father of Passenger Aboard Titanic
- **Child**:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

## Part 1: Developing an understanding of the data

#### Based on the description of the data you read in the readme describe in your own words this data.

This data describes various information about a subset of the passengers of the Titanic:
- demographic data (e.g., age, sex, marital status)
- relationships with other passengers (e.g., sibling, spouse, parent, child)
- logistical (e.g., ticket number, port of embarcation, cabin)

However there is one data point that is most important for our purposes:
- survival (1=survived, 0=did not)





#### Based on our conceptual understanding of the columns in this data set, what are the reasonable range of the values for the Sex, Age, SibSp, Parch columns?

Range of values for:  
- Sex: Male or Female | this is a binary option
- Age: 0 to 90 | given that this is age of humans during that time, it's unlikely anyone was over 90
- SibSp: 1 to 5 | amount of accompanying siblings or spouse
- Parch: 1 to 5 | amount of accompanying parents or children

### Open the data in sublime text is there anything that jumps out to you?

#### Observations:
- There are missing values
- There seem to be more missing values for the lower clases, especially the 3rd class passengers
- Although there were ~2300 passengers on the Titanic, this data only has 1309 entries

## Part 2: reading the data in

#### Now read the data into a Pandas DataFrame

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv('assets/titanic.csv')
# according to df.shape, there are 14 columns and 1309 records
df.head(2)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"


#### Check that the age column doesn't have any unreasonable values 

In [9]:
# check min and max values for age
print df['age'].min()
print df['age'].max()

0.17
80.0


#### Check for missing values.  How do you know that a value is missing?

In [8]:
# we can do this with isnull:
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [7]:
# or we can do it with the info method:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
pclass       1309 non-null int64
survived     1309 non-null int64
name         1309 non-null object
sex          1309 non-null object
age          1046 non-null float64
sibsp        1309 non-null int64
parch        1309 non-null int64
ticket       1309 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1307 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.2+ KB


#### Does it makes sense to guess at the missing values?

It might be possible for age at least, given that apellations were used in many cases, which can indicate generally if someone was a young person or adult. We could insert a mean age for those.

Cabin, Boat, and Body are all missing far too many values to probably be useful for any imputation.

## Part 3: data imputation

#### Well let’s say that it does... You likely noticed that Age has some missing values. How many are missing?

There are 263 null values in the Age column.

#### For the Age of the passangers ... how would you guess at the missing values using the other data present in the CSV.

For the missing age values for the passengers, I am sorting the passengers by their apellation, or title, then taking the mean of that title among the passengers and using that as a replacement value for the missings.

In [10]:
# determine the different apellations used for passenger names and the amount of them

apellation = []  # list of unique apellations
apell_count = {} # count of unique apellations

for i in df['name']:
    t = i.split(',')
    u = t[1].strip()
    v = u.split('.')
    w = v[0]
    apell_count[w] = apell_count.get(w, 0) + 1  # update our dictionary count
    if w not in apellation:
        apellation.append(w)  # add uniques to list

print apellation
print len(apellation)
apell_count.items()


['Miss', 'Master', 'Mr', 'Mrs', 'Col', 'Mme', 'Dr', 'Major', 'Capt', 'Lady', 'Sir', 'Mlle', 'Dona', 'Jonkheer', 'the Countess', 'Don', 'Rev', 'Ms']
18


[('Sir', 1),
 ('Major', 2),
 ('the Countess', 1),
 ('Don', 1),
 ('Mlle', 2),
 ('Capt', 1),
 ('Dr', 8),
 ('Lady', 1),
 ('Rev', 8),
 ('Mrs', 197),
 ('Dona', 1),
 ('Jonkheer', 1),
 ('Master', 61),
 ('Ms', 2),
 ('Mr', 757),
 ('Mme', 1),
 ('Miss', 260),
 ('Col', 4)]

So there are:
- 18 total unique apellations
- many are one-offs, but we have a fairly large sample size for 'Mr' (adult male), 'Mrs' (adult female), 'Master' (young male), and 'Miss' (young female).

In [11]:
# create a function of our apellation discovery process so we can call it 
def find_appel(i):
    t = i.split(',')
    u = t[1].strip()
    v = u.split('.')
    w = v[0]
    return w

# add an apellation column to our dataframe
df['apell'] = df['name'].apply(find_appel)
df.head(1)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,apell
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO",Miss


In [12]:
# create dataframe grouped by apellation with asssociated counts for name 
# and age and add a mean age column
df_apell = df.groupby("apell")['name', 'age'].count()   
df_apell['mean_age'] = df.groupby("apell")['age'].mean() # adds the mean_age column
df_apell.head(15)

Unnamed: 0_level_0,name,age,mean_age
apell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Capt,1,1,70.0
Col,4,4,54.0
Don,1,1,40.0
Dona,1,1,39.0
Dr,8,7,43.571429
Jonkheer,1,1,38.0
Lady,1,1,48.0
Major,2,2,48.5
Master,61,53,5.482642
Miss,260,210,21.774238


#### Observations:
- So now we see we have a mean age for each category of apellation. Of course, we should probably be more accurate and further condense these (e.g., combine "Don" with "Mr", perhaps "Ms" with "Miss," "Mme," and "Mlle").
- We can also see how many of each category of appellation we are missing ages for. 

In [13]:
# use our calculation from above (with a slight modification in syntax for the fillna function),
# to fill in the NaN values in the age column
df['age'] = df['age'].fillna(df.groupby("apell")['age'].transform("mean"))
df.age.isnull().sum()  # check that we've filled in all NaN values

0

## Part 4: Group Statistics

#### Are there any groups that were especially adversely affected in the Titanic wreck? (justify your response numerically)

In [14]:
df_sr = df.pivot_table('body', index=['sex'], columns = ['survived'], aggfunc=len)
df_sr['survival_rate'] = df_sr[1] / (df_sr[0] + df_sr[1])
df_sr.sort_values(by='survival_rate', ascending=False)
df_sr

survived,0,1,survival_rate
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,127.0,339.0,0.727468
male,682.0,161.0,0.190985


Overall, it appears males fared far worse than females.

In [15]:
df_sr2 = df.pivot_table('body', index=['pclass'], columns = ['survived'], aggfunc=len)
df_sr2['survival_rate'] = df_sr2[1] / (df_sr2[0] + df_sr2[1])
df_sr2.sort_values(by='survival_rate', ascending=False)

survived,0,1,survival_rate
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,123.0,200.0,0.619195
2,158.0,119.0,0.429603
3,528.0,181.0,0.255289


And overall, it seems the 3rd class passengers failed to survive at a much higher rate than the other two.

In [16]:
df_sr3 = df.pivot_table('body', index=['sex', 'pclass'], columns = ['survived'], aggfunc=len)
df_sr3['survival_rate'] = df_sr3[1] / (df_sr3[0] + df_sr3[1])
df_sr3.sort_values(by='survival_rate', ascending=False)

Unnamed: 0_level_0,survived,0,1,survival_rate
sex,pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,1,5.0,139.0,0.965278
female,2,12.0,94.0,0.886792
female,3,110.0,106.0,0.490741
male,1,118.0,61.0,0.340782
male,3,418.0,75.0,0.15213
male,2,146.0,25.0,0.146199


However, once we divide further, we see that the terrible survival rate for 3rd class was mostly driven by the male value. 3rd class females still managed to survive 49% of the time, versus 15% for their male counterparts.

Interestingly, 2nd class males have a slightly worse survival rate than even 3rd class males.

In [17]:
# create a classification table for age, divided at age 20
df['age_group'] = np.where(df['age']<18, 'under_18', '18_and_over') 

In [18]:
df_sr4 = df.pivot_table('body', index=['pclass', 'age_group'], columns = ['survived'], aggfunc=len)
df_sr4['survival_rate'] = df_sr4[1] / (df_sr4[0] + df_sr4[1])
df_sr4

Unnamed: 0_level_0,survived,0,1,survival_rate
pclass,age_group,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,18_and_over,121.0,187.0,0.607143
1,under_18,2.0,13.0,0.866667
2,18_and_over,154.0,90.0,0.368852
2,under_18,4.0,29.0,0.878788
3,18_and_over,456.0,139.0,0.233613
3,under_18,72.0,42.0,0.368421


Regardless of class, you fared better as a child (defined as under 18), versus as an adult.

#### Are there any groups that outperformed the survival of the latter group? (justify your response numerically)


#### Survival rates summary
- Males were far more likely to not survive than females
- The only female group that had higher mortality than survival were the 3rd class female passengers
- 3rd class passengers were more likely to perish than 1st or 2nd class
- Worst survival rate was for male 2nd class passengers
- Best survival rate was for female 1st class passengers

## Part 5:  Comparative Statistics:  Lusitania

In [19]:
l = pd.read_csv('assets/lusitania.csv')
l.head()

Unnamed: 0,Family name,Title,Personal name,Fate,Age,Department/Class,Passenger/Crew,Citizenship,Position,Status,...,Country,Lifeboat,Rescue Vessel,Body No.,Ticket No.,Cabin No.,Traveling Companions and other notes,Value,Adult/Minor,Sex
0,CAMERON,Mr.,Charles W.,Lost,38,Band,Crew,British,,,...,,,,,,,,1,Adult,Male
1,CARR-JONES,Mr.,E.,Lost,37,Band,Crew,British,,,...,,,,,,,,1,Adult,Male
2,DRAKEFORD,Mr.,Edward,Saved,30,Band,Crew,British,Violin,,...,,,,,,,,1,Adult,Male
3,HAWKINS,Mr.,Handel,Saved,25,Band,Crew,British,Cello,,...,,,,,,,,1,Adult,Male
4,HEMINGWAY,Mr.,John William,Saved,27,Band,Crew,British,Double Bass,,...,,,,,,,,1,Adult,Male


#### Are there any groups that were especially adversely affected in the Lusitania wreck? (justify your response numerically)

In [21]:
# create dataframe of just 1st/2nd/3rd class passengers
lc = l[(l['Department/Class'] == 'Saloon') | (l['Department/Class'] == 'Third') | (l['Department/Class'] == 'Second')]
lc.head(2)

Unnamed: 0,Family name,Title,Personal name,Fate,Age,Department/Class,Passenger/Crew,Citizenship,Position,Status,...,Country,Lifeboat,Rescue Vessel,Body No.,Ticket No.,Cabin No.,Traveling Companions and other notes,Value,Adult/Minor,Sex
387,ADAMS,Mr.,William McMillan,Saved,19,Saloon,Passenger,USA,,Single,...,England,17,,,46102,D 45,Arthur Adams (father),1,Adult,Male
388,ADAMS,Mr.,Arthur Henry,Lost,46,Saloon,Passenger,USA,,Married,...,England,17,,,46102,D 37,William Adams (son),1,Adult,Male


In [22]:
# from just the passengers, keep only 'Saved' or 'Lost'
lf = lc[(lc['Fate'] == 'Lost') | (lc['Fate'] == 'Saved')]
lf.head(2)

Unnamed: 0,Family name,Title,Personal name,Fate,Age,Department/Class,Passenger/Crew,Citizenship,Position,Status,...,Country,Lifeboat,Rescue Vessel,Body No.,Ticket No.,Cabin No.,Traveling Companions and other notes,Value,Adult/Minor,Sex
387,ADAMS,Mr.,William McMillan,Saved,19,Saloon,Passenger,USA,,Single,...,England,17,,,46102,D 45,Arthur Adams (father),1,Adult,Male
388,ADAMS,Mr.,Arthur Henry,Lost,46,Saloon,Passenger,USA,,Married,...,England,17,,,46102,D 37,William Adams (son),1,Adult,Male


In [23]:
# create pivot table for fate based on sex and class
lf2 = lf.pivot_table('State', index=['Sex', 'Department/Class'], columns=['Fate'], aggfunc=len)
# add column for survival rate percentage
lf2['survival_rate'] = lf2['Saved'] / (lf2['Lost'] + lf2['Saved'])
lf2

Unnamed: 0_level_0,Fate,Lost,Saved,survival_rate
Sex,Department/Class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,Saloon,56,34,0.377778
Female,Second,185,108,0.368601
Female,Third,68,40,0.37037
Male,Saloon,121,78,0.39196
Male,Second,187,119,0.388889
Male,Third,168,94,0.358779


From this breakdown it appears that, unlike the Titanic, the survival rates for all the classes and sexes amongst the passengers were about the same. 

In [24]:
l_all = l[(l['Fate'] == 'Lost') | (l['Fate'] == 'Saved')]
l_all_f = l_all.pivot_table('State', index=['Sex', 'Department/Class'], columns=['Fate'], aggfunc=len)
l_all_f['survival_rate'] = l_all_f['Saved'] / (l_all_f['Lost'] + l_all_f['Saved'])

l_all_f

Unnamed: 0_level_0,Fate,Lost,Saved,survival_rate
Sex,Department/Class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,Saloon,56.0,34.0,0.377778
Female,Second,185.0,108.0,0.368601
Female,Third,68.0,40.0,0.37037
Female,Victualling,16.0,9.0,0.36
Male,Band,2.0,3.0,0.6
Male,Deck,32.0,37.0,0.536232
Male,Engineering,201.0,112.0,0.357827
Male,Saloon,121.0,78.0,0.39196
Male,Second,187.0,119.0,0.388889
Male,Stowaway,3.0,,


#### Are there any groups that outperformed the survival of the latter group? (justify your response numerically)



It seems that the best option for survival on the Lusitania was to be part of the crew. Most especially the band (60% - although it's a very small sample), deck hands (54%), and servers "victualling" (46%).

#### What does the group-wise survival rate imply about circumstances during these two accidents?

The Lusitania was torpedoed by a German U-boat and sunk in 20 minutes, as opposed to the 2.5 hours it took for the Titanic to sink.

Since the attack took place in the afternoon, perhaps this very short time interval from strike to submergence only allowed those who were close to the decks (passengers and the crew attending them), to escape.

Also, the memory of the Titanic (sunk 3 years earlier), was probably still in people's minds and perhaps guided their behavior accordingly - i.e, not so much "women and children first" anymore.