# Women and children first?

## Preliminaries

In [1]:
# Run this cell to start.
import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Load the OKpy test library and tests.
from client.api.notebook import Notebook
ok = Notebook('titanic.ok')

Assignment: titanic
OK, version v1.18.1



The tests in this notebook usually do not test if you have the *right* answer,
but only if you have the *right sort* of answer.  *Be careful* -- the tests
could pass, but your answer could still be wrong.

## Background

We are going to look at the details of who was lost, and who survived, in the sinking of the RMS Titanic.

We first read the dataset containing information about the passengers and crew
who were on the RMS Titanic when it sank.

The data file is `titanic_stlearn.csv`.

See the [Titanic dataset page](https://github.com/matthew-brett/datasets/tree/master/titanic) for more detail.

You might also want to look at [Encylopedia
Titanica](https://www.encyclopedia-titanica.org/titanic-statistics.html) for
more background.

In [2]:
titanic = pd.read_csv('titanic_stlearn.csv')
titanic.head()

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,sibsp,parch,survived
0,"Abbing, Mr. Anthony",male,42.0,3rd,Southampton,United States,5547.0,7.11,0.0,0.0,no
1,"Abbott, Mr. Eugene Joseph",male,13.0,3rd,Southampton,United States,2673.0,20.05,0.0,2.0,no
2,"Abbott, Mr. Rossmore Edward",male,16.0,3rd,Southampton,United States,2673.0,20.05,1.0,1.0,no
3,"Abbott, Mrs. Rhoda Mary 'Rosa'",female,39.0,3rd,Southampton,England,2673.0,20.05,1.0,1.0,yes
4,"Abelseth, Miss. Karen Marie",female,16.0,3rd,Southampton,Norway,348125.0,7.13,0.0,0.0,yes


In [3]:
# Test you are on the right track.
_ = ok.grade('q_01_titanic')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 3
    Failed: 0
[ooooooooook] 100.0% passed



This data file contains the following columns:

* `name`: a string with the name of the passenger.
* `gender`: a string with one of two labels: "male" and "female".
* `age`: a numeric value with the person's age on the day of the sinking. The
  age of babies (under 12 months) is given as a fraction of one year, rounded
  to the nearest month (2 months = 2/12 = 0.1667).
* `class`: a string specifying the class for passengers: "1st", "2nd", "3rd";
  or the type of service aboard for crew members. See below for discussion of
  passengers, crew and the crew service types.
* `embarked`: a string with the person's port of embarkation, one of:
  "Belfast", "Cherbourg", "Queenstown" or "Southampton".
* `country`: a string with the person's home country.
* `ticketno`: a numeric value specifying the persons ticket number (NA for crew
  members, also see below).
* `fare`: a numeric value with the ticket price (NA for crew members, musicians
  and employees of the shipyard company, also see below).
* `sibsp`: an integer specifying the number of siblings/spouses aboard; adopted
  from Vanderbilt data set (see below).  Always NA for crew, sometimes NA for
  passengers.
* `parch`: an ordered factor specifying the number of parents/children aboard;
  adopted from Vanderbilt data set (see below).  Always NA for crew, sometimes
  NA for passengers.
* `survived`: a string with one of two labels: "no" and "yes". It specifies
  whether the person survived the sinking.

## Women and children first

The RMS Titanic sank on 15th April 1912. A standard rule of evacuation at the
time was [Women and Children
First](https://en.wikipedia.org/wiki/Women_and_children_first).  Wikipedia
claims that the original suggestion for this rule was from a French passenger
of a ship in danger, in 1840.

How strictly was that rule applied in the evacuation of the Titanic?

Use `pd.crosstab` to create a new data frame that is a cross-tabulation of the
values in the `gender` column, and the values in the `survived` column.  Store
this cross-tabulation in the variable `gender_by_survived`.  It should contain
four counts, one for `female` passengers who were lost (`no`), one for `female`
and `yes` and so on.

In [4]:
gender_by_survived = pd.crosstab(titanic['gender'], titanic['survived'])
# Show the table in the notebook
gender_by_survived

survived,no,yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,130,359
male,1366,352


In [5]:
# Check you are on the right track.
_ = ok.grade('q_02_gender_by_survived')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 5
    Failed: 0
[ooooooooook] 100.0% passed



These counts are useful, but even more useful would be *proportions* of women
who were lost and who survived.  Investigate the keyword arguments to
`pd.crosstab` to create a new data frame `gender_by_survived_p` that shows the
proportions of men and women who survived.  For example, there should be a
value for `female` and `no` that is the number of `female` passengers who were
lost, divided by the total number of `female` passengers.  That is, you want
the proportions across the *rows*.

In [6]:
def proportion_by_row(df):
    return df.div(df.sum(axis=1),axis='rows')

In [7]:
gender_by_survived_p = proportion_by_row(gender_by_survived)
# Show the table in the notebook
gender_by_survived_p

survived,no,yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.265849,0.734151
male,0.795111,0.204889


In [8]:
# Check you are on the right track.
_ = ok.grade('q_03_gender_by_survived_p')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 5
    Failed: 0
[ooooooooook] 100.0% passed



This should look like pretty convincing evidence that the crew largely followed
the "women" part of the "Women and children first" rule.  Next we investigate
the "children" part.

We need a Series that allows us to categorize the passenger as a `male`, a
`female` or a `child`.

First we make a new series `mwc` (Man Woman Child) that has a copy of the data
from the `gender` column.

In [9]:
# Run this cell.
mwc = titanic['gender'].copy()
mwc.head()

0      male
1      male
2      male
3    female
4    female
Name: gender, dtype: object

Now your turn.  Make a Boolean series named `is_child` that has True for rows
where the passenger's `age` was less than 15, and False otherwise.  Use
`is_child` to set the corresponding elements in `mwc` to have the value
`child`.

After you have done this, the `mwc` Series should have a `child` value for rows
in `titanic` where the person's age was less than 15, otherwise have a `male`
value for male adult passengers or a `female` value for female adult
passengers.

In [10]:
#- Your code here.
# Show the unique values and counts for the "mwc" Series.
is_child = titanic['age'] < 15
mwc[is_child] = 'child'
mwc.value_counts()

male      1651
female     432
child      124
Name: gender, dtype: int64

In [11]:
_ = ok.grade('q_04_mwc')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 5
    Failed: 0
[ooooooooook] 100.0% passed



Create a cross-tabulation data frame called `mwc_by_survived_p` that has the
proportions of children, females and males that were saved and lost.  The
*proportion* of children saved is the number of children saved divided by the
total number of children.  Your `mwc_by_survived_p` data frame should have, for
example, a row corresponding to `child` , that has two values: the proportion
of children that were lost and the proportion of children who were saved.

In [12]:
mwc_by_survived_p = proportion_by_row(pd.crosstab(mwc, titanic['survived']))
mwc_by_survived_p

survived,no,yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
child,0.475806,0.524194
female,0.243056,0.756944
male,0.806784,0.193216


In [13]:
_ = ok.grade('q_05_mwc_p')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 5
    Failed: 0
[ooooooooook] 100.0% passed



## Being at the front of the plane

The next factor we know well is that passengers in higher classes were more
likely to survive.

The problem we have at the moment is that the `class` column in this dataset is a mix of things:

In [14]:
# Run this cell.
titanic['class'].value_counts()

3rd                 709
victualling crew    431
1st                 324
engineering crew    324
2nd                 284
restaurant staff     69
deck crew            66
Name: class, dtype: int64

The `class` column contains "1st", "2nd", "3rd" for some people, but it has job
titles for others, such as "deck crew".

Worse than that, some of the people in "1st" and "2nd" class were closer to
being crew than passengers.  For example, there were [8
musicians](https://en.wikipedia.org/wiki/Musicians_of_the_RMS_Titanic), who
were all listed as "2nd" class passengers. There were [9 members of the Guarantee
Group](https://en.wikipedia.org/wiki/Crew_of_the_RMS_Titanic#Guarantee_group)
on board, whose job was to monitor the ship and fix any problems that arose on
her maiden voyage.  They also have passenger classes listed as "1st" or "2nd".

We would like to be able to classify the people (rows) in the dataset as one of the following:

* Genuine First class passenger: "1st".
* Genuine Second class passenger: "2nd".
* Genuine Third class passenger: "3rd".
* Musician: "musician".
* Guarantee group: "guarantee".
* Deck crew: "deck".
* Engineering crew: "engineering".
* Victualling crew or restaurant staff: "catering".

That is, we need a new Series, maybe called `roles`, with one element per row
in the dataset, that has one of these string labels, classifying the person in
the corresponding row. For example, the first five people in the dataset are
genuine Third Class passengers, so the first five elements in `roles` would be
"3rd".

Much of the information we need is in the `class` column of `titanic` - but we
have more work to do, especially for the musicians and the guarantee group.

One way of doing this task is to use a *recoding function*.  You saw one of
these in action your "stop and search" homework.  In the homework, the function
applied to a Series (and therefore, a column of a data frame), and, when used
with `apply`, returned a Series.

Here we need to use information from multiple columns in the person's row to
classify them, so we need to take a different approach.   We need to `apply` a
function to the whole data frame, to return our new Series `roles`.

Here is an example of how to do this.  The function below is a *row recoding
function*.  It accepts a *row* as its argument, and returns a value.

In this case, the function returns "adult" for a row where the person's age was
15 or more, and otherwise (for persons with age < 15) returns "female child"
for "female" persons and "male child" otherwise.

In [15]:
# Run this cell to create example row classification function

def classify_mf_child(row):
    if row.loc['age'] >= 15:
        return 'adult'
    if row.loc['gender'] == 'female':
        return 'female child'
    return 'male child'

To see the function in action, let's classify the first row of `titanic`:

In [16]:
classify_mf_child(titanic.iloc[0])

'adult'

Classify the second row:

In [17]:
classify_mf_child(titanic.iloc[1])

'male child'

Then we can `apply` this function to the whole data frame, to return a classification for each row in the data frame:

In [18]:
mf_child = titanic.apply(classify_mf_child, axis='columns')
mf_child.head()

0         adult
1    male child
2         adult
3         adult
4         adult
dtype: object

Notice the `axis='columns'` keyword argument.  This tells Pandas to send the
function one *row* at a time (working *across the columns*).  It's also
possible to send the function one *column* at a time (working *across the
rows*), and that it what it does by default, if you don't specify
`axis='columns'`.

Your job is to write a row classification function, like `classify_mf_child`
above, that accepts a single row, and returns the correct string corresponding
to that row, from the list above (from "1st", "2nd", "3rd", "musician", etc).

In order to do this, investigate the `titanic` data set, and have a look at the
links above that have more information on the musicians and the Guarantee
Group.  See if you can find information online and in the dataset rows that is
distinctive enough to identify the 8 musicians, 9 members of the Guarantee
Group, and so on.

*Hint 1* To test if a string contains another string, you can use the `in` operator like this:

In [19]:
a = 'Bah humbug'
'humbug' in a

True

*Hint 2* To test for a missing value, use `pd.isna()` like this:

In [20]:
pd.isna(np.nan)

True

In [21]:
def classify_role(row):
    if row.loc['age'] >= 15:
        return 'adult'
    if row.loc['gender'] == 'female':
        return 'female child'
    return 'male child'

The next cell tests if you are on the right track with your function:

In [22]:
def classify_role(row):
    tic_class = row.loc['class']
    name = row.loc['name']
    if name in ['Hartley, Mr. Wallace Henry', 'Brailey, Mr. William Theodore Ronald','Bricoux, Mr. Roger Marie','Taylor, Mr. Percy Cornelius',\
                    'Woodward, Mr. John Wesley', 'Hume, Mr. John Law', 'Clarke, Mr. John Frederick Preston','Krins, Mr. Georges Alexandre']:
        return 'musician'
    elif name in ['Andrews, Mr. Thomas', 'Campbell, Mr. William Henry', 'Chisholm, Mr. Roderick Robert Crispin', 
                  'Cunningham, Mr. Alfred Fleming', 'Frost, Mr. Anthony Wood', 'Knight, Mr. Robert', 'Parkes, Mr. Francis',\
                'Parr, Mr. William Henry Marsh', 'Watson, Mr. Ennis Hastings']:
        return 'guarantee'
    elif tic_class in ['victualling crew', 'restaurant staff']:
        return 'catering'
    elif tic_class == 'deck crew':
        return 'deck'
    elif tic_class == 'engineering crew':
        return 'engineering'
    else:
        return tic_class


In [23]:
print(classify_role(titanic.iloc[0]))  # Should show '3rd'
print(classify_role(titanic.iloc[6]))  # Should show '2nd'
print(classify_role(titanic.iloc[-1]))  # Should show 'catering'
print(classify_role(titanic.iloc[-3]))  # Should show 'engineering'
print(classify_role(titanic.iloc[-4]))  # Should show 'catering'
print(classify_role(titanic.iloc[-5]))  # Should show 'deck'
is_brailey = titanic['name'].str.startswith('Brailey')
print(classify_role(titanic[is_brailey].iloc[0]))  # Should show 'musician'
is_andrews = titanic['name'] == 'Andrews, Mr. Thomas'
print(classify_role(titanic[is_andrews].iloc[0]))  # Should show 'guarantee'

3rd
2nd
catering
engineering
catering
deck
musician
guarantee


In [24]:
# This test runs the tests above, and some extra besides.

_ = ok.grade('q_06_classify_role')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 3
    Failed: 0
[ooooooooook] 100.0% passed



`apply` your function to the `titanic` data frame to make a new Series, then
use this Series to create a new data frame `role_by_survived_p` that is a
cross-tabulation of the *proportion* of *male* passengers with each role, that
survived or were lost. For example, `role_by_survived_p` will have a row
corresponding to "catering", with two values, where one value will be the
proportion of *male* catering staff that survived, and the other will be the
proportion of male catering staff that were lost.

In [25]:
roles = titanic.apply(classify_role, axis='columns')
roles.head()

0    3rd
1    3rd
2    3rd
3    3rd
4    3rd
dtype: object

In [26]:
len(roles)

2207

In [27]:
roles[titanic['gender']=='male'].shape

(1718,)

In [28]:
role_by_survived_p = proportion_by_row(pd.crosstab(roles[titanic['gender']=='male'], titanic[titanic['gender']=='male']['survived']))
role_by_survived_p

survived,no,yes
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
1st,0.649718,0.350282
2nd,0.853659,0.146341
3rd,0.84787,0.15213
catering,0.838574,0.161426
deck,0.348485,0.651515
engineering,0.780864,0.219136
guarantee,1.0,0.0
musician,1.0,0.0


## Done

You're finished with the assignment!  Be sure to...

- **run all the tests** (the next cell has a shortcut for that),
- **Save and Checkpoint** from the "File" menu.
- Finally, **restart** the kernel for this notebook, and **run all the cells**,
  to check that the notebook still works without errors.  Use the
  "Kernel" menu, and choose "Restart and run all".  If you find any
  problems, go back and fix them, save the notebook, and restart / run
  all again, before submitting.  When you do this, you make sure that
  we, your humble markers, will be able to mark your notebook.

In [29]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 3
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 5
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 5
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 5
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=cf6f198d-421e-4eb3-a458-5b804efc7ad3' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>