### The Lusitania

The RMS Lusitania was a British ocean liner that was topedoed near the Irish coast on the afternoon of May 7th 1915 by a German U-boat. The ship sank in 18 minutes and 1198 passengers and crew died in the disaster.

![The Lusitania](lusitania.jpeg)

Q: Open up the raw csv file ("lusitania_02.csv") in your editor of choice. Have a look at the data.

Q: Try to load the csv file "lusitania_02.csv" that is in the data directory in a Pandas dataframe. Do not forget to load the libraries you need.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
%matplotlib inline

In [None]:
df_lusitania = pd.read_csv('data/lusitania_02.csv', sep = ";")

Q: Use the head() method to have a look at the first entries and make some notes about the data.

In [None]:
df_lusitania.head(10)

Q: Using some of the built-in Pandas methods get to know the data in the Lusitania file. Compared to what we found out about the Titanic, do you see differences? What is the status of the data? Is it tidy? Are there many missing values? Use the Titanic Notebook as a guide (02_pandas_titanic.ipynb).

In [None]:
df_lusitania.shape

In [None]:
df_lusitania.describe()

Q: If you used the describe() method, explain the result [just one column]. What is the dataframe method that you find most useful to get a grip on the data contained in the Lusitania csv file. Use the method below and explain your answer in a separate cell.

In [None]:
df_lusitania.info()

Q: Based on your insights sofar, after exploring the data, first describe in some cells below what you need to do in order to get some insights in the survival rate of the passengers.

Cleaning:

- There are a lot of columns that are of no use due to lack of data that can be discarded: Position, County, Lifeboat, Rescue Vessel, Body No., Ticket No., Cabin No.;
- There are some columns that are incomplete, but perhaps useful (we must take a closer look): Status, Travelling companions;
- There is a column that is complete, but has puzzling values: Value;
- And there is the Age column, that is important, but incomplete and not in numerical format.

Q: Try to implement (parts of) your plan in the cells below.

In [None]:
# Throw away some of the columns that lack data, are not easily repaired and will probably not be used in the
# analysis of the main question: Survival rate of the passengers.

In [None]:
df_lusitania.drop(['Position', 'County', 'Lifeboat', 'Rescue Vessel', 'Body No.', 'Ticket No.', 'Cabin No.', 'City', 'State', 'Country'], axis = 1)

In [None]:
# Inspect the status column (not very informativeb)
df_lusitania['Status']

In [None]:
# If you first try was like the one above, then use the command below
# Second try: List the unique values of the status column
df_lusitania.Status.unique()

Travelling companions is an interesting column, because the column header is "Traveling Companions and other notes", which is a big *nono* because it makes some basic useful commands useless:

In [None]:
df_lusitania.Traveling Companions and other notes.unique()

In [None]:
# Let's rename that column:
df_lusitania = df_lusitania.rename(index = str, columns={"Traveling Companions and other notes": "Notes"})

In [None]:
# Now we can ask:
df_lusitania.Notes.unique()

A lot of information here. If we were to prepare some extra columns for this dataset (like the ones that we found in the Titanic dataset: sibsp and parch) we could use this information. For now, we will leave them in.

Q: That leaves us with the Age column. We have to resolve similar problems as with the Age column of the Titanic dataset: We have strings instead of numbers and we have missing values.

Re-use the code from the Titanic dataset to clean the Lusitania Age column.

In [None]:
pd.to_numeric(df_lusitania.Age)

Q: In case that code throws an error, try to formulate a way to get out of the mess. Or, if you are in a hurry, read in the cleaned datafile: lusitania_02.csv

I opened up my editor of choice (Emacs) and ran:
M-x occur[RET]
[0-9]\ \?;[RET]

This gave me a  list of 10 matches that I repaired by hand. The result is a new version of the csv file: lusitania_02.csv. Running the command above again, gave me another error. Some rows (c. 35) of the Age column contain values like: '14-months' or '03-months', '3-6-months ?'. I have changed these entries into '2', '1', and '1'. Then there were entries like: 22?, 22 or 34?, Infant, etc.

Q: After some cleaning, we are now ready to fill in the missing data:

In [None]:
df_lusitania['Age'].fillna(df_lusitania['Age'].median(), inplace = True)

Q: Run a quick ckeck if all 'Age' slots are filled with the command:

In [None]:
df_lusitania.info()

Q: Use the crosstab() metadata on the dataframe to comapre the 'Age' column with the 'Fate' column:

In [None]:
pd.crosstab(df_lusitania.Age, df_lusitania.Fate)

Q: Similar problem as with the Titanic dataset, use a similar solution?:

In [None]:
figure = plt.figure(figsize=(13,8))
plt.hist([df_lusitania[df_lusitania['Fate']=='Saved']['Age'],df_lusitania[df_lusitania['Fate']=='Lost']['Age']],
          stacked = True, color = ['g','r'],
          bins = 30,label = ['Saved','Lost'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()

Q: Compare your visualization with the one we got from the Titanic data. Can you formulate any hypotheses as to why they differ, if they differ in certain areas?

Q: What would your next action be? (Code is perfect, but text is also ok:-)

In [None]:
pd.crosstab(df_lusitania.Sex, df_lusitania.Fate)