### Analyzing Titanic Data: Scenario

![The Titanic](titanic.jpeg)

### Introduction
- About the disaster
- Set a working directory
- Import the data files we are going to work with

#### The Titanic
So what do we know already? Most of you will probably have seen the movie by James Cameron (starring Kate Winslet and Leonardo DiCaprio, 1997) which is one of several movies made with the disaster as subject. So you will be aware of the following:
![The Movie](iu.jpeg)
- the disaster took place during the night of April 14th, 1912 when the ship hit an iceberg on her maiden voyage out of Southampton (UK) to New York (US) via Cherbourg (FR) and Queenstown (IRE);
- the loss of lives was 1501 out of a total of 2207 passengers and crew;
- there was a shortage of lifeboats and some of the lifeboats were set afloat whilst not fully occupied.

There are various ways of looking at the disaster. Apart from the movie he directed, James Cameron also made a computer generated simulation of the actual sinking of the ship, based upon the distribution of the debris:

<a href="https://www.youtube.com/watch?v=FSGeskFzE0s">James Cameron: How the Titanic sank</a>

#### The data
We are going to work with the data files that were prepared for the Kaggle competiton, an on-line machine learning competiton. So, we are cheating big time here: Out there in the wild, working as a data scientist, you will have to accumulate the data yourself, often relying on various sources, which means a lot cleaning and "harmonizing". These preparations will usually take a lot of time, estimations vary but usually it is thought that 60-80% of the work with data is spend on acquiring data and preparing the data to be used for analysis.

Re-use of data, if possible, therefore is a big thing. And in the R community this assertion sparked the idea of "tidy data": A method to prepare data in such a way that that is can be fruitfully used and more easily shared.

#### Get to know the data

Load and explore the data. In our case we are going to work with two files in CSV format: A so-called training set and a test set.

I usually load the file(s) in my editor of choice and poke around a little bit:

In [1]:
!aquamacs /Users/peter/Documents/bootcamps/data/titanic/train.csv

Time spend to get acquainted with the data you are working with is time well spend. For example: We see that the Name column contains some extras: Titles (Mr., Mrs., Miss., Master, Rev., Don., etc) as well as the maiden names of female, married passengers (passenger 152, for example, is Mrs. Thomas Pearce = Edith 
Wearne). Whenever we see two adjacent column separators (here ",,"), we know we are dealing with missing data! And we might start to wonder what some of the column headers mean: SibSp? (the number of SIBlings / SPouses aboard) or Parch (number of PARents / CHildren aboard).

Ok, let's dive in.

In [2]:
# Import some libraries we are going to work with
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
# Read in the Kaggle datafiles
df_train = pd.read_csv("~/Documents/bootcamps/data/titanic/train.csv", sep=",")
df_test = pd.read_csv("~/Documents/bootcamps/data/titanic/test.csv", sep=",")

Now that we have our dataset as a Pandas dataframe, we can use methods on the "dataframe object" that Pandas provides us with:
- [dataframe].shape: Presents the dimensions of the dataframe as a tuple
- [dataframe].head(): Shows the first 4 rows of the dataframe
- [dataframe].describe(): Generates descriptive statistics
- [dataframe].info(): Generates oerview of columns and datatypes of their values
Let's give them a try:

In [4]:
df_train.shape

(891, 12)

In [5]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


### Is the data tidy?
Before we dive into the Titanic dataset, we wil have a look at the idea of "tidy data". Tidy data is a concept introduced by Hadley Wickham of R fame. It projects concepts from the world of relational databases onto the world of datasets.

[pandas tidy data](http://localhost:8888/notebooks/Documents/bootcamps/Titanic/pandas_tidy.ipynb)

- Is it tidy? Look at the values, observations (rows) and variables (columns). Data missing?
- How is the data structured?

Since we have the luxury to have access to two prepared datasets, a training dataset and a test set, let's try to see if we can "learn" from the training set to predict the fate of the persons that are part of the test set. Survived: 1 0?

Since a lot of people (almost 62%) died, we can predict that in the test set the majority of people will not survive the disaster. Of course our test file did not contain a column "Survived", because we were going to predict that from our training set. So, we add that column, "Survived" to our test dataframe. And we fill that column with all zeros (0's) for died.

Not bad for a first try. We will have probably around 60% correct predictions.

What about the old adagium: "Women and children first"?

Let's have a look at women versus men.
Were we to fill in all 1's for women we would have hit around 75% correct predictions.

And then have a look at the Age variable.

Then the Pclass variable.

#### Feature Engineering

What we are doing here is called "Feature Engineering". This is where humans easily outshine computers. It is an area of creativity, ingenuity, and domain knowledge; of course being able to use computers to aggregate large amounts of data, tidy it and easily prepare visualizations is a big +.

We can do something with the titles in the name strings of the passengers.

Or, work on passengers travelling together.

We can shift our emphasis a bit from analyzing to learning: Training models.