### Titanic

There are 3 contexts here:
   
    - Jupyter notebooks. We touched upon this during the introduction. Kind of a playground, especially suited for computing with data, because one is able to bring together data, code and results (visualisations) in one environment, the notebooks.
    - so-called "Big Data" and data analysis with computers present new possibilities for economic and econometric practice and research.
    - and there is the context of the data we are going to work with: The Titanic disaster. There are various technical ways to look at it. Below there is a link to a computer generated simulation of the actual sinking of the Titanic made by the team of James Cameron.
    
<a href="https://www.youtube.com/watch?v=FSGeskFzE0s">James Cameron: How the Titanic sank</a>

We are going to look at the data about the persons on board with the help of the computer an Python as our programming environment.

### What do we know already?

You probably have seen the movie Titanic (with Kate Winslet and Leonardo DiCaprio, 1997), one of several movies made about the disaster. So you probably know the following:

    - The disaster took place during the night of April 14, 1912 when the ship hit an iceberg
      on her maiden voyage from Southampton (UK) to New York (US) via Cherbourg (FR) and Queenstown (IRE).
    - The loss of lives was 1501 (out of a total of 2207) passengers;
    - There was a shortage of life-boats.

### What do we want to know?

It might well be you have some questions in advance and want to use the data to see if you can find answers, or it could well be you want to play a little bit with the data in order to come up with questions.

Both approaches suggest the following steps:

    - Explore the data (load it, look at it)
    - Clean the data (missing values, splitting columns, etc.)
    - Plot (try to visualize correlations, insights, ...)
    - Assumptions (try to formulate hypotheses, rinse and repeat)

### Explore the data

Mind you we are cheating big time here! We start with existing datasets that allow us to quickly load and explore data, as we will see shortly. Suppose you are a data scientist out in the wild, you will probably get assignments *without* any accompanying datasets. A large proportion of your time will be spend on acquiring data (searching, scraping), cleaning data, combining data from various sources (a lot of tweaking and cleaning).

In [None]:
# Remove warnings
import warnings
warnings.filterwarnings('ignore')
# Set up an environment to be able to explore the data
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
matplotlib.style.use('ggplot')
#pd.options.display.max_columns = 100
#pd.options.display.max_rows = 100

Because we have a csv file here, that is a plain text file with each chunk of information separated by a comma, hence csv or "comma separated value" file, we can open up the file in a text editor and look at the contents of our source file:

In [None]:
!aquamacs /Users/peter/Documents/bootcamps/data/titanic/train.csv

In [None]:
df_train = pd.read_csv('../data/titanic/train.csv', sep=",")

Now that we have our data in a dataframe (a matrix with columns, variables, that contain various datatypes), we can have quick different looks at the contents using several methods of the dataframe object:

    - head (what is there? shows first five rows of the dataset with the header values if any)
    - info (how many total entries? what are the columns and their types? how many not-null values per column)
    - describe

In [None]:
df_test = pd.read_csv('../data/titanic/test.csv', sep=",")

In [None]:
df_train.head()

In [None]:
df_train.info()

In [None]:
df_train.describe()

In [None]:
# We can select the values of columns using iloc() and the column number (counting from 0)
df_train.iloc[:,1]

This typical notation with the "dangling comma" is a reminder of the provenance of the Pandas library: R dataframes. Where ,1 denotes a column and 1, denotes a row.
Hence the following use of the iloc() method:

In [None]:
df_train.iloc[:1,]

In [None]:
# And because iloc[:,1] returns the is the survived column, we can use the label together with the loc() method:
df_train.loc[:,'Survived']

In [None]:
# We can slice columns, through the similar kind of subscription we saw when working with lists
# For example, we take the first 10 values of the column survived
# We have the column names as methods in the dataframe object: df.survived[0:10] also works
df_train['Survived'][0:10]

We see that the contents of the survived column are floats, which is true, but they actually function as Boolean types: True (1.0) and False (0.0).

Given the simple fact that not many people survived (df.info gives us the number of 486 people registered as survivors? in boats on a total of 1309 passengers = 37% survived?). Who did actually survive and are there relations with age, sex, and travel class?

We can prepare quick sneak previews with slicing data using several columns (note1: we have to pass the columns we are interested in in as a list; note2: selecting the first ten passengers we are looking at first class passengers):

In [None]:
df_train[['Survived','Age','Sex','Pclass']][0:10]

Let's have a quick look at the third class. We can use the opposite of the head command (which is aptly called "tail") or we can use a negative slice:

In [None]:
df_train[['Survived','Age','Sex','Pclass']][-10:]

Right, we might have something here. Let's get to work.

Correlations: Let's focus on survived in relation to: sex, age, class
Completions: We miss age data (1046 non-null data points) and we might want to factorize this column
Corrections: We might want to drop certain columns, because we are not going to use them for our initial analysis: ticket, cabin, passengerId.

From here we can choose several paths to explore the data a bit more. For example use the crosstab method of a dataframe to make a cross tabulation on gender and survival:

In [None]:
pd.crosstab(df_train.Sex, df_train.Survived)

We can try to do a quick crosstab between survived and age, but this, of course, blows up: To do something sensible with age, we need to clean it up and "factorize" it.

In [None]:
pd.crosstab(df_train.Age, df_train.Survived)

In [None]:
# We can use the method unique() to get hold of unique values within a column:
df_train.Embarked.unique()

At the beginning we loaded a test set. At random selection from the file "Titanic3.csv", or to be more precise we divided the file into a training set and a test set. What can we learn from the training set in order to predict the survival of the persons from the test set.

Add a column to the test df and fill in with all zeros.

Construct a new dataframe of two columns: PassengerId, Survived

Write that selection to a file.

### Selected bibliography

#### About the RMS Titanic

- Encyclopedia Titanica. Titanic Facts, History and Biography: https://www.encyclopedia-titanica.org/

#### Economics, Machine Learning, and the Titanic

- Bruno S. Frey, David A. Savage, and Benno Torgler, Behavior under Extreme Conditions: The Titanic Disaster, in: Journal of Economic Perspectives, vol. 25, number 1, Winter 2011, pp. 209-222, DOI: http://dx.doi.org/10.1257/jep.25.1.209

- Hal R. Varian, Big Data: New Tricks for Econometrics, in: Journal of Economic Perspectives, vol. 28, number 2, Spring 2014, pp. 3-28 (Titanic, pp. 7-12), DOI: http://dx.doi.org/10.1257/jep.28.2.3

- Sendhil Mullainathan and Jann Spiess, Machine Learning: An Applied Econometric Approach, in: Journal of Economic Perspectives, vol. 31, number 2, Spring 2017, pp. 87-106, DOI: http://dx.doi.org/10.1257/jep.31.2.87

- Francis X. Diebold, All of Machine Learning in One Expression, retrieved from the blog "No Hesitations" (posted: Monday, January 9, 2017) on 20-09-2017 at: https://fxdiebold.blogspot.nl/2017/01/all-of-machine-learning-in-one.html

#### Statistics and computing (R programming language)

- Trevor Hastie, Robert Tibshirani, and J. Friedman, Elements of Statistical Learning, 2nd ed., New York, Springer Science and Business Media, 2009, http://statweb.stanford.edu/~tibs/ElemStatLearn/

- Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning with Applications in R, New York, Springer Science and Business Media, 2013, http://www-bcf.usc.edu/~gareth/ISL/index.html Python code for some chapters of the book can be found here: https://github.com/JWarmenhoven/ISLR-python)