### Titanic

There are 3 contexts here:
   
    - Jupyter notebooks. We touched upon this during the introduction. Kind of a playground, especially suited for computing with data, because one is able to bring together data, code and results (visualisations) in one environment, the notebooks.
    - so-called "Big Data" and data analysis with computers present new possibilities for economic and econometric practice and research.
    - and there is the context of the data we are going to work with: The Titanic disaster. There are various technical ways to look at it. Below there is a link to a computer generated simulation of the actual sinking of the Titanic made by the team of James Cameron.
    
<a href="https://www.youtube.com/watch?v=FSGeskFzE0s">James Cameron: How the Titanic sank</a>

We are going to look at the data about the persons on board with the help of the computer an Python as our programming environment.

### What do we know already?

You probably have seen the movie Titanic (with Kate Winslet and Leonardo DiCaprio, 1997), one of several movies made about the disaster. So you probably know the following:

    - The disaster took place during the night of April 14, 1912 when the ship hit an iceberg
      on her maiden voyage;
    - The loss of lives was 1501 (out of a total of 2207) passengers;
    - There was a shortage of life-boats.

### What do we want to know?

It might well be you have some questions in advance and want to use the data to see if you can find answers, or it could well be you want to play a little bit with the data in order to come up with questions.

Both approaches suggest the following steps:

    - Explore the data (load it, look at it)
    - Clean the data (missing values, splitting columns, etc.)
    - Plot (try to visualize correlations, insights, ...)
    - Assumptions (try to formulate hypotheses, rinse and repeat)

### Explore the data

Mind you we are cheating big time here! We start with existing datasets that allow us to quickly load and explore data, as we will see shortly. Suppose you are a data scientist out in the wild, you will probably get assignments *without* any accompanying datasets. A large proportion of your time will be spend on acquiring data (searching, scraping), cleaning data, combining data from various sources (a lot of tweaking and cleaning).

In [9]:
# Remove warnings
import warnings
warnings.filterwarnings('ignore')
# Set up an environment to be able to explore the data
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
matplotlib.style.use('ggplot')
#pd.options.display.max_columns = 100
#pd.options.display.max_rows = 100

Because we have a csv file here, that is a plain text file with each chunk of information separated by a comma, hence csv or "comma separated value" file, we can open up the file in a text editor and look at the contents of our source file:

In [15]:
!aquamacs /Users/peter/Documents/bootcamps/data/titanic/titanic3.csv

In [10]:
df = pd.read_csv('../data/titanic/titanic3.csv', sep=";")

Now that we have our data in a dataframe (a matrix with columns, variables, that contain various datatypes), we can have a quick look at the contends with the head method of a dataframe: 

In [13]:
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29,0.0,0.0,24160,2113375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,9167,1.0,2.0,113781,1515500,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2,1.0,2.0,113781,1515500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30,1.0,2.0,113781,1515500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1.0,2.0,113781,1515500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


### Selected bibliography

#### Economics, Machine Learning, and the Titanic

- Bruno S. Frey, David A. Savage, and Benno Torgler, Behavior under Extreme Conditions: The Titanic Disaster, in: Journal of Economic Perspectives, vol. 25, number 1, Winter 2011, pp. 209-222, DOI: http://dx.doi.org/10.1257/jep.25.1.209

- Hal R. Varian, Big Data: New Tricks for Econometrics, in: Journal of Economic Perspectives, vol. 28, number 2, Spring 2014, pp. 3-28 (Titanic, pp. 7-12), DOI: http://dx.doi.org/10.1257/jep.28.2.3

- Sendhil Mullainathan and Jann Spiess, Machine Learning: An Applied Econometric Approach, in: Journal of Economic Perspectives, vol. 31, number 2, Spring 2017, pp. 87-106, DOI: http://dx.doi.org/10.1257/jep.31.2.87

- Francis X. Diebold, All of Machine Learning in One Expression, retrieved from the blog "No Hesitations" (posted: Monday, January 9, 2017) on 20-09-2017 at: https://fxdiebold.blogspot.nl/2017/01/all-of-machine-learning-in-one.html

#### Statistics and computing (R programming language)

- 