# Predicting survival for passengers on the Titanic

In these exercises you will perform statistical analyses and train predictive models to show what kinds of people were most likely to get off the Titanic alive. It is well known that gender and age played a big role (abiding by the [women-and-children-first principle](http://en.wikipedia.org/wiki/Women_and_children_first)), and also passenger class.

In [1]:
%matplotlib inline
import sklearn
import pandas as pd
import numpy as np

In [2]:
titanic = pd.read_csv("titanic_train.csv", index_col=0)
D_train = titanic.copy()
labels = titanic.Survived.values

# Remove class attribute
del D_train['Survived']

# Deal with missing values
del D_train['Cabin']
D_train.Embarked.fillna(D_train.Embarked.mode()[0], inplace=True)
D_train.Age.fillna(D_train.Age.mode()[0], inplace=True)

We'll start by creating a function that converts a pandas `DataFrame` to a 2-dimensional `numpy` array, returning the array and the vectorizer used to create the array.

In [None]:
from sklearn.feature_extraction import DictVectorizer
def vectorize_data_frame(data_frame):
    vectorizer = DictVectorizer()
    list_of_dicts = data_frame.to_dict('records')
    X = vectorizer.fit_transform(list_of_dicts)
    
    return X, vectorizer

X, vectorizer = vectorize_data_frame(D_train)

### Training with cross validation

When you have small data sets it is often an advantage to use cross-validation. It also alleviates any concern that your test data is not representative of the training data, since in the end all of the training data will be used for testing.

In sklearn you have several methods for performing cross validation at your disposal. The two most common methods methods are:

- The function `cross_val_score` from the `cross_validation` module. If you are only interested in a single performance figure, like accuracy or F1, this is the way to go.
- A cross-validation generator, also available in the `cross_validation` module. Used in a `for` loop it yields for each iteration a tuple with training set indices and and test set indices.

In [None]:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

**Exercise** Compute the average F1 score using a logistic regression classifier and a 5 fold stratified cross validation

In [None]:
# Your code below


**Exercise** Use the `StratifiedKFold` cross-validation strategy to get predicted labels for the entire training set. Do this in a `for` loop. Set $k=5$ and compute the F1 score.

Hint: Start by initializing an `numpy` array to hold the predicted classes for the whole training set. It is good practice to fill the array with a value which cannot be mistaken for a class, e.g. **-1**, to make sure that you will notice in case not all of the values were updated by the `for` loop.

In [None]:
# Your code here


### Creating new attributes based on existing ones

Two of the attributes in the dataset conflate different measures. According to the documentation, *SibSp* is the number of siblings *and* spouses, and *Parch* denotes number of parents *and* children. In most cases, though, you would either have siblings *or* spouses, and parents *or* children. Here we try to create non-conflated versions of the attributes by assuming that if you are a child, *Sibsp* means siblings, and if you are an adult, the attributes refers to spouses. By a similar logic *Parch* means parents for children and children for adults.

```
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
```

**Exercise** Add five new attributes to the training set `D_train` as indicated below:

* `Is_child`: True when the person is below 16 years
* `Sib`: Number of siblings
* `Sp`: Number of spouses
* `Par`: Number of parents
* `Ch`: Number of children

In [None]:
# Your code here


**Exercise** Compute F1 using 5-fold CV for the augmented dataset

In [None]:
# Your code here


### Binning age groups

In this exercise you should bin the values of the `Age` column into 10 year spans.

**Exercise** Start by making histograms of the age distribution, with different histograms for individuals with `Survived=True`  and `Survived=False`. Each bin should correspond to a single year. 

In [None]:
# Your code here

**Exercise** Create a new column `Age_group` binning the values of `Age` into ten year ranges. The values of this column should be strings of the form:

````
00-09
10-19
20-29
...
````

Output the contents of the new column at the end of the cell.

In [None]:
# Your code here

**Exercise** Make a table displaying the mean survival rate in the different age groups. You can also try to look at survival rates for more specific subgroups by grouping on multiple columns at the same time, e.g. `Age_group` and `Sex`, or `Age_group` and `Sib`.

In [None]:
# Your code here

**Exercise** Run CV with same parameters as in the last exercise. Also try to remove the original Age attribute to see if that makes any difference.

In [None]:
# Your code here

### Optional: Undoing previous preprocessing

The dataset includes a column with the number of the cabin, if available, which we removed due to sparsity. However, this might be throwing away valuable information. Encoded in the cabin number is the location of the cabin, something which conceviably has a great influence on a person's survival chances. 

![Plan of Titanic](http://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Titanic_side_plan_1911.png/1280px-Titanic_side_plan_1911.png)

As an optional exercise you can extract this information into new columns and see if it improves model performance. Here are a couple of suggestions:

* Create columns `At_deck_X` (for decks A to G), which is True if the cabin is on that floor, and false otherwise. Pandas has useful vectorized string functions which makes this easy to do. For instance, `titanic.Cabin.str.contains("A")` gives back True for the instances in which the Cabin column contains an "A".
* Create columns `At_deck_X_or_below`, which is True if the cabin is on the given floor or any lower floor.
* Create a column `Deck` that maps the deck letters ("A" through "G") to numbers. If you do this you have to be careful about what you put in as missing values.

In [None]:
# Your code here