### Data Science, Week 2

# Predicting who will survive on the Titanic

__Goal__: Brushing up Python skills and getting to know machine learning tools


## The Titanic competition on Kaggle

This is one of the continually renewed tutorial competitions on [Kaggle](https://www.kaggle.com/c/titanic). I encourage you to get a Kaggle account and try your hand at this. We'll use the same dataset for our little exercise here.

## Data

Your task is to create the **most generalizable** model from datasets of different sizes. 

We will use 3 partitions of the [Titanic](https://www.kaggle.com/c/titanic) dataset of size 100, 400, and 891. This is a fairly straightforward binary classification task with the goal of predicting *who survived* when the Titanic sank.

All datasets have the following columns: 

| Variable | Definition                                 | Key                                            | 
|:----------|:------------------------------------------|:-----------------------------------------------|
| Survival | Survival                                   | 0 = No, 1 = Yes                                |
| Pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| Sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| Sibsp    | # of siblings / spouses aboard the Titanic |                                                |
| Parch    | # of parents / children aboard the Titanic |                                                |
| Ticket   | Ticket number                              |                                                |
| Fare     | Passenger fare                             |                                                |
| Cabin    | Cabin number                               |                                                |
| Embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

There are some pecularities in the data: some columns contain missing values, some are redundant, and some might only be useful with feature engineering [- what is that?].

## A toy analysis - to be improved

The following shows a simple example of fitting a logistic regression model to the data with 400 training examples.

- Run the code and discuss ways to improve it

In [None]:
import os
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, matthews_corrcoef, accuracy_score, roc_auc_score

In [None]:
data_folder = "data" # set to the name of the folder where you keep the data
test = pd.read_csv(os.path.join(data_folder, "test.csv"))
train = pd.read_csv(os.path.join(data_folder, "train_400.csv"))

In [None]:
# Let's take a quick look at the data
train.head()

In [None]:
# Are there missing values in the train set?
train.isnull().sum()

In [None]:
# What about the test set
test.isnull().sum()

In [None]:
# So - lots of missing age values. Let's fill them with the mean value of the column [- is this a good idea!?]
train["Age"] = train["Age"].fillna(train["Age"].mean())
test["Age"] = test["Age"].fillna(train["Age"].mean())

# 1 missing Fare in test, filled with the mean
test["Fare"] = test["Fare"].fillna(train["Fare"].mean())

# This procedure is called 'ean imputation' - can you think of (possibly) better ways to impute the missing values?

In [None]:
# Let's see if it worked
train.isnull().sum()

In [None]:
# sklearn does not like columns with  categorical values.
# Let's make them binary dummy variables instead.
train = pd.get_dummies(train, columns=["Pclass", "Embarked", "Sex"])
test =  pd.get_dummies(test, columns=["Pclass", "Embarked", "Sex"])

In [None]:
train.head()

In [None]:
# The Ticket, PassengerId, Name, and Cabin column seem like they might be problematic.
# Let's check how many unique values they contain,
print(f"N. of rows: {len(train)}")

for col in ["Ticket", "PassengerId", "Name", "Cabin"]:
    print(f"Proportion of unique values in {col}: {len(train[col].unique()) / len(train)}")

In [None]:
# PassengerId, Name, and Ticket are with few exceptions unique for each individual and thus unusable for predictions.
# Cabin has a lot of missing values and a lot of unique values - so we're dropping the columns.

uninformative_cols = ["PassengerId", "Name", "Ticket", "Cabin"]
train = train.drop(columns=uninformative_cols)
test = test.drop(columns=uninformative_cols)

# Could Cabin be made informative with some feature engineering?

In [None]:
# Here we create a logistic regression model with the remaining columns (except the outcome 'Survived') as predictors
model = LogisticRegression(max_iter=256)
# Make subset of training data containing everything except the outcome
X = train.loc[:, train.columns != "Survived"]
# Make subset containing only the outcome
Y = train["Survived"]

In [None]:
# Fit model on training data
model.fit(X, Y)
# See how well the model does on the training data
yhat = model.predict(X)

# If you get a warning, try to adjust the code so it disappears!

In [None]:
print(f"Accuracy on train data: {accuracy_score(Y, yhat)}")
confusion_matrix(Y, yhat)

# Read up on the accuracy score in the sklearn documentation. What are its limitations? Which alternatives are there?

In [None]:
# Test the model on the testing set
X_test = test.loc[:, test.columns != "Survived"]
Y_test = test["Survived"]

In [None]:
yhat_test = model.predict(X_test)
print(f"Accuracy on test data: {accuracy_score(Y_test, yhat_test)}")
confusion_matrix(Y_test, yhat_test)

## Improvements - by you!

Very much to our surprise, the toy analysis has been very impressive! Why are we surprised? We expected that we sourd fare much worse on the test set than on our training set. Was that luck or skill? Try to find out by varying your test set. Once you've done that, it's time to play around with various possible improvements of the model.

__Exercises__:

Discuss:

- How can you get a better estimate of the out-of-sample performance?
- How can you reduce overfitting to the training data?
- Do you need different strategies for model creation for the different sizes of dataset?
    - If so, what would you do differently?

Code:

- For each partition (i.e. each dataset) create several different models which you expect to generalize well. Evaluate them on the training sample using some form of model selection (cross-validated performance, held-out data, information criteria etc.) and choose one to test on the testing set. Your goal is to create the best-performing, most well-calibrated model, i.e. training performance should be close to testing performance (and performance should of course be high!). 
- See what performance you can get on the small datasets with clever optimization and regularization.

For next time:

- In your study groups, prepare a 3-5 min presentation on something interesting about your solution: Did you create some cool functions for preprocessing, test an exciting new type of model, set everything up to be run from the command line, or perhaps you're just really excited about the performance of your model. No need for slideshows, but feel free to show code.

---

Tips to get started:
- Visualization can often be a good way to get started: how is the distribution of the variables? are any of them highly correlated?
- Instead of training and testing on the whole training data, implement a form of cross-validation ([sklearn makes it easy](https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics))
- Remember ridge regularization from last week? [Try it](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html)
- Check out a list of models in sklearn [here](https://scikit-learn.org/stable/supervised_learning.html)
- Lost or out of ideas? Take some inspiration from entries in the [original Kaggle challenge](https://www.kaggle.com/c/titanic/code)

Things to try:
- You might be able to get more information out of the predictors if you do some feature engineering. Perhaps add a column indicating whether the person had a cabin or not, or one calculating the family size?
- Calculating information criteriais not entirely straight-forward in sk-learn. [This tutorial](https://machinelearningmastery.com/probabilistic-model-selection-measures/) might help
- The outcome (survival) is not completely balanced. Over-/under-sampling might help
- Ensemble models often generalize better than single models. Try one of the methods [here](https://scikit-learn.org/stable/modules/ensemble.html)
- Don't feel restricted to sk-learn. Feel to make a Bayesian model in [PyMC3](https://github.com/pymc-devs/pymc3) or any other framework you want
- High-performance interpretable models are all the rage right now. [Try one of them!](https://github.com/interpretml/interpret) 
