# Titanic dataset
We have a dataset that contains characteristics of the people on the titanic's first and last voyage. These characteristics. In addition we have the complete information of who survived and who perished in the disaster. It is our goal to load the data and get the data into shape for model fitting.

## Handling missing data
Missing data is one of the basic issues data can have. Many statistical models can not handle missing data and in most cases we need to either fill in or remove the records with missing values.

## Cleaning data
Data scientists spend most of their time cleaning up data and preprocessing it before feeding it to a model of choice. This preprocessing is also called feature engineering. We transform the raw data (people characteristics) into well defined features that our model van deal with.

# 1 Imports, load data and first look

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

In [None]:
titanic_data = pd.read_csv('./data/train.csv')  # loads the CSV into a DataFrame with default settings

In [None]:
titanic_data.head()  # this prints the first 5 lines of the DataFrame

In [None]:
print(titanic_data.describe())  # this prints basic statistics of the DataFrame

## 1.1 Impute missing data (preprocessing)
Let take a look at the number of missing values per column.

In [None]:
print(titanic_data.info())  # Wee see that Age and Cabin have a lot missing values

As a strategy to deal with missing values we will fill them with values that do not alter the average of the age distribution. To this end we will 'impute' the missing age values with the mean of the 'Age' column.

In [None]:
titanic_data['Age'] = titanic_data['Age'].fillna(titanic_data['Age'].mean())

## 1.2 Fill missing values in the port of embarkation
The values in the 'Embarked' and their meaning are: C = Cherbourg, Q = Queenstown, S = Southampton.

In [None]:
print(titanic_data['Embarked'].describe())

# Exercise
We see that the `'S'` Southhampton port is the port where most people embarked. Fill missing values in the 'Embarked' column with this most common port.

# Exercise
Change the 'Embarked' column into a categorical variable (see the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html)).

We need to create dummy variables for each of our categories

In [None]:
dummies = pd.get_dummies(titanic_data['Embarked'])

# 2 Preprocessing
We started preprocessing to impute the missing values in the 'Age' column. Now we will transform the raw data further to prepare it for the model.

## 2.1 Convert sex to binary variable
There are no missing values in the 'sex' column and back in 1912 everyone was either male or female.

In [None]:
titanic_data['binary_sex'] = titanic_data['Sex'] == 'male'  # make a new column with binary sex

## 2.2 Split the data in a train and test set
To get an idea of how our model performs we need to train it with one part of our feature data and test it on. This is supported by the scikit-learn `train_test_split` function. 

In [None]:
X = titanic_data[['Age', 'binary_sex', 'Pclass']]
y = titanic_data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42)

# Exercise
Change the above code so that `X` contains our dummy variables (`dummies`) besides 'Age', 'binary_sex', 'Pclass'.
Take a look at [pd.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) nadake sure you do not append the dummies as new rows, but concatenate them along the column axis (axis=1).

# 3 Fit the model

In [None]:
from sklearn import tree
from sklearn.metrics import accuracy_score

## 3.1 Train the model on the training set
Here we train a scikit-learn `DecisionTreeClassifier` on the training data sit, by calling the `fit` method of the classifier model.

In [None]:
decision_tree_classifier = tree.DecisionTreeClassifier(max_depth=3)
decision_tree_classifier.fit(X_train, y_train)

# 3.2 Predict the survival of the test set
We predict the survival outcome for the persons in the test set by calling the `predict` method of the model.

In [None]:
y_predict = decision_tree_classifier.predict(X_test)

# 3.3 Evaluate model performance
Our model is a binary classifier; it predicts either survived or perished. Binary classifiers can be evaluated with a number of metrics, such as accuracy.

In [None]:
print('accuracy', accuracy_score(y_test, y_predict))

This means we can predict with a ~78% accuracy whether someone survived.

# Exercise
Fit the model with and without the Embarked variable, what is you conclusion?