# Classification With the Titanic Dataset

In this notebook you will practice a supervised learning problem with the titanic dataset. You will try to predict whether or not a particular passenger lived or died based on other data about that passenger such as age, sex, fare, etc. If you want to continue to explore this dataset, see the Titanic competition on Kaggle:

https://www.kaggle.com/c/titanic

## Imports

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Load and prepare the data

In [None]:
raw_data = sns.load_dataset('titanic')

In [None]:
raw_data.head()

In [None]:
raw_data.info()

## 2. Cleaning data

Clean the raw data inplace by doing the following:

* Fill missing values in the age column with its mean.
* Convert bool columns to ints.
* Create a new int valued `child` column that is `1` when the `who` column is `child` and `0` otherwise.
* Drop the `alive`, `deck`, `embarked` and `adult_male` columns.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Get the following tests to pass:

In [None]:
assert raw_data.pclass.isnull().value_counts()[False]==891
assert list(raw_data.sex.unique())==['male', 'female']
assert raw_data.age.isnull().value_counts()[False]==891
assert raw_data.sibsp.isnull().value_counts()[False]==891
assert raw_data.parch.isnull().value_counts()[False]==891
assert raw_data.fare.isnull().value_counts()[False]==891
assert list(raw_data['class'].unique())==['Third', 'First', 'Second']
towns = raw_data.embark_town.value_counts()
assert towns['Southampton']==644
assert towns['Cherbourg']==168
assert towns['Queenstown']==77
assert towns['Unknown']==2
a = raw_data.alone.value_counts()
assert a[0]==354
assert a[1]==537
cc = raw_data.child.value_counts()
assert cc[0]==808
assert cc[1]==83

assert list(raw_data.columns)==['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'class',
       'who', 'embark_town', 'alone', 'child']

Here is a summary of the cleaned data:

In [None]:
raw_data.info()

## 3. Features

Create a feature `DataFrame` named `X` with the numerical columns from the raw dataset:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert list(X.columns)==['pclass', 'age', 'sibsp', 'parch', 'fare', 'alone', 'child']

Use `pandas.get_dummies` to one-hot-encode the `sex`, `class` and `embark_town` columns. Use the `drop_first` argument to drop one of the one-hot encoded columns for each of them. Ese `pandas.concat` to concat the new columns to `X`:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert list(X.columns)==['pclass', 'age', 'sibsp', 'parch', 'fare', 'alone', 'child', 'male',
       'Second', 'Third', 'Queenstown', 'Southampton', 'Unknown']

Here is the final features we will use:

In [None]:
X.head()

## 4. Target

Create the target vector, `y`, from the `survived` column:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
yc = y.value_counts()
assert yc[0]==549
assert yc[1]==342

## Train/test split

Use `sklearn.cross_validation.train_test_split` to split your data into a training and test set. Save the resulting data in the variables:

* `Xtrain`
* `Xtest`
* `ytrain`
* `ytest`

Do a train/test split, with 70% of the data used for training and a `random_state=0`:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert Xtrain.shape==(623,13)
assert ytrain.shape==(623,)
assert Xtest.shape==(268,13)
assert ytest.shape==(268,)

## Gaussian Naive-Bayes classifier

Perform the following steps with the `sklearn.naive_bayes.GaussianNB` classifier:

1. Instantiate the model class
2. Fit the model with the training data
3. Use the model to make predictions about the test data

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Compute the accuracy of the model:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Random forest classifier

Perform the following steps with the `sklearn.ensemble.RandomForestClassifier` classifier:

1. Instantiate the model class
2. Fit the model with the training data
3. Use the model to make predictions about the test data
4. Set `random_state=0`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Compute the accuracy of the model:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Use `cross_val_score` to perform k-fold cross validation (`k=10`) with this model:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Compute the average accuracy and its standard deviation:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Use `sklearn.metrics.confusion_matrix` and Seaborn's `heatmap` to display the confusion matrix for this model:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Use the `feature_importances_` attribute of the model to create a `DataFrame` that has two columns:

1. `feature`: the names of the features
2. `importance`: the importances of that feature

Sort by the feature importances.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()