# Part 1b: Predicting survival of Titanic passengers with decision trees
**DUE September 17th 2018**

## Introduction

The code for this project consists of several Python files, some of
which you will need to read and understand in order to complete the
assignment, and some of which you can ignore.

### Files You'll Edit

``assignment_1b.ipynb``: Will be your edited copy of this notebook pertaining to part 1a of the assignment

``features.py``: Simple feature engineering function



### Files you might want to look at
  
``binary.py``: Our generic interface for binary classifiers (actually
works for regression and other types of classification, too).

``datasets.py``: Where a handful of test data sets are stored.

``util.py``: A handful of useful utility functions: these will
undoubtedly be helpful to you, so take a look!

``runClassifier.py``: A few wrappers for doing useful things with
classifiers, like training them, generating learning curves, etc.

``mlGraphics.py``: A few useful plotting commands

``data/*``: all of the datasets we'll use.

### What to Submit

You will hand in all of the python files listed above under "Files
you'll edit". You will also have to answer the written questions in this
notebook denoted **Q#:** in the corresponding cells denoted with **A#:**.

#### Autograding

Your code will be autograded for technical correctness. Please **do
not** change the names of any provided functions or classes within the
code, or you will wreak havoc on the autograder. However, the
correctness of your implementation -- not the autograder's output --
will be the final judge of your score.  If necessary, we will review
and grade assignments individually to ensure that you receive due
credit for your work.

## A quick look at the data

In `data/` you will find the following files:
    `titanic_train.csv`
        
    `titanic_test.csv`
    
Let's take a look at the CSV file using the [Pandas] package and import other packages we think we will need.

In [None]:
import pandas as pd
import dt
import features
import runClassifier
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

Pandas lets us take read CSVs easily and allows us to manipulate the data with ease. So lets take a look at the data!

In [None]:
train_df = pd.read_csv('data/titanic_train.csv')
train_df.head()

Each passenger is identified with a unique PassengerId and is labeled with whether or not she survived the Titanic accident. We can also see that we have some simple information about each of them. In each column, 1 signifies True and 0 False. Since the decision tree we have implemented is quite simple and knows to split on only binary features (either 1 or 0), we have preprocessed the data and have already binaraized some features for you. They are as follows:
- `HighClassTicket`: Signifies whether or not the passenger bought a ticket with some extra perks
- `IsOld`: Signifies whether or not the passenger is older than 22
- `hasLargeFamily`: Signifies whether the passenger had more than 4 other family members on board
- `isSingle`: Signifies whether the passenger had no other family members on board
- `hadNiceCabin`: Signifies whether the passenger purchased an upgraded cabin
- `isAristocrat`: Signifies whether the passenger had an aristocratic title in his/her name (E.g. Sir, Lord, Dutchess etc.)

However, you have to do some **feature engineering** and 'binarize' the remainding columns.

Unfortunately, the simple decision tree that we implemented does not know how to find partitions in features that are strings or features that are continuous. We will have to do some **feature engineering** to solve this. Binarizing the `Sex` feature is simple. However you will have to figure out a reasonable threshold for binarizing the `Fare` feature.

Do some data analysis below to find a reasonbable. **Plot a chart**  and **explain** why you chose the threshold you chose.
(Hint: Use histograms, analyze the survival rates and make a reasobable guess. Or find the threshold that minimizes impurity!)


Also, **complete** the `binarize_features` function in `binarize.py`, this function should return a Pandas dataframe with the same number of columns, and binarize the `Fare` and `Sex` columns to `int`s.

In [None]:
# TODO; Insert your analysis here. Add more cells if you need to!

**Q1:** Why did you threshhold/binarize the `Fare` featuer at that value?

**A1:** (TODO: Your answer here...)

In [None]:
train_df = features.binarize_features(train_df, 25)
train_df.head()

Although there is a test csv, we won't always have access to labels in our test data. Instead, we hold out a portion (20%) of our training data to help us measure how generalizable the trained model is.

In [None]:
X, y = train_df.iloc[:,2:].values, train_df.iloc[:,1].values
X_holdout, y_holdout, X_train, y_train = train_test_split(X, y, test_size=0.2, random_state=422)

Using `hyperparameterCurve` in `runClassifier.py`, **plot** a corresponding chart of tree depth vs accuracy and **choose** the best tree depth.

In [None]:
# TODO: Insert code and analysis here

**Q2:** According to your analysis, what is the best tree depth? Why?

**A2:** (TODO: Your answer here...)

Now let's retrain on all the data...

In [None]:
dt = dt.DT({'maxDepth': #insert value here })
dt.train(X, y)

**Q3:** Why would we want to retrain a decision tree on all the data (`X` and `y`) and not just `X_train` and `y_train`?

**A3:** (TODO: Your answer here...)

We can now test our decision tree the test data!

In [None]:
test_df = pd.read_csv('data/titanic_test.csv')
test_df = features.binarize_features(test_df, #insert value here )
X_test, y_test = test_df.iloc[:,2:].values, test_df.iloc[:,1].values

In [None]:
y_predicted = dt.predictAll(X_test)
acc = np.mean(y_test == y_predicted)
print("Test accuracy:", acc)