## Pipeline: Tune hyperparameters

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will tune the hyperparameters for the basic model we fit in the last section.

### Read in data & create train/validation/test set

![Tune Hyperparameters](img/tune_hyperparameters.png)

_Welcome back to lesson five, we're going to build off our last lesson and we will still use Cross-Validation on the training set but we're going to add one more layer. We're going to run GridSearch to find the optimal hyperparameter settings for our model._

_In addition to `RandomForestClassifier` and `train test split` - you'll notice that the `cross val score` we used before is now replaced by `GridSearchCV`. All `GridSearchCV` is is a wrapper around `cross val score` that allows us to run Grid-Search within `Cross-Validation`._

_So now we'll import our data and creating our training, test, and validation sets._

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

titanic = pd.read_csv('../titanic_cleaned.csv')

features = titanic.drop('Survived', axis=1)
labels = titanic['Survived']

X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_val, y_val, test_size=0.5, random_state=42)

### Hyperparameter tuning

![Hyperparameters](img/hyperparameters.png)

_I wrote a quick little function here for us to use to print the results. I'm not going to go through it in detail but in essence what it does is for every hyper-parameter combination it will print out the average accuracy score (again, remember there will be 5 accuracy scores - one for each fold) and the standard deviation of that accuracy score. This will give us the information we need to select the optimal hyperparameter settings._

In [2]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

_Now, we're using `RandomForestClassifier` again. This course isn't really meant to get too in-depth into the individual algorithms but essentially what a `Random Forest` is is a collection of decision trees. So that decision tree above, a `Random Forest` would be if we built 5 or 10 or 100 of those and then they all worked together to determine the ultimate prediction. Each individual decision tree would be fit on some subset of data and some subset of features. It's not critical that you understand this too in depth at this point._

_With that background - we need to define our parameter search region. There are two `parameters` we want to tune:_
1. _Number of estimators, this means **how many** individual decision trees do we want to build within our `Random Forest`_
2. _Max Depth - we mentioned this in the slide above, this will just dictate how deep each of the individual decision trees go_

_So lets say we want to test using 5, 50, and 100 individual decision trees and we want to test max depth of 2, 10, 20, and None._

_Then we will just call the `GridSearchCV` method, we pass in our model, then we pass in our parameter diction, and lastly we just tell it that we want 5 folds. Then we store that as `cv`. If you've ever used `scikit-learn` before you'll notice that their model training API is exactly the same for any type of model. You store the object and then you call `.fit()`. So we will do that with `cv` and we pass in `X train` and `y train`._

_Lastly, just call the function I wrote about: `print results(cv)`_

_Now, what this is doing under the hood is it's taking each parameter combination (3 levels of estimators, 4 levels of max depth so 12 total combinations) - for each combination it's running 5-fold Cross-Validation. So essentially it's building 60 models under the hood - one model for each fold, 5 folds for each combination, 12 total combinations._

_Ok, now lets take a look at the results. I just want to highlight again, these hyperparameters are facilitating how the Random Forest Classifier fits to the data so it will determine that bias variance trade-off and whether this is overfitting or underfitting._

In [3]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 100],
    'max_depth': [2, 10, 20, None]
}
cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(X_train, y_train)

print_results(cv)

BEST PARAMS: {'max_depth': 10, 'n_estimators': 100}

0.747 (+/-0.124) for {'max_depth': 2, 'n_estimators': 5}
0.794 (+/-0.123) for {'max_depth': 2, 'n_estimators': 50}
0.8 (+/-0.122) for {'max_depth': 2, 'n_estimators': 100}
0.794 (+/-0.049) for {'max_depth': 10, 'n_estimators': 5}
0.82 (+/-0.039) for {'max_depth': 10, 'n_estimators': 50}
0.831 (+/-0.064) for {'max_depth': 10, 'n_estimators': 100}
0.809 (+/-0.073) for {'max_depth': 20, 'n_estimators': 5}
0.805 (+/-0.034) for {'max_depth': 20, 'n_estimators': 50}
0.811 (+/-0.047) for {'max_depth': 20, 'n_estimators': 100}
0.803 (+/-0.033) for {'max_depth': None, 'n_estimators': 5}
0.813 (+/-0.052) for {'max_depth': None, 'n_estimators': 50}
0.816 (+/-0.024) for {'max_depth': None, 'n_estimators': 100}


