<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Gridsearching Hyperparameters


---

<img src="gridsearch_meme.png" style="width: 500px;">

### Learning Objectives
- Understand what the terms gridsearch and hyperparameter refer to.
- Understand how to manually build a gridsearching procedure.
- Apply sklearn's `GridSearchCV` object with basketball data to optimize a KNN model.
- Practice using and evaluating attributes of the gridsearch object.
- Understand the pitfalls of searching large hyperparameter spaces.
- Practice the gridsearch procedure independently optimizing regularized logistic regression.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Big Picture: What part of the modelling process are we focusing on this morning?

---

As we looked at yesterday, one general approach you can use for modelling questions would look like this:

- STEP ONE: Cleaning, descriptive stats, correlation heatmap, plots & visualisations. Find baseline accuracy if it's a classifier.

- STEP TWO: Set up predictor matrix (X) and target array (y).  **Dummify if necessary**.

- STEP THREE: Train/test split and StandardScaler( )

- STEP FOUR: Use cross-validation to **optimise the hyperparameters for your model**. You might try different types of models at this stage as well, and you might use GridSearchCV (or any of the other CVs like RidgeCV).

- STEP FIVE: Once you're happy with your hyperparameters, **fit your model on your whole training data** and test it on your whole testing data.

- STEP SIX: Then you might want to **evaluate the performance** of the model (R2 score, accuracy, confusion matrix, etc); find the actual predictions that your model is providing and store them in a dataframe; plot your predictions against your actual target variable to visualise how well your model performed; investigate feature importance with .coef_ if you have a parametric model.

This morning, we're learning more about STEP FOUR.

<a id='intro'></a>

## What is "Gridsearching"? What are "hyperparameters"?

---

Models often have **specifications that can be set**. For example, when we choose a linear regression, we may decide to **add a penalty to the loss function** such as the Ridge or the Lasso. Those penalties require the **regularization strength**, alpha, to be set. 

**Model parameters are called hyperparameters.**

Hyperparameters are different than the parameters of the model resulting from a fit, such as the coefficients. The hyperparameters are **set prior to the fit** and determine the behavior of the model.

There are often more than one kind of hyperparamter to set for a model. For example, in the KNN algorithm, 
- we have a hyperparameter to set the **number of neighbors**. 
- We also have a hyperparameter for the weights: **uniform or distance**?

We want to know the *optimal* hyperparameter settings, the set that results in the best model evaluation. 

**The search for the optimal set of hyperparameters is called gridsearching.**

Gridsearching gets its name from the fact that we are searching over a **"grid" of parameters**. For example, imagine the `n_neighbors` hyperparameters on the x-axis and `weights` on the y-axis, and we need to test all points on the grid.

**Gridsearching uses cross-validation internally to evaluate the performance of each set of hyperparameters.** More on this later.

<a id='basketball-data'></a>

## Basketball data

---

To explore the process of gridsearching over sets of hyperparameters, we will use some basketball data. The data below has statistics for 4 different seasons of NBA basketball: 2013-2016.
- This data includes aggregate statistical data for each game. 
- The data of each game is aggregated by match for all players.
- Scraped from http://www.basketball-refrence.com

Many of the columns in the dataset represent the mean of a statistic across the last 10 games, for example. Non-target statistics are for *prior* games, they do not include information about player performance in the current game.

**We are interested in predicting whether the home team will win the game or not.** This is a classification problem.


### Load the data and create the target and predictor matrix
- The target will be a binary column of whether the home team wins.
- The predictors should be numeric statistics columns.

Exclude these columns from the predictor matrix:

    ['GameId','GameDate','GameTime','HostName',
     'GuestName','total_score','total_line','game_line',
     'winner','loser','host_wins','Season']


### STEP ONE: Cleaning, descriptive stats, correlation heatmap, plots & visualisations. Find baseline accuracy if it's a classifier.

In [5]:
data = pd.read_csv('./datasets/basketball_data.csv')

In [6]:
data.head()

Unnamed: 0,Season,GameId,GameDate,GameTime,HostName,GuestName,total_score,total_line,game_line,Host_HostRank,...,gPTS_avg10,gTS%_avg10,g3PAR_avg10,gFTr_avg10,gDRB%_avg10,gTRB%_avg10,gAST%_avg10,gSTL%_avg10,gBLK%_avg10,gDRtg_avg10
0,2013,201212090LAL,2012-12-09,6:30 pm,Los Angeles Lakers,Utah Jazz,227.0,207.5,7.5,13,...,99.0,0.5206,0.223,0.2981,69.22,50.05,61.57,8.63,10.31,110.87
1,2013,201212100PHI,2012-12-10,7:00 pm,Philadelphia 76ers,Detroit Pistons,201.0,186.5,5.5,13,...,90.3,0.5077,0.2144,0.3095,71.46,49.48,59.83,6.48,9.46,107.91
2,2013,201212100HOU,2012-12-10,7:00 pm,Houston Rockets,San Antonio Spurs,240.0,212.0,-7.0,12,...,108.0,0.5915,0.2743,0.2518,74.26,50.99,61.82,8.3,6.85,101.41
3,2013,201212110BRK,2012-12-11,7:00 pm,Brooklyn Nets,New York Knicks,197.0,195.5,-3.5,12,...,100.3,0.5473,0.3595,0.2544,74.23,47.88,52.07,9.31,7.64,109.24
4,2013,201212110DET,2012-12-11,7:30 pm,Detroit Pistons,Denver Nuggets,195.0,203.5,-4.5,11,...,101.1,0.5605,0.2173,0.3177,68.45,50.4,56.33,7.67,7.83,114.86


In [7]:
data.columns

Index(['Season', 'GameId', 'GameDate', 'GameTime', 'HostName', 'GuestName',
       'total_score', 'total_line', 'game_line', 'Host_HostRank',
       'Host_GameRank', 'Guest_GuestRank', 'Guest_GameRank', 'host_win_count',
       'host_lose_count', 'guest_win_count', 'guest_lose_count', 'game_behind',
       'winner', 'loser', 'host_place_streak', 'guest_place_streak',
       'hq1_avg10', 'hq2_avg10', 'hq3_avg10', 'hq4_avg10', 'hPace_avg10',
       'heFG%_avg10', 'hTOV%_avg10', 'hORB%_avg10', 'hFT/FGA_avg10',
       'hORtg_avg10', 'hFG_avg10', 'hFGA_avg10', 'hFG%_avg10', 'h3P_avg10',
       'h3PA_avg10', 'h3P%_avg10', 'hFT_avg10', 'hFTA_avg10', 'hFT%_avg10',
       'hORB_avg10', 'hDRB_avg10', 'hTRB_avg10', 'hAST_avg10', 'hSTL_avg10',
       'hBLK_avg10', 'hTOV_avg10', 'hPF_avg10', 'hPTS_avg10', 'hTS%_avg10',
       'h3PAR_avg10', 'hFTr_avg10', 'hDRB%_avg10', 'hTRB%_avg10',
       'hAST%_avg10', 'hSTL%_avg10', 'hBLK%_avg10', 'hDRtg_avg10', 'gq1_avg10',
       'gq2_avg10', 'gq3_avg10', 'gq

In [8]:
data.shape

(3768, 96)

In [9]:
data.Season.value_counts()

2014    998
2016    985
2015    984
2013    801
Name: Season, dtype: int64

In [10]:
data.winner.head()

0             Utah Jazz
1    Philadelphia 76ers
2     San Antonio Spurs
3       New York Knicks
4        Denver Nuggets
Name: winner, dtype: object

In [11]:
#let's create a column called 'host_wins' which will indicate
#whether the host team won the game or not
data['host_wins'] = (data['HostName'] == data['winner']).astype(int)

In [12]:
#let's look at the baseline accuracy for this data
baseline = data['host_wins'].value_counts(normalize=True) 
baseline

1    0.594214
0    0.405786
Name: host_wins, dtype: float64

At this point, we could definitely do a bit more EDA!  We could...
- use .describe() to find some descriptive statistics
- create some plots & visualisations to better understand the shapes and relationships in our data
- use a correlation heatmap to help us find variables that could predict 'host_wins'

...but we're going to skip that for this lesson!  Be aware that if you're undertaking your own data investigation, **skipping EDA is a bad idea**. (Obviously.)  

### STEP TWO: Set up predictor matrix (X) and target array (y).  Dummify if necessary.

In [13]:
predictors = [c for c in data.columns if c not in ['GameId','GameDate','GameTime','HostName',
                                                   'GuestName','total_score','total_line','game_line',
                                                   'winner','loser','host_wins','Season']]
X = data[predictors]
y = data['host_wins']

In [14]:
#let's check to make sure we don't have any categorical predictors:
#if we don't have any categorical predictors, then we DON'T have to dummify
#(remember, it's normally fine to have a categorical target variable)
X.head()

Unnamed: 0,Host_HostRank,Host_GameRank,Guest_GuestRank,Guest_GameRank,host_win_count,host_lose_count,guest_win_count,guest_lose_count,game_behind,host_place_streak,...,gPTS_avg10,gTS%_avg10,g3PAR_avg10,gFTr_avg10,gDRB%_avg10,gTRB%_avg10,gAST%_avg10,gSTL%_avg10,gBLK%_avg10,gDRtg_avg10
0,13,21,13,22,9,11,11,10,-1.5,1,...,99.0,0.5206,0.223,0.2981,69.22,50.05,61.57,8.63,10.31,110.87
1,13,21,13,23,11,9,7,15,5.0,1,...,90.3,0.5077,0.2144,0.3095,71.46,49.48,59.83,6.48,9.46,107.91
2,12,20,13,22,9,10,17,4,-7.0,2,...,108.0,0.5915,0.2743,0.2518,74.26,50.99,61.82,8.3,6.85,101.41
3,12,20,13,21,11,8,15,5,-3.5,4,...,100.3,0.5473,0.3595,0.2544,74.23,47.88,52.07,9.31,7.64,109.24
4,11,24,16,22,7,16,10,11,-4.0,1,...,101.1,0.5605,0.2173,0.3177,68.45,50.4,56.33,7.67,7.83,114.86


### STEP THREE: Train/test split and StandardScaler( ).  In this case, instead of using train_test_split, we're going to use the most recent season as our testing data (2016 data), and previous seasons as our training data.

In [15]:
from sklearn.preprocessing import StandardScaler

In [16]:
data['Season'].value_counts()

2014    998
2016    985
2015    984
2013    801
Name: Season, dtype: int64

In [28]:
#create your training and testing sets
mask = (data['Season'] == 2016)
X_train = X[~mask]
X_test = X[mask]



In [29]:

y_train = y[data['Season'].isin([2014,2013,2015])]
y_test = y[mask]

In [30]:
#standardise your predictor matrices
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

### STEP FOUR: Use cross-validation to optimise the hyperparameters for your model. You might try different types of models at this stage as well, and you might use GridSearchCV (or any of the other CVs like RidgeCV).

Below we can fit a default `KNeighborsClassifier` to predict 'host_wins'.

We can use cross-validation with our training data to see how well it performs.

In [31]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

In [33]:
#set up a default KNN model and cross-validate it on the training data
#use 5 cross-validation folds
knn = KNeighborsClassifier()
cross_val_score(knn, X_test_ss, y_train, cv=5)

ValueError: Found input variables with inconsistent numbers of samples: [985, 2783]

In [34]:
#find the mean for your cross_val_scores
knn_cv_accuracy = cross_val_score(knn, X_test_ss, y_train, cv=5).mean()
print('Mean cross-validated accuracy for default knn:',knn_cv_accuracy)
print('Baseline accuracy:',baseline)

ValueError: Found input variables with inconsistent numbers of samples: [985, 2783]

Our default KNN performs quite poorly on the test data. But what if we **changed the number of neighbors? The weighting? The distance metric?**

These are all **hyperparameters of the KNN**. How would we do this manually? We would need to evaluate on the training data the set of hyperparameters that perform best, and then use this set of hyperparameters to fit the final model and score on the testing set.

#### Gridsearch pseudocode for our KNN

```python
accuracies = {}
for k in neighbors_to_test:
    for w in weightings_to_test:
        for d in distance_metrics_to_test:
            hyperparam_set = (k, w, d)
            knn = KNeighborsClassifier(n_neighbors=n, weights=w, metric=d)
            cv_accuracies = cross_val_score(knn, X_train, y_train, cv=5)
            accuracies[hyperparam_set] = np.mean(cv_accuracies)
```

In the pseudocode above, we would find the key in the dictionary (a hyperparameter set) that has the larget value (mean cross-validated accuracy).

#### Using `GridSearchCV`

This would be an annoying process to have to do manually. Luckily sklearn comes with a convenience class for performing gridsearch:

```python
from sklearn.model_selection import GridSearchCV
```

The `GridSearchCV` has a handful of important arguments:

| Argument | Description |
| --- | ---|
| **`estimator`** | The sklearn instance of the model to fit on |
| **`param_grid`** | A dictionary where keys are hyperparameters for the model and values are lists of values to test |
| **`cv`** | The number of internal cross-validation folds to run for each set of hyperparameters |
| **`n_jobs`** | How many cores to use on your computer to run the folds (-1 means use all cores) |
| **`verbose`** | How much output to display (0 is none, 1 is limited, 2 is printouts for every internal fit) |


Below is an example for how one might set up the gridsearch for our KNN:

```python
knn_parameters = {
    'n_neighbors':[1,3,5,7,9],
    'weights':['uniform','distance']
}

knn_gridsearcher = GridSearchCV(KNeighborsClassifier(), knn_parameters, cv=4, verbose=1)
knn_gridsearcher.fit(X_train, y_train)
```

**Try out the sklearn gridsearch below on the training data.**

In [35]:
from sklearn.model_selection import GridSearchCV

In [36]:
knn_params = {
    'n_neighbors': [5,9,15,25,40,50,60],
    'weights':['uniform','distance'],
    'metric':['euclidean','manhattan']}

knn_gridsearch = GridSearchCV(KNeighborsClassifier(), 
                              knn_params, 
                              n_jobs=-1, cv=5, verbose=1)

knn_gridsearch.fit(X_train_ss, y_train)

Fitting 5 folds for each of 28 candidates, totalling 140 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   35.5s
[Parallel(n_jobs=-1)]: Done 140 out of 140 | elapsed:  1.7min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_neighbors': [5, 9, 15, 25, 40, 50, 60], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

Great!  We've fit our GridSearch.

#### Examing the results of GridSearch( )

Once the gridsearch has fit (this can take awhile!) we can pull out a variety of information and useful objects from the gridsearch object, stored as attributes:

| Property | Use |
| --- | ---|
| **`results.param_grid`** | Displays parameters searched over. |
| **`results.best_score_`** | Best mean cross-validated score achieved. |
| **`results.best_estimator_`** | Reference to model with best score.  Is usable / callable. |
| **`results.best_params_`** | The parameters that have been found to perform with the best score. |
| **`results.grid_scores_`** | Display score attributes with corresponding parameters. | 

**Print out the best score found in the search.**

In [38]:
#print out the best mean cross-validated accuracy from the gridsearch
#hopefully this should be much better than our default mean cross-validated accuracy 
knn_gridsearch.best_score_

0.6266618756737333

In [None]:
print('Mean cross-validated accuracy for default knn:',knn_cv_accuracy)
print('Baseline accuracy:',baseline)

**Print out the set of hyperparameters that achieved the best score.**

In [41]:
#print out your best hyperparameters
best_model = knn_gridsearch.best_estimator_

knn_gridsearch.best_params_

{'metric': 'manhattan', 'n_neighbors': 50, 'weights': 'uniform'}

In [42]:
best_model

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='manhattan',
           metric_params=None, n_jobs=1, n_neighbors=50, p=2,
           weights='uniform')

### STEP FIVE: Once you're happy with your hyperparameters, fit your model on your whole training data and test it on your whole testing data.  (When you use a gridsearch's `.best_estimator_`, it will already have fit a model with the best hyperparameters on your training data, so all you have to do is score it on your testing data.)

**Assign the best fit model (`best_estimator_`) to a variable and score it on the test data.**

Compare this model to the baseline accuracy and your default KNN.

In [43]:
#assign your best_estimator_ to the variable, then use .score( ) on your testing data
best_knn = knn_gridsearch.best_estimator_
best_knn

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='manhattan',
           metric_params=None, n_jobs=1, n_neighbors=50, p=2,
           weights='uniform')

### STEP SIX: Then you might want to evaluate the performance of the model (R2 score, accuracy, confusion matrix, etc); find the actual predictions that your model is providing and store them in a dataframe; plot your predictions against your actual target variable to visualise how well your model performed; investigate feature importance with .coef_ if you have a parametric model.

In [44]:
#there's lots of stuff you can do to follow up and investigate when you've found your best model,
#but let's just look at some different ways to assess a classifier now using what you learned yesterday:
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

predictions = 
confusion = 
pd.DataFrame(confusion, 
             columns=['predicted_home_win','predicted_home_loss'], 
             index=['predicted_home_win','predicted_home_loss'])

SyntaxError: invalid syntax (<ipython-input-44-3bf0736f8ce5>, line 5)

In [None]:
#recall is concerned with how much you catch all the positives:
#it's more important with cancer detection tests, for example: 
#     "Let's make sure we RECALL all the patients with positives to do further testing!" 
#     "Yes!  But let's make sure we're SENSITIVE about how we tell them they might have cancer."
#it's the true positives, divided by all the actual positives


In [None]:
#let's check that:
knn_gridsearch.best_estimator_

In [None]:
#precision is concerned with how precisely you predict positives; in other words,
#when you predict a positive, are you pretty sure about that prediction?
#it's more important with a spam filter, for example:
#    "Whhaaattttt. An email from the Home Office went to my spam filter?!
#    "Spam filters are great, but they damn sure better be PRECISE about what emails they think are spam!"
#it's the true positives, divided by all the predicted positives
501/(501*255)

In [None]:
#let's check that:
recall_score(y_test, predictions)

In [None]:
#the F1 score is the harmonic mean between these two metrics:
precision_score(y_test, predictions)

**Coding checkout (in pair - 8min):** How can you modify the recall or the precision?

In [45]:
#A:
predictions
predictions_probas = best_knn.predict_proba(X_test_ss)
predictions_probas

NameError: name 'predictions' is not defined

<a id='practice'></a>

## Independent practice: gridsearch regularization penalties with logistic regression

---

Logistic regression models can also apply the Lasso and Ridge penalties. The `LogisticRegression` class takes these regularization-relevant hyperparameters:

| Argument | Description |
| --- | ---|
| **`penalty`** | `'l1'` for Lasso, `'l2'` for Ridge |
| **`solver`** | Must be set to `'liblinear'` for the Lasso penalty to work. |
| **`C`** | The regularization strength. Equivalent to `1./alpha` |

**You should:**
1. Fit and validate the accuracy of a default logistic regression on the basketball data.
- Perform a gridsearch over different regularization strengths and Lasso and Ridge penalties.
- Compare the accuracy on the test set of your optimized logistic regression to the baseline accuracy and the default model.
- Look at the best parameters found. What was chosen? What does this suggest about our data?
- Look at the coefficients and associated predictors for your optimized model. What appears to be the most important predictors of winning the game?


In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
#set up a logistic regression model, 
#and find its cross-validated scores for your training data
#you should get accuracies higher than 0.6, which is better than KNN!
lr = 
cross_val_score()

In [None]:
#find the mean of those cross-validated scores:


In [None]:
# Set up the parameters. 
# Use a list with 'l1' and 'l2' for the penalties,
# Use a list with 'liblinear' for the solver,
# Use a logspace from -3 to 0, with 50 different values

# fill the dictionary of parameters
gs_params = {'penalty':?,
             'solver':?,
             'C':?}

#create your gridsearch object
lr_gridsearch = GridSearchCV()


In [None]:
#fit your gridsearch object on your training data
lr_gridsearch.

In [None]:
# find the best mean cross-validated score that your gridsearch found:
# (this should be better than the mean cross-validated score for your default logistic regression above)
lr_gridsearch.

In [None]:
# find the best hyperparameters that your gridsearch found:
lr_gridsearch.

In [None]:
# assign the best estimator to a variable:
best_lr = 

In [None]:
# score your best estimator on the testing data:


### Let's analyse the features importances

In [None]:
# create a dataframe to look at the coefficients
coef_df = pd.DataFrame({'coef': best_lr.coef_[0],
                        'feature': X.columns,
                        'abs_coef': np.abs(best_lr.coef_[0])})

coef_df.head()

In [None]:
# sort by absolute value of coefficient (magnitude)
coef_df.sort_values('abs_coef', ascending=False, inplace=True)
coef_df.head()

## KEY TAKEAWAYS:

- You always want to use your training data to search for your best hyperparameters! You can do this with GridSearchCV, or with other sklearn objects like RidgeCV, LassoCV, ElasticNetCV, or LogisticRegressionCV.  


- You instantiate GridSearchCV with:
    - a model
    - a dictionary for that model's parameters
    - the number of cross-validation folds you want it to perform (`cv=`)
    - how many cores to use on your computer for this job (`n_jobs=`)
    - whether you want your model to give you some print-outs as it works (`verbose=`)
    

- Once you've instantiated the GridSearch object, you can fit it on your training data


- Once it's finished searching, you can access some useful attributes:
    - `.best_score_`, to find the mean cross-validated score of the best estimator
    - `.best_params_`, to find the best hyperparameters 
    - `.best_estimator_`, which you will assign to a variable in order to use `.score()`, `.predict()`, `.coef_`, etc

## PROGRESS CHECKPOINT:

- EVERYONE should be able to:
    - give an example of a hyperparameter
    - explain the basic intuition behind how GridSearch works
    - copy and paste the GridSearching from this lesson and adapt it to a different lab to find the best hyperparameters for a model



- MANY of you should be able to:
    -  use your best estimator from the gridsearch to evaluate model effectiveness and investigate feature importance
    -  come up with reasonable parameter dictionaries by yourself for KNN and Logistic Regression
    -  create a confusion matrix using your predictions from your best estimator, and understand how to interpret it


- SOME of you should be able to:
    - look at the documentation for GridSearchCV on sklearn's website, and investigate the different options that are available (for example, the attribute `.cv_results_`)
    - create a ROC curve and a Precision-Recall curve for your predictions, and be able to interpret them
    