![alt text](images/HDAT9500Banner.PNG)
<br>

# Chapter 3: Resampling Methods
# Assessment


# 1. Introduction

In this assessment, you will be asked to use 'Grid' to find the best $alpha$ ($alpha=C=1/\lambda$) and the best combination of NO:YES class_weight for logistic regression with Ridge (L2) regularization and 5-CV.

**Note Nomeclature **
Training, Validation and Test Set. 

* The training set, used to train the model 
* The validation set, used to evaluate model performance and adjust model parameters accordingly
* The test set, used for final model evaluation.


## 1.1. Aims of the Exercise:
 1. To become familiar with a validation set to find the best parameters of a model. Remember that the parameters are defined by the user.
 2. To become familiar with a grid search: the most commonly used method for tuning paramters is via a grid search, which entails testing many combinations of the parameters of interest.
 3. To become familiar with k-CV and grid search
 4. To become familiar with Python pipelines

 
It aligns with all the learning outcome of our course: 

1.	Distinguish a range of task specific machine learning techniques appropriate for Health Data Science.
2.	Design machine learning tasks for Health Data Science scenarios.
3.	Construct appropriate training and test sets for health research data.


## 1.2. Jupyter Notebook Intructions
1. Read the content of each cell.
2. Where necessary, follow the instructions that are written in each cell.
3. Run/Execute all the cells that contain Python code sequentially (one at a time), using the "Run" button.
4. For those cells in which you are asked to write some code, please write the Python code first and then execute/run the cell.
 
## 1.3. Tips
 1. The square brackets on the left hand side of each cell indicate whether the cell has been executed or not. Empty square brackets mean that the cell has not been excuted, whereas square brackets that contain a number means that the cell has been executed. Run all the cells in sequence, using the "Run" button.
 2. To edit this notebook, just double-click in each cell. In thid document, each cell can be a "Code" cell or "text-Markdown" cell. To choose between these two options, go to the combo-box above. 
 3. If you want to save your notebook, please make sure you press "the floppy disk" icon button above. 
 4. To clean the content of all cells and re-start Notebook, please go to Cell->All Output->Clear


# 2. Load the unstandardized hospital data (including dummy variables).

In [None]:
import sys
print(sys.version)
#For this notebook to work, Python must be 3.6.4 or 3.6.5

import numpy as np
import pandas as pd
from IPython.display import display

from plotnine import *

In [None]:
hospital = pd.read_csv('data/diabetes/Data_Class_Dummies.csv', sep=',')


In [None]:
# Sanity Check:
display(hospital[:][:5])
hospital.shape

## 2.1. Split the data into features and response (readmission).
**Note that we are taking the *values* of readmission, so y is an *array*, NOT a DataFrame**. This saves us from having to use the '.values.ravel()' method whenever we are using the response variable.

In [None]:
X = hospital.drop(['readmission'], axis = 1)
y = hospital[['readmission']].values

In [None]:
display(X[:][:5])
display(y[:][:5])

# 3. Grid

We have previously created logistic models using regularized logistic regression, and at other times class weighting. Now, we will combine these two methods. Further, we wish to improve the models generalization performance by tuning its parameters. The most commonly used method for tuning paramters is via a grid search, which entails testing many combinations of the parameters of interest.<p>
    For our case, we have two primary parameters that we would like to tune:
* C, ($C=alpha=1/\lambda$)
* class_weight, the class weights.
<p>
    
Lets say we want to try C = 0.001, 0.01, 0.1, 1, 10, 100. And for class weights, we want to try class_weight = 'balanced', {'NO':0.1, 'YES':0.9}, {'NO':0.2, 'YES':0.8}, {'NO':0.3, 'YES':0.7}, {'NO':0.4, 'YES':0.6}, and {'NO':0.5, 'YES':0.5}. Note that class_weight = {'NO':0.5, 'YES':0.5} corresponds to no class weighting, as the weightings are equal. As there are 6 cases of C, and 6 of class weight, there 6 times 6 = 36 total combinations of C and class weight.<p>
    We will choose L2 regularization (ridge) for this problem. As we are using a grid, and later a grid in combination with cross validation, we have to keep in our minds *computational complexity*. L2 has a closed form solution because it relies on squaring the beta coefficients. L1 does not have a closed form solution as it involves an absolute value. For this reason, L1 is computationally more expensive, as we can't solve it in terms of matrix math, and most rely on approximations (in the lasso case, coordinate descent). This means L2 will be much faster to implement.





## 3.1. Binarise The Response -  Readmission
We will use **F1 Score** to evaluate the model, as accuracy is not well suited to this problem. As always when using F1 score, we have to binarise the response. Lets do this now. Just keep in mind that 0 corresponds to a 'NO' response, whilst 1 corresponds with a 'YES' response.

In [None]:
y_binary = [0 if x =='NO' else 1 for x in y]

# Sanity Check
print('Readmission (original y): ', hospital['readmission'][:10].values.ravel())
print('y after binary conversion: ', y_binary[:10])

## 3.2. Validation Set
Up to this point, our training and performance methods have only used two sets: the training set, and the test set. Now that we are tuning parameters based on the performance of the model on a set other than that which the model was trained on, we need a third set. This is because we must reserve the test set purely for model evluation at the end of our analysis, and we cannot use it for parameter tuning. If we use the test set to tune parameters, it is no longer 'unseen data' and so cannot be considered as an independent data set for performance evaluation. The new set we create is called the *validation set*. Hence, our data is now partitioned into 3 components:
* The training set, used to train the model 
* The validation set, used to evaluate model performance and adjust model parameters accordingly
* The test set, used for final model evaluation.
<p>

This may seem as though we are giving up data to train the model with, but this is not so serious. After we select the best parameters to train the model, we can retrain the model using those parameters on the union of the training and validation set, and then evaluate on the test set, as before.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Split the whole dataset into a train+validation set, and a test set.

<div class="alert alert-block alert-success">**Start Activity 1**</div>

### <font color='blue'> Question 1a: Split the data into train+validation set and test set (random_state=0 and test_size=0.2) </font>
<p><font color='green'> Tip: You can find the code in page 262 Book 1 </font></p>

In [None]:
#split data into 1) train+validation set and 2) test set 
# Write Python code here:
X_train_val, X_test, y_train_val, y_test = ...

Split the train+validation set into a train and validation set.

### <font color='blue'> Question 1b: Split the train+validation data into train and validation sets (random_state=1 and test_size=0.2) </font>
<p><font color='green'> Tip: You can find the code in page 263 Book 1 </font></p>

In [None]:
# split train+validation set into 1a) training and 1b) validation sets
# Write Python code here:
X_train, X_val, y_train, y_val = ...

### <font color='blue'> Question 1c: Why do we split the data into 1) train+validation set and 2) test sets firts and the train+validation set into 1a) training and 1b) validation sets aftewards? </font>
<p><font color='green'> Tip: You can read the answers in section Grid Search of Book 1 (pages 260 onwards)</font></p>

<b> Write the answer here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 1**</div>

In [None]:
print("Size of training set: {} \nSize of validation set: {} \nSize of test set:"
" {}\n".format(X_train.shape[0], X_val.shape[0], X_test.shape[0])) 

## 3.3. Standardise the training and validation features
Let's use the training set to fit the scaler, then transform the validation set. Later, once we have tuned our parameters, we will use the combined training and validation set to fit the scaler, then transform the test set.

In [None]:
scaler = StandardScaler()

# fit the scaler to the training data ONLY
scaler.fit(X_train)

# standardize the training data
X_train_scaled = scaler.transform(X_train)

# standardize the validation data 
X_val_scaled = scaler.transform(X_val)

# Convert to pandas DataFrames
X_train_scaled = pd.DataFrame(X_train_scaled, columns = list(hospital.drop('readmission', axis = 1).columns.values))
X_val_scaled = pd.DataFrame(X_val_scaled, columns = list(hospital.drop('readmission', axis = 1).columns.values))

In [None]:
# Sanity Check:
display(X_train_scaled[:][:5])
display(X_val_scaled[:][:5])

## 3.4. Find the best parameters - use a grid
Here we will use F1 Score. We use nested for loops to implement our grid search. On the outer for loop, we iterate over each value of 'class_weight' that we are interested in. Next, for each of these class_weight values, we again iterate over each of the 'C' values we are interested in. Within this inner for loop, we train the logistic regression model with those choices of parameters, determine the F1 score, and update best_score and best_parameters if this latest model is the best performing model thus far.

In [None]:
# See Python code in page 262 of Book 1

best_f1_score = 0

# weights
for class_weight in [{0:0.1, 1:0.9}, {0:0.2, 1:0.8}, {0:0.3, 1:0.7}, {0:0.4, 1:0.6}, {0:0.5, 1:0.5}]:
    # C=alpha values
    for C in [0.001, 0.01, 0.1, 1, 10, 100]: 
        # for each combination of parameters, train a model
        Log_Reg = LogisticRegression(C = C, penalty = 'l2', class_weight = class_weight)
        Log_Reg.fit(X_train_scaled, y_train)
        
        # evaluate the model on the test set (using F1 Measure)
        
        #predictions 
        y_pred = Log_Reg.predict(X_val_scaled)
        # Compute f1 score
        score = f1_score(y_true = y_val, y_pred = y_pred, average = 'macro')
        
        # If we got a better score, store the score and parameters 
        if score > best_f1_score:
                best_f1_score = score
                best_parameters = {'C': C, 'class_weight': class_weight}
                

When using the f1 Score, you will sometimes recieve the warning message, <font color = red> "UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples" <font color = black>. This means that for some of the models, there were class labels with no predicted observations from the validation set. In such a situation, the F1 score for that model gets set to zero. This is not an issue for us, as we are not interested in models that only predict a single class label (likely the 'NO' label).

In [None]:
print(best_parameters)

<div class="alert alert-block alert-success">**Start Activity 2**</div>

### <font color='blue'> Question 2: What did we do in the last two pieces of code? </font>
<p><font color='green'> Tip: For help, read section Grid Search of Book 1 (page 260 onwards) </font></p>

<b> Write the answer here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 2**</div>

## 3.5. Rebuild a model on the combined training and validation set, and evaluate it on the test set
Now that we have the best parameter combination, we want to rebuild a model on the combined training and validation set, and evaluate it on the test set. First, we need to standardise again. We use the combined training and validation set to fit the scaler, then transform both the training+validation set and test set.

In [None]:
scaler = StandardScaler()

# fit the scaler to the training+validation data 
scaler.fit(X_train_val)

# standardize the training data
X_train_val_scaled = scaler.transform(X_train_val)

# standardize the validation data 
X_test_scaled = scaler.transform(X_test)

# Convert to pandas DataFrames
X_train_val_scaled = pd.DataFrame(X_train_val_scaled, columns = list(hospital.drop('readmission', axis = 1).columns.values))
X_test_scaled = pd.DataFrame(X_test_scaled, columns = list(hospital.drop('readmission', axis = 1).columns.values))

In [None]:
# Sanity Check:
display(X_train_val_scaled[:][:5])
print(X_train_val_scaled.shape)
display(X_test_scaled[:][:5])
print(X_test_scaled.shape)

In [None]:
Log_Reg = LogisticRegression(C = best_parameters['C'], penalty = 'l2', class_weight = best_parameters['class_weight'])
Log_Reg.fit(X_train_val_scaled, y_train_val)

y_pred = Log_Reg.predict(X_test)
test_f1_score = f1_score(y_true = y_test, y_pred = y_pred, average = 'macro')

In [None]:
print("Best f1 score on validation set: {:.2f}".format(best_f1_score)) 
print("Best parameters: ", best_parameters)
print("Test set f1 score with best parameters: {:.2f}".format(test_f1_score))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true = y_test , y_pred = y_pred, labels=None, sample_weight=None)

<font color = red> **Poor test set performance in the f1 score and in the confusion matrix. Now have the opposite problem of predicting too many 'YES' cases. Perhaps we should use an alternative performance metric. Alternatively, the issue could be that the winning parameters were only good for this particular validation set, and do not perform well for other subsets of the data. This is the flaw of the training-validation-test paradigm, and is solved by using cross-validation.**<p>
    <font color = black> Read about the available scoring parameters [here]('http://scikit-learn.org/stable/modules/model_evaluation.html'). We will try to use **balanced accuracy**.

## 3.6 Using 'Balanced Accuracy' as the performance metric

Balanced accuracy, $\phi$, is defined as the arithmetic mean of the class-specific accuracies:
$$ \phi := {1\over2}(\pi^+ + \pi^-) ,$$
Where $\pi^+ = {TP\over TP+FP}$ is the accuracy of the positive class (readmission = YES) and $\pi^- = {TN\over TN+FN}$ is the accuracy of the negative class (readmission = NO). If the classifier performs equally well for both classes, then balanced accuracy reduces to regular accuracy. However, balanced accuracy penalises classifiers that perform differently for each class.<p>
    Now, the way we will calculate balanced accuracy in Python is via the confusion matrix. 
* TN is the first entry of the first column. FN is the second entry of the first column.
* TP is the first entry of the second column. FP is the second entry of the second column.

In [None]:
#from sklearn.metrics import balanced_accuracy_score

The documentation lists a 'balanced_accuracy_score', but we cannot load it in. Hence, we will have to manually create it from the confusion matrix.

In [None]:
best_BACC = 0


for class_weight in [{0:0.1, 1:0.9}, {0:0.2, 1:0.8}, {0:0.3, 1:0.7}, {0:0.4, 1:0.6}, {0:0.5, 1:0.5}]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]: # for each combination of parameters, train a model
        Log_Reg = LogisticRegression(C = C, penalty = 'l2', class_weight = class_weight)
        Log_Reg.fit(X_train_scaled, y_train)
        
        # evaluate the model on the test set (using BACC)
        
        #predictions 
        y_pred = Log_Reg.predict(X_val_scaled)
        # Compute BACC score
        cm = confusion_matrix(y_true = y_val, y_pred = y_pred)
        acc_pos = cm[1][1]/(cm[1][1] + cm[0][1])
        acc_neg = cm[0][0]/(cm[0][0] + cm[1][0])
        BACC = (acc_pos + acc_neg)/2
        
        # if we got a better score, store the score and parameters 
        if BACC > best_BACC:
                best_BACC = BACC
                best_parameters = {'C': C, 'class_weight': class_weight}
                

In [None]:
best_parameters

## 3.7. Rebuild a model on the combined training and validation set, and evaluate it on the test set
Now that we have the best parameter combination, we want to rebuild a model on the combined training and validation set, and evaluate it on the test set. First, we need to standardise again. We use the combined training and validation set to fit the scaler, then transform both the training+validation set and test set.

In [None]:
scaler = StandardScaler()

# fit the scaler to the training data ONLY
scaler.fit(X_train_val)

# standardize the training data
X_train_val_scaled = scaler.transform(X_train_val)

# standardize the validation data 
X_test_scaled = scaler.transform(X_test)

# Convert to pandas DataFrames
X_train_val_scaled = pd.DataFrame(X_train_val_scaled, columns = list(hospital.drop('readmission', axis = 1).columns.values))
X_test_scaled = pd.DataFrame(X_test_scaled, columns = list(hospital.drop('readmission', axis = 1).columns.values))

In [None]:
Log_Reg = LogisticRegression(C = best_parameters['C'], penalty = 'l2', class_weight = best_parameters['class_weight'])
Log_Reg.fit(X_train_val_scaled, y_train_val)

y_pred = Log_Reg.predict(X_test)
cm = confusion_matrix(y_true = y_test, y_pred = y_pred)
acc_pos = cm[1][1]/(cm[1][1] + cm[0][1])
acc_neg = cm[0][0]/(cm[0][0] + cm[1][0])
test_BACC = (acc_pos + acc_neg)/2


In [None]:
print("Best BACC score on validation set: {:.2f}".format(best_BACC)) 
print("Best parameters: ", best_parameters)
print("Test set BACC score with best parameters: {:.2f}".format(test_BACC))

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true = y_test , y_pred = y_pred, labels=None, sample_weight=None)
print(cm)

<font color = red> **This is an improved performance.**<p>
<font color = black>

There are still some issues in the model. Notice the best balanced accuracy score on the validation set is much larger than that of test set balanced accuracy. This means we again have the problem that the winning parameters were only good for this particular validation set, and do not perform well for other subsets of the data. We will now use grid search with cross-validation to rectify this problem.

# 4. Grid Search with Cross Validation
We want to utilise the benefits of cross validation with the grid search. We will seek to find the model with the best macro averaged F1 score by using cross validation. We will use the "GridSearchCV" class from sklearn.<p>
    As in exercise 1, we will have to use a pipeline in order to also standardize the features for each iteration of the cross validation.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

## 4.1. Define the Pipeline we will use.
First define the scaler we will use, and the estimator. As before, choose the scaler to be StandardScaler(), and the estimator to be L2 Logistic Regression.

To read more about Pipelines, go to Chapter 6, Book 1

In [None]:
Scaler = StandardScaler()
Log_Reg = LogisticRegression(penalty = 'l2')

pipe = Pipeline([('Transform', Scaler), ('Estimator', Log_Reg)])

Define the parameter grid. This is the 2-dimensional range we wish to draw parameter values from. Call this parameter grid, param_grid.<p>
    **Important:** as we are using a pipeline, there are two processes that are executed for each iteration of the cross validation. First, the standardisation, then the fitting of the logistic model. This means we have to indicate which of these processes our specified parameters should be used for. That is, the computer may try to fit class_weight into StandardScaler if we forget to tell it not to. Notice above that we have named our logistic model 'Estimator'. This means we can designate its parameters by naming the parameters in parameter grid "Estimator__'parameter_name'". For example, we tell the computer that C is meant for the logistic regression estimator by defining it as 'Estimator__C' in the param_grid.

## 4.2. Define the parameter grid

In [None]:
param_grid = {'Estimator__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'Estimator__class_weight': [{0:0.1, 1:0.9}, {0:0.2, 1:0.8}, {0:0.3, 1:0.7}, {0:0.4, 1:0.6}, {0:0.5, 1:0.5}]}
              

print("Parameter grid:")
print("class_weight: {}".format(param_grid['Estimator__class_weight']))
print("C: {}".format(param_grid['Estimator__C']))

Now initialise the GridSearchCV class by passing it the pipeline we have created, *pipe*, our paramater grid, *param_grid*, and specifying how many folds we would like. We must consider the computationaly complexity of the algorithm, so we can't set cv too high. We choose 5 folds. We also specify scoring = 'f1' to designate that we would like to use the F1 measure.

In [None]:
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring = 'f1_macro')

Recall that now that we are fitting parameters, we need a third set: the validation set. We will split the data into two sets, the training set and the test set. We will pass the training set into GridSearchCV to use for cross-validation, and reserve the test set for final model evaluation.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, random_state=0, stratify = y_binary, test_size = 0.2)

## 4.3. Find the best parameters
Now train the grid_search object. Note that grid_search behaves similarly to other classifiers, in the sense that we can use the methods fit, predict, and score with it. However, when we use fit, it performs the grid cross validation we designed during it's initialisation.

In [None]:
# It takes a while to run ...
grid_search.fit(X_train, y_train)

In [None]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation f1 score: {:.4f}".format(grid_search.best_score_))

## 4.4. Visualise the grid results
The results from the cross validated grid search are stored in cv_results_

In [None]:
#!pip install mglearn
import mglearn

In [None]:
import warnings; warnings.simplefilter('ignore') #prevent warnings

# convert results to DataFrame
results = pd.DataFrame(grid_search.cv_results_) 
# show the first few rows 
display(results[:][:5])

In [None]:
scores = np.array(results.mean_test_score)
scores = scores.reshape(6, 5) 
# reshape: first index = number of values of C, second index = number of values of dict

# Take transpose because we want class_weight on the y axis, so we can more easily see the tick labels
scores = np.transpose(scores)

print(scores)

In [None]:
# plot the mean cross-validation scores
mglearn.tools.heatmap(scores, 
                      ylabel='class_weight', 
                      yticklabels=param_grid['Estimator__class_weight'], 
                      xlabel='C', 
                      xticklabels=param_grid['Estimator__C'], 
                      cmap="viridis")

* This is very informative. We can see a clear horizontal stripe pattern in the heatmap. There are two explanations:
    * Firstly, we have not specified an appropriate range for C, meaning that C might have an effect but not on the scale that we have specified. 
    * Secondly, it could be that C simply is not important for this problem.<p>

Given that we have specified a relatively large range for C (C ranges from 0.001 to 100), we are inclined to think that the latter option is correct. To assure ourselves of this, we will search the grid again, but this time allow C to take on a large range in its values: C = [0.00001, 1, and 10000]. That is, we will increase the range of C to even more extreme values, just to confirm that C is not of importance.<p>
    
* We can also clearly see that f1 score is increasing as we move towards class_weight = {0:0.2, 1:0.8}, from both directions. This begs the question: is the maximum at class_weight = {0:0.2, 1:0.8}, or somewhere between class_weight = {0:0.1, 1:0.9} and class_weight = {0:0.3, 1:0.7}. Lets see. 
   

## 4.5. Fine tuning our search

In [None]:
param_grid = {'Estimator__class_weight': [{0:0.12, 1:0.88}, {0:0.14, 1:0.86}, {0:0.16, 1:0.84},
                                          {0:0.18, 1:0.82}, {0:0.2, 1:0.8}, {0:0.22, 1:0.78},
                                          {0:0.24, 1:0.76}, {0:0.26, 1:0.74}, {0:0.28, 1:0.72}],
              'Estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}

In [None]:
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring = 'f1_macro')

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation f1 score: {:.4f}".format(grid_search.best_score_))

In [None]:
# convert results to DataFrame
results = pd.DataFrame(grid_search.cv_results_) 
# show the first few rows 
display(results[:][:8])

In [None]:
scores = np.array(results.mean_test_score)
scores = scores.reshape(6, 9) 
# reshape: first index = number of values of C, second index = number of values of dict

# Take transpose because we want class_weight on the y axis, so we can more easily see the tick labels
scores = np.transpose(scores)

print(scores)

In [None]:
# plot the mean cross-validation scores
mglearn.tools.heatmap(scores, 
                      ylabel='class_weight', 
                      yticklabels=param_grid['Estimator__class_weight'], 
                      xlabel='C', 
                      xticklabels=param_grid['Estimator__C'], 
                      cmap="viridis",
                      fmt = "%0.2f")

* As before, there is very little variation with respect to C. Hence, we narrow our search to C = [0.1, 1, 10].
* We can also see that the maximum is somewhere between class_weight = {0:0.14, 1:0.86} and class_weight = {0:0.2, 1:0.8}. So, lets focus our attention to these values.

## 4.6. Further Fine Tuning

To read more about Pipelines, go to Chapter 6, Book 1

In [None]:
param_grid = {'Estimator__class_weight': [{0:0.15, 1:0.85}, {0:0.16, 1:0.84},
                                          {0:0.17, 1:0.83}, {0:0.18, 1:0.82}, {0:0.19, 1:0.81}],
              'Estimator__C': [0.1, 1, 10]}

In [None]:
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring = 'f1_macro')

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation f1 score: {:.4f}".format(grid_search.best_score_))

In [None]:
# convert results to DataFrame
results = pd.DataFrame(grid_search.cv_results_) 
# show the first few rows 
display(results[:][:3])

In [None]:
scores = np.array(results.mean_test_score)
scores = scores.reshape(3, 5) 
# reshape: first index = number of values of C, second index = number of values of dict

# Take transpose because we want class_weight on the y axis, so we can more easily see the tick labels
scores = np.transpose(scores)

print(scores)

In [None]:
# plot the mean cross-validation scores
mglearn.tools.heatmap(scores, 
                      ylabel='class_weight', 
                      yticklabels=param_grid['Estimator__class_weight'], 
                      xlabel='C', 
                      xticklabels=param_grid['Estimator__C'], 
                      cmap="viridis",
                      fmt = "%0.3f")

We are content that we've found the optimal parameters:<p>
C = 1<p>
    class_weight = {0:0.17, 1:0.83}.

## 4.7. Evaluate on the test set
Recall that to this point in section 4 we have not used the test set - only the training set was used for tunining the parameters. Now we will use confusion matrix and average f1 score to evaluate the model.

From page 266 of Book 1:
To read more about Pipelines, go to Chapter 6, Book 1

"Fitting the GridSearchCV object not only searches for the best parameters, but also
automatically fits a new model on the whole training dataset with the parameters that
yielded the best cross-validation performance".

<div class="alert alert-block alert-success">**Start Activity 3**</div>

### <font color='blue'> Question 3: Calculate confusion matrix and F1 score </font>
<p><font color='green'> Tip: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html </font></p>

In [1]:
# Type Python code here
# Confusion Matrix

In [2]:
# Type Python code here
# F1-score

<b> Write your thoughts here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 3**</div>

Notice this time the cross validation f1 score is much closer to the final test set f1 score. This is due to the increased stability of cross-validation in comparison with just the training-validation-test paradigm.