![alt text](images/HDAT9500Banner.PNG)
<br>

# Chapter 3: Model Evaluation and Improvement

#####################################################################################

Double-click to write down your name and surname.

**Name:**


**Surname:**

**Honour Pledge** <p>
    
    
Declaration: <p>
    
    
I declare that this assessment item is my own work, except where acknowledged, and has not been submitted for academic credit elsewhere or previously, or produced independently of this course (e.g. for a third party such as your place of employment) and acknowledge that the assessor of this item may, for the purpose of assessing this item: 

    a. Reproduce this assessment item and provide a copy to another member of the University; and/or 
    b. Communicate a copy of this assessment item to a plagiarism checking service (which may then retain a copy of the assessment item on its database for the purpose of future plagiarism checking). 

#####################################################################################

# Assessment


Tuning parameters with Grid + cross-validation: GridSearchCV <font color=green>**(Step 6 of the ML work-flow)**</font>

Pipelines.

The test will be kept in a "safe box" to use once we have found the best parameters and the best model with Grid SearchCV.


# 1. Introduction

In this assessment, you will be asked to use 'Grid' to find the best $alpha$ ($alpha=C=1/\lambda$) and the best combination of B:M class_weight for logistic regression with Ridge (L2) regularization and 5-CV.

**NB Nomeclature: **
Training, Validation and Test Set. 

* The training set, used to train the model
* The validation set, used to evaluate model performance and adjust hyper-parameters accordingly (for example, the alpha for Ridge Regression). Therefore, the validation set is used as an intermediate step. 
* The test set, used for final model evaluation. Book 2 uses the term "validation set" for what we call "test set" in Book 1.


## 1.1. Aims of the Exercise:
 1. To become familiar with a validation set to find the best hyper-parameters of a model. Remember that the hyper-parameters are defined by the user.
 2. To become familiar with a grid search: the most commonly used method for tuning parameters is via a grid search, which entails testing many combinations of the parameters of interest.
 3. To become familiar with k-CV and grid search
 4. To become familiar with Python pipelines

 
It aligns with all of the learning outcomes of our course: 

1.	Distinguish a range of task specific machine learning techniques appropriate for Health Data Science.
2.	Design machine learning tasks for Health Data Science scenarios.
3.	Construct appropriate training and test sets for health research data.


## 1.2. Jupyter Notebook Instructions
1. Read the content of each cell.
2. Where necessary, follow the instructions that are written in each cell.
3. Run/Execute all the cells that contain Python code sequentially (one at a time), using the "Run" button.
4. For those cells in which you are asked to write some code, please write the Python code first and then execute/run the cell.
 
## 1.3. Tips
 1. The square brackets on the left hand side of each cell indicate whether the cell has been executed or not. Empty square brackets mean that the cell has not been executed, whereas square brackets that contain a number means that the cell has been executed. Run all of the cells in sequence, using the "Run" button.
 2. To edit this notebook, just double-click in each cell. In the document, each cell can be a "Code" cell or "text-Markdown" cell. To choose between these two options, go to the combo-box above. 
 3. If you want to save your notebook, please make sure you press the "floppy disk" icon button above. 
 4. To clean the content of all cells and re-start Notebook, please go to Cell->All Output->Clear

# 2. Load dataset

In [None]:
import sys
print(sys.version)


import numpy as np
import pandas as pd
from IPython.display import display

from plotnine import *

In [None]:
dataframe = pd.read_csv('data/breast-cancer-wisconsin-data/data.csv', sep=',')

In [None]:
# Sanity Check:
display(dataframe[:][:5])
dataframe.shape

# 3. Grid Search with Cross-Validation: GridSearchCV

The most commonly used method for tuning parameters is via a grid search, which entails testing many combinations of the parameters of interest.<p>
    
 We want to utilise the benefits of cross-validation with the grid search. We will seek to find the model with the best accuracy by using cross-validation. We will use the "GridSearchCV" class from sklearn.
 
 http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
    
We have two primary hyper-parameters that we would like to tune:
* C, ($C=alpha=1/\lambda$)
* class_weight, the class weights.
<p>
    
Let's say we want to try:

1. C = 0.001, 0.01, 0.1, 1, 10, 100. And for 

2. class_weight = 'balanced', {'B':0.1, 'M':0.9}, {'B':0.2, 'M':0.8}, {'B':0.3, 'M':0.7}, {'B':0.4, 'M':0.6}, and {'B':0.5, 'M':0.5}. 


Note that class_weight = {'B':0.5, 'M':0.5} corresponds to no class weighting, as the weightings are equal. As there are 6 cases of C, and 6 of class weight, there are 6 times 6 = 36 total combinations of C and class weight.<p>
    
We will choose **L2 regularization (ridge) for this problem**. As we are using a grid, and later a grid in combination with cross-validation, we have to keep in our minds *computational complexity*. L2 has a closed form solution because it relies on squaring the beta coefficients. L1 does not have a closed form solution as it involves an absolute value. For this reason, L1 is computationally more expensive, as we can't solve it in terms of matrix math, and most rely on approximations (in the lasso case, coordinate descent). This means L2 will be much faster to implement.

### <font color='blue'> Question 1: Split the whole dataset into a train and a test set (20% of the total). Keep the test set aside (hidden inside a box as we mentioned in the videos) until the very end (15 marks)</font>

 <font color='green'> *NB*: We stratify in order to have the same number of classes in the different splits.</font>

In [None]:
# Q1: Write Python Code Here


## 3.1. Define the Pipeline

As in exercise 1, we will have to use a pipeline in order to also standardize the features for each iteration of the cross-validation.


### <font color='blue'> Question 2: Define the scaler we will use, and the estimator. As before, choose the scaler ("Transform") to be StandardScaler(), and the estimator ("Estimator") to be L2 Logistic Regression. (5 marks)</font>

To read about Pipelines:
1. https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
2. https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976


In [None]:
# Q2: Write Python Code Here



## 3.2. Define the parameter grid

### <font color='blue'> Question 3: Define the parameter grid. This is the 2-dimensional range we wish to draw parameter values from. Call this object 'param_grid'. (20 marks)</font>

 **Important:** as we are using a pipeline, there are two processes that are executed for each iteration of the cross-validation. First, the standardisation, then the fitting of the logistic model. This means we have to indicate which of these processes our specified parameters should be used for. That is, the computer may try to fit class_weight into StandardScaler if we forget to tell it not to. Notice above that we have named our logistic model 'Estimator'. This means we can designate its parameters by naming the hyper-parameters in parameter grid "Estimator__'parameter_name'". **For example, we tell the computer that C is meant for the logistic regression estimator by defining it as 'Estimator__C' in the param_grid.**

In [None]:
# Q3: Write Python Code Here


### <font color='blue'> Question 4: Initialise the GridSearchCV class by passing it the pipeline we have created, *pipe*, our paramater grid, *param_grid*, and specifying how many folds we would like. We must consider the computational complexity of the algorithm, so we can't set cv too high. We choose 5 folds.(5 marks)</font>

In [None]:
# Q4: Write Python Code Here


### <font color='blue'> Question 5: What score (performance metric) are we using in the GridSearchCV above? If we used the f1 score instead, what command or commands should be included? - (10 marks)</font>

<b> Write your thoughts here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

In [None]:
# Q5: Write Python code here
grid_search_f1 = 

## 3.3. Find the best parameters

Now train the grid_search object. Note that grid_search behaves similarly to other classifiers, in the sense that we can use the methods fit, predict and score with it. When we use fit, it performs the grid cross-validation we designed during its initialisation.

### <font color='blue'> Question 6: Fit the GridSearchCV you created before and show the best parameters and score out of all the folds - (15 marks)</font>

In [None]:
# Q6: Write Python code here


# It takes a while to run ...

## 3.4. Visualise the grid results
The results from the cross-validated grid search are stored in cv_results_

In [None]:
#!pip install mglearn
import mglearn

In [None]:
import warnings; warnings.simplefilter('ignore') #prevent warnings

# convert results to DataFrame
results = pd.DataFrame(grid_search.cv_results_) 
# show the first few rows 
display(results[:][:5])

In [None]:
scores = np.array(results.mean_test_score)
scores = scores.reshape(6, 6) 
# reshape: first index = number of values of C, second index = number of values of dict

# Take transpose because we want class_weight on the y axis, so we can more easily see the tick labels
scores = np.transpose(scores)

print(scores)

In [None]:
# plot the mean cross-validation scores
mglearn.tools.heatmap(scores, 
                      ylabel='class_weight', 
                      yticklabels=param_grid['Estimator__class_weight'], 
                      xlabel='C', 
                      xticklabels=param_grid['Estimator__C'], 
                      cmap="viridis")

### <font color='blue'> Question 7: Interpret the heatmap - (10 marks)</font>

<b> Write your thoughts here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

## 3.5. Evaluate on the performance of the resulting model

Recall that to this point we have not used the test set - only the training set was used for tuning the parameters. Now we will use confusion matrix and average f1 score to evaluate the model.


"Fitting the GridSearchCV object not only searches for the best parameters, but also
automatically fits a new model on the whole training dataset with the parameters that
yielded the best cross-validation performance".

### <font color='blue'> Question 8: Calculate confusion matrix, accuracy, recall, precision and f1 score. Comment on the results below - (20 marks)</font>
<p><font color='green'> Tip: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html </font></p>

In [None]:
# Q8: Type Python code here


<b> Write your thoughts here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################