![alt text](images/HDAT9500Banner.PNG)
<br>

# Chapter 3: Resampling Methods
# Exercise 1: Cross-Validation


# 1. Introduction



## 1.1. Aims of the Exercise:
 1. To become familiar with Cross-validation

 
It aligns with all the learning outcome of our course: 

1.	Distinguish a range of task specific machine learning techniques appropriate for Health Data Science.
2.	Design machine learning tasks for Health Data Science scenarios.
3.	Construct appropriate training and test sets for health research data.

**NB Nomeclature: **
Training, Validation and Test Set. 

* The training set, used to train the model
* The validation set, used to evaluate model performance and adjust model parameters accordingly (for example, the alpha for Ridge Regression). Therefore, the validation set is used as an intermediate step. 
* The test set, used for final model evaluation. Book 2 uses "the term validation set" for what we call "test set" in Book 1.


## 1.2. Jupyter Notebook Intructions
1. Read the content of each cell.
2. Where necessary, follow the instructions that are written in each cell.
3. Run/Execute all the cells that contain Python code sequentially (one at a time), using the "Run" button.
4. For those cells in which you are asked to write some code, please write the Python code first and then execute/run the cell.
 
## 1.3. Tips
 1. The square brackets on the left hand side of each cell indicate whether the cell has been executed or not. Empty square brackets mean that the cell has not been executed, whereas square brackets that contain a number means that the cell has been executed. Run all the cells in sequence, using the "Run" button.
 2. To edit this notebook, just double-click in each cell. In the document, each cell can be a "Code" cell or "text-Markdown" cell. To choose between these two options, go to the combo-box above. 
 3. If you want to save your notebook, please make sure you press "the floppy disk" icon button above. 
 4. To clean the content of all cells and re-start Notebook, please go to Cell->All Output->Clear


# 2. Load the standardized training and test data, and the hospital data

**Unzip the data inside the folder data/diabetes/CSV_data.zip and place it inside the folder data/diabetes/**

In [None]:
import sys
print(sys.version)
#For this notebook to work, Python must be 3.6.4 or 3.6.5

import numpy as np
import pandas as pd
from IPython.display import display

from plotnine import *

In [None]:
hospital = pd.read_csv('data/diabetes/Data_Class_Dummies.csv', sep=',')

In [None]:
# Sanity Check:
display(hospital[:][:5])
hospital.shape

# 3. Cross Validation
To this point, we have restrained ourselves to using training-test approach for <b>model performance evaluation</b>. As you are aware, the training-test approach involves partitioning the data into two sets, one for training and one for testing. The advantage of the training-test approach is that it is simple to understand and implement. However, it does have some drawbacks:
* Estimates of test error can be highly variable, with dependence on which records are selected for training and testing.
* The performance of models tend to improve as we provide them with more training data. As we are only using a portion of the data available to us for training the model (the rest we are using for testing), the model will likely perform worse than if we had trained the model using all the data. This means that the test accuracy may be an under estimate of the true test accuracy for the model trained using all the data.

Here, we will introduce to you another, related method of model performance evaluation, known as "k-fold cross-validation". In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples, called **folds**. Of the k folds, a single fold is designated as the test set, and the remaining k-1 folds are used as training data. Now, here is where the method differs from the training-test approach. Previously, we would train a single model on k-1 folds, evaluate on the test fold, and then finish. Instead, now we train k models, each with a different test fold, and then average the performance to yield our final accuracy. It is common to choose k as either 5 or 10.<p>
        To illustrate, consider the case of 5-fold cross-validation. Here, we split the data into 5 equally sized folds. Let's call them fold_1, fold_2, fold_3, fold_4, and fold_5. Then, choose one of the folds as the first test fold, and the remaining 4 folds are the training folds. Let's say we choose fold_1 for the first test fold, so we have fold_2, fold_3, fold_4, and fold_5 for training. Now, train and evaluate the first model using these folds. Let's call this model and its results model_1.<p> Now, we swap the test fold of fold_1 with a training fold that has not previously been used for testing. Let's say we swap fold_1 with fold_2. This means we have fold_2 as our new test fold, and we have fold_1, fold_3, fold_4, and fold_5 for training. Now, train and evaluate the second model and call it model_2.<p>
            Repeat this process until we have used all 5 of our folds for testing exactly once. We would now have 5 models and their associated performance evaluation results (either F1 score, accuracy,...), model_1, model_2, model_3, model_4, and model_5. We then average these results. 

## 3.1. Cross Validation: standard logistic regression with class weights
Here we will demonstrate 10-fold cross-validation for our hospital data, to evaluate the performance of standard logistic regression with class weights. To do so, we use "cross_val_score" (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html). This method takes as inputs the model we wish to fit, the features we will use as predictors, and the response variable (readmission). The output is an array of model performance scores. <p> 
    **Note** that the cross validation used is a special kind known as **stratified** k-fold cross-validation. **Stratified** means that each fold maintains the proportion of YES:NO cases as in the original data. 
    
From scikit-learn website: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
Read "field cv":
**For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.**

**Note:**
Be mindful that cross-validation is not a way to build a model, it is only a method to evaluate the *performance* of a given model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

### 3.1.1. Splitting the feature variables from the response

In [None]:
display(hospital[:][:5])
hospital.shape

In [None]:
# Split features
X = hospital.drop('readmission', axis = 1)
display(X[:][:5])
X.shape

In [None]:
y = hospital['readmission'][:]
display(y[:][:5])
y.shape

### 3.1.2. Binarise the response:

In [None]:
# Sanity Checks:
print('******************************************')
#print(y)
print('y - NO values =', sum(i =='NO' for i in y))
print('y - YES values =', sum(i =='YES' for i in y))
print('******************************************\n')

# Create y_binary
y_binary = [0 if x=='NO' else 1 for x in y]


# Sanity Check
print('A few elements of y: ', y[:12].ravel())
print('Corresponding elements of y_binary: ', y_binary[:12])

# Sanity Checks:
print('\n******************************************')
#print(y)
print('y_binary - 0 values =', sum(i ==0 for i in y_binary))
print('y - 1 values =', sum(i ==1 for i in y_binary))
print('******************************************')

### 3.1.3. Defining the class weight dictionary, cross-validating the model

Do you remember last week's assessment? We gave some weights associated with classes 'YES' and 'NO' in order to correct the imbalance in the data set. We are going to the same where: class_weight_dict={0:0.1, 1:0.9}

In this case, we will use "y_binary" vector. Therefore, the classes in our dictionary will be 0 and 1 (instead of 'NO' and 'YES' if we had used "y"="readmission").

The default penaly is L2-norm (Ridge), but feel free to change to L1-norm (Lasso). In addition, feel free to change the value of $C=alpha$.

In [None]:
class_weight_dict={0:0.1, 1:0.9}
Log_Reg = LogisticRegression(C = 0.05, penalty = 'l2', class_weight = class_weight_dict)

In [None]:
# Be patient, it takes some time :-):
from sklearn.model_selection import cross_val_score 
scores = cross_val_score(Log_Reg, X, y_binary, cv = 10)

In [None]:
print("Cross-validation accuracy scores: {}".format(scores))
print("\n Average cross-validation accuracy scores: {:.4f}".format(scores.mean()))

### 3.1.5. F1 Score 
The default performance metric used by cross_val_score is accuracy. However, we know from previous experience that accuracy is not particularly suited to this problem. Let's use an unweighted average of f1 score. We can tell cross_val_score to use this metric by setting: scoring = 'f1'.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [None]:
f1_scores = cross_val_score(Log_Reg, X, y_binary, cv = 10, scoring  = 'f1')

In [None]:
print("Cross-validation unweighted mean f1 scores: {}".format(f1_scores))
print("Average cross-validation unweighted mean f1 score: {:.4f}".format(f1_scores.mean()))

### 3.1.6. F1 Score - Macro Averaged

Let's use an unweighted average of f1 score. We can tell cross_val_score to use this metric by setting: scoring = 'f1_macro'.

**Note:** If we set average='macro', we obtain the unweighted mean of f1 score between the NO and YES class. This metric is useful, as it ensures that both classes are taken into consideration. We will use this averaged F1 score.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [None]:
f1_scores = cross_val_score(Log_Reg, X, y_binary, cv = 10, scoring  = 'f1_macro')

In [None]:
print("Cross-validation unweighted mean f1 scores: {}".format(f1_scores))
print("Average cross-validation unweighted mean f1 score: {:.4f}".format(f1_scores.mean()))

<div class="alert alert-block alert-success">**Start Activity 1**</div>

### <font color='blue'> Question 1:  Describe all the steps that we followed in section 3.1. Be concise. </font>

<b> Write your answer here:</b>
#####################################################################################################################

(Double-click here)

Step 1:

Step 2:
...
#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 1**</div>

# 3.2. Cross Validation: logistic regression with lasso regularization
Here we will demonstrate 10-fold cross-validation for our hospital data, to evaluate the performance of logistic regression using L1 regularization. Again, we will use "cross_val_score".<p>
    However, there is a difficulty in this problem. Recall that for regularization to be meanigful, we require the features to be standardised, and that we must fit the standardiser on the *training set only*, and then apply it to the test set. With cross-validation, the training and test set is changing at each iteration. This means we must fit the scaler for each new training set in each iteration. We will achieve this by using a **pipeline**.<p> 
        **Pipelines** allow us to sequentially perform a list of transforms and a final estimator. In our case, we wish to apply the standardisation transformation, and then the logistic regression estimator.<p>
            If we use a pipeline to make a cross-validated estimator using cross_val_score, then the StandardScaler will estimate the parameters for centering and rescaling to unit variance only on the training folds. When evaluating the pipeline on the test fold, the StandardScaler will use the stored means and standard deviations and subtract the train mean from the test set and divide the result by the train standard deviation. So even in the pipeline, the StandardScaler will not use the test set in any way to determine mean and variance of the data.
  

Read about Pipelines in Chapter 6 of Book 1 "Building Pipelines". <p>
You can read more about Pipelines in:
    http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [None]:
from sklearn.pipeline import Pipeline 

### 3.2.1. Define our pipeline and then use cross validation

In [None]:
Scaler = StandardScaler()
Log_Reg = LogisticRegression(C = 0.05, penalty = 'l1')

pipe = Pipeline([('Transform', Scaler), ('Estimator', Log_Reg)])

In [None]:
scores = cross_val_score(pipe, X, y, cv = 10)

In [None]:
print("Cross-validation accuracy scores:\n {}".format(scores))
print("Average cross-validation score: {:.4f}".format(scores.mean()))

### 3.2.2. F1 Score
As before, let's use the f1 score. Remember, when using f1 score, we need to use the binarised version of the response variable.

<div class="alert alert-block alert-success">**Start Activity 2**</div>

### <font color='blue'> Question 2a:  Write the function that calculates the 10-CV using the F1-score and Pipe  </font>

Documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [None]:
# Write Python code here:


### <font color='blue'> Question 2b:  Print the 10 values of the F1-score here. Any comments? </font>

In [None]:
# Write Python code here:


<b> Write your answer here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 2**</div>

<div class="alert alert-block alert-success">**Start Activity 3**</div>

### <font color='blue'> Question 3:  Describe all the steps that we followed in section 3.2. Be concise. </font>

<b> Write your answer here:</b>
#####################################################################################################################

(Double-click here)

Step 1:

Step 2:
...
#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 3**</div>