![alt text](images/HDAT9500Banner.PNG)
<br>

# Chapter 3: Resampling Methods
# Exercise 1: Bootstrapping (Under Sampling Minority Class), and Cross-Validation


# 1. Introduction



## 1.1. Aims of the Exercise:
 1. To become familiar with Bootstrapping
 2. To become familiar with Cross-validation

 
It aligns with all the learning outcome of our course: 

1.	Distinguish a range of task specific machine learning techniques appropriate for Health Data Science.
2.	Design machine learning tasks for Health Data Science scenarios.
3.	Construct appropriate training and test sets for health research data.


## 1.2. Jupyter Notebook Intructions
1. Read the content of each cell.
2. Where necessary, follow the instructions that are written in each cell.
3. Run/Execute all the cells that contain Python code sequentially (one at a time), using the "Run" button.
4. For those cells in which you are asked to write some code, please write the Python code first and then execute/run the cell.
 
## 1.3. Tips
 1. The square brackets on the left hand side of each cell indicate whether the cell has been executed or not. Empty square brackets mean that the cell has not been excuted, whereas square brackets that contain a number means that the cell has been executed. Run all the cells in sequence, using the "Run" button.
 2. To edit this notebook, just double-click in each cell. In thid document, each cell can be a "Code" cell or "text-Markdown" cell. To choose between these two options, go to the combo-box above. 
 3. If you want to save your notebook, please make sure you press "the floppy disk" icon button above. 
 4. To clean the content of all cells and re-start Notebook, please go to Cell->All Output->Clear


# 2. Load the standardized training and test data, and the hospital data.

Unzip de data inside the folder data/diabetes/CSV_data.zip and place it inside the folder data/diabetes/

In [None]:
import sys
print(sys.version)
#For this notebook to work, Python must be 3.6.4 or 3.6.5

import numpy as np
import pandas as pd
from IPython.display import display

from plotnine import *

In [None]:
hospital = pd.read_csv('data/diabetes/Data_Class_Dummies.csv', sep=',')
train_standardized_data = pd.read_csv('data/diabetes/train_standardized_data.csv', sep=',')
test_standardized_data = pd.read_csv('data/diabetes/test_standardized_data.csv', sep=',')

In [None]:
# Sanity Check:
display(hospital[:][:5])
hospital.shape

In [None]:
# Sanity Check:
display(train_standardized_data[:][:5])

In [None]:
# Sanity Check:
display(test_standardized_data[:][:5])

## 2.1. Split the training and test data into features and response.

In [None]:
X_train_standardized = train_standardized_data.drop(['readmission'], axis = 1)
y_train = train_standardized_data[['readmission']].values

In [None]:
X_test_standardized = test_standardized_data.drop(['readmission'], axis = 1)
y_test = test_standardized_data[['readmission']].values

In [None]:
print(X_train_standardized.shape)
print(X_test_standardized.shape)

## 2.2. Binarise response
We will be using the f1 score at various points in this exercise. So, lets create a binary response for the training and test response vectors we have created.

* **Training response:**

In [None]:
# Sanity Checks:
print('******************************************')
#print(y_train)
print('y_train - NO values =', sum(i =='NO' for i in y_train))
print('y_train - YES values =', sum(i =='YES' for i in y_train))
print('******************************************\n')

# Create y_train_binary
y_train_binary = [0 if x=='NO' else 1 for x in y_train]


# Sanity Check
print('A few elements of y_train: ', y_train[:12].ravel())
print('Corresponding elements of y_train_binary: ', y_train_binary[:12])

# Sanity Checks:
print('\n******************************************')
#print(y_train)
print('y_train_binary - 0 values =', sum(i ==0 for i in y_train_binary))
print('y_train - 1 values =', sum(i ==1 for i in y_train_binary))
print('******************************************')

* **Test response:**

In [None]:
# Sanity Checks:
print('******************************************')
#print(y_test)
print('y_test - NO values =', sum(i =='NO' for i in y_test))
print('y_test - YES values =', sum(i =='YES' for i in y_test))
print('******************************************\n')

# Create y_test_binary
y_test_binary = [0 if x=='NO' else 1 for x in y_test]


# Sanity Check
print('A few elements of y_test: ', y_test[:12].ravel())
print('Corresponding elements of y_test_binary: ', y_test_binary[:12])

# Sanity Checks:
print('\n******************************************')
#print(y_test)
print('y_test_binary - 0 values =', sum(i ==0 for i in y_test_binary))
print('y_test - 1 values =', sum(i ==1 for i in y_test_binary))
print('******************************************')

**Alternative fast method of finding unique values and counts within a numpy array:**

In [None]:
print(np.unique(y_test, return_counts = True))

In [None]:
print(np.unique(y_test_binary, return_counts = True))

# 3. Bootstrapping
Bootstrapping is a broad term referring to any statistical method that utilises **random sampling with replacement**.<p>
**Bootstrap Sample:** Say we have *n* data points in our sample. A bootstrap sample of this data set is generated by drawing a data point with replacement exactly *n* times. The result is a sample of the same size as the original, but with duplicates. How many duplicates? Well, on *average* approximately 1/3 of the original data points will excluded from our bootstrap sample. That is, approximately 2/3 of the original data will be included.<p>
    Proof: Suppose the original data contains n observations. The probability that a particular observation is not chosen from a set of n observations is $1 - {1\over n}$, so the probability that the observation is not chosen n times is $(1 - {1\over n})^n$. This is the probability that the observation does not appear in a bootstrap sample. You may recall that the Eulers number, $e$, is defined as $e := \lim_{n\to\infty}{(1 + {1\over n})^n}$. From this fact, and assuming we have a relatively large sample size, we can show that the probability that the observation does not appear in a bootstrap sample is equal to $1\over e$, which is approximately $1\over 3$. 

<p> **Example:** A classic example of bootstrapping is to determine the variance of a test statistic (such as the mean). The method involves first obtaining a dataset of size $n$. Then, for some large number $B$, generate $B$ new datasets of size $n$ by repeatedly sampling with replacement from the original dataset. That is, we generate $B$ bootstrap samples. Then, for each of these $B$ bootstrap samples, we compute the test statistic (for example, the mean). This will give us an empirical sampling distribution of the test statistic, which provides us with the variance. Crucially, this method gives us the variance without any need of a formula.

# 4. Oversampling minority class, readmission = YES.

Here, our goal is to address the imbalanced response problem by evening the class label distribution. In this first method, we will <b>'oversample the minority class'</b>. This means we will generate extra data points corresponding to the 'YES' class label. The method for doing so is outlined in [this website](https://blog.dominodatalab.com/imbalanced-datasets/).<p>
    The name of the method is Synthetic Minority Oversampling TEchnique (SMOTE). In short, SMOTE involves:
* Randomly sample a data point from the minority group (readmission = YES, in our case).
* For some choice of k, compute the k nearest neighbours of this point.
* Add k new points somewhere between the chosen point and its k nearest neighbours.
* Repeat process until the classes are even.

As you can see, the SMOTE method combines bootstrapping and k-nearest neighbour to synthetically create additional observations of the minority class, in this case readmission = 'YES'. However, it does so in such a way to ensure that each new synthetic child sample is never an exact duplicate of its parents.<p>
![alt text](https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/smote.png 'SMOTE Visualisation')


## 4.1. Resample the minority class

Load the package 'imbalanced-learn'. This contains several useful resampling functions. May require installation of 'msgpack'.<p>
    The function we will use is from imblearn.over_sampling, and is called SMOTE. Read more about its details and parameter choices [here](http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html).

In [None]:
!pip install msgpack
# Alternative: type 'pip install msgpack' in Ananconda prompt (Windows) or the command line (Mac/Linux)

# Install 'imbalanced-learn'
!pip install -U imbalanced-learn



In [None]:
from imblearn.over_sampling import SMOTE
import warnings; warnings.simplefilter('ignore')

If you obtain a message that says "ModuleNotFoundError: No module named 'imblearn'", try this:
conda install -c glemaitre imbalanced-learn (run the line below)

In [None]:
#!conda install -c glemaitre imbalanced-learn

<div class="alert alert-block alert-success">**Start Activity 1**</div>

### <font color='blue'> Question 1a:  Write the SMOTE() function. Leave all arguments as default except random_state=0 and ratio? Which option would you choose for ratio?  </font>

Documentation:
http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html

In [None]:
# Write Python code here
# We will name our model 'smote'

smote = SMOTE()


### <font color='blue'> Question 1b:  What does the SMOTE() function do?  </font>

<b> Write the answer here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

### <font color='blue'> Question 1c:  Write the fit_sample() function.   </font>

Documentation:
http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html

In [None]:
# Write Python code here

X_train_standardized_smote, y_train_smote = 


### <font color='blue'> Question 1d:  What does the fit_sample() function do?  </font>

<b> Write the answer here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 1**</div>

In [None]:
print(train_standardized_data['readmission'].value_counts())

In [None]:
# For even ratio, need to add the following number of records to the YES class:
print(train_standardized_data['readmission'].value_counts()[0]-train_standardized_data['readmission'].value_counts()[1])

In [None]:
print(np.unique(y_train_smote, return_counts = True))

As we can see, after performing the SMOTE algorithm, the number of NO and YES cases are now equal, at 47,769.

## 4.2. Train logistic model

<div class="alert alert-block alert-success">**Start Activity 2**</div>

### <font color='blue'> Question 2a: Fit a logistic regression model with L1-norm regularization (Lasso)  </font>

In [None]:
# Type Python code here
from ...

Log_Reg = ...

### <font color='blue'> Question 2b: Print the beta coefficients  </font>

In [None]:
# Type Python code here

# Beta Coefficients



### <font color='blue'> Question 2c: Show the name of the columns whose betta coefficients are different from zero  </font>
<p><font color='green'> Tip: You can find the code in Chapter2 - Exercise 03 </font></p>

In [None]:
# Type Python code here



<div class="alert alert-block alert-warning">**End Activity 2**</div>

In [None]:
# Predictions 
from sklearn import metrics
y_pred= Log_Reg.predict(X_test_standardized)

# Use score method to get accuracy of model
score = Log_Reg.score(X_test_standardized, y_test_binary)
print('Accuracy: {}'.format(score))

## 4.3. Evaluating the model using F1 Score
We will use the average F1 score between YES and NO.

F1 scores need 'YES' to be 1 and 'NO' to be 0.

![alt text](images/F1score.PNG)

<div class="alert alert-block alert-success">**Start Activity 3**</div>

### <font color='blue'> Question 3: Calculate the F1-score  </font>

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [None]:
#Type Python code here:


<div class="alert alert-block alert-warning">**End Activity 3**</div>

The <b>classification_report</b> function produces one line per class (here, "YES" and
"NO", or "1" and "0") and returns precision, recall, and f1-score with this class as the positive class.

Before, we assumed the minority “1” class was the positive class. If we change the
positive class to “0,” we can see from the output of <b>classification_report</b>
that we obtain an f1-score of 0.75.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test_binary, y_pred))

Let's check the f1 score, macro averaged. THe f1-score macro-averaged calculated the average f1-score of the two classes:

In [None]:
metrics.f1_score(y_true = y_test_binary, y_pred = y_pred, average = 'macro') 

## 4.4. Receiver Operating Characteristic (ROC): TPR and FPR

### 4.4.1. Probability associated with each prediction
We need to determine the probability of each record in the test set being a 'YES', or equivalently a 1 as we have converted the response into a binary variable.

In [None]:
# Probabilities of the test set being 0 and 1
y_pred_proba = Log_Reg.predict_proba(X_test_standardized)[:,1]
y_pred

print(y_pred_proba[:5])
print(y_pred[:5])

### 4.4.2. Determining the fpr, tpr at each threshold value
Now that we have the probabilities associated with each prediction, we know exactly which records are predicted YES and NO for each choice of decision threshold. Hence, we can determine the false positive rate (fpr) and true positive rate (tpr) for threshold value.

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test_binary, y_pred_proba)
print(fpr[:5])
print(tpr[:5])
print(thresholds[:5])

### 4.4.3. Plotting The ROC Curve

In [None]:
df = pd.DataFrame()
df['fpr'] = fpr
df['tpr'] = tpr
# Sanity check 
display(df[:][:5])


In [None]:
fpr, tpr,_= metrics.roc_curve(y_true = y_test_binary, y_score = y_pred_proba)

from plotnine import *

p = ggplot(mapping = aes(x = fpr, y = tpr), data = df)
p += geom_line(color = 'red')
p += geom_abline(aes(intercept=0, slope=1), linetype = 'dashed', colour = 'blue')
p += labs(title = 'ROC Curve', x = 'fpr', y = 'tpr')
p += theme_bw()

print(p)

### 4.4.4. Area under the ROC curve (AUC)
Note that AUC = 0.5 corresponds to random assignment.

In [None]:
print(metrics.roc_auc_score(y_true = y_test_binary, y_score = y_pred_proba))

## 4.5. Computing optimal threshold

In [None]:
# index of pair that maximises tpr - fpr
ind_max = np.argmax(tpr - fpr)
print(ind_max)

In [None]:
# threshold value that maximises the tpr - fpr
optimal_thresh = thresholds[ind_max]
print(optimal_thresh)

## 4.6. Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion = metrics.confusion_matrix(y_test_binary, y_pred)
print("Confusion matrix:\n{}".format(confusion))


# 5. Performance Metrics
Here, we will recap and introduce some functions to evaluate performance metrics we can use to evaluate the model.

## 5.1. classification_report
This provides a summary of the precision, recall, f1-score and support. 

<div class="alert alert-block alert-success">**Start Activity 4**</div>

### <font color='blue'> Question 4a:  Write the function that calculates the classification_report()  </font>

Documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

In [None]:
# Write Python code here:


### <font color='blue'> Question 4b:  Define precision and recall  </font>

<b> Write the answer here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 4**</div>

The average is a prevalance-weighted average, meaning that majority class has a larger influence than the minority. This is not useful for us, and there is no parameter to change it. Also, the table is only in text form, meaning we can't access the individual elements as if it were an array. This would make further anlaysis tedious, as we'd have to enter the numbers manually for everything.

## 5.2. Confusion Matrix

![alt text](images/Confusion_Matrix.PNG)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm = confusion_matrix(y_true = y_test_binary, y_pred = y_pred)
print("Confusion matrix:\n{}".format(cm))

## 5.3. Balanced Accuracy
Balanced accuracy, $\phi$, is defined as the arithmetic mean of the class-specific accuracies:
$$ \phi := {1\over2}(\pi^+ + \pi^-) ,$$
Where $\pi^+ = {TP\over TP+FP}$ is the accuracy of the positive class (readmission = YES) and $\pi^- = {TN\over TN+FN}$ is the accuracy of the negative class (readmission = NO). If the classifier performs equally well for both classes, then balanced accuracy reduces to regular accuracy. However, balanced accuracy penalises classifiers that perform differently for each class.<p>
    Now, the way we will calculate balanced accuracy in Python is via the confusion matrix. 
* TN is the first entry of the first column. FN is the second entry of the first column.
* TP is the first entry of the second column. FP is the second entry of the second column.

In [None]:
print(np.unique(y_pred, return_counts = True))

In [None]:
cm = confusion_matrix(y_true = y_test_binary, y_pred = y_pred)
print("Confusion matrix:\n{}".format(cm))

In [None]:
# Accuracy of YES
acc_pos = cm[1][1]/(cm[1][1] + cm[0][1])
print(acc_pos)

In [None]:
# Accuracy of NO
acc_neg = cm[0][0]/(cm[0][0] + cm[1][0])
print(acc_neg)

In [None]:
# Balanced Accuracy
BACC = (acc_pos + acc_neg)/2
print(BACC)

## 5.4. Precision
pos_label specifies which class label we wish to calculate precision for.<p>
average = 'macro' indicates that we wish to compute the unweighted mean for all classes. 


In [None]:
from sklearn.metrics import precision_score

In [None]:
print(precision_score(y_true = y_test_binary, y_pred = y_pred, pos_label = 1))

In [None]:
print(precision_score(y_true = y_test_binary, y_pred = y_pred, pos_label = 0))

In [None]:
print(precision_score(y_true = y_test_binary, y_pred = y_pred, average = 'macro'))

## 5.5. Recall

In [None]:
from sklearn.metrics import recall_score

In [None]:
print(recall_score(y_true = y_test_binary, y_pred = y_pred, pos_label = 1))

In [None]:
print(recall_score(y_true = y_test_binary, y_pred = y_pred, pos_label = 0))

In [None]:
print(recall_score(y_true = y_test_binary, y_pred = y_pred, average = 'macro'))

## 5.6. precision_recall_fscore_support
This is a convenient way of computing the precision, recall, fscore and support. The output is a list of arrays, the first array containing the precision of NO and YES, the second containing the recall of NO and YES, and the third containing the f1 score of NO and YES. The fourth array is the support of NO and YES, which is simply the number of occurences for each class label.

In [None]:
from sklearn.metrics import precision_recall_fscore_support

In [None]:
some_metrics = precision_recall_fscore_support(y_true = y_test_binary, y_pred = y_pred)
some_metrics

In [None]:
# Precision of NO
some_metrics[0][0]

In [None]:
# Recall of YES
some_metrics[1][1]

By setting average='macro', we obtain an un-weighted mean of precision, recall, and fscore:

In [None]:
average_metrics = precision_recall_fscore_support(y_true = y_test_binary, y_pred = y_pred, average = 'macro')
print(average_metrics)

In [None]:
print('Average Precision: {}'.format(average_metrics[0]))
print('Average Recall: {}'.format(average_metrics[1]))
print('Average f1 score: {}'.format(average_metrics[2]))

## 5.7. F1 Score - Unweighted mean
If we set average='macro', we obtain the unweighted mean of f1 score between the NO and YES class. This metric is useful, as it ensures that both classes are taken into consideration. We will use this averaged F1 score.

In [None]:
from sklearn.metrics import f1_score

In [None]:
f1_score(y_true = y_test_binary, y_pred = y_pred, average = 'macro')

# 6. Cross Validation
To this point, we have restrained ourselves to using training-test approach for <b>model performance evaluation</b>. As you are aware, the training-test approach involves partitioning the data into two sets, one for training and one for testing. The advantage of the training-test approach is that it is simple to undertand and implement. However, it does have some drawbacks:
* Estimates of test error can be highly variable, with dependence on which records are selected for training and testing.
* The performance of models tend to improve as we provide them with more training data. As we are only using a portion of the data available to us for training the model (the rest we are using for testing), the model will likely perform worse than if we had trained the model using all the data. This means that the test accuracy may be an under estimate of the true test accuracy for the model trained using all the data.

Here, we will introduce to you another, related method of model performance evaluation, known as "k-fold cross-validation". In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples, called **folds**. Of the k folds, a single fold is designated as the test set, and the remaining k-1 folds are used as training data. Now, here is where the method differs from the training-test approach. Previously, we would train a single model on k-1 folds, evaluate on the test fold, and then finish. Instead, now we train k models, each with a different test fold, and then average the performance to yield our final accuracy. It is common to choose k as either 5 or 10.<p>
        To illustrate, consider the case of 5-fold cross-validation. Here, we split the data into 5 equally sized folds. Lets call them fold_1, fold_2, fold_3, fold_4, and fold_5. Then, choose one of the folds as the first test fold, and the remaining 4 folds are the training folds. Lets say we choose fold_1 for the first test fold, so we have fold_2, fold_3, fold_4, and fold_5 for training. Now, train and evaluate the first model using these folds. Lets call this model and its results model_1.<p> Now, we swap the test fold fold_1 with a training fold that has not previously been used for testing. Lets say we swap fold_1 with fold_2. This means we have fold_2 as our new test fold, and we have fold_1, fold_3, fold_4, and fold_5 for training. Now, train and evaluate the second model and call it model_2.<p>
            Repeat this process until we have used all 5 of our folds for testing exactly once. We would now have 5 models and their associated performance evaluation results (either F1 score, accuracy,...), model_1, model_2, model_3, model_4, and model_5. We then average these results. 

## 6.1. Cross Validation: standard logistic regression with class weights
Here we will demonstrate 10-fold cross-validation for our hospital data, to evaluate the performance of standard logistic regression with class weights. To do so, we use "cross_val_score" (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html). This method takes as inputs the model we wish to fit, the features we will use as predictors, and the response variable (readmission). The output is an array of model performance scores. <p> 
    **Note** that the cross validation used is a special kind known as **stratified** k-fold cross-validation. Stratified means that each fold maintains the proportion of YES:NO cases as in the original data. 
    
    

**Note:**
Be mindful that cross-validation is not a way to build a model, it is only a method to evaluate the *performance* of a given model.

In [None]:
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

### 6.1.1. Splitting the feature variables from the response

In [None]:
display(hospital[:][:5])
hospital.shape

In [None]:
# Split features
X = hospital.drop('readmission', axis = 1)
display(X[:][:5])
X.shape

In [None]:
y = hospital['readmission'][:]
display(y[:][:5])
y.shape

### 6.1.2. Binarise the response:

In [None]:
# Sanity Checks:
print('******************************************')
#print(y)
print('y - NO values =', sum(i =='NO' for i in y))
print('y - YES values =', sum(i =='YES' for i in y))
print('******************************************\n')

# Create y_binary
y_binary = [0 if x=='NO' else 1 for x in y]


# Sanity Check
print('A few elements of y: ', y[:12].ravel())
print('Corresponding elements of y_binary: ', y_binary[:12])

# Sanity Checks:
print('\n******************************************')
#print(y)
print('y_binary - 0 values =', sum(i ==0 for i in y_binary))
print('y - 1 values =', sum(i ==1 for i in y_binary))
print('******************************************')

### 6.1.3. Defining the class weight dictionary, cross-validating the model

In [None]:
class_weight_dict={0:0.1, 1:0.9}
Log_Reg = LogisticRegression(C = 1e50, penalty = 'l2', class_weight = class_weight_dict)

# Be patient, it takes some time :-):
scores = cross_val_score(Log_Reg, X, y_binary, cv = 10)

In [None]:
print("Cross-validation accuracy scores: {}".format(scores))
print("\n Average cross-validation accuracy scores: {:.4f}".format(scores.mean()))

### 6.1.4. F1 Score - Macro Averaged
The default performance metric used by cross_val_score is accuracy. However, we know from previous experience that accuracy is not particularly suited to this problem. Lets use an unweighted average of f1 score. We can tell cross_val_score to use this metric by setting: scoring = 'f1_macro'.

In [None]:
f1_scores = cross_val_score(Log_Reg, X, y_binary, cv = 10, scoring  = 'f1_macro')

In [None]:
print("Cross-validation unweighted mean f1 scores: {}".format(f1_scores))
print("Average cross-validation unweighted mean f1 score: {:.4f}".format(f1_scores.mean()))

# 6.2. Cross Validation: logistic regression with lasso regularization
Here we will demonstrate 10-fold cross-validation for our hospital data, to evaluate the performance of logistic regression using L1 regularization. Again, we will use "cross_val_score".<p>
    However, there is a difficulty in this problem. Recall that for regularization to be meanigful, we require the features to be standardised, and that we must fit the standardiser on the *training set only*, and then apply it to the test set. With cross-validation, the training and test set is changing at each iteration. This means we must fit the scaler for each new training set in each iteration. The way we will achieve this by using a **pipeline**.<p> 
        **Pipelines** allow us to sequentially perform a list of transforms and a final estimator. In our case, we wish to apply the standardisation transformation, and then the logistic regression estimator.<p>
            If we use a pipeline to make a cross-validated estimator using cross_val_score, then the StandardScaler will estimate the parameters for centering and rescaling to unit variance only on the training folds. When evaluating the pipeline on the test fold, the StandardScaler will use the stored means and standard deviations and subtract the train mean from the test set and divide the result by the train standard deviation. So even in the pipeline, the StandardScaler will not use the test set in any way to determine mean and variance of the data.
  

In [None]:
from sklearn.pipeline import Pipeline 

### 6.2.1. Define our pipeline, and then use cross validation

In [None]:
Scaler = StandardScaler()
Log_Reg = LogisticRegression(C = 0.05, penalty = 'l1')

pipe = Pipeline([('Transform', Scaler), ('Estimator', Log_Reg)])

In [None]:
scores = cross_val_score(pipe, X, y, cv = 10)

In [None]:
print("Cross-validation accuracy scores:\n {}".format(scores))
print("Average cross-validation score: {:.4f}".format(scores.mean()))

### 6.2.2. F1 Score
As before, lets use the f1 score. Remember, when using f1 score, we need to use the binarised version of the response variable.

<div class="alert alert-block alert-success">**Start Activity 5**</div>

### <font color='blue'> Question 5a:  Write the function that calculates the 10-CV using the F1-score and Pipe  </font>

Documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

In [None]:
# Write Python code here:


### <font color='blue'> Question 5b:  Print the 10 values of the F1-score here: </font>

In [None]:
# Write Python code here:


<b> Write your answer here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 5**</div>

<div class="alert alert-block alert-success">**Start Activity 6**</div>

### <font color='blue'> Question 6:  Describe all the steps that we followed in this exercise. Be concise. </font>

<b> Write your answer here:</b>
#####################################################################################################################

(Double-click here)

Step 1:

Step 2:
...
#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 6**</div>