![alt text](images/HDAT9500Banner.PNG)
<br>

# Chapter 3: Resampling Methods - Optional Exercise
# Exercise 2: Bootstrapping (Over Sampling Majority Class)


# 1. Introduction

In this exercise, we will present a method to under-sample the majority class. The exercise is optional.


## 1.1. Aims of the Exercise:
 1. To become familiar with a method to under-sample the majority class

 
It aligns with all the learning outcome of our course: 

1.	Distinguish a range of task specific machine learning techniques appropriate for Health Data Science.
2.	Design machine learning tasks for Health Data Science scenarios.
3.	Construct appropriate training and test sets for health research data.



## 1.2. Jupyter Notebook Intructions
1. Read the content of each cell.
2. Where necessary, follow the instructions that are written in each cell.
3. Run/Execute all the cells that contain Python code sequentially (one at a time), using the "Run" button.
4. For those cells in which you are asked to write some code, please write the Python code first and then execute/run the cell.
 
## 1.3. Tips
 1. The square brackets on the left hand side of each cell indicate whether the cell has been executed or not. Empty square brackets mean that the cell has not been excuted, whereas square brackets that contain a number means that the cell has been executed. Run all the cells in sequence, using the "Run" button.
 2. To edit this notebook, just double-click in each cell. In thid document, each cell can be a "Code" cell or "text-Markdown" cell. To choose between these two options, go to the combo-box above. 
 3. If you want to save your notebook, please make sure you press "the floppy disk" icon button above. 
 4. To clean the content of all cells and re-start Notebook, please go to Cell->All Output->Clear


# 2. Load the standardized training and test data, and the hospital data.

In [None]:
import sys
print(sys.version)
#For this notebook to work, Python must be 3.6.4 or 3.6.5

import numpy as np
import pandas as pd
from IPython.display import display

from plotnine import *

In [None]:
hospital = pd.read_csv('data/diabetes/Data_Class_Dummies.csv', sep=',')
train_standardized_data = pd.read_csv('data/diabetes/train_standardized_data.csv', sep=',')
test_standardized_data = pd.read_csv('data/diabetes/test_standardized_data.csv', sep=',')


## 2.1. Split the training and test data into features and response.

In [None]:
X_train_standardized = train_standardized_data.drop(['readmission'], axis = 1)
y_train = train_standardized_data[['readmission']].values

In [None]:
X_test_standardized = test_standardized_data.drop(['readmission'], axis = 1)
y_test = test_standardized_data[['readmission']].values

In [None]:
print(X_train_standardized.shape)
print(X_test_standardized.shape)

## 2.2. Binarise response
We will be using the f1 score at various points in this exercise. So, lets create a binary response for the training and test response vectors we have created.

* **Training response:**

In [None]:
# Sanity Checks:
print('******************************************')
#print(y_train)
print('y_train - NO values =', sum(i =='NO' for i in y_train))
print('y_train - YES values =', sum(i =='YES' for i in y_train))
print('******************************************\n')

# Create y_train_binary
y_train_binary = [0 if x=='NO' else 1 for x in y_train]


# Sanity Check
print('A few elements of y_train: ', y_train[:12].ravel())
print('Corresponding elements of y_train_binary: ', y_train_binary[:12])

# Sanity Checks:
print('\n******************************************')
#print(y_train)
print('y_train_binary - 0 values =', sum(i ==0 for i in y_train_binary))
print('y_train - 1 values =', sum(i ==1 for i in y_train_binary))
print('******************************************')

* **Test response:**

In [None]:
# Sanity Checks:
print('******************************************')
#print(y_test)
print('y_test - NO values =', sum(i =='NO' for i in y_test))
print('y_test - YES values =', sum(i =='YES' for i in y_test))
print('******************************************\n')

# Create y_test_binary
y_test_binary = [0 if x=='NO' else 1 for x in y_test]


# Sanity Check
print('A few elements of y_test: ', y_test[:12].ravel())
print('Corresponding elements of y_test_binary: ', y_test_binary[:12])

# Sanity Checks:
print('\n******************************************')
#print(y_test)
print('y_test_binary - 0 values =', sum(i ==0 for i in y_test_binary))
print('y_test - 1 values =', sum(i ==1 for i in y_test_binary))
print('******************************************')

# 3. Under sampling majority class, readmission = NO.
Recall from Chapter 3 Exercise 1 that we used the SMOTE algorithm to make the class value counts equal. Now, we will over sample the majority class. As before, the goal is to rebalance the proportion of class labels. Now, rather than create new data points from the minority class, we will remove points from the majority class.<p>
    A naive approach would be to remove the points randomly. However, we can choose to remove the points that make prediction difficult. These points are those that are very close to points of the opposite class. A useful notion is *Tomek Links*. These are points (A,B) such that A and B are each others closest neighbour, and have opposing class labels. Please see [this website](https://blog.dominodatalab.com/imbalanced-datasets/) for a nice explanation of how the method works.<p> 
        ![alt text](https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/tomek.png?v=2 "Tomek Links Visualisation")<p>
        It is the NO half of these *Tomek Links* that we will remove.<p>
            **Note**: In the case of the SMOTE algorithm, we over sample the majority class in order to make the ratio of NO to YES *exactly equal*. For this under sampling algorithm, we only remove the NO records *that are Tomek links*, and no further records. This means that the ratios will not be equal.
        

![alt text](http://contrib.scikit-learn.org/imbalanced-learn/stable/_images/sphx_glr_plot_tomek_links_001.png "Tomek Links Visualisation")

## 3.1. Deleting points of class label NO.

In [None]:
from imblearn.under_sampling import TomekLinks

**Note: TomekLinks method is computationally expensive, as at each step we need to compute the distances between every point in the data set. Please allow ~10 minutes for the process to finish.**

In [None]:
tLinks = TomekLinks(return_indices = False)
X_train_standardized_tl, y_train_tl = tLinks.fit_sample(X_train_standardized, y_train_binary)

In [None]:
print(train_standardized_data['readmission'].value_counts())

For even ratio, need to remove the following number of records from the NO class:

In [None]:
print(train_standardized_data['readmission'].value_counts()[0]-train_standardized_data['readmission'].value_counts()[1])

Number of NO class records actually removed:

In [None]:
print(train_standardized_data['readmission'].value_counts()[0] - np.unique(y_train_tl, return_counts = True)[1][0])

The algorithm says that no YES class records are removed. Lets check:

In [None]:
print(train_standardized_data['readmission'].value_counts()[1] - np.unique(y_train_tl, return_counts = True)[1][1])

In [None]:
print(np.unique(y_train_tl, return_counts = True))

## 3.2. Train logistic model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
Log_Reg = LogisticRegression(C = 1/20, penalty = 'l1').fit(X_train_standardized_tl, y_train_tl.ravel()) 

$\beta$ Coefficients:

In [None]:
print(np.round(Log_Reg.coef_, 3))

In [None]:
# Model predictions on test set
y_pred= Log_Reg.predict(X_test_standardized)

# Use score method to get accuracy of model
score = Log_Reg.score(X_test_standardized, y_test_binary)
print(score)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm = confusion_matrix(y_true = y_test_binary, y_pred = y_pred)
print("Confusion matrix:\n{}".format(cm))

The tomek link method did not perform very well in this case. This could be due to the fact that only a small portion of NO records were removed, so that the class imbalance issue is still a major problem for the model.

## 3.3. Evaluating the model using unweighted mean of F1 Score.

In [None]:
from sklearn.metrics import f1_score

In [None]:
print(f1_score(y_true = y_test_binary, y_pred = y_pred, average = 'macro'))

## 3.4. Receiver Operating Characteristic (ROC): TPR and FPR

### 3.4.1. Probability associated with each prediction
We need to determine the probability of each record in the test set being a 'YES', or equivalently a 1 as we have converted the response into a binary variable.

In [None]:
# Probabilities of the test set being 0 and 1
y_pred_proba = Log_Reg.predict_proba(X_test_standardized)[:,1]
y_pred

print(y_pred_proba[:5])
print(y_pred[:5])

### 3.4.2. Determining the fpr, tpr at each threshold value
Now that we have the probabilitys associated with each prediction, we know exactly which records are predicted YES and NO for each choice of decision threshold. Hence, we can determine the false positive rate (fpr) and true positive rate (tpr) for threshold value.

In [None]:
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(y_test_binary, y_pred_proba)
print(fpr[:4])
print(tpr[:4])
print(thresholds[:4])

### 3.4.3. Plotting The ROC Curve

In [None]:
df = pd.DataFrame()
df['fpr'] = fpr
df['tpr'] = tpr
# Sanity check 
display(df[:][:5])

In [None]:
fpr, tpr,_= metrics.roc_curve(y_true = y_test_binary, y_score = y_pred_proba)

from plotnine import *
import warnings; warnings.simplefilter('ignore')

p = ggplot(mapping = aes(x = fpr, y = tpr), data = df)
p += geom_line(color = 'red')
p += geom_abline(aes(intercept=0, slope=1), linetype = 'dashed', colour = 'blue')
p += labs(title = 'ROC Curve', x = 'fpr', y = 'tpr')
p += theme_bw()

print(p)

### 3.4.4. Area under the ROC curve (AUC)
Note that AUC = 0.5 corresponds to random assignment.

In [None]:
print(metrics.roc_auc_score(y_true = y_test_binary, y_score = y_pred_proba))

## 3.5. Computing optimal threshold

In [None]:
# index of pair that maximises tpr - fpr
ind_max = np.argmax(tpr - fpr)
print(ind_max)

In [None]:
# threshold value that maximises the tpr - fpr
optimal_thresh = thresholds[ind_max]
print(optimal_thresh)

# 4. Using Tomek Links AND SMOTE
Recall that SMOTE synthetically generates minority class records whilst avoiding duplications. Here, we will use this method after we have removed all the Tomek Link majority records. Our hope is that the removal of Tomek Links will complement the SMOTE method.

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
smote = SMOTE(k_neighbors = 5, ratio = 'minority', random_state = 0, kind = "regular")
X_train_standardized_tl_smote, y_train_tl_smote = smote.fit_sample(X_train_standardized_tl, y_train_tl)

In [None]:
np.unique(y_train_tl_smote, return_counts = True)

In [None]:
Log_Reg = LogisticRegression(C = 1/20, penalty = 'l1').fit(X_train_standardized_tl_smote, y_train_tl_smote.ravel()) 

In [None]:
np.round(Log_Reg.coef_, 3)

In [None]:
# Predictions 
y_pred= Log_Reg.predict(X_test_standardized)

# Use score method to get accuracy of model
score = Log_Reg.score(X_test_standardized, y_test_binary)
print(score)

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_true = y_test_binary, y_pred = y_pred)
print(cm)

In [None]:
# Accuracy of YES
acc_pos = cm[1][1]/(cm[1][1] + cm[0][1])
print(acc_pos)

# Accuracy of NO
acc_neg = cm[0][0]/(cm[0][0] + cm[1][0])
print(acc_neg)

# Balanced Accuracy
BACC = (acc_pos + acc_neg)/2
print(BACC)

In [None]:
print(f1_score(y_true = y_test_binary, y_pred = y_pred, average = 'macro'))

We see that there is only a small improvement to the SMOTE results individually, suggesting that removing the Tomek Links does not influence the prediction problem a great deal.

<div class="alert alert-block alert-success">**Start Activity 1**</div>

### <font color='blue'> Question 1: Describe the steps that we follow in this exercise </font></p>

<b> Write your answer here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

<div class="alert alert-block alert-warning">**End Activity 1**</div>