<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti103/blob/master/session-4/Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification

We have worked with regression type of problem in the previous exercise. Let us now look at classification type of problem.  We will start with a simpler binary classification problem (i.e. we are only dealing with prediction two classes, e.g. 0 or 1, False or True, Negative or Positive, etc)



## Binary Classification

### Dataset

We will use a relatively small dataset from UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. Based on the features extracted, each sample is classified as  malignant (labelled as 0) or benign (labelled as 1).

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

cancer = load_breast_cancer()

cancer.keys()

In [None]:
# print the text label of the target
print(cancer.target_names)

# print out the unique values of the target
print(np.unique(cancer.target))

**Question 1**

Based on the output of the above cell, what does a target label of 0 and 1 corresponds to? 


<details>
 <summary>Click here for answer</summary>
  label 0 refers to malignant, and label 1 refers to benign
</details>


Let's create our X (features) and y (label) based on the dictionary keys above.

In [None]:
X, y = cancer["data"], cancer["target"]

print(X.shape)
print(y.shape)

We can see from the shape of X, that there are 30 different features used to classify the sample. 

Let's see how many are labelled as malignant and as benign.

In [None]:
num_malignant = np.sum(y==0)
num_benign = np.sum(y==1)
print('malignant = {}'.format(num_malignant))
print('benign = {}'.format(num_benign))

Split the data into training set and test set (with 80:20 ratio) and shuffle the data set 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, shuffle=True, random_state=42)


**Question 2**

Create a binary classifier capable of distinguishing between malignant and benign breast mass sample.  

* Use Logistic Regression and train it on the whole training set. (use liblinear as solver and 42 as random_state)
* Use the trained classifier to predict the test set 
* Calculate the accuracy score 


<details>
    <summary>Click here for answer</summary>
    
``` python
    
lr_clf = LogisticRegression(solver='liblinear', random_state=42)
lr_clf.fit(X_train, y_train)
y_pred = lr_clf.predict(X_test)
    
```

</details>

In [None]:
# import the logistic regressor 

from sklearn.linear_model import LogisticRegression

# instantiate a logistic regressor using liblinear as solver

lr_clf = None

# train on the X_train 


# make prediction on test set



In [None]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, y_pred))

Our accuracy on the chosen test set seems quite decent. But how do we know if it is because we are lucky to pick a 'easy' test set. Since our test set is pretty small, it may not be an accurate reflection of the accuracy of our model. A better way is to use cross-validation.

### Measuring Accuracy using Cross-Validation


**Question 3**

Evaluate the **accuracy** of the model using cross-validation on the **train** data set with the `cross_val_score()` function, with 3 folds. What do you observe? 

<details><summary>Click here for answer</summary>
    
``` python 
    
cross_val_score(lr_clf, X_train, y_train, cv=3, scoring="accuracy")
    
```
<br/>
There are 3 different accuracy scores reported, one for each fold. 
The accuracy score also differs for each fold
    
</details>

In [None]:
from sklearn.model_selection import cross_val_score

# Complete your code here



### Confusion Matrix

**Question 4**

A much better way to understand how a trained classifier perform is to look at the confusion matrix.  


*   Generate a set of predictions using `cross_val_predict()` on the train data set
*   Compute the confusion matrix using the `confusion_matrix()` function 

<details><summary>Click here for answer</summary>

``` python
    
y_train_pred = cross_val_predict(lr_clf, X_train, y_train, cv=3)
    
confusion_matrix(y_train, y_train_pred) 
    
```
</details>

In [None]:
from sklearn.model_selection import cross_val_predict

# Complete your code here 




In [None]:
from sklearn.metrics import confusion_matrix

# complete your code here 



A perfect classifier would have only true positives and true negatives, so its confusion matrix would have non zero values only on its main diagonal.

In [None]:
y_train_perfect_predictions = y_train  # pretend we reached perfection

confusion_matrix(y_train, y_train_perfect_predictions)

**Question 5**

How do we know which row of the confusion matrix corresponds to which label? 

*Hint*: use the the `classes_` attribute of the classifier.

<details><summary>Click here for answer</summary>

``` python
    
lr_clf.classes_
    
```
<br/>  
The ordering of the different classes will be the order the confusion matrix shows the different classes, the 1st row corresponds to 1st class, 2nd row corresponds to 2nd class, and so on. 

</details>

In [None]:
# Complete your code here



### Precision and Recall

**Question 6**

From the confusion matrix above, compute the precision, recall and F1 score **manually** using the following formula:

- `recall = TP/(TP+FN)`
- `precision = TP/(TP+FP)`
- `F1 = 2*precision*recall/(precision + recall)`

Since we are doing breast cancer detection, a **positive** test result is considered as maglinant, while a **negative** is considered as benign.

<details><summary>Click here for answer</summary>
    
Earlier on, we have determined that 0 is malignant and 1 is benign. So, our positive label is 0 and negative label is 1. 
    
From the confusion matrix, we can obtain the following: 
- TP = 156
- FN = 13
- FP = 10 
- TN = 276

Now we can calculate recall, precision, and f1 easily: 

- recall = 156/(156+13) = 0.92
- precision = 156/(156+10) = 0.94
- f1 = 0.93

</details>

In [None]:
# Complete your code here 

recall = None 

precision = None

f1 = None


**Question 7**

Now use the scikit learn's metric function to compute recall, precision and f1_score and compare the values with those manually computed: 
- recall_score()
- precision_score()
- f1_score()

Are they the same? If your calculation is different from those computed by the scikit-learn, why? 


<details><summary>Click here for answer</summary>
    

``` python
    
recall_score(y_train, y_train_pred)
precision_score(y_train, y_train_pred)
f1_score(y_train, y_train_pred)
    
```
<br/>
If you use the codes above, they would have given you different results from what you have computed earlier. 
    
It is because, be default, they will treat 1 as positive label and 0 as negative label. To change the default, we need to use *pos_label* to specify what is considered as positive label:
    
``` python
    
recall_score(y_train, y_train_pred, pos_label=0)
precision_score(y_train, y_train_pred, pos_label=0)
f1_score(y_train, y_train_pred, pos_label=0)
   
```
</details>
 

In [None]:
from sklearn.metrics import recall_score, precision_score, f1_score

# Complete your code here 



The is a another useful function called `classification_report()` in scikit-learn that gives all the metrics in one glance

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_train, y_train_pred))

Note that we have different precison and recall scores for each class (0 and 1). 

**Question 8**

Also note that we have slightly different averages for precision, recall and f1 : macro average and weighted average in the classication_report. What is the difference between the two ? You can refer to this [link](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) for info.  Manually calculate the macro and weighted average to check your understanding. 


<details><summary>Click here for answer</summary>
    
```python
    
macro_average_recall = (0.923 + 0.965)/2 
weighted_average_recall = (0.923*169 + 0.965*286)/455
    
```
<br/>

Note: Here we use more decimal places than what was shown in the classification table to obtain the final numbers. 
    
</details>

In [None]:
# complete your code here 


macro_average_recall = None 
weighted_average_recall = None


### Precision and Recall tradeoff

The confusion matrix and the classification report provide a very detailed analysis of
a particular set of predictions. However, the predictions themselves already threw
away a lot of information that is contained in the model. 

Most classifiers provide a `decision_function()` or a `predict_proba()` method to
assess degrees of certainty about predictions. Making predictions can be seen as
thresholding the output of decision_function or predict_proba at a certain fixed
point—in binary classification we use 0 for the decision function and 0.5 for
predict_proba.

In logistic regression, we can use the `decision_function()` method to compute the scores.   

In [None]:
sample_X = X[20]
sample_y = y[20]

y_score = lr_clf.decision_function([sample_X])
print(y_score)

With threshold = 0, the prediction (of positive case, i.e. 1) is correct.

In [None]:
threshold = 0
y_some_X_pred = (y_score > threshold)
print(y_some_X_pred == sample_y)

With threshold set at 6, prediction (of positive case, i.e. 1) is wrong. In other words, we failed to detect positive cases (lower recall)

In [None]:
threshold = 6
y_some_data_pred = (y_score > threshold)
print(y_some_data_pred == sample_y)

With a higher threshold, it decreases the recall and increases the precision. Conversely, with a lower threshold, we increases recall at the expense of decrease in precision. To decide which threshold to use, get the scores of all instances in the training set using the `cross_val_predict()` function to return decision scores instead of predictions.

Perform cross validation to get the scores for all instances.

In [None]:
y_scores = cross_val_predict(lr_clf, X_train, y_train, cv=3,
                             method="decision_function")

Compute precision and recall for all possible thresholds using the precision_recall_curve function.

In [None]:
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)

In [None]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.legend(loc="center right", fontsize=16) 
    plt.xlabel("Threshold", fontsize=16)        
    plt.grid(True)                                           

plt.figure(figsize=(8, 4))                      
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)

plt.show()

If we set our threshold and use it to make predictions, we will get the same prediction results as the `cross_val_predict()` function

In [None]:
(y_train_pred == (y_scores > 0)).all()

Another way to select a good precision/recall trade-off is to plot precision directly against recall.

In [None]:
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    #plt.axis([-5, 10, 0, 1])
    plt.grid(True)

plt.figure(figsize=(8, 4))
plot_precision_vs_recall(precisions, recalls)

plt.show()

We want to aim for 98% or better precision, compute the threshold value.

In [None]:
threshold_98_precision = thresholds[np.argmax(precisions >= 0.98)]
threshold_98_precision

In [None]:
y_train_pred_98 = (y_scores >= threshold_98_precision)

Compute the precision and recall score

In [None]:
precision_score(y_train, y_train_pred_98)

In [None]:
recall_score(y_train, y_train_pred_98)

### ROC Curves

The receiver operation characteristic (ROC) curve is another common tool used with binary classifiers.  It is similar to the precision/recall curve, but it plots the true positive rate (recall) against the false positive rate.  

**Question 9**

Compute the True positive rate (TPR), False positive rate (FPR) for various thresholds using the `roc_curve()` function.

<details><summary>Click here for answer</summary>

```python
    
fpr, tpr, thresholds = roc_curve(y_train, y_scores)
    
```
</details>

In [None]:
from sklearn.metrics import roc_curve

# Complete your code here 



In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1])                                    
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) 
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    
    plt.grid(True)                                            

plt.figure(figsize=(8, 6))                        
plot_roc_curve(fpr, tpr)        
plt.show()

The higher the recall (TPR), the more false positives (FPR) the classifier produces.  The dotted line represents the ROC curve of a purely random classifier, a good classfier stays as far away from the line as possible.

**Quesiton 10**

Compute the area under the curve (AUC) using `roc_auc_score()`

<details><summary>Click here for answer</summary>

```python
    
roc_auc_score(y_train, y_scores)
    
```
</details>

In [None]:
from sklearn.metrics import roc_auc_score

# Complete your code here 



**Question 11**

We are finally done with our binary classification...Wait a minute! Did we just computed all the evaluation metrics on ***training set*** ??!!  Isn't it bad practice to do so.. Don't we need to use ***test set*** to evaluate how good is our model?

Why?

<details><summary>Click here for answer</summary>

We only evaluate our model after we are satisfied with performance of it on our validation set. We will do our model fine-tuning on the validation set and not test set. In our case, since our training set is pretty small (only about 400+), if we are to set aside a validation set, then our training set would be too small. That is why we use cross_validation to evaluate our model
    
</details>

## Multiclass classification

We will now look at multi-class classification. The dataset we are going to use is the UCI ML hand-written digits datasets https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where each class refers to a digit. Each digit is a 8x8 image.  

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()

print(digits.keys())

**Question 12**

Now create the X (the features) and y (the label) from the digits dataset.  X is a np.array of 64 pixel values, while y is the label e.g. 0, 1, 2, 3, .. 9.

<details><summary>Click here for answer</summary>
    
```python
    
X = digits['data']
y = digits['target']

```
</details>

In [None]:
# Complete your code here 



Let's plot the image of a particular digit to visualize it.  Before plotting, we need to reshape the 64 numbers into 8 x 8 image arrays so that it can be plotted.

In [None]:
# let's choose any one of the row and plot it
some_digit = X[100]

# print out the corresponding label
print('digit is {}'.format(y[100]))

# reshape it to 8 x 8 image
some_digit_image = some_digit.reshape(8, 8)

plt.imshow(some_digit_image, cmap = mpl.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()

**Question 13**

Split the data into train and test set, and randomly shuffle the data.


<details><summary>Click here for answer</summary>

```python
    
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, shuffle=True, random_state=42)

```
</details>

In [None]:
## Complete your code here

X_train, X_test, y_train, y_test = 



Multiclass classifiers distinguish between more than two classess.  Scikit-learn detects when you try to use a binary classification algorithm for a multiple class classification task and it automatically runs one-versus-all (OvA)

**Question 14**

Use Logistic Regression to train using the training set, and make a prediction of the chosen digit (`some_digit`). Is the prediction correct?

<details><summary>Click here for answer</summary>

```python

lr_clf = LogisticRegression(solver='liblinear', random_state=42)
lr_clf.fit(X_train, y_train)
    
```
</details>

In [None]:
# Complete the code here



Under the hood, Scikit-Learn actually trained 10 binary classifiers, got their decision scores for the image and selected the class with the highest score.  

**Question 15**

Compute the scores for `some_digit` using the `decision_function()` method to return 10 scores, one per class.

<details><summary>Click here for answer</summary>

```python
    
y_scores = lr_clf.decision_function([some_digit])
    
```
</details>

In [None]:
# complete the code here



The highest score is the one corresponding to the correct class.

In [None]:
index = np.argmax(some_digit_scores)
print(index)

In [None]:
lr_clf.classes_[index]

**Question 16**

Use `cross_val_score()` to evaluate the classifier

<details><summary>Click here for answer</summary>
    
```python 
    
cross_val_score(lr_clf, X_train, y_train, cv=3, scoring="accuracy")
    
```
</details>  

In [None]:
# Complete your code here 



**Question 17**

Compute the confusion matrix of the classifier. From the confusion matrix, which two digits tend to be confused with each other?

<details><summary>Click here for answer</summary>
    
```python 

y_train_pred = cross_val_predict(lr_clf, X_train, y_train, cv=3)
confusion_matrix(y_train, y_train_pred)
    
```
<br/>
1 and 8 are confused with each other. 
    
</details>  

In [None]:
# Complete your code here



**Question 13**

Print out the classification_report.  

<details><summary>Click here for answer</summary>
    
```python 

print(classification_report(y_train, y_train_pred))
    
```
</details>  


In [None]:
# Complete your code here 

