# 03 - Classification

The goal of this exercise is to to develop an understanding how to train a binary classifier and how to mesaure its performance with different performance metrics.

<div class="alert alert-block alert-info">
To solve this notebook you need the knowledge from the previous notebook. If you have problems solving it, take another look at last week's notebook.
    
It's also recommended to read the chapter 3 of the book in advance.
</div>

**Task**: In this exercise, we want to use pictures of banknotes to identify whether they are forged or not.

In [None]:
# Run this cell two import the following modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<h2 style="color:blue" align="left">Banknote Authentication Data Set</h2>

For this we use a public available dataset called [_banknote authentication Data Set_](https://archive.ics.uci.edu/ml/datasets/banknote+authentication#). This dataset was created by two resarchers of the University of Applied Sciences in Ostwestfalen-Lippe. They took 1372 images of genuine and forged banknote-like specimens with an industrial camera usually used for print inspection and applied a Welvelt Transformation on them. Then they extracted 4 features of the images:

1. variance of Wavelet Transformed image (continuous)
2. skewness of Wavelet Transformed image (continuous)
3. curtosis of Wavelet Transformed image (continuous)
4. entropy of image (continuous)

The last column shows if the banknote is valid or fake:

- class 0 is genuine / authentic
- class 1 is fake / forgery


In [None]:
dataset = pd.read_csv('dataset/data_banknote_authentication.csv')
dataset.head()

<div class="alert alert-block alert-success"><b>Task</b><br> 
Find a way to count how many fake and how many real banknotes are in the dataset.
</div>

In [None]:
# Write Your Code Here


<h2 style="color:blue" align="left">Data Preprocessing</h2>

### Train Test Split

After we inspect the data, we can split our data in a test set and training set. Therefore we use the built in function of SciKit-learn `train_test_split`. 

In [None]:
from sklearn.model_selection import train_test_split

<div class="alert alert-block alert-success"><b>Task</b><br>
Use the train_test_split function, to split up dataset. The test set should consists of 20% of the total data data. Save the results in the variables X_train, X_test, y_train and y_test. Set the variable random_state = 42.
</div>

In [None]:
X, y = [], []
X_train, X_test, y_train, y_test = [], [], [], []
# Write Your Code Here


### Scale the data

Because the data is given is different units, we must scale it. Therefore we use the built-in class `StandardScaler` of SciKit-learn.

In [None]:
from sklearn.preprocessing import StandardScaler

<div class="alert alert-block alert-success"><b>Task</b><br>
Scale the training and test set with the StandardScaler sc. Save the results in the variables X_train_scaled and X_test_scaled.
</div>

In [None]:
sc = StandardScaler()
X_train_scaled, X_test_scaled = [], []
# Write Your Code Here


<h2 style="color:blue" align="left">Train and Evaluate your Model</h2>

Now that we've done with the preprocessing, we can start to train and evaluate your model. In this exercices, we will use the SGDClassifier of SciKit-learn. By default, this functions fits a linear support vector machine (SVM) with stochastic gradient descent (SGD) learning. These terms will be discussed in the next lectures.

In [None]:
from sklearn.linear_model import SGDClassifier

In [None]:
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train_scaled, y_train)

### Accuracy with Cross Validation

After we have fit your model to the data, we can start to evaluate it. For this we use the cross validation score, like in the book.

In [None]:
from sklearn.model_selection import cross_val_score

<div class="alert alert-block alert-success"><b>Task</b><br>
Evaluate the model, using the cross_val_score function of SciKit-learn. Make a 3-Fold cross validation using your training set.
</div>

In [None]:
# Write Your Code Here


Wow, all these predictions are more than 97% correct. In fact, that we have nearly a equally number of fake and real banknotes in your training set, this is a really good result. But as we've learned in the book, there are more metrics to measure the performance of your model.

### Make some Predictions

To apply this metrics to your model, we need to make some predictions and save them for further investigations. Therefore we use the function `cross_val_predict` of SciKit-learn.

In [None]:
from sklearn.model_selection import cross_val_predict

<div class="alert alert-block alert-success"><b>Task</b><br>
Use the function cross_val_predict to predict the data of your training set using a 3 folds. Save the result in the variable y_train_pred.
</div>

In [None]:
y_train_pred = []
# Write Your Code Here


### Confusion Matrix

When have the predicted values `y_train_predict`, we can compare them with the truth values `y_train` and plot a confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
tickLabels = ['authentic', 'fake']
ax = sns.heatmap(confusion_matrix(y_train, y_train_pred), annot=True, fmt="d", yticklabels=tickLabels, xticklabels=tickLabels)
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12);

The confusion matrix has the following notation.

$ \text{confusion matrix} =
\begin{pmatrix}
TN & FP \\
FN & TP
\end{pmatrix}
$

On the upper left, we have the True Negatives (_TN_). That are the values that were correctly predicted as negatives. Because negative ($0$) is the class _authentic_, we can see the number of authentic banknotes, that are correctly predicted as authentic. To the right are the False Positives (_FP_). These are the authentic banknotes that were wrongly labeled as positives ($1$), i.e. fake.

On the lower left, you can see the False Negatives, i.e. the fake banknotes that are wrongly predicted as authentic and on the lower right, there are the number of fake banknotes, that are correctly predicted as fake (_TP_).

All in all we have a very good prediction, because on the diagonal are only large numbers and only 20 predictions are wrong. 

### Presicion and Recall

The precision and recall are another metric to measure to performance of your model. They can be directly computed from the values in the confusion matrix.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

#### Precision

The precision describes what proportion of positive identifications was actually correct.

It can be computed as follows:

$\text{precision} = \frac{TP}{TP+FP}$

You can also use the built-in function of Scikit-learn `precision_score`.

<div class="alert alert-block alert-success"><b>Task</b><br>
Use the function precision_score to calculate the precision of your model.
</div>

In [None]:
# Write Your Code Here


#### Recall

The recall describes what proportion of actual positives was identified correctly.

It can be computed as follows:

$\text{recall} = \frac{TP}{TP+FN}$

You can also use the built-in function of Scikit-learn `recall_score`.

<div class="alert alert-block alert-success"><b>Task</b><br>
Use the function recall_scare to calculate the recall of your model.
</div>

In [None]:
# Write Your Code Here


The F1-Score score is the harmonic mean of the precision and recall. SciKit-learn has a built-it in function for this, too.

In [None]:
f1_score(y_train, y_train_pred)

### Make Adjustments to your Model

The goal of your classifier is to not accept any fake banknotes. Because of the Precision/Recall Trade-off this means, that we will mark some real banknotes as fake. Since the fake banknotes are labeled with $1$, we don't want to have false negatives (_FN_), so we want high _Recall_.


In [None]:
from sklearn.metrics import precision_recall_curve

<div class="alert alert-block alert-success"><b>Task</b><br>
Use the function precision_recall_curve to find a threshold, so that your model won't accept any fake banknotes. Save the threshold in the variable threshhold_highest_recall. If you are not sure, how to do it, have another look at the third chapter of the book.
</div>

In [None]:
threshhold_highest_recall = 0
y_scores = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3, method="decision_function")
# Write Your Code Here


You can use this threshhold to make new predictions of your model

In [None]:
y_pred_recall = (y_scores >= threshhold_highest_recall)
confusion_matrix(y_train, y_pred_recall)

Hey, as we can see, there are no False Negatives any more. Great! 

But there also more False Positives, so we marked 26 authentic banknotes as fake. This leads to a smaller precision.

In [None]:
precision_score(y_train, y_pred_recall)

### ROC-Curve and AUC-Score

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

- True Positive Rate (TPR), where $\text{TPR} = \frac{TP}{TP+FN}$, also known als recall and
- False Positive Rate (FPR), where $\text{FPR} = \frac{FP}{FP+TN}$.

SciKit-learn provides a build-in function called `roc_curve` to compute the rates at the different thresholds.

In [None]:
from sklearn.metrics import roc_curve

In [None]:
fpr, tpr, thresholds = roc_curve(y_train, y_scores)

The following code plots the graph.

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1.01])  
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16)
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)  
    plt.grid(True)

plt.figure(figsize=(8, 6))                    
plot_roc_curve(fpr, tpr)

plt.show()

As we can see, we have a really perfect course of the curve, because we are very far in the upper left corner. The Area under the curve is called AUC_score can the computed with the roc_auc_score.

In [None]:
from sklearn.metrics import roc_auc_score

<div class="alert alert-block alert-success"><b>Task</b><br>
Use the function roc_auc_score to compute the AUC-Score.
</div>

In [None]:
# Write Your Code Here


Finally test your model with the test set. 

<div class="alert alert-block alert-success"><b>Task</b><br>
Evaluate accuracy, precision and recall for the test set. 
</div>

In [None]:
# Write Your Code Here