# Exercise #2 - Evaluation

In the previous exercise you implemented a number of basic functions and combined them into a neural network model that can predict a value. But how can we judge if the output is any good? We need to evaluate the output by comparing it to a ground truth using metric (and loss) functions, which we will implement in this exercise.

First import the NumPy library again.

In [None]:
import numpy as np

<img style="float: right;" src="figures/precisionrecall.png" width="315" height="573">

## Metrics

There are [several ways](https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers) to check how well your model performs. We will first inspect some metrics.

The most common metrics in use are:

- *Binary accuracy*: How many of all the predictions are correct.

        (tp + tn) / (tp + fp + tn + fn)

- *Precision*: How many of the positive predictions are correct.

        tp / (tp + fp)
    
- *Recall*: How many of the actual positive ground truths are predicted correct.

        tp / (tp + fn)

Implement the `get_metrics` function that, given the ground truth (`y_true`) and the predictions (`y_pred`), returns the accuracy, precision and recall values.

**Note:** The variables `tp`, `fp`, `fn` and `fp` have already been computed for you. They are each integers representing the number of elements for that statistic. For example `fp` is the number of false positives.

Be aware to prevent division by zero.

In [None]:
def get_metrics(y_true, y_pred):
    y_true = (y_true >= 0.5)
    y_pred = (y_pred >= 0.5)

    tp = np.sum(y_pred & y_true)
    fp = np.sum(y_pred & ~y_true)
    fn = np.sum(~y_pred & y_true)
    tn = np.sum(~y_pred & ~y_true)

    #### BEGIN IMPLEMENTATION ####
    accuracy = 
    precision = 
    recall = 
    #### END IMPLEMENTATION ####
    return accuracy, precision, recall

Let's check the implementation. We give it 10 predictions which are not all correct.

In [None]:
y_true = np.array([[1.0], [1.0], [0.0], [1.0], [0.0], [0.0], [0.0], [1.0], [1.0], [0.0]])
y_pred = np.array([[0.6], [0.4], [0.2], [0.7], [0.1], [0.2], [0.5], [0.9], [0.8], [0.6]])
print(get_metrics(y_true, y_pred))

The output should be `(0.7, 0.6666666666666666, 0.8)`. In other words, an accuracy of 70%, a precision of 67%, and a recall of 80%.
* 7 out of 10 predictions (70%) match the ground truth.
* From the 6 elements that are predicted positive only 4 (67%) should have been predicted positive. So we have 2 false positives.
* From the 5 elements that should have been predicted positive only 4 (80%) are actually predicted positive. One is missing, i.e. a false negative.

Now we can also see what happens if your data is not well balanced. Say we have a data set with mainly negative samples and only a few positive samples. If we simply only predict a negative value (e.g. always output 0), then we get a very high accuracy (most samples are predicted correctly), but precision and recall will be terrible.

In [None]:
y_true = np.array([[1.0], [1.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]])
y_pred = np.zeros_like(y_true)
print(get_metrics(y_true, y_pred))

The output should be `(0.8, 0.0, 0.0)`.

Inversely, if we have a dataset with mainly positive samples and we always predict a positive value then both accuracy and precision will be high, and recall will be perfect. All seems right, but it actually isn't. So always pay attention to the distribution of your data, It should be well balanced.

In [None]:
y_true = np.array([[0.0], [0.0], [1.0], [1.0], [1.0], [1.0], [1.0], [1.0], [1.0], [1.0]])
y_pred = np.ones_like(y_true)
print(get_metrics(y_true, y_pred))

The output should be `(0.8, 0.8, 1.0)`.

## Loss function

Another way to check if a neural network is performing well (and more useful during training) is the loss function.

We have built a binary classifier model, so we are going to implement the **binary cross entropy loss** function to determine how well it fits to the ground truth. This function is defined as follows.

$$\mathop{BinaryCrossEntropyLoss}(y, \hat{y}) = -{( y \cdot \log{\hat{y}} + (1 - y) \cdot \log{(1 - \hat{y})})}$$

Where $y$ is our ground truth and $\hat{y}$ is the prediction. The following plot shows the the behavior of the loss function in the two cases of the ground truth.

![min log](figures/minlog.png "-log")

In other words, if the ground truth is a `1` and we predict a value closer to `1` then the loss will go down to zero. If we predict a value closer to `0` (i.e. the opposite of the ground truth), then the loss will go up to infinity.

This loss value will be computed for every sample in the batch, but the optimization routine used during training expects a single scalar value. We can **reduce** the array of losses to a single scalar value by simply taking the average, which gives us the final loss value.

$$\mathop{L}(y, \hat{y}) = \frac{1}{N} \sum_n^N \mathop{BinaryCrossEntropyLoss}(\hat{y}_n, y_n)$$

Where $N$ is the number of samples in the batch.

Let's implement them both in the function below.

**Hint:**
You will need [np.log](https://numpy.org/doc/stable/reference/generated/numpy.log.html) and [np.mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html).
And remember that the `*` operator is an element-wise multiplication (mapped to [np.multiply](https://numpy.org/doc/stable/reference/generated/numpy.multiply.html)), which you _do_ need to use here instead of matrix multiplications.

**Note:**
* `y_true` represents the ground truth $y$
* `y_pred` represent the predicted value(s), i.e. the output $\hat{y}$ from the network.

In [None]:
def binary_cross_entropy_loss(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-9, 1-1e-9) # clip to prevent log of 0.
    #### BEGIN IMPLEMENTATION ####
    # Compute Binary Cross Entropy Loss for every sample
    losses = 
    # Reduce to a single value
    loss = 
    #### END IMPLEMENTATION ####
    return loss

Time for a test. If we provide this function with a prediction that is far from the ground truth it should give a relatively high loss value.

In [None]:
y_true = np.array([[1.0], [1.0], [0.0], [1.0], [0.0], [0.0]])
y_pred = np.array([[0.1], [0.2], [0.7], [0.1], [0.9], [0.8]])
loss = binary_cross_entropy_loss(y_true, y_pred)
print(loss)

The output should be `1.8884339846960454`.

But if we give it a prediction close to the ground truth the loss value should be low.

In [None]:
y_true = np.array([[1.0], [1.0], [0.0], [1.0], [0.0], [0.0]])
y_pred = np.array([[0.9], [0.8], [0.3], [0.9], [0.1], [0.2]])
loss = binary_cross_entropy_loss(y_true, y_pred)
print(loss)

The output should be `0.18650726559010514`

## Evaluation

Time to combine this with the model we created in the previous exercise. You should now create a function that computes the loss of a model given some input (`x`) and the ground truth (`y_true`).

In [None]:
def evaluate(model, x, y_true):
    #### BEGIN IMPLEMENTATION ####
    y_pred = 
    loss = 
    #### END IMPLEMENTATION ####
    return loss

### Data

Let's import some real data to test your implementation. We are going to use this set throughout the rest of the exercises. The dataset that we are going to use is the [Breast Cancer Prediction Dataset](https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset) from the public Kaggle datasets library.

> Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.
>
> This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

This dataset is a table of 569 rows and 6 columns. The first 5 columns are the results of certain measurements. The last column is the final diagnosis, where a `1` means that it is in fact a malignant tumor (i.e. a tumor that may invade its surrounding tissue or spread around the body) and a `0` means it is benign.

So lets import the data set. The helper function splits this data into two sets, a training set and a validation set. This will come in handy later on. For now we will only use the validation set.

In [None]:
from siouxdnn import load_data
X_train, Y_train, X_val, Y_val = load_data()
print('training set', X_train.shape, Y_train.shape)
print('validation set', X_val.shape, Y_val.shape)

### Test

If all is implemented well, we should now be able to compute the loss of our previously implemented model on the validation dataset. We use a reference implementation of `Model` that is exactly the same as your implementation.

In [None]:
from siouxdnn import reset_seed, Model
reset_seed()
model = Model()
loss = evaluate(model, X_val, Y_val)
print(loss)

The output should be `0.8438193071051783` otherwise check your implementation of `binary_cross_entropy_loss` or `evaluate`.

## Done

We are now done with exercise #2. We can now quickly determine how well a model performs using metrics and loss functions. We will use these in the next exercise to actually train the model.