# Evaluating Models

Main goals:

* Select appropriate metrics for evaluating different models, given different data
* Explore visualisation methods for evaluation

We will work with `sklearn` and the usual data science libraries, which we import now:

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

## Task 1 : Is this fish ready?

Imagine you have a small fish farm, where you breed a variety of fish. You manually inspect the fish to see when they are ready for sale. This involves picking up a fish, looking at it very closely and making a decision based on your many years of training and experience in fish appraisal.

The sad fact is, you hate fish and it's a fairly time-consuming task. Could you just automatically weigh the fish and use that as a proxy for your skills? Surely it can't be that simple?

## Load the data

You've collected data on the last ~150 fish you evaluated. You recorded the species, weight, some physical measurements, how much you think the fish is worth, and whether you think it is ready for sale.

In [None]:
data = pd.read_csv('data/fish.csv')

data.head()

## Exploring the task

You are hoping to predict the binary outcome recorded in the `Ready` column, using only the `Weight` column.

* Using `pandas`, we find out the distribution of the outcomes. We represent these:
    * Numerically (with the `.value_counts()` method of a DataFrame column)
    * Visually (the output of `.value_counts()` has a `.plot()` method)

* Is this a classification or a regression task?
* What issues do you notice with the data?

In [None]:
print(data['Ready'].value_counts())

data['Ready'].value_counts().plot(kind='bar');

# Your thoughts here...


We will use a simple Naive Bayes model for predicting `Ready` from `Weight`. 

The data will be split into two sets: 75% for training the model and the remaining 25% for evaluating it.

Splitting the data this way gives an idea how well the model generalises to unseen data.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

x = data[['Weight']]
y = data['Ready']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)

model = GaussianNB()
model.fit(x_train, y_train)

predictions = model.predict(x_test)

print(predictions)

## Base functions for evaluating the classifier

Most metrics for classifiers are some combination of true/false negatives/positives.

Implement functions for these, which take in two lists (truth and prediction) containing True and False booleans.

In [None]:
def get_tp(ground_truth, predictions):
    # True positive: both ground truth and prediction are True
    tp = 0
    # Your code here...
    return tp

def get_tn(ground_truth, predictions):
    # True negative: both ground truth and prediction are False
    tn = 0
    # Your code here...
    return tn

def get_fp(ground_truth, predictions):
    # False positive: ground is False but prediction is True
    fp = 0
    # Your code here...
    return fp

def get_fn(ground_truth, predictions):
    # False negative: ground is True but prediction is False
    fn = 0
    # Your code here...
    return fn



## Compound functions for evaluating the classifier

Some evaluation measures are just combinations of the output of the above functions:

* Accuracy: $\frac{TP + TN}{TP+FP+TN+FN}$

* Precision: $\frac{TP}{TP+FP}$

* Recall: $\frac{TP}{TP+FN}$

And F1 is just a combination of the output of *those* functions:

* F1 Score = $2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$

Implement these now.

In [None]:
def get_accuracy(tp, fp, tn, fn):
    # Your code here...
    return accuracy

def get_precision(tp, fp):
    # Your code here...
    return precision

def get_recall(tp, fn):
    # Your code here...
    return recall

def get_f1(precision, recall):
    # Your code here...
    return f1

Use the eight functions now, on the `y_test` from the dataset and the `predictions` the model made.

In [None]:
# Your code here...


How would you summarise these results to someone who wasn't familiar with these measures?

In [None]:
# Your thoughts here...


Double check your results against those calculated by functions in `sklearn.metrics`.

If they aren't the same, then check you have implemented the four base functions accurately and that you haven't accidentally typed "fp" instead of "tp" somewhere!

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print(precision_score(y_test, predictions))
print(accuracy_score(y_test, predictions))
print(recall_score(y_test, predictions))
print(f1_score(y_test, predictions))

# More detailed evaluation

The functions you have implemented only look at the overall picture. However, it's often more useful to look at per-class performance.

We won't implement this here. Instead, we will use `sklearn.metrics`.

A very useful function for performing lots of evaluation is the classification report.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

This generates many of the results you calculated, but with a few differences. For P/R/F1, it reports the macro and weighted averages, not the micro average.

Macro average is per-label average. Weighted is the macro average weighted by the number of examples of each class in the data. 

If you want control over how averages are calculated, you can do this with the `sklearn.metrics.precision_recall_fscore_support` function.

This doesn't return such a pretty table, though!

In [None]:
from sklearn.metrics import precision_recall_fscore_support

print(precision_recall_fscore_support(y_test, predictions, average=None))

print(precision_recall_fscore_support(y_test, predictions, average='micro'))

In [None]:
# Generate per-class statistics
print("Per-class stats:", *(f"{i}:{x[0], x[1]}" for i, x in zip(['p', 'r', 'f1'], precision_recall_fscore_support(y_test, predictions, average=None))))
# Generate micro average
print("Micro average:", *(f"{i}:{x:.2f}" for i, x in zip(['p', 'r', 'f1'], precision_recall_fscore_support(y_test, predictions, average='micro'))))
# Generate macro average
print("Macro average:", *(f"{i}:{x:.2f}" for i, x in zip(['p', 'r', 'f1'], precision_recall_fscore_support(y_test, predictions, average='macro'))))
# Generate weighted average
print("Weighted average:", *(f"{i}:{x:.2f}" for i, x in zip(['p', 'r', 'f1'], precision_recall_fscore_support(y_test, predictions, average='weighted'))))

How would you characterise the model's per-class performance? Where are the strengths and weaknesses?

In [None]:
# Your thoughts here...


If you were a fish farmer, would you be more concerned with classifying unready fish as ready? Or ready fish as unready?

In [None]:
# Your thoughts here...


Let's say you are an especially conscientious fish farmer, so you are more concerned with making sure you don't accidentally classify fish as ready.

Recall that the "1" in F1 score is the weight given to recall: F0.5 would favour precision twice as much as recall, F1.5 would favour recall 1.5 times as much as precision.

Use the `beta` argument of `precision_recall_fscore_support` to compare a range of weights from 0.1 to 2.0, to see how F score changes for your predictions. Use the `macro` average.

In [None]:
# Your code here...


Below are the values of P, R and F1 for a range of beta values. As you can see, P and R stay constant while F1 changes, but are bounded by R.

In [None]:
x_vals = np.linspace(0, 8, 100)
y_p = [precision_recall_fscore_support(y_test, predictions, beta=b, average='macro')[0] for b in x_vals]
y_r = [precision_recall_fscore_support(y_test, predictions, beta=b, average='macro')[1] for b in x_vals]
y_f = [precision_recall_fscore_support(y_test, predictions, beta=b, average='macro')[2] for b in x_vals]
sns.lineplot(x=x_vals, y=y_p, lw=4, label='Precision' );
sns.lineplot(x=x_vals, y=y_r, lw=4, label='Recall' );
sns.lineplot(x=x_vals, y=y_f, lw=4, label='F1');

## Evaluating against baselines

These figures don't really mean much on their own. We need something against which to compare the model.

Common simple baselines are predicting the most common class (here, that is `True`) or just randomly guessing.

Construct two lists for each of these baselines: `random_preds` and `most_common`.
(Make sure each has the same number of elements as your model's predictions!)

You might find the [`random` module](https://docs.python.org/3/library/random.html) in the Python standard library useful.

In [None]:
import random

# Your code here...


Let's use a heatmap to visualise the predictions of all three.

`sklearn.metrics.confusion_matrix` will generate the right kind of data for this, so let's look at that first.

Use `confusion_matrix` to see the results from the model and two baselines.

In [None]:
from sklearn.metrics import confusion_matrix

#Your code here...


## Visualising confusion matrices

`seaborn.heatmap` is a great function for visualising a confusion matrix.

The code below sets up a 1x3 figure, then plots three heatmaps (one per model/baseline) in them.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,4))

sns.heatmap(confusion_matrix(y_test, predictions, labels=[1,0]), ax=axes[0], annot=True, cmap='Greens', square=True, cbar=False, 
            xticklabels=['Ready', 'Not ready'], yticklabels=['Ready', 'Not ready']);
sns.heatmap(confusion_matrix(y_test, most_common, labels=[1,0]), ax=axes[1], annot=True, cmap='Greens', square=True, cbar=False, 
            xticklabels=['Ready', 'Not ready'], yticklabels=['Ready', 'Not ready']);
sns.heatmap(confusion_matrix(y_test, random_preds, labels=[1,0]), ax=axes[2], annot=True, cmap='Greens', square=True, cbar=False, 
            xticklabels=['Ready', 'Not ready'], yticklabels=['Ready', 'Not ready']);

These look pretty awful. Check [the `seaborn.heatmap` documentation](https://seaborn.pydata.org/generated/seaborn.heatmap.html) for useful configuration options.

Useful ones:

* `cmap` : lets you pick the colour scheme. See the [matplotlib colormap reference](https://matplotlib.org/3.1.1/gallery/color/colormap_reference.html) for names of colour schemes.
* `square` : If `True`, it will make all cells nice and square.
* `cbar` : Use `True`/`False` to show/hide the guide bar to the left
* `annot` : If `True`, it will show numbers on the cells. If you want normalised numbers (e.g. percentages), you can do this in `sklearn.metrics.confusion_matrix` first.

You will probably want to label your cells with the right names. In that case, pass the correct names with `xticklabels` and `yticklabels`.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,4))

# Copy the heatmap code from above and edit it using the arguments described above,
# to make them look nicer and easier to read. 
# Use ax=axes[0], ax=axes[1], ax=axes[2] when calling `heatmap`
# to place that heatmap in a specific column of the figure.

# Your code here...


## Looking at prediction probabilities

Many models provide information about how they made classification decisions. `sklearn` models generally have a `predict_proba` method. Rather than returning a list of predicted class labels, as with `predict`, it returns a list of lists where each sublist contains the probabilities generated for each class, per prediction.

From the NB model we stored in `model` earlier, get the probabilities for the `x_test` set and look at the first few.

In [None]:
# Your code here...


Calculate the log loss for these predictions against the true labels in `y_test`.

In [None]:
from sklearn.metrics import log_loss

# Your code here...


Let's compare this to our "most common" baseline. The probabilities for that will always be $[1, 0]$ if it predicts `True`, otherwise $[0, 1]$.

In [None]:
most_common_probs = [[1,0]  if p == True else [0,1] for p in predictions]

log_loss(y_test, most_common_probs)

Unsurprisingly, the NB model has much lower loss. Why do you suppose that is the case?

In [None]:
# Your thoughts here...


## Going further

How would you characterise the NB model compared to the baselines? And does using only weight to classify fish as ready/not ready seem reasonable to you? Or do you have to carry on doing it manually?

If you were to implement this model, how would you evaluate it extrinsically? What additional data would you need?

In [None]:
# Your thoughts here...


## Extensions

* Look at the Species column for the fish data. How would you go about predicting that? What problems do you foresee having?
* Would model performance improve if you used additional features besides weight?
    * If so, how much do you gain?
    * Would it be worth the extra work of measuring fish with a ruler?
* How do different classification algorithms perform?

# Task 2: How much is this fish worth?

As well as knowing whether a fish is ready for sale, you know exactly how much they are worth. You've recorded this in the `Value` column of your data:

In [None]:
data.head()

Could you predict `Value` from `Weight`? If so, you might never have to look at another fish ever again.

In [None]:
from sklearn.linear_model import LinearRegression

x = data[['Weight']]
y = data['Value']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)

model = LinearRegression()
model.fit(x_train, y_train)

predictions = model.predict(x_test)

print(predictions)

How do these predictions compare to the expected values?

In [None]:
# Your code here...


Recall that for $n$ model predictions $\hat{y} \in \hat{Y}$ and ground truth $y \in Y$, the formula for Mean Squared Error: $$\frac{1}{n} \sum_{i=0}^{n - 1} (y_i - \hat{y}_i)^2$$

Implement a function to calculate this, given the true values and the predictions.

In [None]:
def mse(ground_truth, predictions):
    #Your code here...
    return mean_square_error

Calculate MSE for `y_test` and `predictions` using your function. Compare it to the output of `sklearn.metrics.mean_squared_error` to see if they are the same.

In [None]:
# Your code here...


And now implement RMSE.

In [None]:
from math import sqrt

def rmse(ground_truth, predictions):
    #Your code here...
    return rmse_output

print(rmse(y_test, predictions))

# Interpreting regression evaluation

The RMSE gives an idea of how close the predictions are using the original units. Is being within roughly £2 of the correct price good or bad? It depends on the task. If you are predicting house prices, being off by £2 is probably good. If you are losing £2 on a lot of fish sales, it could be the end of your fish farm.

In general, regression evaluations require you to use your domain expertise to interpret them. But since they are error scores, it makes it easy to compare models (trained on the same data) because lower is always better.

Generate two baselines:

* The mean of all fish values in the training data
* Some random prices with the same mean and standard deviation of the training data values

Remember to make sure you have the same number of predictions in your baseline as you do from your model.

In [None]:
# Your code here...


Evaluate these two baselines and your model using RMSE. How do they compare?

In [None]:
model_rmse = rmse(y_test, predictions)

mean_fish_rmse = rmse(y_test, mean_fish)

random_fish_rmse = rmse(y_test, random_fish)

print(f"Regression: {model_rmse:.3f}\t Mean: {mean_fish_rmse:.3f}\t Random: {random_fish_rmse:.3f}\n\n")

# Your thoughts here...


# Visualising linear relations for evaluation

It may be useful to visualise the `Value` data, to see the distribution we are trying to capture, using `seaborn.histplot`. This will show a histogram and a density estimation.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(16,4), sharey=False)

sns.histplot(y_test, ax=axes[0], kde=True);
sns.histplot(predictions, ax=axes[1], kde=True);
sns.histplot(random_fish, ax=axes[2], kde=True);
sns.histplot(mean_fish, ax=axes[3], kde=False);

axes[0].set_title('True values')
axes[1].set_title('Linear regression model predictions')
axes[2].set_title('Random model predictions')
axes[3].set_title('Mean model predictions')

Edit the above code to instead show a `seaborn.regplot` to compare the predictions to the true values. This will fit a regression line with confidence intervals: the narrower the CI at a point on the regression line, the more constrained predictions are for y, given that x.

(You can't fit a regression to a horizontal line in `seaborn`, so for `mean_fish` set `fit_reg=False`)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,4), sharey=True)

# Your code here...


How would you interpret this information?

In [None]:
# Your thoughts here...


We can evaluate how good the relation between prediction and truth is, using the `r2_score` function from `sklearn.metrics`.

Use this to evaluate the three models and get a better picture of how each model fits the data.

In [None]:
from sklearn.metrics import r2_score

# Your code here...


How would you interpret these values?

In [None]:
# Your thoughts here...


## Extensions

* Would using additional features, besides weight, improve the predictive performance of the model?
* And again, what about different regression algorithms?