## Performance metrics

When we are teaching and evaluating supervised models, we are trying to teach it to predict as well as possible. To do that, we need some metrics to evaluate this - i.e. we need to give a numeric value to the goodness of the model performance in order to say which model performs best. The algoritms are designed to find the parameters that will lead to the highest possible metric value using the train and test sets.

For this purpose, there are a number of metrics, and it is important to understand their differences in order to select the one that fits your purpose. It may also be a good idea to use more than one metric.

### Regression

Regression model performance is evaluated through the difference between the real value and the predicted value - the smaller the difference, the better the model. This is exactly the same as we do when computing the best fit for a linear model for example. 

The most popular regression model performance metrics are
- Mean Squared Error (MSE): The average of the squared difference between the real value and the predicted value.
- Root Mean Squared Error (RMSE): Root of the MSE. It's in the same unit as the original variable and therefore easier to interpret. 
- Mean Absolute Error (MAE): The average of the absolute error (i.e. all errors computed as positive values, regardless of whether the prediction is too small or too big).
- R^2: Total variance explained by the model divided by the total variance.

MSE and RMSE penalize large errors more severely than smaller ones, i.e. an error of 10 is worse than two errors of 5. MEA penalizes all errors relative to their magnitude.R^2 is not considered as a true error metric, as it does not look at the predicted vs. real values, but it is also often used as one.

You can read more [here](https://towardsdatascience.com/regression-an-explanation-of-regression-metrics-and-what-can-go-wrong-a39a9793d914). 

### Classification
To understand the performance metrics, let's first consider a binary classification case (we want to predict a yes/no answer, such as "is this a potential flying squirrel habitat?"). In this case, there are four possible outcomes: We classify a real positive value to be positive; we classify the real positive value to be negative; we classify the real negative value to be negative, or we classify the real negative value to be positive. 

- The values that are predicted as positive and are actually positive are *true positives*.
- The values that are predicted as positive but are actually negative are *false positives*.
- The values that are predicted as begative and are actually negative are *true negatives*.
- The values that are predicted as negative but are actually positive are *false negatives*. 

In this terminology, positive/negative refers to the classifier result (*what we think*), and true/false to whether that was correct or not.

By calculating ratios between these four classes, we get the different performance metrics. 
- Precision: What proportion of the positives given by our model are actually positives? 
- Recall (also called sensitivity or true positive rate): What proportion of the real positives are correctly clasified as positive?
- Specificity (or true negative rate): What propostion of the real negatives are correctly classified as negative?
- Accuracy: What proportion of all values are classified correctly?


<img src="img/TP-TN-FP-FN-1.png"/>

There are a number or other metrics, such as **F1 score** or **F measure**, which seeks to find balance between precision and recall. It is computed as (2 * Precision * Recall) / (Precision + Recall). The F1 score is popular for imbalanced data sets. If the model gives a score or probability of class memebership, and different cut-off points can be defined (e.g. do we call this positive if there's a 50 % probability for it to be positive, or only after 70 %?), ROC curves can be used to evaluate the performance and seek the best cutoff point. If you wish, you can read more about performance metrics [here](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226) and [here](https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/).

The correct performance metric depends on the purpose of the model. Often more important than overall accuracy is to catch all positives even if we get a higher number of false positives as a side effect. For example, we want to be sure that we have identified all meadows that may host an endangered species, so we can protect them - we don't mind terribly if we protect some other meadows as well. In other cases, it may be essential to avoid false negatives: when we are classifying mushrooms to edible and non-edible (poisonous), we want to be very sure not to classify a non-edible mushroom as edible even if that means that some edible ones are erroneously classified as non-edible. In these cases, recall and specificity may be much better metrics than precision or accuracy.

The distribution of positives and negatives also needs to be taken into account when selecting the performance measure. For example, if the positive and negative cases are very imbalanced, for example only 5 % of the studied habitats are suitable for the flying squirrel, we would reach 95 % accuracy simply by predicting that all habitats are unsuitable - we would be right 95 % of the time only because of the data distribution! In this case, recall might be a much better metric.

It's important to notice that - depending on the data and the classifier model - there is only a certain accuracy that can be reached. Some observations will be misclassified. However we can build a model that optimizes the performance metrics that is important to us, e.g. in the case of the edible mushroom example, the high true negative rate (non-edible mushrooms are classified as non-edible with high reliability). This usually means that we will also get more false negatives, i.e. edible mushrooms classified as non-edible. The desired balance between the different types of errors depends on the purpose of the model.

Multi-class classifier performance metrics are usually variants of these binary metrics.


# Further reading


## Neural networks and deep learning

Artificial neural networks (ANN or NN) are a fast-developing, flexible machine learning method family. They consist of units called artificial neurons, and links (edges) between these units. The neurons are typically adjusted in layers: an input layer, an output layer, and one or more hidden layers. The neurons are linked to each other, receive input and produce output that is passed on to other neurons. 

<img src="img/Colored_neural_network.svg">
Figure by Glosser.ca - Own work, Derivative of File:Artificial neural network.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24913461

A neuron's *activation function* determines its output based on the inputs it receives. There are a number of different activation functions with different mathematical properties, but the important thing to realize is that nonlinear activation fuctions (such as the logistic (sigmoid) function) allow the ANN to learn nonlinear responses.

There are also many different ANN architectures, i.e. ways that the neurons are connected to each other. The figure above describes the simplest, "basic" ANN. Recurrent neural networks (RNN) are suited for the analysis of text, audio, and time series, as they perform time steps and therefore can capture seqential information in the data. Convolutional neural networks (CNN), on the other hand, are highly useful for image analysis, since they capture spatial features (such as the positions of eyes, mouth, and nose in a portrait) from an image.

Deep learning or deep neural networks (DNN) is basically a neural networks with multiple layers. There is no clear definition for deep learning, but DNNs can be, for example, large RNNs and CNNs. DNNs are powerful with large data sets, and excel with data that are not in the form of classical data tables, but e.g. images, text documents, or audio.

ANNs are often complex and it is impossible to diagnose how they come to the conclusion they do (remember, for example, the tree-based classification in which it was easy to see how the conclusions are drawn). This means that if the data is biased, the model result may be biased and there's no way to see this has happened. For example, if the cat images often have a brown background and dog images green background, the algorithm may learn to classify the backgrouds and not the animal species.


## How to handle uncertainty: Bayesian machine learning

Eero writes this 

Classical ML methods tend to give their predictions without indicating how certain or uncertain the answers may be - the habitat is classified as suitable or non-suitable, and there's no way to tell whether the two habitats are suitable with equal certainty, or if one is more likely to be suitable than the other. 
