![pics/overview](pics/overview.png)


# Regression metrics 

![rm_1](pics/rm_1.png)
![rm_1](pics/rm_2.png)
![rm_1](pics/rm_3.png)

MSE is particularly sensitive to outliers. 

![rm_1](pics/rm_4.png)

In fact, MSE is a little bit easier to work with, so everybody uses MSE instead of RMSE. But there is a little bit of difference between the two for gradient-based models. Take a look at the gradient of RMSE with respect to i-th prediction. It is basically equal to gradient of MSE multiplied by some value. The value doesn't depend on the index I. It means that travelling along MSE gradient is equivalent to traveling along RMSE gradient but with a different flowing rate and the flowing rate depends on MSE score itself. So, it is kind of dynamic. So even though RMSE and MSE are really similar in terms of models scoring, they can be not immediately interchangeable for gradient based methods. We will probably need to adjust some parameters like the learning rate.

Actually, it's hard to realize if our model is good or not by looking at the absolute values of MSE or RMSE. It really depends on the properties of the dataset and their target vector. How much variation is there in the target vector.

**We would probably want to measure how much our model is better than the constant baseline.**

![rm_1](pics/rm_5.png)

 When MSE of our predictions is zero, the R_squared is 1, and when our MSE is equal to MSE over constant model, then R_squared is zero. Well, because the values in numerator and denominator are the same. And all reasonable models will score between 0 and 1. The most important thing for us is that to optimize R_squared, we can optimize MSE. It will be absolutely equivalent since R_squared is basically MSE score divided by a constant and subtracted from another constant.

![rm_1](pics/rm_6.png)

What is important about this metric is that it penalizes huge errors that not as that badly as MSE does. Thus, it's not that sensitive to outliers as mean square error. It also has a little bit different applications than MSE. MAE is widely used in finance, where $\$$10 error is usually exactly two times worse than $\$$5 error. On the other hand, MSE metric thinks that $\$$10 error is four times worse than $\$$5 error. MAE is easier to justify. And if you used RMSE, it would become really hard to explain to your boss how you evaluated your model. What constant is optimal for MAE? It's quite easy to find that its a median of the target values

![rm_1](pics/rm_7.png)

Another important thing about MAE is its gradients with respect to the predictions. The grid end is a step function and it takes -1 when Y_hat is smaller than the target and +1 when it is larger. Now, the gradient is not defined when the prediction is perfect, because when Y_hat is equal to Y, we can not evaluate gradient. It is not defined. So formally, MAE is not differentiable, but in fact, how often your predictions perfectly measure the target. 

![rm_1](pics/rm_8.png)
![rm_1](pics/rm_10.png)

![rm_1](pics/rm_11.png)
![rm_1](pics/rm_12.png)
![rm_1](pics/rm_13.png)
![rm_1](pics/rm_14.png)
![rm_1](pics/rm_15.png)
 So, this metric is usually used in the same situation as MSPE and MAPE, as it also carries about relative errors more than about absolute ones. 
 
 ![rm_1](pics/rm_16.png)
But note the asymmetry of the error curves. From the perspective of RMSLE, it is always better to predict more than the same amount less than target.

Same as root mean square error doesn't differ much from mean square error, RMSLE can be calculated without root operation. But the rooted version is more widely used. It is important to know that the plot we see here on the slide is built for a version without the root. And for a root version, an analogous plot would be misleading.

![rm_1](pics/rm_17.png)
![rm_1](pics/rm_18.png)

![rm_1](pics/rm_19.png)
 
MSE is quite biased towards the huge value from our dataset, while MAE is much less biased. MSPE and MAPE are biased towards smaller targets because they assign higher weight to the object with small targets. And RMSLE is frequently considered as better metrics than MAPE, since it is less biased towards small targets, yet works with relative errors. I strongly encourage you to think about the baseline for metrics that you can face for first time.

# Classification metrics 

![cm_1.png](pics/cm_1.png)
The problem is, that the base line accuracy can be very high for a data set, even 99%, and that makes it hard to interpret the results. Although accuracy score is very clean and intuitive, it turns out to be quite hard to optimize.

Accuracy also doesn't care how confident the classifier is in the predictions, and what soft predictions are.

And thus, people sometimes prefer to use different metrics that are first, easier to optimize. And second, these metrics work with soft predictions, not hard ones.

![cm_1.png](pics/cm_2.png)

And finally, it should be mentioned that to avoid in practice, predictions are clipped to be not from 0 to 1, but from some small positive number to 1 minus some small positive number.

![cm_1.png](pics/cm_3.png)
As we can see it prefers to do many small mistakes than a big one. 

![cm_1.png](pics/cm_4.png)

Recall that to compute accuracy score for a binary task, we usually take soft predictions from our model and apply threshold.

AUC metric kind of tries all possible thresholds and aggregates those scores.

![cm_1.png](pics/cm_5.png)

Actually, there are several ways AUC, or this area under curve, can be explained. 
1. The first one explains under what curve we should compute area. 
2. And the second explains AUC as the probability of object pairs to be correctly ordered by our model.

**First Explanation** 

Actually it's very simple to calculate Receiver Operating Curve (ROC), we start from bottom left corner and go up every time we see red point. And right when we see a green one. Let's see. So we stand on the leftmost point first. And it is red, or positive. So we increase the number of true positives and move up. Next, we jump on the green point. It is false positive, and so we go right. Then two times up for two red points. And finally two times right for the last green point. We finished in the top right corner. And it always works like that. We start from bottom left and end up in top right corner when we jump on the right most point. 

![cm_1.png](pics/cm_6.png)

And now we are ready to calculate an area under this curve. The area is seven and we need to normalize it by the total plural area of the square. So AUC is 7/9

It doesn't need a threshold to be specified and it doesn't depend on absolute values. 

**Second Explanation** 

Consider all pairs of objects, such that one object is from red class and another one is from green. AUC is a probability that score for the green one will be higher than the score for the red one. In other words, AUC is a fraction of correctly ordered pairs. You see in our example we have two incorrectly ordered pairs and nine pairs in total.

![cm_1.png](pics/cm_7.png)

![cm_8.png](pics/cm_8.png)

![cm_9.png](pics/cm_9.png)

Recall that if we always predict the label of the most frequent class, we can already get pretty high accuracy score, and that can be misleading. 

In Cohen's Kappa we take another value as the baseline. We take the higher predictions for the data set and shuffle them, like randomly permute. And then we calculate an accuracy for these shuffled predictions. And that will be our baseline. Well to be precise, we permute and calculate accuracies many times and take, as the baseline, an average for those computed accuracies. 

![cm_9.png](pics/cm_10.png)

 We need, first, to multiply the empirical frequencies of our predictions and grant those labels for each class, and then sum them up. For example, if we assign 20 cat labels and 80 dog labels at random, then the baseline accuracy will be 0.2*0.1 + 0.8*0.9 = 0.74.
 
We can also recall that error is equal to 1 minus accuracy. We could rewrite the formula as 1 minus model's error/baseline error.

![cm_9.png](pics/cm_11.png)


To explain weighted Kappa, we first need to do a step aside, and introduce weighted error. See now we have cats, dogs and tigers to classify. And we are more or less okay if we predict dog instead of cat. But it's undesirable to predict cat or dog if it's really a tiger. So we're going to form a weight matrix where each cell contains The weight for the mistake we might do.
 
 ![cm_9.png](pics/cm_12.png)


 ![cm_9.png](pics/cm_13.png)


Now, to calculate weight and error we need another matrix, confusion matrix, for the classifier's prediction.

This matrix shows how our classifier distributes the predictions over the objects. For example, the first column indicates that four cats out of ten were recognized correctly, two were classified as dogs and four as tigers. So to get a weighted error score, we need to multiply these two matrices element-wise and sum their results.

![cm_14.png](pics/cm_14.png)

In many cases, the weight matrices are defined in a very simple way. For example, for classification problems with ordered labels.

Say you need to assign each object a value from 1 to 3. It can be, for instance, a rating of how severe the disease is. And it is not regression, since you do not allow to output values to be somewhere between the ratings and the ground truth values also look more like labels, not as numeric values to predict.

So such problems are usually treated as classification problems, but weight matrix is introduced to account for order of the labels.


# General approaches for metrics  

![m_l.png](pics/m_l.png)
![m_l.png](pics/m_l_1.png)
![m_l.png](pics/m_l_2.png)

Thankfully, there is a method that always works. It is called **early stopping**, and it is very simple. You set a model to optimize any loss function it can optimize and you monitor the desired metric on a validation set. And you stop the training when the model starts to fit according to the desired metric and not according to the metric the model is truly optimizing. That is important. Of course, some metrics cannot be even easily evaluated. For example, if the metric is based on a human assessor's opinions, you cannot evaluate it on every iteration. For such metrics, we cannot use early stopping, but we will never find such metrics in a competition. 

![m_l.png](pics/m_l_3.png)
![m_l.png](pics/m_l_4.png)



# Regression metrics optimization

![r_m_1.png](pics/r_m_1.png)
![r_m_1.png](pics/r_m_2.png)
![r_m_1.png](pics/r_m_3.png)
![r_m_1.png](pics/r_m_4.png)
![r_m_1.png](pics/r_m_5.png)
![r_m_1.png](pics/r_m_6.png)

SPE and MAPE. It's much harder to find the model which can optimize them out of the box. Of course we can always can use, either, of course we can always either implement a custom loss for an integer boost or a neural net. It is really easy to do there. Or we can optimize different metric and do early stopping.

![r_m_1.png](pics/r_m_7.png)
![r_m_1.png](pics/r_m_8.png)

This approach is based on the fact that MSP is a weighted version of MSE and MAP is a weighted version of MAE. On the right side, we've sen expression for MSP and MAP. The summon denominator just ensures that the weights are summed up to 1, but it's not required.

Intuitively, the sample weights are indicating how important the object is for us while training the model.

The smaller the target, is the more important the object.

So, how do we use this knowledge?

![r_m_1.png](pics/r_m_9.png)

And, the model will actually optimize desired MSPE loss. Although most important libraries like XGBoost, LightGBM, most neural net packages support sample weighting, not every library implements it.

But there is another method which works whenever a library can optimize MSE or MAE. Nothing else is needed.

All we need to do is to create a new training set by sampling it from the original set that we have and fit a model with, for example, I'm a secretarian if you want to optimize MSPE.

It is important to set the probabilities for each object to be sampled to the weights we've calculated.

The size of the new data set is up to you. You can sample for example, twice as many objects as it was in original train set.

And note that we do not need to do anything with the test set. It stays as is.

I would also advise you to re-sample train set several times. Each time fitting a model. And then average models predictions, if we'll get the score much better and more stable.

The results will, another way we can optimize MSPE,

this approach was widely used during Rossmund Competition on Kagle. It can be proved that if the errors are small, we can optimize the predictions in logarithmic scale. Where it is similar to what we will do on the next slide actually.

![r_m_1.png](pics/r_m_10.png)

And finally, let's get to the last regression metric we have to discuss. Root, mean, square, logarithmic error.

It turns out quite easy to optimize, because of the connection with MSE loss.

All we need to do is first to apply and transform to our target variables. In this case, logarithm of the target plus one.

Let's denote the transformed target with a z variable right now.

And then, we need to fit a model with MSE loss to transform target. To get a prediction for a test subject, we first obtain the prediction, z hat, in the logarithmic scale just by calling model.predict or something like that.

And next, we do an inverse transform from logarithmic scale back to the original by expatiating z hat and subtracting one, and this is how we obtain the predictions y hat for the test set.

# Regression metrics optimization

![c_m_1.png](pics/c_m_1.png)
Logloss for classification is like MSE for aggression, it is implemented everywhere. 
 
![c_m_1.png](pics/c_m_2.png)

Random forest classifier predictions turn out to be quite bad in terms of logloss. 

But there is a way to make them better, we can calibrate the predictions to better fit logloss. We've mentioned several times that logloss requires model to output exterior probabilities, but what does it mean?

![c_m_1.png](pics/c_m_3.png)

It actually means that if we take all the points that have a score of, for example, 0.8, then there will be exactly four times more positive objects than negatives. That is, 80% of the points will be from class 1, and 20% from class 0. If the classifier doesn't directly optimize logloss, its predictions should be calibrated.

![c_m_1.png](pics/c_m_4.png)

Take a look at this plot, the blue line shows sorted by value predictions for the validation set. And the red line shows correspondent target values smoothed with rolling window. We clearly see that our predictions are kind of conservative. They´re much greater than two target mean on the left side, and much lower than they should be on the right side.

But if we plot sorted predictions for calibrated classifier, the curve will be very similar to target rolling mean. And in fact, the calibrator predictions will have lower log loss.



![c_m_5.png](pics/c_m_5.png)

Now, there are several ways to calibrate predictions, for example, we can use so-called Platt scaling. Basically, we just need to fit a logistic regression to our predictions.

I will not go into the details how to do that, but it's very similar to how we stack models, and we will discuss stacking in detail in a different video. Second, we can fit isotonic regression to our predictions, and again, it is done very similar to stacking, just another model. While finally, we can use stacking, so the idea is, we can fit any classifier. It doesn't need to optimize logloss, it just needs to be good, for example, in terms of AUC.

And then we can fit another model on top

that will take the predictions of our model, and calibrate them properly. And that model on top will use logloss as its optimization loss. So it will be optimizing indirectly, and its predictions will be calibrated.

![c_m_5.png](pics/c_m_6.png)


Logloss was the only metric that is easy to optimize directly. With accuracy, there is no easy recipe how to directly optimize it. In general, the recipe is following, actually, if it is a binary classification task, fit any metric, and tune with the binarization threshold. For multi-class tasks, fit any metric and tune parameters comparing the models by their accuracy score, not by the metric that the models were really optimizing.

So this is kind of early stopping and the cross validation, where you look at the accuracy score.

Just to get an intuition why accuracy is hard to optimize, let's look at this plot.

![c_m_5.png](pics/c_m_7.png)

So on the vertical axis we will show the loss, and the horizontal axis shows signed distance to the decision boundary, for example, to a hyper plane or for a linear model. The distance is considered to be positive if the class is predicted correctly. And negative if the object is located at the wrong side of the decision boundary.

**The problem is that, this loss has zero almost everywhere gradient, with respect to the predictions.** And most learning algorithms require a nonzero gradient to fit, otherwise it's not clear how we need to change the predictions such that loss is decreased.

And so people came up with **proxy losses** that are upper bounds for these zero-one loss. So if you perfectly fit the proxy loss, the accuracy will be perfect too, but differently to zero-one loss, they are differentiable. 

For example, you see here logistic loss, the red curve used in logistic regression, and hinge loss, loss used in SVM.

![c_m_8.png](pics/c_m_8.png)

We can tune the threshold we apply, we can do it with a simple grid search implemented with a for loop. Well, it means that we can basically fit any sufficiently powerful model. It will not matter much what loss exactly, say, hinge or log loss the model will optimize. All we want from our model's predictions is the existence of a good threshold that will separate the classes.

**Also, if our classifier is ideally calibrated, then it is really returning posterior probabilities. And for such a classifier, threshold 0.5 would be optimal, but such classifiers are rarely the case, and threshold tuning helps often. **


![c_m_9.png](pics/c_m_9.png)

 Although the loss function of AUC has zero gradients almost everywhere, exactly as accuracy loss, there exists an algorithm to optimize AUC with gradient-based methods, and some models implement this algorithm. So we can use it by setting the right parameters. I will give you an idea about this method without much details as there is more than one way to implement it
 
 ![c_m_9.png](pics/c_m_10.png)

Recall that originally, classification task is usually solved at the level of objects. We want to assign 0 to red objects, and 1 to the green ones. But we do it independently for each object, and so our loss is pointwise. We compute it for each object individually, and sum or average the losses for all the objects to get a total loss. 

Now, recall that AUC is the probability of a pair of the objects to be ordered in the right way. So ideally, we want predictions Y^ for the green objects to be larger than for the red ones. So, instead of working with single objects, we should work with pairs of objects. And instead of using pointwise loss, we should use pairwise loss. 

 ![c_m_9.png](pics/c_m_11.png)

A pairwise loss takes predictions and labels for a pair of objects and computes their loss. Ideally, the loss would be zero when the ordering is correct, and greater than zero when the ordering is not correct, incorrect. But in practice, different loss functions can be used. 

![c_m_12.png](pics/c_m_12.png)

For example, we can use logloss. We may think that the target for this pairwise loss is always one, red minus green should be one. That is why there is only one term in logloss objective instead of two. The prob function in the formula is needed to make sure that the difference between the predictions is still in the 0,1 range, and I use it here just for the sake of simplicity. 

![c_m_12.png](pics/c_m_13.png)

 I should say that in practice, most people still use logloss as an optimization loss without any more post processing. I personally observed XGBoost learned with loglosst to give comparable AUC score to the one learned with pairwise loss.
 
 ![c_m_12.png](pics/c_m_14.png)
 ![c_m_12.png](pics/c_m_15.png)
We need to fit MSE loss to our data and then find appropriate thresholds.