<a href="https://colab.research.google.com/github/center4ml/Workshops/blob/2023_2_solutions/Day_1/2_loss_functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
import math

## Loss function

In neural network supervised training (and to lesser extent in evaluation) one of the key concepts is the loss function. The loss function measures what is the distance between the known target values and neural network predictions. A good loss function is differentiable and has non-zero gradients.

As an example, **Accuracy**, a measure often used in neural network evaluation in classification tasks, has gradients almost everywhere equal to zero. This is not good to train the neural network, which needs to update the weights in a process called backpropagation, and for that it needs non-zero gradients. Thus, Accuracy isn't suitable as a loss function.

**[Some extra reading material on that topic](https://stats.stackexchange.com/questions/222585/what-are-the-impacts-of-choosing-different-loss-functions-in-classification-to-a).**

Examples of loss functions:

- Mean Square Error - MSE - used for regression,
- Binary Cross Entropy - a measure used for classification with two classes, it approximates Accuracy but has non-zero gradients,
- Cross Entropy - a measure used for general classification.

Let us provide an exact definition of MSE.

If a target vector is

$T=(T_i), i=1, \ldots, N$

and a prediction vector of a regressor is

$P=(P_i), i=1, \ldots, N$

Then $MSE(P, T) = \frac{\sum_{i=1}^N (T_i-P_i)^2}{N}$


### Your task #1

Calculate, using Python, MSE loss of the prediction $P=(1.1, 4.12, 8.9, 14.85)$ versus the target values $T=(1,4,9,16)$

In [None]:
target = [1, 4, 9, 16]
prediction = [1.1, 4.12, 8.9, 14.85]
sum = 0.0
for i in range(4):
    sum += (target[i]-prediction[i])**2
mse = sum/len(target)
print(mse)


0.3392250000000002


# Predefined Loss Functions in PyTorch

In PyTorch there are predefined methods for some of the loss functions. Of course, one is free to define his own loss functions, too, but a predefined loss functions have some advantages
- the implementation is numerically stable. As an example, Cross Entropy has a logarithm following the exponent. If you do that correctly, the result is identity. But the exponent of even moderately large values is infinite in numerical calculations.
- they usually have more efficient implementations
- they have built-in reduction methods

The predefined loss functions in PyTorch are all part of `torch.nn.functional` and are [documented here](https://pytorch.org/docs/stable/nn.functional.html#loss-functions).

Let us have a look at a predefined version of MSE loss that we've just calculated

In [None]:
torch.nn.functional.mse_loss(torch.tensor([1.1, 4.12, 8.9, 14.85]), torch.tensor([1, 4, 9, 16]))

tensor(0.3392)

To explain why the naive calculation of log-exp pair is numerically unstable, let's have a look at this example:

In [None]:
for i in range(700, 800):
  print(i, math.log(math.exp(i)))

### Reductions

Reductions are ways to consolidate the result into one value. Depending on the exact use-case,
a user may want to calculate mean loos (as in the definition of MSE) or the sum or even the individual loss values (before any consolidation). This can be controlled with the
additional parameter `reduction`

- `reduction="mean"`  (this is the default)
- `reduction="sum"`
- `reduction="none"` to get an array of individual results

In [None]:
torch.nn.functional.mse_loss(torch.tensor([1.1, 4.12, 8.9, 14.85]), torch.tensor([1, 4, 9, 16]), reduction="sum")

tensor(1.3569)

In [None]:
torch.nn.functional.mse_loss(torch.tensor([1.1, 4.12, 8.9, 14.85]), torch.tensor([1, 4, 9, 16]), reduction="none")

tensor([0.0100, 0.0144, 0.0100, 1.3225])

We can see that the fourth element of a prediction error had a largest contribution in a MSE by using `reduction = "none"`. If you want to sum the values, rather than average them, use `reduction = "sum"`.

## Classification loss

### Two classes - Binary Cross Entropy

OK, now let's examine other loss functions, the ones that are used for classification. If the target is only classes 0 or 1, and predictions are **class probabilities**, i.e. the floats between 0 and 1, then it is binary classification and the appropriate loss is Binary Cross Entropy loss with the following usage example in PyTorch:

In [None]:
torch.nn.functional.binary_cross_entropy(torch.tensor([0.0, 0.77, 0.11, 0.99]), torch.tensor([0,1,0,1]).type(torch.float), reduction="none")

tensor([0.0000, 0.2614, 0.1165, 0.0101])

If the target is only classes 0 or 1, and predictions are arbitrary floats, then it is still binary classification but before using Binary Cross Entropy loss the values should be transformed first into $<0,1>$ interval with the use of a Sigmoid:

In [None]:
torch.nn.functional.binary_cross_entropy(torch.sigmoid(torch.tensor([-2.0, 3.13, 0.0, -120.0])), torch.tensor([0,1,0,1]).type(torch.float), reduction="none")

tensor([1.2693e-01, 4.2789e-02, 6.9315e-01, 1.0000e+02])

Or, you can apply a sigmoid automatically (which is recommended because of numerical stability) with using `torch.nn.functional.binary_cross_entropy_with_logits()`

In [None]:
torch.nn.functional.binary_cross_entropy_with_logits(torch.tensor([-2.0, 3.13, 0.0, -120.0]), torch.tensor([0,1,0,1]).type(torch.float), reduction="none")

tensor([1.2693e-01, 4.2789e-02, 6.9315e-01, 1.2000e+02])

### Arbitrary number of classes - Cross Entropy

It used for multiclass classification, but - in principle - there is nothing stopping us from using it for two classes, too. Please observe, that instead of a vector of logits, you must provide the loss function with raw predictions for all classes (thus, as predictions, we provide a tensor with one more level, i.e. with an increased order).

**This is very important: softmax is executed internally while calculating loss, so there should be no softmax application as part of neural network forward pass**.

Observe also, that classes are provided as `torch.long` type.

In [None]:
torch.nn.functional.cross_entropy(torch.tensor([[-2.0, -2.0], [3.14, 3.14], [0.0, 0.0], [120.0, 0.0]]), torch.tensor([0,1,0,1]).type(torch.long), reduction="none")

tensor([  0.6931,   0.6931,   0.6931, 120.0000])

### Summary of classification loss options

![summary](https://imgur.com/COfTYRh.png)

### Your task #2

You do a two class classification task. Your classifier has a few layers, but it does NOT have a sigmoid as the final nonlinarity. Rather, it may output arbitrary values between minus and plus infinity.

For the input data examples with known classes $(0,1,0,1,1,1,0)$ the classifier outputs $(-12.8, 3.0, 0.3, 2.9, 17.3, 14.2, -11.9)$.

What is the mean Binary Cross Entropy Loss?
What is the Binary Cross Entropy Loss for the individual data points?
Which data point has maximal loss, and which has the minimal loss?

**Remember!**
- target tensor should be provided as `torch.float` type.

In [None]:
torch.nn.functional.binary_cross_entropy_with_logits(torch.tensor([-12.8, 3.0, 0.3, 2.9, 17.3, 14.2, -11.9]), torch.tensor([0, 1, 0, 1, 1, 1, 0]).type(torch.float), reduction="mean")

tensor(0.1366)

In [None]:
torch.nn.functional.binary_cross_entropy_with_logits(torch.tensor([-12.8, 3.0, 0.3, 2.9, 17.3, 14.2, -11.9]), torch.tensor([0, 1, 0, 1, 1, 1, 0]).type(torch.float), reduction="none")

tensor([2.7418e-06, 4.8587e-02, 8.5436e-01, 5.3563e-02, 0.0000e+00, 7.1526e-07,
        6.7949e-06])

In [None]:
torch.argmax(torch.nn.functional.binary_cross_entropy_with_logits(torch.tensor([-12.8, 3.0, 0.3, 2.9, 17.3, 14.2, -11.9]), torch.tensor([0, 1, 0, 1, 1, 1, 0]).type(torch.float), reduction="none"))

tensor(2)

In [None]:
torch.argmin(torch.nn.functional.binary_cross_entropy_with_logits(torch.tensor([-12.8, 3.0, 0.3, 2.9, 17.3, 14.2, -11.9]), torch.tensor([0, 1, 0, 1, 1, 1, 0]).type(torch.float), reduction="none"))

tensor(4)


### Your task #3

You do a three class classification task. Your classifier has a few layers and it outputs arbitrary values between minus and plus infinity.

For the four input data examples with known classes $(0, 2, 1, 0)$ the classifier outputs
- $(2.8, -2.0, 0.1)$ for the first data point,
- $(2.1, 0.3, 1.8)$ for the second data point,
- $(0.2, 0.2, 0.3)$ for the third data point,
- $(0.1, -0.2, 0.0)$ for the last data point.

What is the mean Cross Entropy Loss?
What is the Cross Entropy Loss for the individual data points?
Which data point has maximal loss, and which has the minimal loss?

**Remember!**
- increase the order of prediction tensor by one level.
- target classes should be provided as `torch.long` type.


In [None]:
torch.nn.functional.cross_entropy(torch.tensor([[2.8, -2.0, 0.1], [2.1, 0.3, 1.8], [0.2, 0.2, 0.3], [0.1, -0.2, 0.0]]), torch.tensor([0, 2, 1, 0]).type(torch.long), reduction="mean")

tensor(0.7809)

In [None]:
torch.nn.functional.cross_entropy(torch.tensor([[2.8, -2.0, 0.1], [2.1, 0.3, 1.8], [0.2, 0.2, 0.3], [0.1, -0.2, 0.0]]), torch.tensor([0, 2, 1, 0]).type(torch.long), reduction="none")

tensor([0.0727, 0.9451, 1.1331, 0.9729])

In [None]:
torch.argmin(torch.nn.functional.cross_entropy(torch.tensor([[2.8, -2.0, 0.1], [2.1, 0.3, 1.8], [0.2, 0.2, 0.3], [0.1, -0.2, 0.0]]), torch.tensor([0, 2, 1, 0]).type(torch.long), reduction="none"))

tensor(0)

In [None]:
torch.argmax(torch.nn.functional.cross_entropy(torch.tensor([[2.8, -2.0, 0.1], [2.1, 0.3, 1.8], [0.2, 0.2, 0.3], [0.1, -0.2, 0.0]]), torch.tensor([0, 2, 1, 0]).type(torch.long), reduction="none"))

tensor(2)