# Commonly Used Loss Functions

The model paramters of Neural Networks are trained using a process known as the back-propagation of errors from a loss function. So model training requires that you choose a loss function when designing your model. 

In this chapter, we'll walk through a few commonly used loss function available in PyTorch and what's involved in the computation. 

At a basic level, Loss Functions can be categorized into two categories: Regression Loss and Classification Loss. 

Below we show examples of both.


In [None]:
import torch
import torch.nn as nn

print(f'Torch version: {torch.__version__}')

## Regression Analysis with a Regression Loss Function
Regression analysis predicts a continuous output variable based on the values of one or more input variables. Briefly, the goal of a regression model is to build a mathematical approximation that defines the output variable as a function of the input variables. An example of a continuous variable is the predicted price of a home based on inputs such as living space, number of rooms, etc. Note, the output is a continuous value, not a discrete value or a categorical value. The input variable can be continuous or discrete.

The loss fucntion associated with regression analysis is known as the regression loss. Mean Square Error (MSE) is the most commonly used regression loss function (and we'll illustrate that here with an example.

MSE is calculated as the mean of squared difference between predictions and actual observations (also know as the target). Expressed as a computation, it is:

`sum( square( predicted_vals - targets ) ) / (number of training samples)` and you'll see it expressed as PyTorch code below.

Intuitively, the squaring of the error term means that larger deviations from the target values amplify the error way more than than smaller deviations, meaning that the model is penalized in a square relationship for deviating from the target values.

A disadvantage of MSE is that a few (or even one) outlier data points can make the loss unnecessarily large and that maynot be desirable since the squared outlier data point is penalizing the entire mean calculation. 


In [None]:
predicted_vals= torch.randn(3, 5, requires_grad=True)
targets = torch.randn(3, 5)
regression_mse_loss = nn.MSELoss()(predicted_vals, targets)  # calculates mean
print('predicted_vals:\n\t', predicted_vals)
print('targets:\n\t', targets)
print('regression loss:\n\t', regression_mse_loss)
hand_calc_regress_mse_loss = torch.sum(torch.square(predicted_vals - targets)) / (targets.shape[0] * targets.shape[1])
print('hand calculated regression loss:\n\t:', hand_calc_regress_mse_loss)
assert torch.equal(regression_mse_loss, hand_calc_regress_mse_loss)

## Binary Classification with a Cross Entropy Loss Function

Measuring the Binary Cross Entropy Loss between the target and the predicted output is done with: 

`loss = -1 * (target * log(predicted)  +  (1 - target) * log(1 - predicted))` where log means log-base-e

We show below with a simple example that the Cross Entropy Loss defined above penalizes misclassifications more, a property that can help in better learning.

Assume that the target value is 1. We compute the MSE Loss and the Cross Entropy Loss for a correctly predicted value (0.95) and a wrongly predicted value (0.1), i.e., a misclassification.

First, we calculate the MSE loss.

| Target &nbsp &nbsp     | Predicted &nbsp &nbsp| MSE Loss | &nbsp &nbsp Comment | 
| :---:        |    :----:   |          :---: | :--- |
| 1      | 0.95       | (1.0 - 0.95)**2 = 0.0025   | &nbsp &nbsp For a correct prediction, loss is small |
| 1   | 0.1        | (1.0 - 0.1)**2 = 0.81      | &nbsp &nbsp For a wrong prediction, loss is big

For the Cross Entropy Loss, because the target value is 1, the computation reduces to `-1 * target * log(predicted)` or simply `-1 * log(predicted)`.

| Target &nbsp &nbsp | Predicted &nbsp &nbsp| Cross Entropy Loss| &nbsp &nbsp Comment | 
| :---:        |    :----:   |          :---: | :--- |
| 1      | 0.95       | -1 * log(0.95) =  0.051  | &nbsp &nbsp Prediction is correct; loss is small |
| 1   | 0.1        | -1 * log(0.1) = 2.302      | &nbsp &nbsp Prediction is wrong; loss is LARGE when compared to the MSE loss

If the predicted output is close to the desired output, then the loss is small (for both loss functions). The difference is noticeable however when the output is misclassified (0.81 for MSE Loss and 2.302 for Cross Entropy Loss). 

The Cross Entropy Loss function also has the benefit of learning at a faster pace. To learn (the process of continously updating model parameters), we back-propagate the loss, and that is done by taking the partial derivative of the loss with respect to the weights. In doing so, we can show that the rate at which the model parameters (weights and biases) learn is proportional to `output - target`, i.e., proportional to the error in the output. The larger the error, the faster the model will learn which is a very nice property.

In the example below, we create two clusters of normalized data points, 1.5 units apart. Each cluster belongs a class; so we have two classes, class 0 and class 1. The loss computation uses `nn.BCELoss`, the PyTorch Binary Cross Entropy Loss function.

It is typical to use the `nn.Sigmoid` activation as the final output of your neural net followed by the nn.BCELoss to calculate the Binary Cross Entropy loss.

We also calculate is by hand and ensure we get the same results.

![](assets/sigmoid_p_bce.png)





In [None]:
model_output = torch.cat([torch.randn(3), torch.randn(3) + 1.5])  # two clusters 1.5 units apart
target = torch.cat([torch.zeros(3), torch.ones(3)])  # ... from two classes; class 0, class 1
predicted = torch.sigmoid(model_output)

unreduced_loss = nn.BCELoss(reduction='none')(predicted, target)
mean_loss = nn.BCELoss(reduction='mean')(predicted, target)
print('input:\n\t', input)
print('target:\n\t', target)
print('predicted:\n\t', predicted)
print('unreduced loss:\n\t', unreduced_loss)
print('mean loss:\n\t', mean_loss)

hand_calc_loss = -1 * (target * torch.log(predicted) + (1 - target) * torch.log(1 - predicted))
print('hand calculated loss\n\t', hand_calc_loss)
assert torch.equal(hand_calc_loss, unreduced_loss)  # check they are the same

In the exampe above, the Cross Entropy Loss calculation was done in two steps.

1. A Sigmoid lon-linearity on the output

2. Application of the Binary Cross WEntropy Loss

For numerical stability reasons, the two steps above can be collapsed into one step with the `nn.BCEWithLogitsLoss` loss function.

`nn.BCEWithLogitsLoss` loss combines a Sigmoid layer and the BCELoss in one single class

In the example below, we reuse the input tensor from the code above and show the equivalence.


In [None]:
# inputs and target same as above
loss_unreduced = nn.BCEWithLogitsLoss(reduction='none')(input, target)
loss_mean = nn.BCEWithLogitsLoss(reduction='mean')(input, target)
print('unreduced loss:\n', unreduced_loss)
print('mean loss:\n', mean_loss)  # will be the same as the two step calculation

## Multi-Class Classification with a Cross Entropy Loss Function

We use this loss function when we train a model to output a probability over multiple classes. Lets say, we have a total of C classes. Each sample can be classified into one of those C classes.  

The `target` that this loss expects should be a class index in the range \[0, C-1\] where C is number of classes. 

![](assets/softmax_p_NLLoss.png)

References

https://forums.fast.ai/t/nllloss-implementation/20028/

https://gombru.github.io/2018/05/23/cross_entropy_loss/

https://www.youtube.com/watch?v=7q7E91pHoW4&ab_channel=PythonEngineer




In [19]:

model_output = torch.randn(7, 5)  # batch size is 7 and we have 5 classes
target = torch.randint(5, (7,))  # 7 targets to match the batch size, each target prepresents classes, 0 to 4
predicted = nn.LogSoftmax()(model_output)  # pass output through softmax layer
unreduced_loss = nn.NLLLoss(reduction='none')(predicted, target)
mean_loss = nn.NLLLoss(reduction='mean')(predicted, target)
print('input:\n\t', input)
print('target:\n\t', target)
print('predicted:\n\t', predicted)
print('unreduced loss:\n\t', unreduced_loss)
print('mean loss:\n\t', mean_loss)

def coded_nll_loss(logs, targets):
    out = torch.zeros_like(targets, dtype=torch.float)
    for i in range(len(targets)):
        out[i] = logs[i][targets[i]]
    return -out  # negative out
hand_coded = coded_nll_loss(predicted, target)
print('hand calc loss', hand_coded) 

input:
	 tensor([[-0.8950,  0.9014, -0.0569, -0.6153,  1.9885],
        [-0.4626,  0.2669,  0.6811,  0.3441, -0.8225],
        [ 0.7108, -0.0324,  0.3900, -0.8149,  1.5479],
        [-0.0916,  1.3633, -0.0421, -1.4999,  0.7701],
        [ 1.1824,  0.4017, -2.5063, -1.3161, -1.7623],
        [-0.6089,  0.7983,  0.1294,  1.2488, -0.7882],
        [ 0.8558,  0.4784,  0.0946, -0.1078,  0.4479]])
target:
	 tensor([2, 4, 4, 4, 1, 3, 1])
predicted:
	 tensor([[-1.7278, -3.0123, -1.1349, -1.2646, -1.7760],
        [-2.2558, -1.1229, -2.0422, -1.4121, -1.6271],
        [-1.7369, -2.2626, -0.6376, -1.8019, -3.6376],
        [-2.5275, -1.0308, -2.9608, -0.8408, -2.5226],
        [-1.9876, -2.8648, -1.5183, -1.8859, -0.8319],
        [-2.9272, -1.9747, -1.2376, -1.1586, -1.5913],
        [-3.0826, -0.9130, -1.3581, -1.3898, -3.0671]])
unreduced loss:
	 tensor([1.1349, 1.6271, 3.6376, 2.5226, 2.8648, 1.1586, 0.9130])
mean loss:
	 tensor(1.9798)
hand calc loss tensor([1.1349, 1.6271, 3.6376, 2.5226, 

`nn.CrossEntropyLoss` combines `nn.LogSoftmax()` and `nn.NLLLoss()` in one single class.

![](assets/cross_entropy_loss.png)

The results are the same as above.



In [20]:
# input and target, same as above
loss = nn.CrossEntropyLoss()(model_output, target)
print('unreduced loss:\n', unreduced_loss)
print('mean loss:\n', mean_loss)

unreduced loss:
 tensor([1.1349, 1.6271, 3.6376, 2.5226, 2.8648, 1.1586, 0.9130])
mean loss:
 tensor(1.9798)


## Multi-Label Categorical Classification

“Multi-label” classification means that each image can belong to any number of the specified classes, including no class (i.e., the foreground doesn't have any of the classes we've trained on). So multi-label classification can be interpreted as a series of binary classifications per class. Is the image in class A – yes or no? Is the same image in class B – yes or no? And so on.

![](assets/desert+mountains+cactus-with-labels.png)

For computing the loss for multi-label classification, it's convenient to use the `torch.nn.BCEWithLogitsLoss` class which combines a Sigmoid activation layer and the BCELoss (Binary Cross Entropy Loss) in one single class. By combining the operations into one layer, we take advantage of numerical stability inherent in these combined operations (and this is well documented). 

With the aid of examples, we'll show the equialence of the two.

![](assets/multi-label-sigmoid_p_bce.png)


In [23]:
model_output = torch.randn(7, 5)  # batch-size is 7 and we have 5 labels
target = torch.randint(2, (7, 5), dtype=torch.float)  # 7 target vectors, each vector indicates which labels are present/absent with 1/0
predicted = torch.sigmoid(model_output)
unreduced_loss = nn.BCELoss(reduction='none')(predicted, target)
mean_loss = nn.BCELoss(reduction='mean')(predicted, target)
print('input:\n', input)
print('target:\n', target)
print('predicted:\n', predicted)
print('unreduced loss:\n', unreduced_loss)
print('mean loss:\n', mean_loss)

input:
 tensor([[ 1.5968, -0.4286,  0.0860,  1.7630, -1.1820],
        [ 0.4368, -1.8240,  0.0101, -0.5243, -0.3475],
        [ 0.5771,  0.2613,  0.2285,  1.2745,  0.7631],
        [ 0.8846, -0.3833, -0.2283,  0.1239, -0.3933],
        [ 1.2028,  0.1742,  1.9431,  2.4397, -1.1949],
        [ 1.9256,  1.3565, -2.4439, -3.3839,  0.5788],
        [ 2.4243, -1.8506,  1.5658,  0.5640, -0.8770]])
target:
 tensor([[0., 0., 1., 0., 1.],
        [1., 0., 1., 0., 0.],
        [0., 1., 0., 1., 0.],
        [1., 0., 1., 1., 1.],
        [0., 0., 1., 0., 0.],
        [0., 1., 1., 0., 0.],
        [0., 1., 1., 1., 0.]])
predicted:
 tensor([[0.6665, 0.2511, 0.6910, 0.7486, 0.3983],
        [0.1654, 0.1196, 0.4309, 0.3940, 0.7574],
        [0.7069, 0.6947, 0.3228, 0.1840, 0.3091],
        [0.5916, 0.5808, 0.7274, 0.6152, 0.2509],
        [0.8088, 0.7374, 0.5291, 0.3958, 0.6805],
        [0.8210, 0.7961, 0.6641, 0.8809, 0.4150],
        [0.5307, 0.7872, 0.8385, 0.6361, 0.8897]])
unreduced loss:
 tensor

THe equivalent of the above is the `nn.BCEWithLogitsLoss` class that combines the sigmoid' and the `BCELoss` class.

![](assents/bce_with_logits+loss.png)

In [24]:
# inputs and targets ramins the same as above
unreduced_loss = nn.BCEWithLogitsLoss(reduction='none')(input, target)
mean_loss = nn.BCELoss(reduction='mean')(predicted, target)
print('unreduced loss:\n', unreduced_loss)
print('mean loss:\n', mean_loss)

unreduced loss:
 tensor([[1.7812, 0.5016, 0.6511, 1.9213, 1.4494],
        [0.4984, 0.1496, 0.6881, 0.4650, 0.5344],
        [1.0228, 0.5710, 0.8139, 0.2465, 1.1458],
        [0.3456, 0.5197, 0.8138, 0.6331, 0.9090],
        [1.4654, 0.7840, 0.1339, 2.5233, 0.2645],
        [2.0617, 0.2292, 2.5272, 0.0334, 1.0239],
        [2.5091, 1.9966, 0.1897, 0.4504, 0.3478]])
mean loss:
 tensor(0.8712)


In [None]:
assert torch.equal(torch.tensor([1, 2]), torch.tensor([1, 2]))