# Commonly Used Loss Functions

The model paramters of Neural Networks are trained using a process known as the back-propagation of errors from a loss function. So model training requires that you choose a loss function when designing your model. 

In this chapter, we'll walk through a few commonly used loss function available in PyTorch and what's involved in the computation. 

At a basic level, Loss Functions can be categorized into two categories: Regression Loss and Classification Loss. 

Below we show examples of both.


In [1]:
import torch
import torch.nn as nn

print(f'Torch version: {torch.__version__}')

Torch version: 1.7.1


## Regression Analysis with a Regression Loss Function
Regression analysis predicts a continuous output variable based on the values of one or more input variables. Briefly, the goal of a regression model is to build a mathematical approximation that defines the output variable as a function of the input variables. An example of a continuous variable is the predicted price of a home based on inputs such as living space, number of rooms, etc. Note, the output is a continuous value, not a discrete value or a categorical value. The input variable can be continuous or discrete.

The loss fucntion associated with regression analysis is known as the regression loss. Mean Square Error (MSE) is the most commonly used regression loss function (and we'll illustrate that here with an example.

MSE is calculated as the mean of squared difference between predictions and actual observations (also know as the target). Expressed as a computation, it is:

`sum( square( predicted_vals - targets ) ) / (number of training samples)` and you'll see it expressed as PyTorch code below.

Intuitively, the squaring of the error term means that larger deviations from the target values amplify the error way more than than smaller deviations, meaning that the model is penalized in a square relationship for deviating from the target values.

A disadvantage of MSE is that a few (or even one) outlier data points can make the loss unnecessarily large and that maynot be desirable since the squared outlier data point is penalizing the entire mean calculation. 


In [2]:
predicted_vals= torch.randn(3, 5, requires_grad=True)
targets = torch.randn(3, 5)
regression_mse_loss = nn.MSELoss()(predicted_vals, targets)  # calculates mean
print('predicted_vals:\n\t', predicted_vals)
print('targets:\n\t', targets)
print('regression loss:\n\t', regression_mse_loss)
hand_calc_regress_mse_loss = torch.sum(torch.square(predicted_vals - targets)) / (targets.shape[0] * targets.shape[1])
print('hand calculated regression loss:\n\t:', hand_calc_regress_mse_loss)
assert torch.equal(regression_mse_loss, hand_calc_regress_mse_loss)

predicted_vals:
	 tensor([[-2.9639,  1.4672,  0.8487, -0.7195, -1.9816],
        [-0.8811, -1.5988,  1.4383,  0.2276, -0.3059],
        [-1.0753,  0.1708, -0.0526,  1.1169, -1.5161]], requires_grad=True)
targets:
	 tensor([[ 0.3525, -0.2891, -0.2615,  2.5561, -0.5336],
        [-0.7510, -0.7784,  0.4914,  0.9608, -0.7434],
        [-1.0682, -1.0122,  1.8483, -1.2479,  0.2343]])
regression loss:
	 tensor(2.9418, grad_fn=<MseLossBackward>)
hand calculated regression loss:
	: tensor(2.9418, grad_fn=<DivBackward0>)


## Binary Classification with a Cross Entropy Loss Function

Measuring the Binary Cross Entropy Loss between the target and the predicted output is done with: 

`loss = -1 * (target * log(predicted)  +  (1 - target) * log(1 - predicted))` where log means log-base-e

We show below with a simple example that the Cross Entropy Loss defined above penalizes misclassifications more, a property that can help in better learning.

Assume that the target value is 1. We compute the MSE Loss and the Cross Entropy Loss for a correctly predicted value (0.95) and a wrongly predicted value (0.1), i.e., a misclassification.

First, we calculate the MSE loss.

| Target &nbsp &nbsp     | Predicted &nbsp &nbsp| MSE Loss | &nbsp &nbsp Comment | 
| :---:        |    :----:   |          :---: | :--- |
| 1      | 0.95       | (1.0 - 0.95)**2 = 0.0025   | &nbsp &nbsp For a correct prediction, loss is small |
| 1   | 0.1        | (1.0 - 0.1)**2 = 0.81      | &nbsp &nbsp For a wrong prediction, loss is big

For the Cross Entropy Loss, because the target value is 1, the computation reduces to `-1 * target * log(predicted)` or simply `-1 * log(predicted)`.

| Target &nbsp &nbsp | Predicted &nbsp &nbsp| Cross Entropy Loss| &nbsp &nbsp Comment | 
| :---:        |    :----:   |          :---: | :--- |
| 1      | 0.95       | -1 * log(0.95) =  0.051  | &nbsp &nbsp Prediction is correct; loss is small |
| 1   | 0.1        | -1 * log(0.1) = 2.302      | &nbsp &nbsp Prediction is wrong; loss is LARGE when compared to the MSE loss

If the predicted output is close to the desired output, then the loss is small (for both loss functions). The difference is noticeable however when the output is misclassified (0.81 for MSE Loss and 2.302 for Cross Entropy Loss). 

The Cross Entropy Loss function also has the benefit of learning at a faster pace. To learn (the process of continously updating model parameters), we back-propagate the loss, and that is done by taking the partial derivative of the loss with respect to the weights. In doing so, we can show that the rate at which the model parameters (weights and biases) learn is proportional to `output - target`, i.e., proportional to the error in the output. The larger the error, the faster the model will learn which is a very nice property.

In the example below, we create two clusters of normalized data points, 1.5 units apart. Each cluster belongs a class; so we have two classes, class 0 and class 1. The loss computation uses `nn.BCELoss`, the PyTorch Binary Cross Entropy Loss function.

It is typical to use the `nn.Sigmoid` activation as the final output of your neural net followed by the nn.BCELoss to calculate the Binary Cross Entropy loss.

We also calculate is by hand and ensure we get the same results.



In [3]:
input = torch.cat([torch.randn(3), torch.randn(3) + 1.5])  # two clusters 1.5 units apart
target = torch.cat([torch.zeros(3), torch.ones(3)])  # ... from two classes; class 0, class 1
predicted = torch.sigmoid(input)

unreduced_loss = nn.BCELoss(reduction='none')(predicted, target)
mean_loss = nn.BCELoss(reduction='mean')(predicted, target)
print('input:\n\t', input)
print('target:\n\t', target)
print('predicted:\n\t', predicted)
print('unreduced loss:\n\t', unreduced_loss)
print('mean loss:\n\t', mean_loss)

hand_calc_loss = -1 * (target * torch.log(predicted) + (1 - target) * torch.log(1 - predicted))
print('hand calculated loss\n\t', hand_calc_loss)
assert torch.equal(hand_calc_loss, unreduced_loss)  # check they are the same

input:
	 tensor([ 0.4471,  0.2686, -0.4730,  1.1380,  1.3156,  0.0472])
target:
	 tensor([0., 0., 0., 1., 1., 1.])
predicted:
	 tensor([0.6099, 0.5667, 0.3839, 0.7573, 0.7884, 0.5118])
unreduced loss:
	 tensor([0.9415, 0.8364, 0.4844, 0.2780, 0.2377, 0.6698])
mean loss:
	 tensor(0.5746)
hand calculated loss
	 tensor([0.9415, 0.8364, 0.4844, 0.2780, 0.2377, 0.6698])


In the exampe above, the Cross Entropy Loss calculation was done in two steps.

1. A Sigmoid lon-linearity on the output

2. Application of the Binary Cross WEntropy Loss

For numerical stability reasons, the two steps above can be collapsed into one step with the `nn.BCEWithLogitsLoss` loss function.

`nn.BCEWithLogitsLoss` loss combines a Sigmoid layer and the BCELoss in one single class

In the example below, we reuse the input tensor from the code above and show the equivalence.


In [4]:
# inputs and target same as above
loss_unreduced = nn.BCEWithLogitsLoss(reduction='none')(input, target)
loss_mean = nn.BCEWithLogitsLoss(reduction='mean')(input, target)
print('unreduced loss:\n', unreduced_loss)
print('mean loss:\n', mean_loss)  # will be the same as the two step calculation

unreduced loss:
 tensor([0.9415, 0.8364, 0.4844, 0.2780, 0.2377, 0.6698])
mean loss:
 tensor(0.5746)


## Multi-Class Classification with a Cross Entropy Loss Function

We use this loss function when we train a model to output a probability over multiple classes(multi-class classification). 

To be concrete, let's ay we're classifying images of digits. It is used for .

It is useful when training a classification problem with C classes.
The `target` that this loss expects should be a class index in the range :math:`[0, C-1]` where `C = number of classes`; 

References
https://forums.fast.ai/t/nllloss-implementation/20028/

https://gombru.github.io/2018/05/23/cross_entropy_loss/

https://www.youtube.com/watch?v=7q7E91pHoW4&ab_channel=PythonEngineer




In [5]:

input = torch.randn(7, 5)
target = torch.randint(5, (7,))
# predicted = torch.softmax(input, dim=-1)
predicted = nn.LogSoftmax()(input)
unreduced_loss = nn.NLLLoss(reduction='none')(predicted, target)
mean_loss = nn.NLLLoss(reduction='mean')(predicted, target)
print('input:\n\t', input)
print('target:\n\t', target)
print('predicted:\n\t', predicted)
print('unreduced loss:\n\t', unreduced_loss)
print('mean loss:\n\t', mean_loss)

def coded_nll_loss(logs, targets):
    out = torch.zeros_like(targets, dtype=torch.float)
    for i in range(len(targets)):
        out[i] = logs[i][targets[i]]
    # out = torch.diag(logs[:,targets])  # one-liner instead of loop
    return -out  # negative out
hand_coded = coded_nll_loss(predicted, target)
print('hand calc loss', hand_coded) 

input:
	 tensor([[-0.2160, -1.7204,  0.1276, -1.1409, -1.4754],
        [-0.0730, -0.7012, -1.6942, -0.5264, -0.1453],
        [ 0.7482, -2.5375,  0.1499, -1.5509,  0.5278],
        [-0.1652,  0.1398, -0.2193, -0.0302,  0.1148],
        [-0.8853, -0.1703, -0.8741,  0.2440, -0.3303],
        [-0.4432,  0.4571, -2.0676,  0.4733, -0.6300],
        [-0.4423,  0.6099,  1.8465, -0.1155,  0.3033]])
target:
	 tensor([0, 2, 1, 1, 3, 2, 3])
predicted:
	 tensor([[-1.1978, -2.7021, -0.8541, -2.1226, -2.4571],
        [-1.1930, -1.8212, -2.8142, -1.6464, -1.2653],
        [-0.9122, -4.1978, -1.5105, -3.2112, -1.1325],
        [-1.7530, -1.4479, -1.8071, -1.6179, -1.4729],
        [-2.1851, -1.4700, -2.1738, -1.0558, -1.6300],
        [-1.9442, -1.0438, -3.5685, -1.0276, -2.1309],
        [-2.8462, -1.7939, -0.5573, -2.5193, -2.1006]])
unreduced loss:
	 tensor([1.1978, 2.8142, 4.1978, 1.4479, 1.0558, 3.5685, 2.5193])
mean loss:
	 tensor(2.4002)
hand calc loss tensor([1.1978, 2.8142, 4.1978, 1.4479, 

In [6]:
# input and target, same as above
loss = nn.CrossEntropyLoss()(input, target)
print('unreduced loss:\n', unreduced_loss)
print('mean loss:\n', mean_loss)

unreduced loss:
 tensor([1.1978, 2.8142, 4.1978, 1.4479, 1.0558, 3.5685, 2.5193])
mean loss:
 tensor(2.4002)


## Multi-Label Categorical Classification


In [7]:
input = torch.randn(7, 5)
target = torch.randint(2, (7, 5), dtype=torch.float)
predicted = torch.sigmoid(input)
unreduced_loss = nn.BCELoss(reduction='none')(predicted, target)
mean_loss = nn.BCELoss(reduction='mean')(predicted, target)
print('input:\n', input)
print('target:\n', target)
print('predicted:\n', predicted)
print('unreduced loss:\n', unreduced_loss)
print('mean loss:\n', mean_loss)

input:
 tensor([[-0.8950,  0.9014, -0.0569, -0.6153,  1.9885],
        [-0.4626,  0.2669,  0.6811,  0.3441, -0.8225],
        [ 0.7108, -0.0324,  0.3900, -0.8149,  1.5479],
        [-0.0916,  1.3633, -0.0421, -1.4999,  0.7701],
        [ 1.1824,  0.4017, -2.5063, -1.3161, -1.7623],
        [-0.6089,  0.7983,  0.1294,  1.2488, -0.7882],
        [ 0.8558,  0.4784,  0.0946, -0.1078,  0.4479]])
target:
 tensor([[0., 1., 1., 1., 1.],
        [1., 1., 0., 1., 0.],
        [1., 1., 1., 1., 1.],
        [0., 0., 1., 1., 0.],
        [1., 0., 0., 1., 1.],
        [1., 1., 0., 1., 1.],
        [1., 1., 1., 1., 1.]])
predicted:
 tensor([[0.2901, 0.7112, 0.4858, 0.3509, 0.8796],
        [0.3864, 0.5663, 0.6640, 0.5852, 0.3052],
        [0.6706, 0.4919, 0.5963, 0.3069, 0.8246],
        [0.4771, 0.7963, 0.4895, 0.1824, 0.6836],
        [0.7654, 0.5991, 0.0754, 0.2115, 0.1465],
        [0.3523, 0.6896, 0.5323, 0.7771, 0.3126],
        [0.7018, 0.6174, 0.5236, 0.4731, 0.6101]])
unreduced loss:
 tensor

In [8]:
# inputs and targets ramins the same as above
unreduced_loss = nn.BCEWithLogitsLoss(reduction='none')(input, target)
mean_loss = nn.BCELoss(reduction='mean')(predicted, target)
print('unreduced loss:\n', unreduced_loss)
print('mean loss:\n', mean_loss)

unreduced loss:
 tensor([[0.3426, 0.3408, 0.7220, 1.0474, 0.1283],
        [0.9510, 0.5686, 1.0906, 0.5358, 0.3642],
        [0.3996, 0.7095, 0.5170, 1.1814, 0.1929],
        [0.6484, 1.5911, 0.7144, 1.7013, 1.1506],
        [0.2674, 0.9140, 0.0784, 1.5537, 1.9207],
        [1.0433, 0.3716, 0.7599, 0.2522, 1.1630],
        [0.3541, 0.4823, 0.6470, 0.7485, 0.4941]])
mean loss:
 tensor(0.7414)


In [9]:
assert torch.equal(torch.tensor([1, 2]), torch.tensor([1, 2]))