## The only imports we need

In [31]:
import numpy as np

import torch
import torch.nn as nn
import torch.autograd as autograd
from torch.autograd import Variable

# What is a loss?

Also known as a utitlity function, criterion, objective function

It tell us how wrong the output of some model is compared to the ground truth (the real answers). How "badly" or "good" the algorithm is doing. They come up again and again in economics, mathemtics, optimisation, statistics, ML/DL/AI etc.

e.g. image of cat goes in, model says with 80% confidence it is a cat, and 20% it is a dog. Model is X Loss wrong!. 
X turns out to depend on the specific loss function but the specific number is not important. 
0 loss occurs when the model is perfect e.g. The model says 100% cat when image *actually* is a cat

We won't be talking too much or at all about specific models like Neural Networks or Random Forests here, because Loss functions usually work on outputs of models regardless of the model. just purely inputs to loss functions to get the right intuitions

todo show some loss curves and graphs and landscapes pictures here

todo: stress the importance of this: Oriol says: "Architectures, Losses and inputs/outputs". These are the three main things in Deep Learning. Messenger comment here. 

e.g. due to the importance of the losses, I will focus a few videos on this. Anyone who studies and watches these videos will be very fluent in these.

I'm also doing this for my own understanding

Every loss function therefore needs inputs. These are usually the ground truth *targets* and the model outputs in Supervised Learning

## Declare Inputs and targets

Every loss function needs inputs and targets

In [32]:
input_regression = torch.Tensor([1, 2, 3, 4, 5])
target_regression = torch.Tensor([1, 2, 3, 4, 6])

input_classification = torch.Tensor([[1, 2, 3], [1, 2, 3], [1, 2, 3]]).transpose(1, 0)
target_classification = torch.LongTensor([1, 2, 3, 4, 5]) # torch.LongTensor(3).random_(5)

print('input_regression:', input_regression.numpy())
print('input_classification:', input_classification.numpy())
print('target_regression:', target_regression.numpy())
print('target_classification:', target_classification.numpy())
# todo print these in numpy or clearer
# todo show math in markdown
# work through each methodically and clearly

input_regression: [1. 2. 3. 4. 5.]
input_classification: [[1. 1. 1.]
 [2. 2. 2.]
 [3. 3. 3.]]
target_regression: [1. 2. 3. 4. 6.]
target_classification: [1 2 3 4 5]


In [54]:
# Official docs
# size_average – By default, the losses are averaged over observations for each minibatch. 
#               However, if the field size_average is set to False, the losses are instead 
#               summed for each minibatch. Only applies when reduce is True. Default: True
# reduce – By default, the losses are averaged over observations for each minibatch, 
#         or summed, depending on size_average. When reduce is False, 
#         returns a loss per input/target element instead 
#         and ignores size_average. Default: True


# Two different params in common for ALL loss functions in PyTorch with defaults set to: 
# reduce=True and size_average=True
# size_average only matters when reduce=True and means we will average, otherwise we only sum
# reduce default is True, but if False, we will get [0. 0. 0. 0. 1.]

## L1Loss AKA absolute loss AKA Laplace

$$ L = \sum_{i=0}^n \left| y_i - h(x_i) \right|$$

$$\frac{(1 - 1) + (2 - 2) + (3 - 3) + (4 - 4) + (6 - 5)}{5} =\frac{1}{5} = 0.2 $$


In [34]:
# L1Loss AKA absolute loss
loss_default = nn.L1Loss()
loss_sum_reduce = nn.L1Loss(size_average=False)
loss_no_reduce = nn.L1Loss(reduce=False, size_average=True) # Doesn't average by the number of elements

inp = Variable(input_regression, requires_grad=True) # todo requires necessary?
target = Variable(target_regression)
print('Input: ', inp.data.numpy())
print('Target: ', target.data.numpy(), '\n')
print('L1Loss default (avg reduce) {}'.format(loss_default(inp, target).data.numpy())) # example above
print('L1Loss no average (sum reduce): {}'.format(loss_sum(inp, target).data.numpy())) # numerator in fraction
print('L1Loss: no reduce: {}'.format(loss_no_reduce(inp, target).data.numpy())) # loss for each example

# Show gradients. todo explain and show output gradients. 
output = loss_default(inp, target)
output.backward()
print('\nShow Gradients for loss')
print('L1Loss: {}'.format(output.data.numpy()))
print(inp.grad.data.numpy())

print(inp.data - inp.grad.data) # todo why are they all 0.2 when they were correct?

# We want the user to get a real intuition for how the final layer is wrong.
# Todo quantile regression loss and Squared loss (without importance weight aware updates)
# ridge and lasso regression. mention regularisation
# todo add more english from official docstrings

Input:  [1. 2. 3. 4. 5.]
Target:  [1. 2. 3. 4. 6.] 

L1Loss default (avg reduce) [0.2]
L1Loss no average (sum reduce): [1.]
L1Loss: no reduce: [0. 0. 0. 0. 1.]

Show Gradients for loss
L1Loss: [0.2]
[ 0.2  0.2  0.2  0.2 -0.2]

 0.8000
 1.8000
 2.8000
 3.8000
 5.2000
[torch.FloatTensor of size 5]



## MSE Loss AKA Euclidean Distance AKA AKA

$$ L = \sum_0^n (y_i - h(x_i))^2 $$

$$\frac{(1 - 1)^2 + (2 - 2)^2 + (3 - 3)^2 + (4 - 4)^2 + (6 - 5)^2}{5} =\frac{1}{5} = 0.2 $$

L2 Loss is top of fraction

In [35]:
# MSE Loss AKA AKA AKA
loss = nn.MSELoss() # default paramaters (reduce and size_average) are the same as for L1Loss and the others
inp = Variable(input_regression, requires_grad=True)
target = Variable(target_regression)
output = loss(inp, target)
output.backward()
print('MSELoss: {}'.format(output.data.numpy()))

print(inp.data - inp.grad.data)

# todo add pros and cons for L1 vs L2 Losses

MSELoss: [0.2]

 1.0000
 2.0000
 3.0000
 4.0000
 5.4000
[torch.FloatTensor of size 5]



In [36]:
# SmoothL1Loss AKA Huber loss # Creates a criterion that uses a squared term if the absolute element-wise error falls below 1 and an L1 term otherwise.
loss = nn.SmoothL1Loss()

# Classification losses

## CrossEntropyLoss

https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
A Short Introduction to Entropy, Cross-Entropy and KL-Divergence  
https://www.youtube.com/watch?v=ErfnhcEV1O8

In [37]:
# CrossEntropyLoss
loss = nn.CrossEntropyLoss()
#loss = nn.BCELoss()
#inp = Variable(input_classification, requires_grad=True)
#target = Variable(target_classification)

num_rows = 2
num_classes = 5
inp = Variable(torch.randn(num_rows, num_classes).clamp(0.0001, 100), requires_grad=True)
target = Variable(torch.LongTensor(num_rows).random_(num_classes))
print(inp, target)
output = loss(inp, target)
output.backward()
#print(target.data.numpy().flatten()[0])
#print(-np.log(inp[target.data.numpy().flatten()[0]]))
print('CrossEntropyLoss: {}'.format(output))

Variable containing:
 2.9194e-01  1.0000e-04  1.0000e-04  7.7704e-01  1.2983e+00
 1.0940e+00  9.3695e-02  1.4648e+00  1.0000e-04  1.0000e-04
[torch.FloatTensor of size 2x5]
 Variable containing:
 1
 2
[torch.LongTensor of size 2]

CrossEntropyLoss: Variable containing:
 1.5474
[torch.FloatTensor of size 1]



In [38]:
# NLLLoss
m = nn.LogSoftmax()
loss = nn.NLLLoss()
# inp is of size N x C = 3 x 5
inp = Variable(torch.randn(3, 5), requires_grad=True)
# each element in target has to have 0 <= value < C
target = Variable(torch.LongTensor([1, 0, 4]))
output = loss(m(inp), target)
output.backward()
print('NLLLoss: {}'.format(output))

NLLLoss: Variable containing:
 2.1415
[torch.FloatTensor of size 1]





In [39]:
# PoissonNLLLoss # Negative log likelihood loss with Poisson distribution of target.
loss = nn.PoissonNLLLoss()
log_inp = Variable(torch.randn(5, 2), requires_grad=True)
target = Variable(torch.randn(5, 2))
output = loss(log_inp, target)
output.backward()
print('PoissonNLLLoss: {}'.format(output))

PoissonNLLLoss: Variable containing:
 1.3639
[torch.FloatTensor of size 1]



In [40]:
# NLLLoss2d # negative log likehood loss, but for image inputs. It computes NLL loss per-pixel.
m = nn.Conv2d(16, 32, (3, 3)).float()
loss = nn.NLLLoss2d()
# input is of size N x C x height x width
inp = Variable(torch.randn(3, 16, 10, 10))
# each element in target has to have 0 <= value < C
target = Variable(torch.LongTensor(3, 8, 8).random_(0, 4))
output = loss(m(inp), target)
output.backward()
print('NLLLoss2d: {}'.format(output))

NLLLoss2d: Variable containing:
1.00000e-02 *
 -2.9927
[torch.FloatTensor of size 1]



In [41]:
# KLDivLoss # The Kullback-Leibler divergence Loss
loss = nn.KLDivLoss()

In [42]:
# BCELoss # Binary Cross Entropy
m = nn.Sigmoid()
loss = nn.BCELoss(size_average=False) # default is True
inp = Variable(torch.randn(3), requires_grad=True)
target = Variable(torch.FloatTensor(3).random_(2))
output = loss(m(inp), target)
output.backward()
print('target:', target)
print('m(inp (output)): {} BCELoss: {}'.format(m(inp), output))

target: Variable containing:
 1
 1
 1
[torch.FloatTensor of size 3]

m(inp (output)): Variable containing:
 0.6500
 0.3623
 0.7615
[torch.FloatTensor of size 3]
 BCELoss: Variable containing:
 1.7185
[torch.FloatTensor of size 1]



In [43]:
# BCEWithLogitsLoss # This loss combines a Sigmoid layer and the BCELoss in one single class
loss = nn.BCEWithLogitsLoss()

In [44]:
# MarginRankingLoss # Creates a criterion that measures the loss given inputs x1, x2, two 1D mini-batch Tensor`s, and a label 1D mini-batch tensor `y with values (1 or -1).
loss = nn.MarginRankingLoss()

In [45]:
# HingeEmbeddingLoss
'''                 { x_i,                  if y_i ==  1
loss(x, y) = 1/n {
                    { max(0, margin - x_i), if y_i == -1'''

loss = nn.HingeEmbeddingLoss()

In [46]:
#MultiLabelMarginLoss # multi-class multi-classification hinge loss (margin-based loss) 
# loss(x, y) = sum_ij(max(0, 1 - (x[y[j]] - x[i]))) / x.size(0)

loss = nn.MultiLabelMarginLoss()

In [47]:
# SoftMarginLoss # two-class classification logistic loss
loss = nn.SoftMarginLoss()

In [48]:
# MultiLabelSoftMarginLoss # multi-label one-versus-all loss based on max-entropy
loss = nn.MultiLabelSoftMarginLoss()

In [49]:
# CosineEmbeddingLoss
loss = nn.CosineEmbeddingLoss()

In [50]:
# MultiMarginLoss
loss = nn.MultiMarginLoss()

In [53]:
# TripletMarginLoss

"""
Creates a criterion that measures the triplet loss given an input tensors x1, x2, x3 and a margin with a value greater than 0. This is used for measuring a relative similarity between samples. A triplet is composed by a, p and n: anchor, positive examples and negative example respectively. The shapes of all input tensors should be (N,D).
"""
# defaults: margin=1.0, p=2
triplet_loss = nn.TripletMarginLoss()
input1 = Variable(torch.randn(100, 128), requires_grad=True)
input2 = Variable(torch.randn(100, 128), requires_grad=True)
input3 = Variable(torch.randn(100, 128), requires_grad=True)
output = triplet_loss(input1, input2, input3)
print(output.backward())
print(output)
# todo push to pytorch official docs that the example they have is wrong. They don't wrap the tensors in variables. But that is solved in 0.4?

None
Variable containing:
 1.2470
[torch.FloatTensor of size 1]



# Loss functions in every single relevant framework

You should now understand 90% of the loss functions in the below frameworks. 

**PyTorch Losses**: http://pytorch.org/docs/master/nn.html#loss-functions

**Torch Losses**: https://github.com/torch/nn/blob/master/doc/criterion.md

**Keras Losses**: https://keras.io/losses/

**TensorFlow Losses**: https://www.tensorflow.org/api_docs/python/tf/losses

**Gluon/MXNet Losses**: https://mxnet.incubator.apache.org/api/python/gluon/loss.html

**Chainer Losses**: http://docs.chainer.org/en/stable/reference/functions.html#loss-functions

**CNTK Losses**: https://docs.microsoft.com/en-us/cognitive-toolkit/Loss-Functions-and-Metrics

**DeepLearning4j Losses**: https://deeplearning4j.org/features#lossobjective-functions

**Lasagne Losses**: http://lasagne.readthedocs.io/en/latest/modules/objectives.html

**PaddlePaddle Losses**: http://paddlepaddle.org/docs/develop/api/en/v2/config/layer.html?highlight=loss#cost-layers

**Caffe2 Losses**: Couldn't find a good and simple list for Caffe or Caffe2

### Other Resources:

https://en.wikipedia.org/wiki/Loss_function  
https://en.wikipedia.org/wiki/Loss_functions_for_classification  
http://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html  
https://davidrosenberg.github.io/ml2015/docs/3a.loss-functions.pdf  