Introduction
======

This notebook will show some common loss functions, then walk through through forward and backward passes in PyTorch

Loss Functions
======

$y_i =$ ith category

$p_i =$ ith certainty that category i is found


\begin{equation*}
L_1 = |\vec{y} - \vec{p}|
\end{equation*}

\begin{equation*}
L_2 = (\vec{y} - \vec{p})^2
\end{equation*}

\begin{equation*}
L_{log} = - \sum_{i=1}^N y_iLog(p_i)
\end{equation*}

In [1]:
#Now, apply these to a multi-label classification problem.
#Suppose we have a flower classifier with four categories: violet,
#poppy, rose, empty.
import torch
import math
torch.cuda.set_device(0)


In [2]:
Yvec = torch.FloatTensor([0, 1 , 0, 0])
Pvec = torch.FloatTensor([0.001, .5 , .1, .1]) 
PvecSmallError = torch.FloatTensor([0.001, .9 , .1, .1]) 

In [3]:

#
print('First with a large error')
#
L1Loss=(Yvec-Pvec).norm()
print('L1 Loss = ',L1Loss)
L2Loss=math.pow((Yvec-Pvec).norm(),2)
print('L2 Loss = ',L2Loss)
LogLoss = (-Yvec*Pvec.log()).sum()
print('Log Loss = ',LogLoss)

#
print('Now with a small error')
#
L1Loss=(Yvec-PvecSmallError).norm()
print('L1 Loss = ',L1Loss)
L2Loss=math.pow((Yvec-PvecSmallError).norm(),2)
print('L2 Loss = ',L2Loss)
LogLoss = (-Yvec*PvecSmallError.log()).sum()
print('Log Loss = ',LogLoss)

First with a large error
L1 Loss =  0.5196162050938572
L2 Loss =  0.27000100059614146
Log Loss =  0.6931471824645996
Now with a small error
L1 Loss =  0.17320798296993586
L2 Loss =  0.030001005364513594
Log Loss =  0.10536054521799088


This illustrates a couple of important behaviors
L2 loss places a very heavy weight on outliers as compared to L1 loss
Log loss penalizes highly incorrect values even more

Automatic Gradient Calculation - How Does It Work?
======
Pytorch uses parameters which retain their last value

Lets follow a calculation forward and backward to see how this works.



In [4]:
from torch.autograd import Variable

In [5]:
#import pdb

#
# During learning, the input will be changed resulting in some change of the loss
# That change is simulated with two predictions Pvec and PvecLowerLoss.  
# The delta will be Loss - LowerLoss
#
Yvec = torch.FloatTensor([0, 1 , 0, 0])
Pvec = torch.FloatTensor([0.001, .5 , .1, .1]) 
PvecLowerLoss = torch.FloatTensor([0.001, .55 , .1, .1]) 

Yvar = Variable(Yvec,requires_grad=True)
Pvar = Variable(Pvec,requires_grad=True)
PvarLowerLoss = Variable(PvecLowerLoss,requires_grad=True)



In [6]:
#Simulate a training step
L2LossVar = Variable.pow((Yvar-Pvar).norm(),2)
L2LossVarLowerLoss = Variable.pow((Yvar-PvarLowerLoss).norm(),2)

In [7]:
#The is the change in loss from the simulated step
LossDelta=(L2LossVar.data - L2LossVarLowerLoss.data)
print(LossDelta)


1.00000e-02 *
  4.7500
[torch.FloatTensor of size 1]



Back Propagation
======
Passing the loss change to the output of the calculation will cause all derivatives to be updated

In [8]:
L2LossVarLowerLoss.backward(gradient=LossDelta, retain_graph=True)

In [9]:
firstGradientResult = PvarLowerLoss.grad.clone()
print(firstGradientResult)

Variable containing:
1.00000e-02 *
  0.0095
 -4.2750
  0.9500
  0.9500
[torch.FloatTensor of size 4]



Where is the intermediate data stored, and how is it used?
======
In the case of built-in functions like subtract and pow, it looks
like the backward pass is in C code and inaccessible to Python

To get a better feel for this, define a PyTorch method to square something, 
and use that in place of Variable.pow()

In [10]:
from torch.autograd import Function
import pdb
class square(Function):
        
             @staticmethod
             def forward(ctx, i):
                result = i.pow(2)
                ctx.save_for_backward(result)
                return result
        
             @staticmethod
             def backward(ctx, grad_output):
                 result, = ctx.saved_variables
                 return grad_output

In [11]:

L2LossCustom = square.apply((Yvar-PvarLowerLoss).norm())

In [12]:
#Check that the forward pass has created the same result
print(L2LossVarLowerLoss)
print(L2LossCustom)

Variable containing:
 0.2225
[torch.FloatTensor of size 1]

Variable containing:
 0.2225
[torch.FloatTensor of size 1]



In [13]:
print(LossDelta)


1.00000e-02 *
  4.7500
[torch.FloatTensor of size 1]



In [14]:
PvarLowerLoss.grad.data.zero_()
L2LossCustom.backward(gradient=LossDelta, retain_graph=True)

In [15]:
PvarLowerLoss.grad

Variable containing:
1.00000e-02 *
  0.0101
 -4.5315
  1.0070
  1.0070
[torch.FloatTensor of size 4]

In [16]:
#
# The above result should match the first gradient calculation since all that has happened
# is replacing a PyTorch internal function with one defined in Python
#

firstGradientResult


Variable containing:
1.00000e-02 *
  0.0095
 -4.2750
  0.9500
  0.9500
[torch.FloatTensor of size 4]

The result is nearly (but not quite) identical.  Still need to figure out why.
What happens if the gradients are not zeroed?


In [17]:
#Calculation with the gradient not zeroed
L2LossCustom.backward(gradient=LossDelta, retain_graph=True)
PvarLowerLoss.grad

Variable containing:
1.00000e-02 *
  0.0201
 -9.0630
  2.0140
  2.0140
[torch.FloatTensor of size 4]

In [18]:
L2LossCustom.backward(gradient=LossDelta, retain_graph=True)
PvarLowerLoss.grad

Variable containing:
 0.0003
-0.1359
 0.0302
 0.0302
[torch.FloatTensor of size 4]

Each iteration simply add the gradient into the .grad field.  The gradient is accumulated.
The purpose is to facilitate averaging the gradient over an entire minibatch.
