# Lab Assignment 1 - Part2 
## With this assignment you will get to know more about gradient descent optimization and writing your own functions with forward and backward (i.e., gradient) passes
## You need to complete all the tasks in this notebook in the lab and show you work to the TA. Edit only those portions in the cells where it asks you to do so!

In [None]:
import torch
from torch.autograd import Variable
from torch.autograd import Function
import torch.nn.functional as F
import numpy as np

## Huber loss function
https://en.wikipedia.org/wiki/Huber_loss

In [None]:
# A loss function measures distance between a predicted and a target tensor
# An implementation of Huber loss function is given below
# We will make use of this loss function in gradient descent optimization
def Huber_Loss(input,delta):
  m = (torch.abs(input)<=delta).detach().float()
  output = torch.sum(0.5*m*input**2 + delta*(1.0-m)*(torch.abs(input)-0.5*delta))
  return output

# Test Huber loss with a couple of different examples

In [None]:
a = torch.tensor([[0.3, 2.0, -3.1],[0.5, 9.2, 0.1]])
print(a.numpy())
ha = Huber_Loss(a,1.0)
print(ha.numpy())

b = torch.tensor([0.3, 2.0])
print(b.numpy())
hb = Huber_Loss(b,1.0)
print(hb.numpy())

[[ 0.3  2.  -3.1]
 [ 0.5  9.2  0.1]]
12.974999
[0.3 2. ]
1.545


# Gradient descent code
## Study the following generic gradient descent optimization code.
## Huber loss f measures the distance between a probability vector `z` and target 1-hot vector `target`.
## When `f.backward` is called, PyTorch first computes $\nabla_z f$ (gradient of `f` with respect to `z`), then by chain rule it computes $\nabla_{var} f = J^{z}_{var} \nabla_z f$, where $J^{z}_{var}$ is the Jacobian of `z` with respect to `var`.
## Next, `optimizer.step()` call adjusts the variable `var` in the opposite direction of $\nabla_{var} f.$

In [None]:
def gradient_descent(var,optimizer,softmax,loss,target,nIter,nPrint):
  for i in range(nIter):
    z = softmax(var)
    f = loss(z-target,1.0)
    optimizer.zero_grad()
    f.backward()
    optimizer.step()
    if i%nPrint==0:
      with np.printoptions(precision=3, suppress=True):
        print("Iteration:",i,"Variable:", z.detach().numpy(),"Loss: %0.6f" % f.item())


# Gradient descent with Huber Loss
## The following cell shows how `gradient_descent` function can be used.
## The cell first creates a target 1-hot vector `y`, where only the 3rd place is on.
## It also creates a variable `x` with random initialization and an optimizer.
## Learning rate and momentum has been set to 0.1 and 0.9, respectively.
## Then it calls `gradient_descent` function.

In [None]:
y = torch.zeros(10)
y[2] = 1.0
print("Target 1-hot vector:",y.numpy())
x = Variable(torch.randn(y.shape),requires_grad=True)

optimizer = torch.optim.SGD([x], lr=1e-1, momentum=0.9) # create an optimizer that will do gradient descent optimization

gradient_descent(x,optimizer,F.softmax,Huber_Loss,y,1000,100)


Target 1-hot vector: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
Iteration: 0 Variable: [0.085 0.19  0.367 0.04  0.138 0.032 0.035 0.018 0.039 0.055] Loss: 0.235679
Iteration: 100 Variable: [0.008 0.01  0.946 0.005 0.01  0.004 0.004 0.002 0.005 0.006] Loss: 0.001622
Iteration: 200 Variable: [0.006 0.008 0.959 0.004 0.007 0.003 0.003 0.002 0.004 0.005] Loss: 0.000949
Iteration: 300 Variable: [0.005 0.007 0.966 0.003 0.006 0.003 0.003 0.002 0.003 0.004] Loss: 0.000671
Iteration: 400 Variable: [0.004 0.006 0.97  0.003 0.005 0.002 0.002 0.001 0.003 0.003] Loss: 0.000519
Iteration: 500 Variable: [0.004 0.005 0.973 0.002 0.005 0.002 0.002 0.001 0.002 0.003] Loss: 0.000423
Iteration: 600 Variable: [0.004 0.005 0.975 0.002 0.004 0.002 0.002 0.001 0.002 0.003] Loss: 0.000357
Iteration: 700 Variable: [0.003 0.004 0.977 0.002 0.004 0.002 0.002 0.001 0.002 0.003] Loss: 0.000309


  This is separate from the ipykernel package so we can avoid doing imports until


Iteration: 800 Variable: [0.003 0.004 0.978 0.002 0.004 0.002 0.002 0.001 0.002 0.002] Loss: 0.000272
Iteration: 900 Variable: [0.003 0.004 0.979 0.002 0.004 0.002 0.002 0.001 0.002 0.002] Loss: 0.000243


# <font color='red'>30% Weight:</font> In this markdown cell, using [math mode](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html), write gradient of Huber loss function: $output = \sum_i 0.5 m_i (input)^{2}_{i} + \delta (1-m_i)(|input_i|-0.5 \delta)$ with respect to $input.$ Treat $m_i$ to be independent of $input_i,$ because we replaced *if* control statement with $m_i.$
## Your solution : $\frac{\partial (output)}{\partial (input)_i} = \sum_i \frac{\delta(1-m_i)input_i}{|input_i|}+m_iinput_i$

# <font color='red'>30% Weight:</font> Define your own (correct!) rule of differentiation for Huber loss function
## Edit indicated line in the cell below. Use the following formula. Do not use for/while/any loop in your solution.
## For this function,  chain rule (Jacobian-vector product) takes the following form: $\frac{\partial (cost)}{\partial (input)_i} = \frac{\partial (output)}{\partial (input)_i} \frac{\partial (cost)}{\partial (output)}.$
# In the `backward` method below, $\frac{\partial (cost)}{\partial (output)}$ is denoted by `output_grad` and the $i^{th}$ component of `input_grad` is symbolized by $\frac{\partial (cost)}{\partial (input)_i}.$

In [None]:
# Inherit from torch.autograd.Function
class My_Huber_Loss(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    def forward(ctx, input, delta):
        m = (torch.abs(input)<=delta).float()
        ctx.save_for_backward(input,torch.tensor(m),torch.tensor(delta))
        output = torch.sum(0.5*m*input**2 + delta*(1.0-m)*(torch.abs(input)-0.5*delta))
        return output

    @staticmethod
    def backward(ctx, output_grad):
        # retrieve saved tensors and use them in derivative calculation
        input, m, delta = ctx.saved_tensors

        # Return Jacobian-vector product (chain rule)
        # For Huber loss function the Jacobian happens to be a diagonal matrix
        # Also, note that output_grad is a scalar, because forward function returns a scalar value
        input_grad = ((delta*(1-m)*input)/torch.abs(input) + m*input)#*output_grad # complete this line, do not use for loop
        # must return two gradients becuase forward function takes in two arguments
        #print(input_grad)
        return input_grad, None

#Gradient Descent on Your Own Huber Loss
## You should get almost identical results as before if your rule of differentation is correct!

In [None]:
y = torch.zeros(10)
y[2] = 1.0
print("Target:",y.numpy())
x = Variable(torch.randn(y.shape),requires_grad=True)

optimizer = torch.optim.SGD([x], lr=1e-1, momentum=0.9) # create an optimizer that will do gradient descent optimization

gradient_descent(x,optimizer,F.softmax,My_Huber_Loss.apply,y,1000,100)


Target: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
Iteration: 0 Variable: [0.082 0.045 0.02  0.113 0.052 0.288 0.033 0.324 0.028 0.014] Loss: 0.587482
Iteration: 100 Variable: [0.009 0.007 0.934 0.009 0.008 0.009 0.006 0.009 0.005 0.003] Loss: 0.002432
Iteration: 200 Variable: [0.006 0.005 0.956 0.006 0.005 0.006 0.004 0.006 0.004 0.002] Loss: 0.001101
Iteration: 300 Variable: [0.005 0.004 0.963 0.005 0.004 0.005 0.003 0.005 0.003 0.002] Loss: 0.000747
Iteration: 400 Variable: [0.004 0.004 0.968 0.004 0.004 0.004 0.003 0.004 0.003 0.001] Loss: 0.000566
Iteration: 500 Variable: [0.004 0.003 0.972 0.004 0.003 0.004 0.003 0.004 0.002 0.001] Loss: 0.000455
Iteration: 600 Variable: [0.004 0.003 0.974 0.004 0.003 0.004 0.002 0.004 0.002 0.001] Loss: 0.000380
Iteration: 700 Variable: [0.003 0.003 0.976 0.003 0.003 0.003 0.002 0.003 0.002 0.001] Loss: 0.000326


  This is separate from the ipykernel package so we can avoid doing imports until
  


Iteration: 800 Variable: [0.003 0.003 0.977 0.003 0.003 0.003 0.002 0.003 0.002 0.001] Loss: 0.000286
Iteration: 900 Variable: [0.003 0.002 0.979 0.003 0.003 0.003 0.002 0.003 0.002 0.001] Loss: 0.000254


# <font color='red'>40% Weight:</font> Your own softmax with forward and backward functions
## Edit indicated line in the cell below. Use the following formula. Do not use for/while/any loop in your solution.
## The Jacobian-vector product (chain rule) takes the following form using summation sign: $\frac{\partial (cost)}{\partial (input)_i} = \sum_j \frac{\partial (output)_j}{\partial (input)_i} \frac{\partial (cost)}{\partial (output)_j}$
# Once again note that, in the `backward` method below, $i^{th}$ component of `input_grad` and $j^{th}$ component of `output_grad` are denoted by $\frac{\partial (cost)}{\partial (input)_i}$ and $\frac{\partial (cost)}{\partial (output)_j}$, respectively.

In [None]:
# Inherit from Function
class My_softmax(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    def forward(ctx, input):
        output = F.softmax(input,dim=0)
        ctx.save_for_backward(output) # this is the only tensor you will need to save for backward function
        return output

    # This function has only a single output, so it gets only one gradient
    @staticmethod
    def backward(ctx, output_grad):
        output = ctx.saved_tensors[0]
        # retrieve saved tensors and use them in derivative calculation
        # return Jacobian-vecor product
        input_grad =   # Complete this line

        return input_grad

# Gradient Descent on your own Huber Loss and your own softmax

In [None]:
y = torch.zeros(10)
y[2] = 1.0
print(y)
x = Variable(torch.randn(y.shape),requires_grad=True)
print(x)

optimizer = torch.optim.SGD([x], lr=1e-1, momentum=0.9) # create an optimizer that will do gradient descent optimization

gradient_descent(x,optimizer,My_softmax.apply,My_Huber_Loss.apply,y,1000,100)


tensor([0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])
tensor([ 1.1794, -0.6172,  0.3024, -0.2511,  1.0290,  0.7454, -0.5135,  0.0971,
        -0.0446,  0.0690], requires_grad=True)


  


RuntimeError: ignored