# What in the world is a dynamic computational graph

Caution:  this is not an easy topic so if it doesn't make sense now keep reading about it

## Forward/Backwards
* In training a NN there are a couple of steps:  the forward pass and the backwards pass (back propagation of gradients).
  * In PyTorch `forward` and `backward` are in the same class `torch.autograd.Function`
  


## Let's see an example network (and train it!)

In [1]:
# Do some imports
import torch

# Define the leaf nodes
a = torch.tensor([4.])

# This is just a vector of tensors
weights = [torch.tensor([i], requires_grad=True) for i in (2., 5., 9., 7.)]

# unpack the weights for nicer assignment
w1, w2, w3, w4 = weights

Exercise:  Print the type of a

In [2]:
type(a)

torch.Tensor

## Create the network

Here we'll see the graph created on-the-fly and the forward pass

**Note:  static graph frameworks predefine the graph (that then can not change later) and then run inputs through it**

In [3]:
# IMPORTANT:  When we create b, the graph creation begins!!!

# The next three lines of code (b, c, d creation) are our
# forward pass - when the inputs are processed into output

# BEGIN COMPUTATIONAL GRAPH CREATION (some operations)
b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
# END GRAPH CREATION

# This is the loss
L = (10 - d)

## Run backprop and check the gradient data

In [4]:
L.backward()

for index, weight in enumerate(weights, start=1):
    gradient, *_ = weight.grad.data
    print("Gradient of w{} w.r.t to L: {}".format(index, gradient))

Gradient of w1 w.r.t to L: -36.0
Gradient of w2 w.r.t to L: -28.0
Gradient of w3 w.r.t to L: -8.0
Gradient of w4 w.r.t to L: -20.0


Exercise:  run the above cell one more time and see what happens

**Remember the computational graph is constructed in PyTorch at the time it executes (`backward()` is called). Two things must be done to run over and over**
  * Clear the gradients
  * Build (and possibly redefine) the network again

Exercise:  re-run the "Create the network" section and then "Run backprop..." section.  Why do the gradients change?  How do you reset the gradients?

## As you'll see later...but to round this out

Let's update the weights and zero them (we'd do this before running the network again as would happen in training)

Your update and reset will look like:
```python
# For fun let's say we had a learning rate of 1e-4
learning_rate = 1e-4

with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()
```

## Let's put it all together to create, run, backwards prop, update weights, clear gradients

In [5]:
# Define the leaf nodes
a = torch.tensor([4.])

# This is just a vector of tensors
weights = [torch.tensor([i], requires_grad=True) for i in (2., 5., 9., 7.)]

# unpack the weights for nicer assignment
w1, w2, w3, w4 = weights


# IMPORTANT:  When we create b, the graph creation begins!!!

# The next three lines of code (b, c, d creation) are our
# forward pass - when the inputs are processed into output

# BEGIN COMPUTATIONAL GRAPH CREATION (some operations)
b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
# END GRAPH CREATION

# This is the loss
L = (10 - d)

# Run the backwards propagation of gradients 
# (remember your chain rule for differentiation? Well PyTorch
# takes care of this for you!)
L.backward()

for index, weight in enumerate(weights, start=1):
    gradient, *_ = weight.grad.data
    print("Gradient of w{} w.r.t to L: {}".format(index, gradient))

# For fun let's say we had a learning rate of 1e-4
learning_rate = 1e-4

with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()

Gradient of w1 w.r.t to L: -36.0
Gradient of w2 w.r.t to L: -28.0
Gradient of w3 w.r.t to L: -8.0
Gradient of w4 w.r.t to L: -20.0


**Now we've done one epoch!**

## Advantages

* Easier to debug that a static graph (we can modify our graph and easily check variables and gradients)
* Since the network is created when ran it can be modified **on-the-fly** (very good for NLP where input lengths and output lengths may differ like in machine translation)
* Reminiscent (as you'll see more later) of regular Python and object oriented programming - closer to what devs know

## References
1.  [Getting Started with PyTorch Part 1: Understanding how Automatic Differentiation works](https://towardsdatascience.com/getting-started-with-pytorch-part-1-understanding-how-automatic-differentiation-works-5008282073ec) by Ayoosh Kathuria
2.  [PyTorch: Autograd example](https://github.com/jcjohnson/pytorch-examples#pytorch-autograd) by Justin Johnson