- TensorFlow's computational graphs are **static**  
    - Computational graph is defined once and same graph is executed over and over again
    - Static graphs are nice as they can be optimized up front
---    
- PyTorch uses **dynamic** computational graphs
    - Each forward pass defines a new computational graph
    - Dynamic graphs are nice if model needs to perform different computation for each data point (e.g. RNN)
---
**`nn`** package provides higher-level abstractions over raw computational graphs (like `Keras`)

In [1]:
# Imports
import torch
from torch.autograd import Variable

In [2]:
# N: batch size, D_in: input dimension, H: hidden dimension, D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

In [3]:
# Create random Tensors to hold input and output and wrap them in Variables
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

In [4]:
# Use nn package to define model as a sequence of layers.
# nn.Sequential is a Module which contains other Modules and applies them in sequence to produce its output
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.ReLU(),
                            torch.nn.Linear(H, D_out))

In [5]:
# MSE Loss (size_average=False -> does not divide sum by n)
loss_fn = torch.nn.MSELoss(size_average=False)

In [6]:
# Computation
learning_rate = 1e-4

for i in xrange(500):
    # Forward pass: compute predicted y by passing x into the model
    y_pred = model(x)
    
    # Compute loss
    loss = loss_fn(y_pred, y)
    if i % 50 == 0:
        print i, loss.data[0]
        
    # Backward pass: compute gradient of the loss with respect to all the 
    # learnable parameters of the model. Internally, the parameters of each 
    # Module are stored in Variables with requires_grad=True
    loss.backward()
    
    # Update the weights using gradient descent. Each parameter is a Variable
    for param in model.parameters():
        param.data -= learning_rate * param.grad.data
        
    # Mutate the gradients to zero 
    model.zero_grad()

0 655.656799316
50 38.687664032
100 2.36846733093
150 0.239535540342
200 0.0310179758817
250 0.00467509170994
300 0.000779781315941
350 0.000139846699312
400 2.65552826022e-05
450 5.26942449142e-06
