# UF Research Computing  

![UF Research Computing Logo](images/ufrc_logo.png)


This tutorial is adapted from: [Understanding PyTorch with an example: a step-by-step tutorial](https://towardsdatascience.com/understanding-pytorch-with-an-example-a-step-by-step-tutorial-81fc5f8c4e8e) by Daniel Godoy.

# Understanding PyTorch: Part 2

## PyTorch

First, we need to cover a **few basic concepts** that may throw you off-balance if you don’t grasp them well enough before going full-force on modeling.

In Deep Learning, we see **tensors** everywhere. Well, Google’s framework is called TensorFlow for a reason! *What is a tensor, anyway?*

### Tensor
In Numpy, you may have an **array** that has **three dimensions**, right? That is, technically speaking, a **tensor**.

A **scalar** (a single number) has **zero** dimensions, a **vector** has **one** dimension, a **matrix** has **two** dimensions and a **tensor** has **three or more** dimensions. That’s it!

But, to keep things simple, it is commonplace to call vectors and matrices tensors as well — so, from now on, **everything is either a scalar or a tensor.**

### Loading Data, Devices and CUDA

*"How do we go from Numpy’s arrays to PyTorch’s tensors"*, you ask? That’s what [`from_numpy()`](https://pytorch.org/docs/stable/torch.html#torch.from_numpy) is good for. It returns a **CPU tensor**, though.

*"But I want to use my fancy GPU…"*, you say. No worries, that’s what [`to()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.to) is good for. It sends your tensor to whatever **device** you specify, including your **GPU** (referred to as `cuda` or `cuda:0`).

*"What if I want my code to fallback to CPU if no GPU is available?"*, you may be wondering… PyTorch got your back once more — you can use [`cuda.is_available()`](https://pytorch.org/docs/stable/cuda.html?highlight=is_available#torch.cuda.is_available) to find out if you have a GPU at your disposal and set your device accordingly.

You can also easily **cast** it to a lower precision (32-bit float) using [`float()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.float).

We need to regenerate the data we used in the previous notebook. The code for that is in the script [`generate_data.py`](generate_data.py). 

In [20]:
%run generate_data.py

In [21]:
import random
import numpy as np

# Data Generation
np.random.seed(42) # Comment out for random results.
x = np.random.rand(100, 1)

# y = 1 + 2x + Gaussian noise
y = 1 + 2 * x + .1 * np.random.randn(100, 1)


# Shuffles the indices to split train and validation datasets
idx = np.arange(100)
np.random.shuffle(idx)

# Uses first 80 random indices for train
train_idx = idx[:80]

# Uses the remaining indices for validation
val_idx = idx[80:]

# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]

In [32]:
import torch
import torch.optim as optim
import torch.nn as nn

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Our data was in Numpy arrays, but we need to transform them into PyTorch's Tensors
# and then we send them to the chosen device
x_train_tensor = torch.from_numpy(x_train).float().to(device)
y_train_tensor = torch.from_numpy(y_train).float().to(device)

# Here we can see the difference - notice that .type() is more useful
# since it also tells us WHERE the tensor is (device)
print(f"Using device: {device}")
print(f"Type of x_train: {type(x_train)}")
print(f"Type of x_train_tensor: {type(x_train_tensor)}")
print(f"Type of x_train_tensor with .type: {x_train_tensor.type()}")

Using device: cuda
Type of x_train: <class 'numpy.ndarray'>
Type of x_train_tensor: <class 'torch.Tensor'>
Type of x_train_tensor with .type: torch.cuda.FloatTensor


If you compare the types of both variables, you’ll get what you’d expect: `numpy.ndarray` for the first one and `torch.Tensor` for the second one.

But where does your nice tensor "live"? In your CPU or your GPU? You can’t say… but if you use PyTorch’s `type()`, it will reveal its location — `torch.cuda.FloatTensor` — a GPU tensor in this case.

We can also go the other way around, turning tensors back into Numpy arrays, using [`numpy()`](https://pytorch.org/docs/stable/tensors.html?highlight=numpy#torch.Tensor.numpy). It should be easy as `x_train_tensor.numpy()` but…

In [23]:
z = x_train_tensor.numpy()

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Unfortunately, Numpy cannot handle GPU tensors… you need to make them CPU tensors first using cpu().

In [24]:
z = x_train_tensor.cpu().numpy()

## Creating Parameters

What distinguishes a *tensor* used for *data* — like the ones we’ve just created — from a *tensor* used as a (*trainable*) *parameter/weight*?

The latter tensors require the **computation of its gradients**, so we can **update** their values (the parameters' values, that is). That’s what the `requires_grad=True` argument is good for. It tells PyTorch we want it to compute gradients for us.

You may be tempted to create a simple tensor for a parameter and, later on, send it to your chosen device, as we did with our data, right? Not so fast…

In [25]:
# FIRST
# Initializes parameters "a" and "b" randomly, ALMOST as we did in Numpy
# since we want to apply gradient descent on these parameters, we need
# to set REQUIRES_GRAD = TRUE
a = torch.randn(1, requires_grad=True, dtype=torch.float)
b = torch.randn(1, requires_grad=True, dtype=torch.float)
print(a)
print(b)

tensor([0.3367], requires_grad=True)
tensor([0.1288], requires_grad=True)


The first chunk of code creates two nice tensors for our parameters, gradients and all. But they are CPU tensors.

In [26]:
# SECOND
# But what if we want to run it on a GPU? We could just send them to device, right?
a = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)
b = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)

print(a)
print(b)
# Sorry, but NO! The to(device) "shadows" the gradient..

tensor([0.2345], device='cuda:0', grad_fn=<CopyBackwards>)
tensor([0.2303], device='cuda:0', grad_fn=<CopyBackwards>)


In the second chunk of code, we tried the **naive** approach of sending them to our GPU. We succeeded in sending them to another device, but we "**lost**" the **gradients** somehow…

In [27]:
# THIRD
# We can either create regular tensors and send them to the device (as we did with our data)
a = torch.randn(1, dtype=torch.float).to(device)
b = torch.randn(1, dtype=torch.float).to(device)
# and THEN set them as requiring gradients...
a.requires_grad_()
b.requires_grad_()

print(a)
print(b)

tensor([-1.1229], device='cuda:0', requires_grad=True)
tensor([-0.1863], device='cuda:0', requires_grad=True)


In the third chunk, we **first** send our tensors to the **device** and **then** use [`requires_grad_()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.requires_grad_) method to set its `requires_grad` to `True` in place.

In PyTorch, every method that **ends** with an **underscore (_)** makes changes **in-place**, meaning, they will **modify** the underlying variable.

Although the last approach worked fine, it is much better to **assign** tensors to a **device** at the moment of their **creation**.


In [28]:
# We can specify the device at the moment of creation - RECOMMENDED!
torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

print(a)
print(b)

tensor([0.1940], device='cuda:0', requires_grad=True)
tensor([0.1391], device='cuda:0', requires_grad=True)


Much easier, right?

Now that we know how to create tensors that require gradients, let’s see how PyTorch handles them — that’s the role of the…

## Autograd

Autograd is PyTorch’s *automatic differentiation package*. Thanks to it, we **don’t need to worry** about *partial derivatives*, *chain rule* or anything like it.

So, how do we tell PyTorch to do its thing and **compute all gradients**? That’s what [`backward()`](https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward) is good for.

Do you remember the **starting point** for **computing the gradients**? It was the **loss**, as we computed its partial derivatives w.r.t. our parameters. Hence, we need to invoke the `backward()` method from the corresponding Python variable, like, `loss.backward()`.

What about the **actual values** of the **gradients**? We can inspect them by looking at the [`grad`](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.grad) **attribute** of a tensor.

If you check the method’s documentation, it clearly states that **gradients are accumulated**. So, every time we use the **gradients** to **update** the parameters, we need to **zero the gradients afterwards**. And that’s what [`zero_()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.zero_) is good for.

What does the **underscore (_)** at the end of the method name mean? Do you remember? If not, scroll back to the previous section and find out.

So, let’s **ditch** the **manual computation of gradients** and use both `backward()` and `zero_()` methods instead.

That’s it? Well, pretty much… but, **there is always a catch**, and this time it has to do with the **update** of the **parameters**…

In [29]:
lr = 1e-1
n_epochs = 1000

torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

for epoch in range(n_epochs):
    yhat = a + b * x_train_tensor
    error = y_train_tensor - yhat
    loss = (error ** 2).mean()

    # No more manual computation of gradients! 
    # a_grad = -2 * error.mean()
    # b_grad = -2 * (x_tensor * error).mean()
    
    # We just tell PyTorch to work its way BACKWARDS from the specified loss!
    loss.backward()
    # Let's check the computed gradients...
    print(f"a.grad is: {a.grad}")
    print(f"b.grad is: {b.grad}")
    
    # What about UPDATING the parameters? Not so fast...
    
    # FIRST ATTEMPT
    # AttributeError: 'NoneType' object has no attribute 'zero_'
    a = a - lr * a.grad
    b = b - lr * b.grad
    print(f"a is: {a}")


a.grad is: tensor([-3.3881], device='cuda:0')
b.grad is: tensor([-1.9439], device='cuda:0')
a is: tensor([0.5328], device='cuda:0', grad_fn=<SubBackward0>)
a.grad is: None
b.grad is: None


TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'

In the first attempt, if we use the same update structure as in our Numpy code, we’ll get the weird error.

But we can get a hint of what’s going on by looking at the tensor itself — once again we "**lost**" the **gradient** while reassigning the update results to our parameters. Thus, the `grad` attribute turns out to be None and it raises the error…

In [30]:
# SECOND ATTEMPT
# RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.
a -= lr * a.grad
b -= lr * b.grad        
    

TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'

We then change it slightly, using a familiar **in-place Python assignment** in our second attempt. And, once again, PyTorch complains about it and raises an error.

> *Why?!* It turns out to be a case of "**too much of a good thing**". The culprit is PyTorch's ability to build a **dynamic computation graph** from *every* **Python operation** that involves any **gradient-computing tensor** or **its dependencies**.

> We'll go deeper into the inner workings of the dynamic computation graph in the next section.

So, how do we tell PyTorch to "**back off**" and let us **update our parameters** without messing up with its *fancy dynamic computation graph*? That's what [`torch.no_grad()`](https://pytorch.org/docs/stable/autograd.html#torch.autograd.no_grad) is good for. It allows us to **perform regular Python operations on tensors, independent of PyTorch’s computation graph.**

Finally, we managed to successfully run our model and get the **resulting parameters**. Surely enough, they **match** the ones we got in our Numpy-only implementation.

In [31]:
lr = 1e-1
n_epochs = 1000

torch.manual_seed(42)
a = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

for epoch in range(n_epochs):
    yhat = a + b * x_train_tensor
    error = y_train_tensor - yhat
    loss = (error ** 2).mean()

    # No more manual computation of gradients! 
    # a_grad = -2 * error.mean()
    # b_grad = -2 * (x_tensor * error).mean()
    
    # We just tell PyTorch to work its way BACKWARDS from the specified loss!
    loss.backward()
    # Let's check the computed gradients...
    if epoch % 100 == 0:    # print every 100 epochs
        print(f"Epoch: {epoch}")
        print(f"   a.grad is: {a.grad}")
        print(f"   b.grad is: {b.grad}")
    
     # THIRD ATTEMPT
    # We need to use NO_GRAD to keep the update out of the gradient computation
    # Why is that? It boils down to the DYNAMIC GRAPH that PyTorch uses...
    with torch.no_grad():
        a -= lr * a.grad
        b -= lr * b.grad
    
    # PyTorch is "clingy" to its computed gradients, we need to tell it to let it go...
    a.grad.zero_()
    b.grad.zero_()
    
print(f"\na is: {a}")
print(f"b is: {b}")

Epoch: 0
   a.grad is: tensor([-3.3881], device='cuda:0')
   b.grad is: tensor([-1.9439], device='cuda:0')
Epoch: 100
   a.grad is: tensor([0.0188], device='cuda:0')
   b.grad is: tensor([-0.0367], device='cuda:0')
Epoch: 200
   a.grad is: tensor([0.0041], device='cuda:0')
   b.grad is: tensor([-0.0080], device='cuda:0')
Epoch: 300
   a.grad is: tensor([0.0009], device='cuda:0')
   b.grad is: tensor([-0.0018], device='cuda:0')
Epoch: 400
   a.grad is: tensor([0.0002], device='cuda:0')
   b.grad is: tensor([-0.0004], device='cuda:0')
Epoch: 500
   a.grad is: tensor([4.2574e-05], device='cuda:0')
   b.grad is: tensor([-8.3295e-05], device='cuda:0')
Epoch: 600
   a.grad is: tensor([9.3249e-06], device='cuda:0')
   b.grad is: tensor([-1.8163e-05], device='cuda:0')
Epoch: 700
   a.grad is: tensor([1.9097e-06], device='cuda:0')
   b.grad is: tensor([-4.1103e-06], device='cuda:0')
Epoch: 800
   a.grad is: tensor([5.1083e-07], device='cuda:0')
   b.grad is: tensor([-8.8313e-07], device='cuda:0

**[Continue to part 3 of the tutorial: 03_DynamicComputationGraph.ipynb](03_DynamicComputationGraph.ipynb)**