# 2b. Tensor features

Free after [Deep Learning with PyTorch, Eli Stevens, Luca Antiga, and Thomas Viehmann](https://www.manning.com/books/deep-learning-with-pytorch)

In [None]:
import torch
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
sns.set_theme(style="ticks")

In [None]:
t = torch.tensor([1,2,3,4,5,6])

### Numpy interoperability

In [None]:
t_np = t.numpy()
t = torch.from_numpy(t_np)

### Serialization

* use `torch.save(t, "path_to_file.t")` and `torch.load("path_to_file.t")`
* alternatively, can use in combination with `hdf5` file format (library: h5py)

### GPU

* PyTorch makes it very easy to use one or several GPUs, using the `torch.device`

In [None]:
torch.device("cpu") # use cpu by default
torch.device("cuda") # GPU
torch.device("cuda:0") # index multiple GPUs #0 -> default: 0
torch.device("cuda:1") # use GPU #1

* delegate a tensor to a device using

In [None]:
print(t.to(torch.device("cpu")))
print(t.cpu())

if torch.cuda.is_available():
    print(t.cuda())
    print(t.cuda(0))

## Tensor API

For PyTorch, there exists a ton of ops... whatever you would like to do, it's probably already implemented in a performant manner.

**PyTorch convention:** a mathematical operation often has an in-place equivalent referenced by using the suffix `_`. E.g. `t.cos()` and `t.cos_()`

Some examples:

In [None]:
t = torch.tensor(range(10), dtype=torch.float) 

In [None]:
t.cos()

In [None]:
t.log()

In [None]:
t.log_(); t # operates in-place/ mutates tensor

## Exercise 6:

Calculate the mean squared error between predictions and target values: 


<center>$\rm mse = \frac{1}{N}\sum_i^N (p_i - t_i)^2$</center>

In [None]:
def mse(p, t):
    return (p - t).pow(2).mean()

In [None]:
from helper import test_mse
test_mse(mse)

### Normalization

The [Group Normalizatoin Paper](https://arxiv.org/pdf/1803.08494.pdf) shows a nice figure on how different normalization schemes slice a tensor.

![Figure](../img/group_norm.png)

Choose one scheme and normalize the below tensor accordingly!

In [None]:
N_BATCH_SIZE, C_NUMBER_OF_CHANNELS, H_HEIGHT, W_WIDTH = 32, 3, 64, 64

t = torch.rand(N_BATCH_SIZE, C_NUMBER_OF_CHANNELS, H_HEIGHT, W_WIDTH)

## Auto grad

Fundamental to optimization is the ability to perform differentiation. PyTorch does this with its **autograd** framework, which we will dive into now.

<div align="center">
    <img src="../img/autograd.svg" width="1200px" alt="Tracking derivatives through the compute graph">
</div>

### Key Concepts

1. Compute graph and chain rule 
1. `t.requires_grad_()` and `t.grad`
2. `t.backward()`
3. `param.detach` and `torch.no_grad()` 
4. zeroing the gradient
    

### Gradient descent by hand...

We want to find the minimum of a quadratic function and show how PyTorch can help us to do so.

In [None]:
# Define a quadratic function and plot it

def second_order_polynomial(x, a, b, c):
    return a*x**2 + b*x + c

def show_sop(x, y):
    fig, ax = plt.subplots(1, figsize=(7,7))
    ax.set_ylabel("$y$", fontsize=20)
    ax.set_xlabel("$x$", fontsize=20)
    ax.plot(x, y, linewidth=4 )
    ax.set_title("$ax^2 + bx + c$", fontsize=24)

a, b, c = 0.5, 1.3, 2.8

x = np.linspace(-10, 10, 100)

show_sop(x, second_order_polynomial(x, a, b, c))

We already know that $\frac{d}{dx} f(x) = \frac{d}{dx} ax^2 + bx + c = 2ax + b$.

Does PyTorch also know that? Let's see.

#### 1. Dependent variable

We first need to let PyTorch know that $x$ is our dependent variable. We do so by specifying that $x$ requires the computation of gradients, using `requires_grad`.

In [None]:
x = torch.tensor([2.5], requires_grad=True)

# or 
x = torch.tensor([2.5])
x.requires_grad_()

#### 2. Perform computations with dependent variable 

Next, we want to compute something with this variable, namely our quadratic function $f(x)$

In [None]:
y = second_order_polynomial(x, a, b, c)

Due to the `requires_grad` attribute, PyTorch dynamically tracks the dependency on `x` on any computation on x.

#### 3. Compute the gradients

Now we wish to compute the gradients. This is simply done by calling `backward()` on $y$. The gradients can then be found in the `x.grad` attribute.

In [None]:
print(x.grad)

y.backward()

print(x.grad)

#### 4. Check agreement

Let's also check with the expected result:

In [None]:
assert x.grad == 2*a*x + b 

#### 5. Repeat

##### **Parameter update**

* Notice that we haven't found a value of `x` yet where $f(x)$ is minimum.
* But the gradient descent algorithm at least tells us in which direction we should continue our search.
* Since the gradient is positive, we know that `f(x)` keeps growing in the postiive `x` direction. Hence, we should choose a smaller value for `x`.
* However, if we now operate on `x` in order to reduce its value, we will change the graph of `x`. To avoid this, we can ask PyTorch to operate on `x` without tracking this operation.


In [None]:
with torch.no_grad():
    x -= 1. # just guessed some value
print(x.requires_grad) # still requires grad

**Side note:** It can sometimes be necessary to stop computing gradients altogether. In this case, use `x.detach`:

```python
some_other_thing = x.detach()
assert not some_other_thing.requires_grad
```

##### **Zeroing the gradient**

Notice that `x` still has a gradient:

In [None]:
print(x.grad)

Everytime we call `backward` on some `y(x)`, we will accumulate gradients in `x`. This is helpful if for example we want to compute gradients across multiple GPUs...

But fow now that is not what we want to do. Instead we want to compute the gradient for a new value of `x`. So we whould reset `x.grad`:

In [None]:
with torch.no_grad():
    x.grad.zero_()

##### **Next**
Now, let's go back to step 2. And see if we are closer to the minimum.

#### Summary

1. Use `requires_grad` to let PyTorch know your dependent variable.
2. Now every operation on `x` is tracked in order to dynamically build the compute graph involving `x`.
3. Use `y.backward()` to compute the gradient of `y` using the chain rule. This works because the Tensor framework implements a `forward` and `backward` operation for each computational operation. This includes overloading `a.__mult__(self, b)` etc.
4. Make sure to `detach` some operations on `x` form the compute graph if they are not required for the computation of gradients. Use `x.detach` or `torch.no_grad()` 
5. Each call to `y.backward()` will accumulate gradients in the leaves of the graph. Make sure to zero the gradients after a parameter udpate.

These are the essential steps to computing gradients with PyTorch. We will later discover PyTorch's higher-level API that helps us make those steps more user friendly.

## Exercise 7: Least squares fit for a linear function

Find the best set of parameters `m, b` for a linear model $f(x) = mx + b$ that best fit the data.

To do so, you will have to:
1. Decide which are your dependent variables.
2. Calculate the mean squared error.
3. Calculate the gradient of the mse with respect to the dependent variables.
4. Perform a parameter update.
5. Iterate until some stopping condition.

To help you with these task, some functions and the training loop are already set up for you.

In [None]:
from helper import linear_model, noise, mse, show_fit

In [None]:
# Prepare the data
initial_parmas = torch.tensor([1., 0.])
target_params = torch.tensor([3.4, -0.8])

x = torch.tensor(range(10))
data = linear_model(x, target_params) + noise(x)

assert mse(linear_model(x, initial_parmas), data) > 100

show_fit(x, linear_model(x, initial_parmas), data)

In [None]:
# Run training loop
lr = 0.01
n_epochs = 10
initial_parmas = torch.tensor([1., 0.], requires_grad=True)

for epoch in range(n_epochs):
    # calculate loss
    loss = mse(linear_model(x, initial_parmas), data)
    print(f"Loss at epoch [{epoch}]: [{loss.item()}]")
    
    # calculate gradients / propagate error
    loss.backward()
    
    # update weights
    with torch.no_grad():
        initial_parmas -= lr * initial_parmas.grad.data
        initial_parmas.grad.data.zero_()
    
with torch.no_grad():
    show_fit(x, linear_model(x, initial_parmas), data)