# PyTorch Basics

Pytorch is a library for creating neural networks in Python. 

This notebook draws heavily from the following sources: 
* Pytorch's official [Tensors notebook](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html)
* Phillip Lippe's [Intro to Pytorch notebook](https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial2/Introduction_to_PyTorch.ipynb)

## Overview of the library
* [`torch`](https://pytorch.org/docs/stable/torch.html) The top-level PyTorch package that provides an entry point to all PyTorch modules including the Tensor object.  
* [`torch.nn`](https://pytorch.org/docs/stable/nn.html)  A subpackage that contains modules and classes for building neural networks.
* [`torch.utils.data`](https://pytorch.org/docs/stable/data.html) A subpackage that provides tools for working with data. 
* [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html) A subpackage that provides support for training on multiple gpus and multiple nodes.
* [`torch.autograd`](https://pytorch.org/docs/stable/autograd.html) A package that provides automatic differentiation for all operations on Tensors.
* [`torchvision`](https://pytorch.org/vision/stable/index.html) A package that provides access to popular datasets, model architectures, and image transformations for computer vision.

## Tensors
Tensors are a specialized data structure that are very similar to arrays and matrices.
In PyTorch, we use tensors to store:
1. model inputs
2. model outputs
3. model parameters.

Tensors can have _many_ dimensions (at least 10,000 in version 2.0 -- I checked).

Tensors are similar to [NumPy’s](https://numpy.org/) ndarrays, except that tensors can run on GPUs or other hardware accelerators. Tensors are also optimized for automatic differentiation. If you’re familiar with numpy arrays, you’ll be right at home with the Tensor API.

To start working with tensors, we import the top-level Pytorch package:

In [1]:
import torch

### Initializing a Tensor

Tensors can be initialized in various ways. Take a look at the following examples:

**A range of values**

In [2]:
torch.arange(10)

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

**Directly from data**

Tensors can be created directly from data. The data type is automatically inferred.

In [3]:
data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)
x_data

tensor([[1, 2],
        [3, 4]])

**From/To a NumPy array**

Tensors can be created from NumPy arrays.



In [4]:
import numpy as np
np_arr = np.array([[1, 2], [3, 4]])
tensor = torch.from_numpy(np_arr)
np_arr_2 = tensor.numpy()

print("Numpy array:\n", np_arr)
print("PyTorch tensor:\n", tensor)
print("Numpy array 2:\n", np_arr_2)

Numpy array:
 [[1 2]
 [3 4]]
PyTorch tensor:
 tensor([[1, 2],
        [3, 4]])
Numpy array 2:
 [[1 2]
 [3 4]]


**From another tensor:**

The new tensor retains the properties (shape, datatype) of the argument tensor, unless explicitly overridden.



In [5]:
x_ones = torch.ones_like(x_data) # retains the properties of x_data
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")

Ones Tensor: 
 tensor([[1, 1],
        [1, 1]]) 

Random Tensor: 
 tensor([[0.5320, 0.7685],
        [0.7713, 0.9187]]) 



**With random or constant values:**

``shape`` is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor.



In [6]:
shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

Random Tensor: 
 tensor([[0.3585, 0.2870, 0.5259],
        [0.1348, 0.2001, 0.1976]]) 

Ones Tensor: 
 tensor([[1., 1., 1.],
        [1., 1., 1.]]) 

Zeros Tensor: 
 tensor([[0., 0., 0.],
        [0., 0., 0.]])


### Attributes of a Tensor

Tensor attributes describe their shape, datatype, and the device on which they are stored.



In [7]:
tensor = torch.rand(3,4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


### Operations on Tensors

Over 100 tensor operations, including arithmetic, linear algebra, matrix manipulation (transposing,
indexing, slicing), sampling and more are
comprehensively described [here](https://pytorch.org/docs/stable/torch.html).

**Indexing**

We often have the situation where we need to select a part of a tensor. Indexing works just like in numpy, so let's try it:

In [8]:
x = torch.arange(12).view(3, 4)
x

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

In [9]:
x[:, 1]   # Second column

tensor([1, 5, 9])

In [10]:
x[0]      # First row

tensor([0, 1, 2, 3])

In [11]:
x[:2, -1] # First two rows, last column

tensor([3, 7])

In [12]:
x[1:3] # Middle two rows

tensor([[ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

**adding/removing indices**

We often need to add empty indices.

In [13]:
x[None].shape # add new index at front

torch.Size([1, 3, 4])

In [14]:
x[:, None].shape # in the 2nd position

torch.Size([3, 1, 4])

In [15]:
x[..., None].shape # in the last position

torch.Size([3, 4, 1])

`unsqueeze` accomplishes the same thing

In [16]:
print(x.unsqueeze(0).shape)
print(x.unsqueeze(1).shape)
print(x.unsqueeze(-1).shape)

torch.Size([1, 3, 4])
torch.Size([3, 1, 4])
torch.Size([3, 4, 1])


We can remove empty indices as well.

In [17]:
x = torch.randn(1,3,4)
x.shape

torch.Size([1, 3, 4])

In [18]:
x[0].shape

torch.Size([3, 4])

In [19]:
x.squeeze().shape

torch.Size([3, 4])

**Changing the shape**
There are many ways

In [20]:
x = torch.randn(2,3)
x, x.shape

(tensor([[-1.5750, -0.5173, -1.2813],
         [-1.5837,  0.9553, -0.2438]]),
 torch.Size([2, 3]))

In [21]:
x.T, x.T.shape  # transpose

(tensor([[-1.5750, -1.5837],
         [-0.5173,  0.9553],
         [-1.2813, -0.2438]]),
 torch.Size([3, 2]))

In [22]:
x.reshape(3,2)

tensor([[-1.5750, -0.5173],
        [-1.2813, -1.5837],
        [ 0.9553, -0.2438]])

---

**Question**: How would we create a tensor with shape `toch.Size([6,1])`?

---

Shapes must be compatible

In [23]:
try:
    x.reshape(2,6)
except RuntimeError as e:
    print(e)

shape '[2, 6]' is invalid for input of size 6


`permute` allows us to rearrange indices more flexibly

In [24]:
# create a tensor to play with
y = torch.arange(24).reshape(2,3,4)
y, y.shape

(tensor([[[ 0,  1,  2,  3],
          [ 4,  5,  6,  7],
          [ 8,  9, 10, 11]],
 
         [[12, 13, 14, 15],
          [16, 17, 18, 19],
          [20, 21, 22, 23]]]),
 torch.Size([2, 3, 4]))

In [25]:
y.permute(1, 0, 2).shape

torch.Size([3, 2, 4])

In [26]:
y.permute(0,2,1).shape

torch.Size([2, 4, 3])

**Joining tensors** 

You can use ``torch.cat`` to concatenate a sequence of tensors along a given dimension.
See also [torch.stack](https://pytorch.org/docs/stable/generated/torch.stack.html)_,
another tensor joining option that is subtly different from ``torch.cat``.

In [27]:
tensor = torch.arange(12).reshape(3,4)
tensor

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

In [28]:
torch.cat([tensor, tensor, tensor], dim=0)

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

In [29]:
torch.cat([tensor, tensor, tensor], dim=1)

tensor([[ 0,  1,  2,  3,  0,  1,  2,  3,  0,  1,  2,  3],
        [ 4,  5,  6,  7,  4,  5,  6,  7,  4,  5,  6,  7],
        [ 8,  9, 10, 11,  8,  9, 10, 11,  8,  9, 10, 11]])

Sometimes, you want to create a new dimension when combining:

In [30]:
torch.stack([tensor, tensor, tensor])

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]]])

**Arithmetic operations**



In [31]:
# This computes the matrix multiplication between two tensors. y1, y2, y3 will have the same value
# ``tensor.T`` returns the transpose of a tensor
y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)

y1, y2, y1-y2

(tensor([[ 14,  38,  62],
         [ 38, 126, 214],
         [ 62, 214, 366]]),
 tensor([[ 14,  38,  62],
         [ 38, 126, 214],
         [ 62, 214, 366]]),
 tensor([[0, 0, 0],
         [0, 0, 0],
         [0, 0, 0]]))

In [32]:
# This computes the element-wise product. z1, z2, z3 will have the same value
z1 = tensor * tensor
z2 = tensor.mul(tensor)

z1, z2, z1-z2

(tensor([[  0,   1,   4,   9],
         [ 16,  25,  36,  49],
         [ 64,  81, 100, 121]]),
 tensor([[  0,   1,   4,   9],
         [ 16,  25,  36,  49],
         [ 64,  81, 100, 121]]),
 tensor([[0, 0, 0, 0],
         [0, 0, 0, 0],
         [0, 0, 0, 0]]))

**aggregations**

In [33]:
x = torch.arange(12, dtype=torch.float32).reshape(3,4)
x

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])

In [34]:
# sum, mean, std, etc.
x.sum(), x.mean(), x.std()

(tensor(66.), tensor(5.5000), tensor(3.6056))

In [35]:
# sum along first axis
print(x.sum(axis=0))

# or second
x.sum(axis=1)

tensor([12., 15., 18., 21.])


tensor([ 6., 22., 38.])

## GPU support

The palmetto cluster has [many GPU compute nodes](https://docs.rcd.clemson.edu/palmetto/compute/hardware#infiniband-phases). A crucial feature of PyTorch is the support of GPUs, short for Graphics Processing Unit. A GPU can perform many thousands of small operations in parallel, making it very well suitable for performing large matrix operations in neural networks. _You do not need to know anything about GPU programming to use PyTorch on the GPU!_

When comparing GPUs to CPUs, we can list the following main differences (credit: [Kevin Krewell, 2009](https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/)) 

<left style="width: 100%"><img src="https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial2/comparison_CPU_GPU.png?raw=1" width="700px"></left>

CPUs and GPUs have both different advantages and disadvantages, which is why many computers contain both components and use them for different tasks. In case you are not familiar with GPUs, you can read up more details in this [NVIDIA blog post](https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/) or [here](https://www.intel.com/content/www/us/en/products/docs/processors/what-is-a-gpu.html). 



GPUs can accelerate the training of your network up to a factor of $100$ which is essential for large neural networks. PyTorch implements a lot of functionality for supporting GPUs (mostly those of NVIDIA due to the libraries [CUDA](https://developer.nvidia.com/cuda-zone) and [cuDNN](https://developer.nvidia.com/cudnn)). First, let's check whether you have a GPU available:

In [36]:
gpu_avail = torch.cuda.is_available()
print(f"Is the GPU available? {gpu_avail}")

Is the GPU available? True


You can information about your GPU usage by opening a terminal and running the `nvidia-smi` command.

In [37]:
!nvidia-smi

Thu Mar 23 13:32:30 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   28C    P0    25W / 250W |    130MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

By default, all tensors you create are stored on the CPU. We can push a tensor to the GPU by using the function `.to(...)`, or `.cuda()`. However, it is often a good practice to define a `device` object in your code which points to the GPU if you have one, and otherwise to the CPU. Then, you can write your code with respect to this device object, and it allows you to run the same code on both a CPU-only system, and one with a GPU. Let's try it below. We can specify the device as follows: 

In [38]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("Device", device)

Device cuda


Let's create a large tensor, push it to device.

In [39]:
x = torch.randn(1000, 1000, 1000).to(device) 
x.dtype, x.device

(torch.float32, device(type='cuda', index=0))

---

**Question**: Can you estimate how much VRAM this tensor should occupy?

---

In [40]:
!nvidia-smi

Thu Mar 23 13:32:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   29C    P0    38W / 250W |   4721MiB / 16384MiB |     40%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [41]:
del x

# the gpu memory may not be freed right away
# we can explicitly free
torch.cuda.empty_cache()

In [42]:
!nvidia-smi

Thu Mar 23 13:32:39 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   29C    P0    41W / 250W |    905MiB / 16384MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

**Tensors must be on the same device**

In [43]:
a = torch.randn(3,3, device = torch.device('cpu')) # the default
b = torch.randn(3,3, device = torch.device('cuda')) # gpu

try:
    print(a+b)
except RuntimeError as e:
    print(e)

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!


**This will happen to you a lot**

The solution: find ways to use less memory.

In [44]:
try:
    # tensor is too big to fit on our GPU!
    x = torch.randn(int(3e10), device=torch.device('cuda'))
except Exception as e:
    print(e)

CUDA out of memory. Tried to allocate 111.76 GiB (GPU 0; 15.78 GiB total capacity; 512 bytes already allocated; 14.90 GiB free; 2.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF


**How much does GPU actually help?**

Let's test by multiplying a large matrix with itself. 

In [45]:
import time

x = torch.randn(10000, 10000)

## CPU version
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
print(f"CPU time: {(end_time - start_time):6.5f}s")

## GPU version
x = x.to(device)
_ = torch.matmul(x, x)  # First operation to 'burn in' GPU
# CUDA is asynchronous, so we need to use different timing functions
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
_ = torch.matmul(x, x)
end.record()
torch.cuda.synchronize()  # Waits for everything to finish running on the GPU
print(f"GPU time: {0.001 * start.elapsed_time(end):6.5f}s")  # Milliseconds to seconds

CPU time: 1.46176s
GPU time: 0.15011s


## Dynamic Computation Graph and Backpropagation

One of the main reasons for using PyTorch in Deep Learning projects is that we can automatically get _gradients/derivatives_ of functions that we define. We will mainly use PyTorch for implementing neural networks, and they are just fancy functions. If we use weight matrices in our function that we want to learn, then those are called the _parameters_ or simply the _weights_. The ability to compute gradients is essential for optimizing (a.k.a. _training_) our networks. 

Tensors have a `requires_grad` attribute

In [46]:
x = torch.ones((3,))
print(x.requires_grad)

False


We can change this for an existing tensor using the function `requires_grad_()` (underscore indicating that this is a in-place operation). Alternatively, when creating a tensor, you can pass the argument `requires_grad=True` to most initializers we have seen above.

In [47]:
x.requires_grad_(True)
print(x.requires_grad)

True


In order to get familiar with the concept of a computation graph, we will create one for the following function:

$$y = \frac{1}{|x|}\sum_i \left[(x_i + 2)^2 + 3\right]$$

You could imagine that $x$ are our parameters, and we want to optimize (either maximize or minimize) the output $y$. For this, we want to obtain the gradients $\partial y / \partial \mathbf{x}$. For our example, we'll use $\mathbf{x}=[0,1,2]$ as our input.

In [48]:
x = torch.arange(3, dtype=torch.float32, requires_grad=True) # Only float tensors can have gradients
print("X", x)

X tensor([0., 1., 2.], requires_grad=True)


Now let's build the computation graph step by step. You can combine multiple operations in a single line, but we will separate them here to get a better understanding of how each operation is added to the computation graph.

In [49]:
a = x + 2
b = a ** 2
c = b + 3
y = c.mean()
print("Y", y)

Y tensor(12.6667, grad_fn=<MeanBackward0>)


Using the statements above, we have created a computation graph that looks similar to the figure below:

<center style="width: 100%"><img src="https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial2/pytorch_computation_graph.svg?raw=1" width="200px"></center>

We calculate $a$ based on the inputs $x$ and the constant $2$, $b$ is $a$ squared, and so on. The visualization is an abstraction of the dependencies between inputs and outputs of the operations we have applied.
Each node of the computation graph has automatically defined a function for calculating the gradients with respect to its inputs, `grad_fn`. You can see this when we printed the output tensor $y$. This is why the computation graph is usually visualized in the reverse direction (arrows point from the result to the inputs). We can perform backpropagation on the computation graph by calling the function `backward()` on the last output, which effectively calculates the gradients for each tensor that has the property `requires_grad=True`:

In [50]:
y.backward()

`x.grad` will now contain the gradient $\partial y/ \partial \mathcal{x}$, and this gradient indicates how a change in $\mathbf{x}$ will affect output $y$ given the current input $\mathbf{x}=[0,1,2]$:

In [51]:
print(x.grad)

tensor([1.3333, 2.0000, 2.6667])


We can also verify these gradients by hand. We will calculate the gradients using the chain rule, in the same way as PyTorch did it:

$$
\frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial c_i}\frac{\partial c_i}{\partial b_i}\frac{\partial b_i}{\partial a_i}\frac{\partial a_i}{\partial x_i}
$$

Note that we have simplified this equation to index notation, and by using the fact that all operation besides the mean do not combine the elements in the tensor. The partial derivatives are:

$$
\frac{\partial a_i}{\partial x_i} = 1,\hspace{1cm}
\frac{\partial b_i}{\partial a_i} = 2\cdot a_i\hspace{1cm}
\frac{\partial c_i}{\partial b_i} = 1\hspace{1cm}
\frac{\partial y}{\partial c_i} = \frac{1}{3}
$$

Hence, with the input being $\mathbf{x}=[0,1,2]$, our gradients are $\partial y/\partial \mathbf{x}=[4/3,2,8/3]$. The previous code cell should have printed the same result.