# The ultimate guide to PyTorch

In [77]:
import torch
import torch.nn.functional as F
import numpy as np

## `torch.Tensor` class and constructors

This is the most important classes in PyTorch. `torch.Tensor` is a "multi-dimensional matrix" (really in physics language it is just a multi-index tensor). There are multiple ways to create an object of `torch.Tensor` type. The most common one is `torch.tensor`

In [4]:
# To create a tensor from a python array
print(torch.tensor([1,2,3]))

# You can also use multidimensional arrays 
print(torch.tensor([[1,2,3],[4,5,6]]))

# It also accepts np.array
print(torch.tensor(np.array([1,2,3])))

tensor([1, 2, 3])
tensor([[1, 2, 3],
        [4, 5, 6]])
tensor([1, 2, 3])


There are also special functions to create `torch.Tensor`s. The most useful ones are - 

In [51]:
# Create a 5 by 5 tensor filled with zeroes
print(torch.zeros((5,5)))

# Create a 3 by 5 tensor filled with ones
print(torch.ones((3,5)))

# Create a random 2 by 2 by 6 tensor from a Gaussian
print(torch.randn((2,2,6)))

# Create a random 2 by 2 by 6 tensor populated by integers
# Syntax is int low, int high, tuple of sizes
print(torch.randint(15, 20, (2,2,6)))

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])
tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])
tensor([[[ 0.1815,  1.1265,  1.7042,  1.1012, -0.5984, -0.1766],
         [-0.2570, -0.1536, -0.9602, -1.9536, -0.1074, -0.7823]],

        [[-0.6211, -1.6089,  0.9538,  0.7135, -0.8947,  0.2069],
         [ 0.8291, -0.6183, -0.6868,  1.3104, -1.3237,  1.1811]]])
tensor([[[19, 16, 16, 15, 16, 18],
         [17, 19, 16, 18, 17, 16]],

        [[16, 18, 18, 17, 15, 15],
         [17, 16, 15, 18, 18, 17]]])


By default, all the tensors are created with `requires_grad` set to `False`. If you want to do any calculations involving gradients, this needs to be set to true

In [14]:
# Tensor without grad
print(torch.ones((2,2)).requires_grad)

# Tensor with grad
print(torch.ones((2,2), requires_grad=True).requires_grad)

# You can also set the grad to be true later by using requires_grad_()
torchTensor = torch.randn((2,2))
print(f"{torchTensor.requires_grad = }")
torchTensor.requires_grad_()
print(f"{torchTensor.requires_grad = }")

False
True
torchTensor.requires_grad = False
torchTensor.requires_grad = True


There are two ways to access the data stored in `torch.Tensor`. They are `.item` and `.data`. `item` is used *only* when the tensor stores a scalar. `data` is used in general to access the data (and ignore other attributes like `grad` which we will see later)

In [24]:
torchTensor = torch.randn((2,2))
print(torchTensor.data)
try:
    print(torchTensor.item())
except RuntimeError as error:
    print(f"Error occurred: {error}")

torchScalar = torch.tensor(2)
try:
    print(f"{torchScalar.item() = }")
except RuntimeError as error:
    print(f"Error occurred: {error}")

tensor([[ 0.7446, -0.4623],
        [-1.4433,  1.0537]])
Error occurred: a Tensor with 4 elements cannot be converted to Scalar
torchScalar.item() = 2


You can also index into the tensor exactly the way you would index into a multidimensional array in classic Python. This can be powerful because now you can pick specific parts of the tensor

In [69]:
tensorToIndex = torch.randn((7,48,3))
print(tensorToIndex.shape)

# Accessing an element in the first dimension
print(tensorToIndex[2].shape)

# Accessing an element in the second dimension
print(tensorToIndex[:,22,:].shape)
print(tensorToIndex[:,22].shape)

# Accessing an element in the third dimension
print(tensorToIndex[:,:,2].shape)

# We can build a new tensor by indexing the old tensor at specific values using a tensor
indices = torch.tensor([[1,5,2],[2,0,0],[0,1,2]])
print(indices.shape)
print(tensorToIndex[:,indices,:].shape)

torch.Size([7, 48, 3])
torch.Size([48, 3])
torch.Size([7, 3])
torch.Size([7, 3])
torch.Size([7, 48])
torch.Size([3, 3])
torch.Size([7, 3, 3, 3])


What the above code does is the following. It tells PyTorch to keep the first and third dimension as it is, and inserts new dimensions between the first and the third dimension. The second dimension in `tensorToIndex`, which was 48 dimensional, now gets replaced by two dimensions of size [3,3] since the index tensor is of shape [3,3]. 

How does it populate the new dimensions? That information comes from the `index` tensor itself. For the specific example used above, `index` tells PyTorch that at position [-, 0, 0, -] it wants PyTorch to put [-, 1, -] from `tensorToIndex`. At [-, 1, 0, -] it wants [-, 2, -] from the old tensor, and so on.

If we do not use any `:` and only pass `indices` directly, then it does the above operation on the left most dimension.

In [71]:
# We can build a new tensor by indexing the old tensor at specific values using a tensor
indices = torch.tensor([[1,5,2],[2,0,0],[0,1,2]])
print(tensorToIndex.shape)
print(indices.shape)
print(tensorToIndex[indices].shape)

torch.Size([7, 48, 3])
torch.Size([3, 3])
torch.Size([3, 3, 48, 3])


This can be particularly useful in situations like the following. Let's say I have `N = 5` classes. I have a dataset `X` that contain samples of these classes (for example, we could have `N = 26` and `X` is a letters that appear in my dataset). Normally, one would one-hot encode these classes into a N-dimensional vector and apply a linear transformation on it. However, one-hot encoding something and then multiplying with a matrix is equivalent to picking a specific row (or column depending on how you look at it). Thus we can just directly index into the tensor using `X`. See below for example.

In [87]:
# "dataset" with 40 samples
X = torch.randint(5, (40,))

# One-hot encoding
Xenc = F.one_hot(X, num_classes=5).float()
print(Xenc)
print(Xenc.shape)

matrix = torch.randn((5, 3))

# Matrix multiply
print((Xenc @ matrix).shape)

# Index into the matrix
print(matrix[X].shape)

# Checking they are the same element-wise
(Xenc @ matrix) == matrix[X]

tensor([[0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 1.],
        [0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0.],
        [1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0.],
        [1., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 1., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 1.],
        [0., 0., 1., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1.],
        [0

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True,

## Operations on `torch.Tensor`

The `Torch` class comes with a huge set of operations on tensors. We will split the most important ones into unary and binary operators. Further, unary will be split into tensor operations and tensor functions

### Unary tensor operators 

`torch.Tensor` has a bunch of standard operations such as transpose and Hermitian conjugation in the case of two-dimensional tensors

In [31]:
# Trial tensor
M = torch.randn((3,3))
print(f"{M = }")

# Transpose
print(f"{M.T = }")

# Hermitian conjugate (which would give the same result as T since M is real)
print(f"{M.H = }")

M = tensor([[ 0.6264, -0.6803,  1.3870],
        [-0.3869,  2.3262,  0.2047],
        [-0.4378, -0.1837, -0.2527]])
M.T = tensor([[ 0.6264, -0.3869, -0.4378],
        [-0.6803,  2.3262, -0.1837],
        [ 1.3870,  0.2047, -0.2527]])
M.H = tensor([[ 0.6264, -0.3869, -0.4378],
        [-0.6803,  2.3262, -0.1837],
        [ 1.3870,  0.2047, -0.2527]])


The situation with higher dimensional tensors is tricky. First there is `mT` which implements transpose in the trailing two dimensions

In [34]:
# Trial tensor
M = torch.randn((3,3,3))
print(f"{M = }")

# Transpose
print(f"{M.mT = }")

# Hermitian conjugate
print(f"{M.mH = }")

M = tensor([[[-1.4673, -0.4204, -1.1857],
         [-0.3308, -0.0802,  0.0118],
         [-0.8805, -0.0644, -0.2816]],

        [[-0.2144, -0.1880, -1.1302],
         [ 0.3032,  0.0827,  0.1062],
         [-0.5461,  0.3554,  0.6406]],

        [[ 0.0679, -0.3958,  0.1948],
         [ 0.3312,  0.9028,  1.2418],
         [ 0.5943,  0.6860, -0.3712]]])
M.mT = tensor([[[-1.4673, -0.3308, -0.8805],
         [-0.4204, -0.0802, -0.0644],
         [-1.1857,  0.0118, -0.2816]],

        [[-0.2144,  0.3032, -0.5461],
         [-0.1880,  0.0827,  0.3554],
         [-1.1302,  0.1062,  0.6406]],

        [[ 0.0679,  0.3312,  0.5943],
         [-0.3958,  0.9028,  0.6860],
         [ 0.1948,  1.2418, -0.3712]]])
M.mH = tensor([[[-1.4673, -0.3308, -0.8805],
         [-0.4204, -0.0802, -0.0644],
         [-1.1857,  0.0118, -0.2816]],

        [[-0.2144,  0.3032, -0.5461],
         [-0.1880,  0.0827,  0.3554],
         [-1.1302,  0.1062,  0.6406]],

        [[ 0.0679,  0.3312,  0.5943],
         [-0.395

However in the case of `mH`, under the hood the function being used is `adjoint()` which transposes the last two dimensions and then conjugates. There is another available function called `.transpose()` which takes in two values (which indicate which dimensions to flip)

In [35]:
# Trial tensor
M = torch.randn((3,3,3))
print(f"{M = }")

# Transpose
print(f"{M.mT = }")

# transpose() function
print(f"{M.transpose(-1, -2) = }")

M = tensor([[[-0.0152, -1.2074, -0.3261],
         [-0.3511, -0.1746,  0.5307],
         [-0.2593, -0.6867, -0.7115]],

        [[ 1.2810, -1.6757,  0.5918],
         [-0.2835,  0.0422,  1.0530],
         [-1.2587, -0.9597, -1.8864]],

        [[ 0.4300, -0.2072,  0.6119],
         [-0.0768,  1.5598,  1.6492],
         [-0.7949,  1.2559, -0.1922]]])
M.mT = tensor([[[-0.0152, -0.3511, -0.2593],
         [-1.2074, -0.1746, -0.6867],
         [-0.3261,  0.5307, -0.7115]],

        [[ 1.2810, -0.2835, -1.2587],
         [-1.6757,  0.0422, -0.9597],
         [ 0.5918,  1.0530, -1.8864]],

        [[ 0.4300, -0.0768, -0.7949],
         [-0.2072,  1.5598,  1.2559],
         [ 0.6119,  1.6492, -0.1922]]])
M.transpose(-1, -2) = tensor([[[-0.0152, -0.3511, -0.2593],
         [-1.2074, -0.1746, -0.6867],
         [-0.3261,  0.5307, -0.7115]],

        [[ 1.2810, -0.2835, -1.2587],
         [-1.6757,  0.0422, -0.9597],
         [ 0.5918,  1.0530, -1.8864]],

        [[ 0.4300, -0.0768, -0.7949],
 

For more complicated permutations of the indices, one can use `permute()`. I am not writing that down since I don't envision it being useful at this moment. It will be good to revisit if necessary.

### Unary functions

`torch.Tensor` comes with a *_huge_* set of functions that can be applied to tensors. The most common ones are `exp()`, `log()` and `tanh()`. All of these come with an in-place version which is accessed by appending `_` to the function names. 

### Binary operators

Currently the only important operator I can think of is `torch.matmul` which as the name suggests is matrix multiplication. It behaves in different ways depending on the dimensions of the objects being multiplied

In [44]:
# Scalar product
x = torch.tensor([1,3,2])
y = torch.tensor([4,3,8])
print(f"{torch.matmul(x,y) = }")

# Linear transform
x = torch.tensor([[3,1,7],[4,2,6],[7,5,8]])
y = torch.tensor([4,3,8])
print(f"{torch.matmul(x,y) = }")

# Vector - matrix multiplication
x = torch.tensor([[3,1,7],[4,2,6],[7,5,8]])
y = torch.tensor([4,3,8])
print(f"{torch.matmul(y,x) = }") # Here the way it is multiplied is [(4*3)+(3*4)+(8*7), (4*1)+(3*2)+(8*5), (4*7)+(3*6)+(8*8)]

# matrix - matrix multiplication
x = torch.tensor([[3,1,7],[4,2,6],[7,5,8]])
y = torch.tensor([[4,3,8],[4,3,8],[4,3,8]])
print(f"{torch.matmul(x,y) = }")

torch.matmul(x,y) = tensor(29)
torch.matmul(x,y) = tensor([ 71,  70, 107])
torch.matmul(y,x) = tensor([ 80,  50, 110])
torch.matmul(x,y) = tensor([[ 44,  33,  88],
        [ 48,  36,  96],
        [ 80,  60, 160]])


It is important to note here that in the cases where the shapes of `x` and `y` are different, there is quite a bit of under-the-hood "magic" happening in terms of dimension broadcasting. I will document this a bit in the section below, but also in much more detail as a separate section because that is highly important.

One can also use `@` instead of `torch.matmul()`

In [45]:
# Scalar product
x = torch.tensor([1,3,2])
y = torch.tensor([4,3,8])
print(f"{x @ y = }")

# Linear transform
x = torch.tensor([[3,1,7],[4,2,6],[7,5,8]])
y = torch.tensor([4,3,8])
print(f"{x @ y = }")

# Vector - matrix multiplication
x = torch.tensor([[3,1,7],[4,2,6],[7,5,8]])
y = torch.tensor([4,3,8])
print(f"{y @ x = }") # Here the way it is multiplied is [(4*3)+(3*4)+(8*7), (4*1)+(3*2)+(8*5), (4*7)+(3*6)+(8*8)]

# matrix - matrix multiplication
x = torch.tensor([[3,1,7],[4,2,6],[7,5,8]])
y = torch.tensor([[4,3,8],[4,3,8],[4,3,8]])
print(f"{x @ y = }")

x @ y = tensor(29)
x @ y = tensor([ 71,  70, 107])
y @ x = tensor([ 80,  50, 110])
x @ y = tensor([[ 44,  33,  88],
        [ 48,  36,  96],
        [ 80,  60, 160]])


### Operations that mutate tensor dimension

Here I particularly want to document the behaviour of `sum` and the usage of `keepdims` which is important since it can have weird behaviours

Also add `torch.cat`, `torch.unbind`, `torch.view`, and understand Eric Yang's Pytorch internal memory storage - http://blog.ezyang.com/2019/05/pytorch-internals/

## Tensor dimension broadcasting

## Calculating gradients with backprop

In most cases it is as simple as setting the gradient to `None` (which is PyTorch's equivalent for zero gradient) and then use the `backward` method. Note that `backward` only works on functions that output a scalar. 

In [92]:
X = torch.randn((20,20), requires_grad=True)
fn = torch.trace(X.T @ X)
X.grad = None
print(X.grad) # dumb check
fn.backward()
print(X.grad)
print(X.grad.shape)

None
tensor([[ 6.8919e-01,  5.9545e-01,  1.0578e+00,  6.5663e-01, -3.1901e-01,
          2.5812e+00, -4.5603e-01, -1.8559e+00,  6.1421e-01, -1.4582e+00,
         -1.1110e+00,  1.7701e-01,  4.1079e+00, -9.9751e-01, -8.9327e-01,
          2.8381e-01, -2.9639e+00,  1.7887e+00,  2.7781e-01,  9.4487e-01],
        [-4.0027e+00, -3.6812e+00, -2.9251e+00, -4.4314e+00, -5.9880e+00,
          1.1277e+00, -5.1030e-01, -4.5243e-01, -1.6009e+00,  2.2205e+00,
         -2.3395e+00, -1.3554e+00,  3.9091e+00,  3.7297e-01, -5.8702e-01,
         -3.5751e-01,  1.5001e+00, -2.5243e-01,  1.8313e+00,  2.0883e+00],
        [ 1.0277e-02,  2.7778e+00, -1.3754e+00, -2.6729e+00,  1.2703e+00,
          4.0222e+00,  2.9275e+00, -1.1829e+00, -2.6049e-01, -6.0611e-01,
          1.2003e+00, -2.9057e+00, -6.6012e-01,  1.9421e+00,  2.2600e-01,
         -3.1704e+00,  3.6250e+00, -1.9206e+00, -3.1726e+00, -8.6204e-03],
        [ 2.3642e-01, -3.1587e+00, -1.0306e+00,  2.0279e+00, -1.0325e+00,
          1.1159e+00,  9.5901e

I am currently not aware of a situation where this would not work. Will return to this section if such an example pops up in the future.