### Small Intro to Pytorch

Pytorch is very similar to Numpy since it also uses multidimensional Tensor objects with similar features (e.g., sDatatypes, Slicing, Broadcasting, Batch operations etc.). The most important differences between Numpy and Pytorch is the fact that Pytorch can use graphics processing units (GPU) to accelerate tensors operations and that it has a native optimized autograd engine for automatically computing derivatives (similar to `autograd.numpy`)


If you are using Googel Colab, you can enable GPUs in "Modifier"/Paramètres du Notebook"/"Accélérateur Materiel"


Similarly to Numpy, PyTorch tensors have a `dtype` attribute specifying their datatype. All PyTorch tensors also have a `device` attribute that specifies the device where the tensor is stored -- either CPU, or CUDA (for NVIDA GPUs). A tensor on a CUDA device will automatically use the GPU to accelerate all operations.

Just as with datatypes, we can use the [`.to()`](https://pytorch.org/docs/1.1.0/tensors.html#torch.Tensor.to) method to change the device of a tensor. We can also use the convenience methods `.cuda()` and `.cpu()` methods to move tensors between CPU and GPU.

In [1]:
import torch

if torch.cuda.is_available:
  print('PyTorch can use GPUs!')
else:
  print('PyTorch cannot use GPUs.')

PyTorch can use GPUs!


In [2]:
# Construct a tensor on the CPU
x0 = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
print('x0 device:', x0.device)

# Move it to the GPU using .to()
x1 = x0.to('cuda')
print('x1 device:', x1.device)

# Move it to the GPU using .cuda()
x2 = x0.cuda()
print('x2 device:', x2.device)

# Move it back to the CPU using .to()
x3 = x1.to('cpu')
print('x3 device:', x3.device)

# Move it back to the CPU using .cpu()
x4 = x2.cpu()
print('x4 device:', x4.device)

# We can construct tensors directly on the GPU as well
y = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float64, device='cuda')
print('y device / dtype:', y.device, y.dtype)

# Calling x.to(y) where y is a tensor will return a copy of x with the same
# device and dtype as y
x5 = x0.to(y)
print('x5 device / dtype:', x5.device, x5.dtype)

x0 device: cpu
x1 device: cuda:0
x2 device: cuda:0
x3 device: cpu
x4 device: cpu
y device / dtype: cuda:0 torch.float64
x5 device / dtype: cuda:0 torch.float64


Performing large tensor operations on a GPU can be **a lot faster** than running the equivalent operation on CPU.

Here we compare the speed of adding two tensors of shape (10000, 10000) on CPU and GPU:

(Note that GPU code may run asynchronously with CPU code, so when timing the speed of operations on the GPU it is important to use `torch.cuda.synchronize` to synchronize the CPU and GPU.)

In [4]:
import time

a_cpu = torch.randn(10000, 10000, dtype=torch.float32)
b_cpu = torch.randn(10000, 10000, dtype=torch.float32)

a_gpu = a_cpu.cuda()
b_gpu = b_cpu.cuda()
torch.cuda.synchronize()

t0 = time.time()
c_cpu = a_cpu + b_cpu
t1 = time.time()
c_gpu = a_gpu + b_gpu
torch.cuda.synchronize()
t2 = time.time()

# Check that they computed the same thing
diff = (c_gpu.cpu() - c_cpu).abs().max().item()
print('Max difference between c_gpu and c_cpu:', diff)

cpu_time = 1000.0 * (t1 - t0)
gpu_time = 1000.0 * (t2 - t1)
print('CPU time: %.2f ms' % cpu_time)
print('GPU time: %.2f ms' % gpu_time)
print('GPU speedup: %.2f x' % (cpu_time / gpu_time))

Max difference between c_gpu and c_cpu: 0.0
CPU time: 144.53 ms
GPU time: 5.98 ms
GPU speedup: 24.15 x


You should see that running the same computation on the GPU was more than 10-30 times faster than on the CPU! Due to the massive speedups that GPUs offer, we will use GPUs to accelerate much of our machine learning code.

A list of functions for vector/matrix product can be found [`here`](https://pytorch.org/docs/stable/torch.html#blas-and-lapack-operations). Some examples are:

- [`torch.dot`](https://pytorch.org/docs/stable/torch.html#torch.dot): Computes inner product of vectors
- [`torch.mm`](https://pytorch.org/docs/stable/torch.html#torch.mm): Computes matrix-matrix products
- [`torch.mv`](https://pytorch.org/docs/stable/torch.html#torch.mv): Computes matrix-vector products
- [`torch.addmm`](https://pytorch.org/docs/stable/torch.html#torch.addmm) / [`torch.addmv`](https://pytorch.org/docs/stable/torch.html#torch.addmv): Computes matrix-matrix and matrix-vector multiplications plus a bias
- [`torch.bmm`](https://pytorch.org/docs/stable/torch.html#torch.addmv) / [`torch.baddmm`](https://pytorch.org/docs/stable/torch.html#torch.baddbmm): Batched versions of `torch.mm` and `torch.addmm`, respectively
- [`torch.matmul`](https://pytorch.org/docs/stable/torch.html#torch.matmul): General matrix product that performs different operations depending on the rank of the inputs; this is similar to `np.dot` in numpy.



One of the most important operation in Deep Learning is the batched matrix multiplication 'bmm'. Let's see how it works. 

In [7]:
import torch
print("Using torch", torch.__version__)

if torch.cuda.is_available:
  print('PyTorch can use GPUs!')
else:
  print('PyTorch cannot use GPUs.')

B, N, M, P = 3, 2, 5, 4
x = torch.rand(B, N, M)  # Random tensor of shape (B, N, M)
y = torch.rand(B, M, P)  # Random tensor of shape (B, M, P)

# We can use a for loop to (inefficiently) compute a batch of matrix multiply
# operations
z1 = torch.empty(B, N, P)  # Empty tensor of shape (B, N, P)
for i in range(B):
  z1[i] = x[i].mm(y[i])
print('Here is the result of batched matrix multiply with a loop:')
print(z1)

z2 = torch.bmm(x, y)
print('\nHere is the result of batched matrix multiply with bmm:')
print(z2)

diff = (z1 - z2).abs().max().item()
print('\nDifference:', diff)
print('Difference within threshold:', diff < 1e-6)

Using torch 2.0.0.post200
PyTorch can use GPUs!
Here is the result of batched matrix multiply with a loop:
tensor([[[0.6381, 1.2742, 1.5322, 1.5301],
         [0.9729, 1.6648, 1.9223, 2.0967]],

        [[1.4766, 1.9549, 1.4101, 1.8668],
         [1.0844, 1.6417, 1.1127, 1.6078]],

        [[1.1042, 2.0551, 1.9900, 1.7746],
         [1.3077, 1.5846, 1.9509, 1.9104]]])

Here is the result of batched matrix multiply with bmm:
tensor([[[0.6381, 1.2742, 1.5322, 1.5301],
         [0.9729, 1.6648, 1.9223, 2.0967]],

        [[1.4766, 1.9549, 1.4101, 1.8668],
         [1.0844, 1.6417, 1.1127, 1.6078]],

        [[1.1042, 2.0551, 1.9900, 1.7746],
         [1.3077, 1.5846, 1.9509, 1.9104]]])

Difference: 2.384185791015625e-07
Difference within threshold: True


In [9]:
import time

a_cpu = torch.randn(10000, 1000, 10, dtype=torch.float32)
b_cpu = torch.randn(10000, 10, 100, dtype=torch.float32)

a_gpu = a_cpu.cuda()
b_gpu = b_cpu.cuda()
torch.cuda.synchronize()

# Compare batched version of torch in cpu and gpu
t0 = time.time()
c_cpu = torch.bmm(a_cpu, b_cpu)
t1 = time.time()
c_gpu = torch.bmm(a_gpu, b_gpu)
torch.cuda.synchronize()
t2 = time.time()

# Check that they computed the same thing
#diff = (c_gpu - c_cpu).abs().max().item() # this will give an error
diff = (c_gpu.cpu() - c_cpu).abs().max().item()
print('Max difference between c_gpu and c_cpu:', diff)

cpu_time = 1000.0 * (t1 - t0)
gpu_time = 1000.0 * (t2 - t1)
print('CPU time: %.2f ms' % cpu_time)
print('GPU time: %.2f ms' % gpu_time)
print('GPU speedup: %.2f x' % (cpu_time / gpu_time))

Max difference between c_gpu and c_cpu: 3.814697265625e-06
CPU time: 802.86 ms
GPU time: 78.13 ms
GPU speedup: 10.28 x
