High-performance and parallel computing for AI - Practical 1: HPC Architecture
==============================================================================


IMPORTANT!
==========

* For these practicals we will be using a different `conda environment`. When opening a notebook or a terminal make sure you are using the **CuPy Kernel**!!!
* Do these practicals at your own pace. Solutions will be provided on the same day or on the day after the practical. Do not worry if you do not finish everything!

Question 1
----------

This question is to help you understand the architecture of a computing server.

In Jupyter, open a new window and open a terminal. On the terminal, type

```bash
lscpu
```

This outputs a plethora of information about goliat's CPU. Can you answer the following questions?

1- Who is the CPU manufacturer? Which year was this CPU commercialized (you will have to google the CPU model)?

2- How many cores does the CPU have? Be careful, google what hyperthreading is (goliat has it enabled).

3- What is the largest square matrix of doubles (64 bits) that you can fit in the L1i cache?

4- Assuming a similar architecture as the one we saw in the lectures in which the L3 cache is shared. What is the largest double (64-bit) square matrix size that you can fit on a socket without spilling over to the RAM?

Question 2
----------

This question is to teach you how to compute whether a computation is compute or memory bound.

Assume you are using a CPU with a peak computational performance of 500 GFLOPs/s and a memory bandwidth of 50 GB/s (this could roughly be a laptop's CPU). Let $n=1024$ and take $A,B\in\mathbb{R}^{n\times n}$, $v\in\mathbb{R}^n$. Assume they are stored in single precision (32 bits/number). Are the following computations memory- or compute-bound?

1- The matrix-vector product $Av$.

2- The matrix-matrix product $AB$.

Draw a roofline plot. Where would these computations be on the plot? What would happen if the matrices and vectors were booleans (e.g., only 1 bit)?


Hints: First, compute the FLOPs of each operation, then the memory occupied by each variable. Second, compute the arithmetic intensity (FLOPs/memory required). Third, draw the roofline plot. Lastly, add the arithmetic intensities of the operations (include them by drawing a vertical line: the problem does not give you the GFLOPS performance of these operations, you would need to test it in practice). 

Question 3
----------

For this question we will be using numpy and [cupy](https://cupy.dev/)

In this question we investigate the cost of data movement. To move data between CPUs in Python you would need to know MPI, which will be taught later on in the course. Here, we move data between CPUs and GPUs instead using cupy.

* Read and understand the code below. Run it (possibly twice to make sure there is no caching). Do you expect matrix multiplications to be faster on the CPU or on the GPU? Do you expect it to be faster to move data to or from the GPU (or the same)?
* Create a random square matrix $A$ and a vector $v$ of size $n=1024$ on the CPU (using numpy) and on the GPU (using cupy, it does not matter if the values are not the same). Then write three functions: 1) A function that computes $v\cdot v$ on the CPU using the CPU variables. 2) A function that computes $v\cdot v$ on the GPU using the GPU variables and then returns the answer to the CPU. 3) A function that copies $v$ onto the GPU, computes $v\cdot v$ on the GPU, and then returns the answer to the CPU. Time all functions using 10 runs each. Remember to run it twice and to only take the second timing. Please add the lines
```python
mempool = cp.get_default_memory_pool()
mempool.set_limit(size=1.5*1024**3)  # 1.5 GB limit on GPU memory usage.
```
before your code to make sure we do not overuse the GPU memory.

* Repeat the above, but with $Av$.
* Repeat the above, but with $A^2$.
* Looking at the timings what do you observe? When is it worth it to move data to perform computations on the GPU?

In [1]:
import numpy as np
import cupy as cp
from time import time

mempool = cp.get_default_memory_pool()
mempool.set_limit(size=1.5*1024**3)  # 1.5 GiB

N = 1024

x  = np.random.randn(N, N)
xg = cp.random.randn(N, N)

# Matrix multiplication on CPU
tic = time()
for i in range(10):
    x@x

t = (time()-tic)/10
print("Time matmat CPU:", t)

# Matrix multiplication on GPU
tic = time()
for i in range(10):
    xg@xg

t = (time()-tic)/10
print("Time matmat GPU:", t)

# CPU to GPU data movement
tic = time()
for i in range(10):
    yg = cp.asarray(x) # moves a numpy array stored on the CPU onto the GPU
    
t = (time()-tic)/10
print("Time CPU to GPU:", t)

# GPU to CPU data movement.
tic = time()
for i in range(10):
    y = cp.asnumpy(xg) # moves a cupy array stored on the GPU onto the CPU
    
t = (time()-tic)/10
print("Time GPU to CPU:", t)

Time matmat CPU: 0.00620572566986084
Time matmat GPU: 0.007312226295471192
Time CPU to GPU: 0.0027472972869873047
Time GPU to CPU: 0.0009063482284545898
