High-performance and parallel computing for AI - Practical 1: HPC Architecture
==============================================================================


IMPORTANT
=========

For these practicals we will be using a different `conda environment`. When opening a notebook or a terminal make sure you are using the **CuPy Kernel**!!!

Question 1
----------

This question is to help you understand the architecture of a computing server.

In Jupyter, open a new window and open a terminal. On the terminal, type

```bash
lscpu
```

This outputs a plethora of information about goliat's CPU. Can you answer the following questions?

1- Who is the CPU manufacturer? Which year was this CPU commercialized (you will have to google the CPU model)?

2- How many cores does the CPU have? Be careful, google what hyperthreading is (goliat has it enabled).

3- What is the largest square matrix of doubles (64 bits) that you can fit in the L1i cache?

4- Assuming a similar architecture as the one we saw in the lectures in which the L3 cache is shared. What is the largest double (64-bit) square matrix size that you can fit on a socket without spilling over to the RAM?

Solution to question 1
----------------------

1- AMD. Product launched in 2021.

2- 128, but it is using hyperthreading so it looks as if you had 256.

3- sqrt(4MB/(64 bits)/128 (number of L1i caches)) = 62.

4- 584 MB of total caches. Divide by 4 since there are 2 sockets and 2 core dies per socket gives 146 MB. sqrt(146MB/(64 bits))\approx 4289. Can store a matrix of size up to $4289$ without spilling over to the RAM. In practice some additional memory will always be occupied.

Question 2
----------

This question is to teach you how to compute whether a computation is compute or memory bound.

Assume you are using a CPU with a peak computational performance of 500 GFLOPs/s and a memory bandwidth of 50 GB/s (this could roughly be a laptop's CPU). Let $n=1024$ and take $A,B\in\mathbb{R}^{n\times n}$, $v\in\mathbb{R}^n$. Assume they are stored in single precision (32 bits/number). Are the following computations memory- or compute-bound?

1- The matrix-vector product $Av$.

2- The matrix-matrix product $AB$.

Draw a roofline plot. Where would these computations be on the plot? What would happen if the matrices and vectors were booleans (e.g., only 1 bit)?


Hints: First, compute the FLOPs of each operation, then the memory occupied by each variable. Second, compute the arithmetic intensity (FLOPs/memory required). Third, draw the roofline plot. Lastly, add the arithmetic intensities of the operations (include them by drawing a vertical line: the problem does not give you the GFLOPS performance of these operations, you would need to test it in practice). 

Solution to Question 2
----------------------

**Cost.**
* Matvec cost $= 2n^2 \approx 2$ MFLOPs.
* Matmat cost $= 2n^3 \approx 2$ GFLOPs. 

**Memory occupied.**
* Matvec $= n^2 + n$ single-precision numbers $ \approx 4.2$MB.
* Matmat $= 2n^2$ single-precision numbers $ \approx 8.4$ MB.

**Arithmetic intensity.**
* Matvecs $2$ MFLOPs $ /$ $4.2$MB $ \approx 0.5$ FLOPs/Byte.
* Matmats $2$ GFLOPs $ /$ $8.4$MB $ \approx 240$ FLOPs/Byte

In the roofline plot, the threshold between memory and compute bound occurs in the corner of the roofline (the black line in the lectures). This is when

$$ \text{bandwidth} \times \text{arithmetic intensity} = \text{peak performance} $$

which happens at the arithmetic intensity given by

$$ \text{threshold intensity} = \frac{\text{peak performance}}{\text{bandwidth}} = \frac{500 \text{GFLOPs/s}}{50 \text{GB/s}} = 10 \text{FLOPs/Byte}. $$

**Result**
* Since $0.5 < 10$, matvecs are memory-bound.
* Since $240 > 10$, matmats are compute-bound.
* In the case in which numbers are booleans you would be using $1/32$nd of the bits so the arithmetic intensity would increase by a factor $32$. Matrix-vector products would then have an arithmetic intensity of $16$ FLOPs/byte, which means they would be compute bound as well!


Question 3
----------

For this question we will be using numpy and [cupy](https://cupy.dev/)

In this question we investigate the cost of data movement. To move data between CPUs in Python you would need to know MPI, which will be taught later on in the course. Here, we move data between CPUs and GPUs instead using cupy.

* Read and understand the code below. Run it (possibly twice to make sure there is no caching). Do you expect matrix multiplications to be faster on the CPU or on the GPU? Do you expect it to be faster to move data to or from the GPU (or the same)?
* Create a random square matrix $A$ and a vector $v$ of size $n=1024$ on the CPU (using numpy) and on the GPU (using cupy, it does not matter if the values are not the same). Then write three functions: 1) A function that computes $v\cdot v$ on the CPU using the CPU variables. 2) A function that computes $v\cdot v$ on the GPU using the GPU variables and then returns the answer to the CPU. 3) A function that copies $v$ onto the GPU, computes $v\cdot v$ on the GPU, and then returns the answer to the CPU. Time all functions using 10 runs each. Remember to run it twice and to only take the second timing. Please add the lines
```python
mempool = cp.get_default_memory_pool()
mempool.set_limit(size=1.5*1024**3)  # 1.5 GB limit on GPU memory usage.
```
before your code to make sure we do not overuse the GPU memory.

* Repeat the above, but with $Av$.
* Repeat the above, but with $A^2$.
* Looking at the timings what do you observe? When is it worth it to move data to perform computations on the GPU?

In [1]:
import numpy as np
import cupy as cp
from time import time

mempool = cp.get_default_memory_pool()
mempool.set_limit(size=1.5*1024**3)  # 1.5 GiB

N = 1024

x  = np.random.randn(N, N)
xg = cp.random.randn(N, N)

# Matrix multiplication on CPU
tic = time()
for i in range(10):
    x@x

t = (time()-tic)/10
print("Time matmat CPU:", t)

# Matrix multiplication on GPU
tic = time()
for i in range(10):
    xg@xg

t = (time()-tic)/10
print("Time matmat GPU:", t)

# CPU to GPU data movement
tic = time()
for i in range(10):
    yg = cp.asarray(x) # moves a numpy array stored on the CPU onto the GPU
    
t = (time()-tic)/10
print("Time CPU to GPU:", t)

# GPU to CPU data movement.
tic = time()
for i in range(10):
    y = cp.asnumpy(xg) # moves a cupy array stored on the GPU onto the CPU
    
t = (time()-tic)/10
print("Time GPU to CPU:", t)

Time matmat CPU: 0.005548596382141113
Time matmat GPU: 0.0057473421096801754
Time CPU to GPU: 0.0025550127029418945
Time GPU to CPU: 0.0008652210235595703


Solutions to question 3
-----------------------

GPU is supposed to be faster, but I haven't told you why yet. Data movement should in principle take the same time, but it seems like GPU to CPU is faster.

The functions are below.

In [2]:
import numpy as np
import cupy as cp
from time import time

mempool = cp.get_default_memory_pool()
mempool.set_limit(size=1.5*1024**3)  # 1.5 GiB

N = 1024

A  = np.random.randn(N, N)
Ag = cp.random.randn(N, N)
v  = np.random.randn(N)
vg = cp.random.randn(N)

# Note: these functions still work if you call them with A rather than v
def vdot1(v):
    return v@v

def vdot2(vg):
    outg = vg@vg
    out = cp.asnumpy(outg)
    return out

def vdot3(v):
    vgpu = cp.asarray(v)
    outg = vgpu@vgpu
    out = cp.asnumpy(outg)
    return out

def mat1(A,v):
    return A@v

def mat2(Ag,vg):
    outg = Ag@vg
    out = cp.asnumpy(outg)
    return out

def mat3(A,v):
    Agpu = cp.asarray(A)
    vgpu = cp.asarray(v)
    outg = Agpu@vgpu
    out = cp.asnumpy(outg)
    return out


######################## v \cdot v ################################

# vdotv on CPU
tic = time()
for i in range(10):
    out = vdot1(v)
t = (time()-tic)/10
print("Time vdotv CPU:", t)

# vdotv on GPU + move to CPU
tic = time()
for i in range(10):
    out = vdot2(vg)
t = (time()-tic)/10
print("Time vdotv GPU then move to CPU:", t)

# vdotv by moving to and from GPU
tic = time()
for i in range(10):
    out = vdot3(v)
t = (time()-tic)/10
print("Time vdotv by moving to and from GPU:", t)

######################## Av ################################

# Av on CPU
tic = time()
for i in range(10):
    out = mat1(A,v)
t = (time()-tic)/10
print("Time matvec CPU:", t)

# Av on GPU + move to CPU
tic = time()
for i in range(10):
    out = mat2(Ag,vg)
t = (time()-tic)/10
print("Time matvec GPU then move to CPU:", t)

# Av by moving to and from GPU
tic = time()
for i in range(10):
    out = mat3(A,v)
t = (time()-tic)/10
print("Time matvec by moving to and from GPU:", t)

######################## A^2 ################################

# A^2 on CPU
tic = time()
for i in range(10):
    out = vdot1(A)
t = (time()-tic)/10
print("Time matmat CPU:", t)

# A^2 on GPU + move to CPU
tic = time()
for i in range(10):
    out = vdot2(Ag)
t = (time()-tic)/10
print("Time matmat GPU then move to CPU:", t)

# A^2 by moving to and from GPU
tic = time()
for i in range(10):
    out = vdot3(A)
t = (time()-tic)/10
print("Time matmat by moving to and from GPU:", t)


Time vdotv CPU: 9.894371032714844e-06
Time vdotv GPU then move to CPU: 0.021835923194885254
Time vdotv by moving to and from GPU: 0.0003664731979370117
Time matvec CPU: 0.0008300542831420898
Time matvec GPU then move to CPU: 0.000778818130493164
Time matvec by moving to and from GPU: 0.001305699348449707
Time matmat CPU: 0.0051013708114624025
Time matmat GPU then move to CPU: 0.0037709712982177735
Time matmat by moving to and from GPU: 0.00427541732788086


* $v \cdot v$. Pure CPU solution is best.
* $Av$. It pays off to use the GPU. Better if $A$ and $v$ are already there so that data movement is minimized. Why it pays off? $O(n^2)$ flops cost and $O(n)$ data movement.
* $A^2$. It pays off to use the GPU. Better if $A$ is already there so that data movement is minimized. Why it pays off? $O(n^3)$ flops cost and $O(n^2)$ data movement.