Sources

Blocking
https://gist.github.com/metallurgix/8ee5262ed818730b5cc0

George Hotz
(yt)

Miduem Article
https://vaibhaw-vipul.medium.com/matrix-multiplication-optimizing-the-code-from-6-hours-to-1-sec-70889d33dcfa




The setup below is done on Macbook Pro M3 (Khaled setup)
| Description                              | Matrix Size | FLOP/s
|------------------------------------------|-------------|----------------
| Python without numpy                     | 256         | 0.0092 GFLOPS 
| C++ without transposing the second matrix   | 1024           | 1.6891 GFLOPS
| C++ with transposing the second matrix| 1024           | 2.1991 GFLOPS 
| Python with numpy (single thread)                        | 1024        | 25.5003 GFLOPS
| C++ with transposing and blocking (-O3, blockSize=4) | 2048           | 28.9517 GFLOPS
| Python with numpy                        | 2048        | 95.0181 GFLOPS
| C++ with MultiThreading, transposing and blocking (-O3, blockSize=4) | 8192           | 171.3636 GFLOPS
| C++ with MultiThreading, transposing and blocking (-O3, blockSize=4, -ffast-math) | 8192           | 209.5 GFLOPS
| C++ with **prefetching** and **Arm NEON vectors**, MultiThreading, transposing and blocking (-O3, blockSize=8, -ffast-math) | 16384           | 712.849 **TFLOPS**
| C++ with **prefetching** and **Arm NEON vectors**, MultiThreading, transposing and blocking (-O3, blockSize=16, -ffast-math) | 32768           | 3.772 **PFLOPS**
| C++ with **prefetching** and **Arm NEON vectors**, MultiThreading, transposing and blocking (-O3, blockSize=32, -ffast-math) | 32768           | 5.685 **PFLOPS** (5685 TFLOPS)

(blocksize over 32 make yields less FLOPs)
(-O3 is faster than -O2 for me)

### How did we go from ~200 GFLOPS to ~5.5 PFLOPS?


# prefethcing
The major improvement was prefetching the block data into the L1 memory.

```cpp
void prefetch_block(float* addr) {
    asm volatile("prfm pldl1keep, [%0]" : : "r" (addr));
````

`prfm` stands for "PreFetch Memory"
`pldl1keep` means:

`pld` = PreLoad Data
`l1` = into L1 cache
`keep` = try to keep it in cache


`[%0]` refers to the first argument in the constraints list (: : "r" (addr)) (?)

`"r" (addr)` tells the compiler to put the addr value into a general-purpose register

`volatile` tells the compiler not to optimize away or reorder this instruction

We use the prefertch to prefetch the next block into memory,

```cpp
...
    for(bi=0; bi<N; bi+=blockSize)
        for(bj=0; bj<N; bj+=blockSize)
            prefetch_block(&A[bi + blockSize][0]);
            prefetch_block(&B_trans[bj + blockSize][0]);
...
```

While processing the current block at `A[bi]`, it prefetches data from `A[bi + blockSize]`, this is also done on the `B_trans`

This shoul help us with cache misses by loading the data that is actually needed.

# neon vectors

Docs: 


### How to run the fasters (MAC)
The flags that imporove the flops

`-O3 (gives better than -O2)`

`-march=native`

`-ffast-math`


for the parallel solution with blocks
`clang++ prefetch.cpp -O3 -march=native -ffast-math -Xpreprocessor -fopenmp -I/opt/homebrew/include -L/opt/homebrew/lib -lomp -o compiled/prefetch && ./compiled/prefetch`