## Exercise: Write Allocates

**Ideally, this example should be run on Noctua 1.**

In [14]:
function striad!(a, b, c, d)
    for i in eachindex(a, b, c, d)
        a[i] = b[i] + c[i] * d[i]
    end
    return nothing
end

N = 1_000_000
a = rand(N)
b = rand(N)
c = rand(N)
d = rand(N)

striad!(a, b, c, d)

1) Looking at the Schoenhauer Triad kernel (i.e. the `striad!` function above),
how many LOADs (data reads) and STOREs (data writes) to you expect to happen? Otherwise put, how many bytes do
you think will need to be transferred to/from memory?

**Answer:** Naively, one would expect 3 LOADs and 1 STORE per loop iteration.

2) Use LIKWID.jl to empirically measure how much data has been read from / written to memory.
  - Hint: Depending on availability, you want to measure the "DATA" or "MEM" performance group.

In [3]:
using LIKWID

In [15]:
@perfmon "DATA" striad!(a, b, c, d);
# @perfmon "MEM" striad!(a, b, c, d);


Group: [0m[1mDATA[22m
┌────────────────────────────┬───────────┐
│[1m                      Event [0m│[1m  Thread 1 [0m│
├────────────────────────────┼───────────┤
│           ACTUAL_CPU_CLOCK │ 3.93956e6 │
│              MAX_CPU_CLOCK │ 2.74326e6 │
│       RETIRED_INSTRUCTIONS │ 1.02521e6 │
│        CPU_CLOCKS_UNHALTED │ 3.82725e6 │
│          LS_DISPATCH_LOADS │  802391.0 │
│         LS_DISPATCH_STORES │  267408.0 │
│ LS_DISPATCH_LOAD_OP_STORES │      42.0 │
└────────────────────────────┴───────────┘
┌──────────────────────┬────────────┐
│[1m               Metric [0m│[1m   Thread 1 [0m│
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │ 0.00109885 │
│ Runtime unhalted [s] │ 0.00160802 │
│          Clock [MHz] │    3518.32 │
│                  CPI │    3.73314 │
│  Load to store ratio │    3.00031 │
└──────────────────────┴────────────┘


3) Which ratio of reads and writes do you find? How many LOADs and STOREs actually happen per iteration?

**Answer:** We find that the ratio of load and store is ~4. We hence conclude that there are
4 LOADs per 1 STORE, i.e. one more LOAD than expected. This is because `a` is also read
before written to ("write-allocate").

The reason you might see a higher load/store ratio are so-called "write-allocates": On some systems, to write to a piece of memory it has to be loaded to cache first (e.g. by reading from it first). Hence you get one extra LOAD.

4) In the exercise "cache_sizes" we used SDAXPY rather than STRIAD.
  * How would the bandwidth values for striad (qualitatively) compare to our
    sdaxpy results assuming we didn't account for write-allocates?
  * Focusing on data volume rather than data transfer,
    how much data is hold for one iteration of sdaxpy and striad, respectively?
    Does a factor of this data volume fit nicely into L1 cache (in either case)?

**Answer:** The bandwidth values for STRIAD would be lower (and would thus underestimate the
        maximal bandwidth for each memory level) because we would think that 32 bytes are
        transferred whereas in reality it might be 40 bytes (not a power of 2).
        As for data volume, for striad we need 4 x 8 bytes = 32 bytes whereas for
        sdaxpy we need 3 x 8 bytes = 24 bytes (not a power of 2). Since L1 cache size
        is usually a power of 2, we can nicely fit (parts of) the 4 vectors for striad
        but generally not for sdaxpy.