## Exercise: Use LIKWID to Count FLOPs

First, let's check that LIKWID is working. The following should work and print the supported LIKWID performance groups.

In [3]:
using LIKWID

In [4]:
PerfMon.supported_groups()

Dict{String, LIKWID.GroupInfoCompact} with 30 entries:
  "L2CACHE"        => L2CACHE => L2 cache miss rate/ratio
  "NUMA"           => NUMA => Local and remote memory accesses
  "QPI"            => QPI => QPI traffic between sockets
  "MEM"            => MEM => Main memory bandwidth in MBytes/s
  "CYCLE_ACTIVITY" => CYCLE_ACTIVITY => Cycle Activities
  "BRANCH"         => BRANCH => Branch prediction miss rate/ratio
  "FLOPS_SP"       => FLOPS_SP => Single Precision MFLOP/s
  "RECOVERY"       => RECOVERY => Recovery duration
  "DIVIDE"         => DIVIDE => Divide unit information
  "L2"             => L2 => L2 cache bandwidth in MBytes/s
  "FALSE_SHARE"    => FALSE_SHARE => False sharing
  "L3"             => L3 => L3 cache bandwidth in MBytes/s
  "L3CACHE"        => L3CACHE => L3 cache miss rate/ratio
  "UOPS_EXEC"      => UOPS_EXEC => UOPs execution
  "CYCLE_STALLS"   => CYCLE_STALLS => Cycle Activities (Stalls)
  "ICACHE"         => ICACHE => Instruction cache miss rate/ratio
  "CACH

Great, you're set up!

**You can find the instructions for this exercise/tutorial here:**   
https://juliaperf.github.io/LIKWID.jl/dev/tutorials/counting_flops/

In [3]:
# ...Your code goes here...

In [5]:
daxpy!(z, a, x, y) = z .= a .* x .+ y

const N = 10_000
const a = 3.141
const x = rand(N)
const y = rand(N)
const z = zeros(N)

daxpy!(z, a, x, y);

In [6]:
metrics, events = @perfmon "FLOPS_DP" daxpy!(z, a, x, y);


Group: [0m[1mFLOPS_DP[22m
┌──────────────────────────────────────┬──────────┐
│[1m                                Event [0m│[1m Thread 1 [0m│
├──────────────────────────────────────┼──────────┤
│                    INSTR_RETIRED_ANY │  23610.0 │
│                CPU_CLK_UNHALTED_CORE │  85577.0 │
│                 CPU_CLK_UNHALTED_REF │ 185328.0 │
│ FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE │      0.0 │
│ FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE │      0.0 │
│            SIMD_FP_256_PACKED_DOUBLE │   5229.0 │
└──────────────────────────────────────┴──────────┘
┌──────────────────────┬─────────────┐
│[1m               Metric [0m│[1m    Thread 1 [0m│
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │ 0.000201794 │
│ Runtime unhalted [s] │  3.29132e-5 │
│          Clock [MHz] │     1200.61 │
│                  CPI │     3.62461 │
│         DP [MFLOP/s] │      103.65 │
│     AVX DP [MFLOP/s] │      103.65 │
│     Packed [MUOPS/s] │     25.9126 │
│     Scalar [MUOPS/s] │    

In [7]:
function count_FLOPs(N)
    a = 3.141
    x = rand(N)
    y = rand(N)
    z = zeros(N)
    metrics, _ = perfmon(() -> daxpy!(z, a, x, y), "FLOPS_DP"; print=false)
    flops_per_second = first(metrics["FLOPS_DP"])["DP [MFLOP/s]"] * 1e6
    runtime = first(metrics["FLOPS_DP"])["Runtime (RDTSC) [s]"]
    return round(Int, flops_per_second * runtime)
end

count_FLOPs (generic function with 1 method)

In [8]:
NFLOPs_expected(N) = 2 * N

NFLOPs_expected (generic function with 1 method)

In [9]:
count_FLOPs(N)

20000

In [10]:
count_FLOPs(2 * N) == NFLOPs_expected(2 * N)

true