# Exercise: Performance Optimization 2

Optimize the following function.

In [1]:
function work!(A, B, v, N)
    val = zero(eltype(v))
    for i in 1:N
        for j in 1:N
            val = mod(v[i],256)
            A[i,j] = B[i,j] * (sin(val) * sin(val) - cos(val) * cos(val))
        end
    end
    return A
end

work! (generic function with 1 method)

The following data is **fixed** and **not supposed to be modified**!

In [2]:
using Random
Random.seed!(42)

const N = 4000
const A = zeros(N,N)
const B = rand(N,N)
const v = rand(Int, N);

const A_result = work!(A,B,v,N);

You can compare against `A_result` to test your implementation(s):

In [3]:
using Test

@test work!(A,B,v,N) ≈ A_result

[32m[1mTest Passed[22m[39m

You can benchmark as follows:

In [4]:
using BenchmarkTools

@btime work!($A, $B, $v, $N); # or use @benchmark for more information

  462.066 ms (0 allocations: 0 bytes)


## Your Optimizations

Your optimized variants go here!

**Hints** (hopefully):
* What is suboptimal about the code? What is it that you'd want to change (but can't directly)?
* Sometimes writing the code in a different way doesn't give direct speedups but enables further optimization.
* A >30x speedup should be possible on Noctua 2 😉

### Analytic optimization

Trigonometric identity

In [5]:
x = rand()
@test sin(x) * sin(x) - cos(x) * cos(x) ≈ -cos(2*x)

[32m[1mTest Passed[22m[39m

In [6]:
function work2!(A, B, v, N)
    val = zero(eltype(v))
    for i in 1:N
        for j in 1:N
            val = mod(v[i],256)
            A[i,j] = B[i,j] * (-cos(2*val))
        end
    end
    return A
end

@btime work2!($A, $B, $v, $N)
@test work2!(A, B, v, N) ≈ A_result

  146.684 ms (0 allocations: 0 bytes)


[32m[1mTest Passed[22m[39m

### Moving `val` computation

In [7]:
function work3!(A, B, v, N)
    val = zero(eltype(B))
    for i in 1:N
        val = -cos(2*mod(v[i],256))
        for j in 1:N
            A[i,j] = B[i,j] * val
        end
    end
    return A
end

@btime work3!($A, $B, $v, $N)
@test work3!(A, B, v, N) ≈ A_result

  55.109 ms (0 allocations: 0 bytes)


[32m[1mTest Passed[22m[39m

### Separating `val` computation

In [8]:
function work4!(A, B, v, N)
    val = [-cos(2*mod(x,256)) for x in v]
    
    for i in 1:N
        for j in 1:N
            A[i,j] = B[i,j] * val[i]
        end
    end
    return A
end

@btime work4!($A, $B, $v, $N)
@test work4!(A, B, v, N) ≈ A_result

  53.968 ms (2 allocations: 31.30 KiB)


[32m[1mTest Passed[22m[39m

### Switch loop order

In [9]:
function work5!(A, B, v, N)
    val = [-cos(2*mod(x,256)) for x in v]
    
    for j in 1:N
        for i in 1:N
            A[i,j] = B[i,j] * val[i]
        end
    end
    return A
end

@btime work5!($A, $B, $v, $N)
@test work5!(A, B, v, N) ≈ A_result

  12.011 ms (2 allocations: 31.30 KiB)


[32m[1mTest Passed[22m[39m

### `@inbounds`

In [10]:
function work6!(A, B, v, N)
    val = [-cos(2*mod(x,256)) for x in v]
    
    for j in 1:N
        for i in 1:N
            @inbounds A[i,j] = B[i,j] * val[i]
        end
    end
    return A
end

@btime work6!($A, $B, $v, $N)
@test work6!(A, B, v, N) ≈ A_result

  9.308 ms (2 allocations: 31.30 KiB)


[32m[1mTest Passed[22m[39m

### Preallocated buffer

In [11]:
function work7!(A, B, v, N; valbuffer)
    @assert length(v) == length(valbuffer)
    
    for i in eachindex(v)
        @inbounds valbuffer[i] = -cos(2*mod(v[i],256))
    end
    
    for j in 1:N
        for i in 1:N
            @inbounds A[i,j] = B[i,j] * valbuffer[i]
        end
    end
    return A
end

@btime work7!($A, $B, $v, $N; valbuffer) setup = (valbuffer = zeros(length(v)))
@test work7!(A, B, v, N; valbuffer=zeros(length(v))) ≈ A_result

  9.313 ms (0 allocations: 0 bytes)


[32m[1mTest Passed[22m[39m

## Bonus Question: Performance limit?

Look at your final optimized version of `work!`.

* What is conceptually limiting the performance, the compute capability or memory transfer?
* Assuming that a single CPU-core in Noctua 2 can achieve a **maximal memory bandwidth of ~45 GB/s**, can you give a performance bound estimate, i.e. the minimal runtime that we could possibly hope to achieve?
  * Hint: how many flops are performed per iteration and how many bytes are transferred?
* How far off is your implementation from achieving the limit (in percent)?

In [12]:
membw = 45 # GB/s
flops = 1 # flops per iteration
traffic = 3*8 # bytes per iteration
I = flops / traffic # flops / byte

perf_bound = I*membw # GFLOPS
runtime_estimate = N^2 * 1e3 / (perf_bound * 1e9) # in ms

println("Performance bound: ", round(perf_bound, digits=2), " GFLOP/s")
println("Runtime estimate: ", round(runtime_estimate, digits=2), " ms")

Performance bound: 1.88 GFLOP/s
Runtime estimate: 8.53 ms


In [13]:
t_work7 = @belapsed work7!($A, $B, $v, $N; valbuffer) setup = (valbuffer = zeros(length(v)))
ratio = runtime_estimate / (t_work7 * 1e3)
println("My best version achieves ", round(ratio * 100, digits=2), "% of the \"theoretical\" limit.")

My best version achieves 91.79% of the "theoretical" limit.
