# Exercise: Performance Optimization 1

Optimize the following code, that is, try to reduce the runtime as well as the number of allocations as much as you can.

In [1]:
function work!(A, b, c)
    D = zeros(N,N)
    for i in 1:N
        D = b[i]*c*A
        b[i] = sum(D)
    end
    return b
end

work! (generic function with 1 method)

The following data is **fixed** and **not supposed to be modified**!

In [2]:
using Random
Random.seed!(42)

N = 1000
A = rand(N,N)
b = rand(N)
c = 1.23

const b_result = work!(A, b, c);

You can compare against `b_result` to test your implementation(s):

In [3]:
using Test 

@test work!(A, b, c) ≈ b_result

[32m[1mTest Passed[22m[39m

You can benchmark as follows:

In [17]:
using BenchmarkTools

@benchmark work!($A, $b, $c)

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m868.974 ms[22m[39m … [35m873.410 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m7.22% … 7.29%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m870.834 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m7.28%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m871.231 ms[22m[39m ± [32m  1.752 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m7.28% ± 0.05%

  [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m█[34m [39m[39m█[39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m█[39m [39m 
  [39m█[39m▁[39m▁[39m▁

## Your Optimizations

Your optimized variants go here!

**Hints** (hopefully):
* Is the function self-contained?
* Is it efficient with respect to allocations?
* An O(10000) speedup and 0 allocations are possible on Noctua 2 😉

### Avoiding globals

In [5]:
# @code_warntype work!(A, b, c)

In [18]:
function work1!(A, N, b, c) # N is now a function argument
    D = zeros(N,N)
    for i in 1:N
        D = b[i] * c * A
        b[i] = sum(D)
    end
    return b
end

@test work1!(A, N, b, c) ≈ b_result

[32m[1mTest Passed[22m[39m

In [19]:
@benchmark work1!($A, $N, $b, $c)

BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m830.030 ms[22m[39m … [35m840.502 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m6.52% … 6.77%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m831.774 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m6.53%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m833.902 ms[22m[39m ± [32m  4.085 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m6.62% ± 0.19%

  [39m█[39m [39m [39m█[39m█[34m [39m[39m [39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m 
  [39m█[39m▁[39m▁[39m█

In [7]:
# @code_warntype work1!(A, N, b, c)

### Avoid (some) temporary allocations

In [20]:
function work2!(A, N, b, c)
    D = zeros(N,N)
    for i in 1:N
        @. D = b[i] * c * A
        b[i] = sum(D)
    end
    return b
end

@test work2!(A, N, b, c) ≈ b_result

[32m[1mTest Passed[22m[39m

In [21]:
@benchmark work2!($A, $N, $b, $c)

BenchmarkTools.Trial: 18 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m279.755 ms[22m[39m … [35m290.253 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.07%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m280.543 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m281.689 ms[22m[39m ± [32m  2.792 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.01% ± 0.03%

  [39m▃[39m [39m▃[39m▃[34m█[39m[39m▃[39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m▇[39m█[39m█

Alternatively without broadcasting but a loop:

In [22]:
function work3!(A, N, b, c)
    D = zeros(N,N)
    for i in 1:N
        for j in eachindex(D)
            @inbounds D[j] = b[i] * c * A[j]
        end
        b[i] = sum(D)
    end
    return b
end

@test work3!(A, N, b, c) ≈ b_result

[32m[1mTest Passed[22m[39m

In [23]:
@benchmark work3!($A, $N, $b, $c)

BenchmarkTools.Trial: 18 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m279.156 ms[22m[39m … [35m290.298 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.06%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m280.244 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m281.485 ms[22m[39m ± [32m  3.033 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.01% ± 0.03%

  [39m [39m [39m [39m█[39m [39m▁[34m▁[39m[39m [39m▁[39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▆[39m▁[39m▆[39m█

### Preallocating `D`

In [24]:
function work4!(A, N, b, c, D)
    for i in 1:N
        @. D = b[i] * c * A
        b[i] = sum(D)
    end
    return b
end

D = zeros(N,N)

@test work4!(A, N, b, c, D) ≈ b_result

[32m[1mTest Passed[22m[39m

In [25]:
@benchmark work4!($A, $N, $b, $c, $D)

BenchmarkTools.Trial: 18 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m279.221 ms[22m[39m … [35m283.481 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m279.525 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m279.932 ms[22m[39m ± [32m  1.068 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[39m [39m [39m▄[34m▄[39m[39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m▁[39m▆[39m█

### Realizing that one can factor out `b` and `c`

In [26]:
# function work5!(A, N, b, c; D)
#     for i in 1:N
#         @. D = b[i] * c * A
#         b[i] = sum(D)
#     end
#     return b
# end

# function work5!(A, N, b, c; D)
#     for i in 1:N
#         b[i] = sum(b[i] * c * A)
#     end
#     return b
# end

# function work5!(A, N, b, c; D)
#     for i in 1:N
#         b[i] = b[i] * c * sum(A)
#     end
#     return b
# end

function work5!(A, N, b, c)
    b .*= c * sum(A)
    return b
end

@test work5!(A, N, b, c) ≈ b_result

[32m[1mTest Passed[22m[39m

In [27]:
@benchmark work5!($A, $N, $b, $c)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m98.350 μs[22m[39m … [35m306.011 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m99.471 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m99.789 μs[22m[39m ± [32m  3.401 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m▁[39m▂[39m▄[39m▅[39m▆[39m▇[39m▇[39m█[39m█[39m█[34m█[39m[39m▇[39m▇[39m▆[32m▅[39m[39m▃[39m▃[39m▂[39m▁[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▁[39m▁[39m▂[39m▂[39m▂[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m [39m▁[39m [39m [39m [39m▃
  [39m▃[39m▂[39m▅[39m▅[39m▇