# Exercise: Performance Optimization

Optimize the following function.

In [1]:
function work!(A)
    val = zero(eltype(v))
    for i in 1:N
        val = mod(v[i],256)
        A[i,1:N] = B[i,1:N] * (sin(val) * sin(val) - cos(val) * cos(val))
    end
    A = A/2
    return A
end

work! (generic function with 1 method)

The following data is **fixed** and **not supposed to be modified**!

In [2]:
# do not modify this cell!

using Random
Random.seed!(42)

N = 250
B = rand(N,N)
v = rand(Int, N);

const result = work!(zeros(N,N));

# do not modify this cell!

You can compare against `result` to test your implementation(s):

In [3]:
using Test

@test work!(zeros(N,N)) ≈ result

[32m[1mTest Passed[22m[39m

You can benchmark as follows:

In [4]:
using BenchmarkTools

@btime work!(A) setup=(A=zeros(N,N)); # or use @benchmark for more information

  549.768 μs (3253 allocations: 1.54 MiB)


## Your Optimizations

Your optimized variants go here!

<details>

<summary><b>If you want some hints, click here!</b></summary>

* Try to look for type instabilities (with `@code_warntype`).
* Try to avoid unnecessary allocations (due to slicing and allocating array computations).
* Try to optimize the memory access (keyword: column-major order).
* Bonus: Simplify the algebra by using a trigonometric identity.

</details>

### Fixing the type instability (accessing global `N`)

In [5]:
@code_warntype work!(zeros(N,N))

MethodInstance for work!(::Matrix{Float64})
  from work!([90mA[39m)[90m @[39m [90mMain[39m [90m[4mIn[1]:1[24m[39m
Arguments
  #self#[36m::Core.Const(work!)[39m
  A@_2[36m::Matrix{Float64}[39m
Locals
  @_3[91m[1m::Any[22m[39m
  val[91m[1m::Any[22m[39m
  i[91m[1m::Any[22m[39m
  A@_6[36m::Matrix{Float64}[39m
Body[36m::Matrix{Float64}[39m
[90m1 ─[39m       (A@_6 = A@_2)
[90m│  [39m %2  = Main.eltype(Main.v)[91m[1m::Any[22m[39m
[90m│  [39m       (val = Main.zero(%2))
[90m│  [39m %4  = (1:Main.N)[91m[1m::Any[22m[39m
[90m│  [39m       (@_3 = Base.iterate(%4))
[90m│  [39m %6  = (@_3 === nothing)[36m::Bool[39m
[90m│  [39m %7  = Base.not_int(%6)[36m::Bool[39m
[90m└──[39m       goto #4 if not %7
[90m2 ┄[39m %9  = @_3[91m[1m::Any[22m[39m
[90m│  [39m       (i = Core.getfield(%9, 1))
[90m│  [39m %11 = Core.getfield(%9, 2)[91m[1m::Any[22m[39m
[90m│  [39m %12 = Base.getindex(Main.v, i)[91m[1m::Any[22m[39m
[90m│  [39m      

Get `N` from the size of `A` (or `B`) or add another function argument.

In [6]:
function work1!(A, B, v)
    N = size(A,1) # or additional function argument
    val = zero(eltype(v))
    for i in 1:N
        val = -cos(2*mod(v[i],256))
        A[i,1:N] = B[i,1:N] * val
    end
    A = A/2
    return A
end

@btime work1!(A, $B, $v) setup=(A=zeros(N,N))
@test work1!(zeros(N,N), B, v) ≈ result

  378.110 μs (502 allocations: 1.48 MiB)


[32m[1mTest Passed[22m[39m

### Analytic optimization (style points 😉)

Trigonometric identity

In [7]:
x = rand()
@test sin(x) * sin(x) - cos(x) * cos(x) ≈ -cos(2*x)

[32m[1mTest Passed[22m[39m

In [8]:
function work2!(A, B, v)
    N = size(A,1)
    val = zero(eltype(v))
    for i in 1:N
        val = -cos(2*mod(v[i],256))
        A[i,1:N] = B[i,1:N] * val
    end
    A = A/2
    return A
end

@btime work2!(A, $B, $v) setup=(A=zeros(N,N))
@test work2!(zeros(N,N), B, v) ≈ result

  378.445 μs (502 allocations: 1.48 MiB)


[32m[1mTest Passed[22m[39m

### Avoid allocations due to slicing and allocating array arithmetic

In [9]:
function work3_vectorized!(A, B, v)
    N = size(A,1)
    val = zero(eltype(v))
    for i in 1:N
        val = -cos(2*mod(v[i],256))*0.5 # moved the rescaling by 0.5 here
        @views A[i,1:N] .= B[i,1:N] .* val # using @views and broadcasting here
    end
    return A
end

@btime work3_vectorized!(A, $B, $v) setup=(A=zeros(N,N))
@test work3_vectorized!(zeros(N,N), B, v) ≈ result

  99.691 μs (0 allocations: 0 bytes)


[32m[1mTest Passed[22m[39m

Same idea but explicit loop

In [10]:
function work3_loop!(A, B, v)
    N = size(A,1)
    val = zero(eltype(v))
    for i in 1:N
        val = -cos(2*mod(v[i],256))*0.5
        for j in 1:N
            @inbounds A[i,j] = B[i,j] * val
        end
    end
    return A
end

@btime work3_loop!(A, $B, $v) setup=(A=zeros(N,N))
@test work3_loop!(zeros(N,N), B, v) ≈ result

  89.115 μs (0 allocations: 0 bytes)


[32m[1mTest Passed[22m[39m

### Separating `val` computation

In [11]:
function work4_vectorized!(A, B, v)
    N = size(A,1)
    val = @. -cos(2*mod(v,256))*0.5
    
    for i in 1:N
        @views A[i,1:N] .= B[i,1:N] .* val[i]
    end
    return A
end

@btime work4_vectorized!(A, $B, $v) setup=(A=zeros(N,N))
@test work4_vectorized!(zeros(N,N), B, v) ≈ result

  93.329 μs (1 allocation: 2.06 KiB)


[32m[1mTest Passed[22m[39m

Same idea but explicit loop

In [12]:
function work4_loop!(A, B, v)
    N = size(A,1)
    val = @. -cos(2*mod(v,256))*0.5
    
    for i in 1:N
        for j in 1:N
            @inbounds A[i,j] = B[i,j] * val[i]
        end
    end
    return A
end

@btime work4_loop!(A, $B, $v) setup=(A=zeros(N,N))
@test work4_loop!(zeros(N,N), B, v) ≈ result

  84.571 μs (1 allocation: 2.06 KiB)


[32m[1mTest Passed[22m[39m

### Switch loop order (!!!)

In [13]:
function work5!(A, B, v)
    N = size(A,1)
    val = @. -cos(2*mod(v,256))*0.5
    
    for j in 1:N
        for i in 1:N
            @inbounds A[i,j] = B[i,j] * val[i]
        end
    end
    return A
end

@btime work5!(A, $B, $v) setup=(A=zeros(N,N))
@test work5!(zeros(N,N), B, v) ≈ result

  32.790 μs (1 allocation: 2.06 KiB)


[32m[1mTest Passed[22m[39m

And here with broadcasting

In [14]:
function work5_broadcasting!(A, B, v)
    val = @. -cos(2*mod(v,256))*0.5
    @. A = B * val
    return A
end

@btime work5_broadcasting!(A, $B, $v) setup=(A=zeros(N,N))
@test work5_broadcasting!(zeros(N, N), B, v) ≈ result

  32.722 μs (1 allocation: 2.06 KiB)


[32m[1mTest Passed[22m[39m

## Bonus Question: Performance limit?

Look at your final optimized version of `work!`.

* In the limit of larger `A` and `B`, what is conceptually limiting the performance, the compute capability or memory transfer (i.e. reading and writing `A` and `B`)?