# Exercise: SIMD Data Dependency

Consider the following loop involving four vectors `a`,`b`,`c`, and `d`:

In [1]:
const LOOP_ITERATIONS = 8192
const N = LOOP_ITERATIONS + 2

"naive loop"
function loop_naive!(a, b, c, d)
    @inbounds for i in 1:LOOP_ITERATIONS
        a[i] = a[i] + b[i]
        b[i+2] = c[i] + d[i]
    end
end

a = rand(Float32, N)
b = rand(Float32, N)
c = rand(Float32, N)
d = rand(Float32, N)

loop_naive!(a,b,c,d)

This loop is hard to auto-vectorize because it has a **data-dependency**: we're reading and writing elements of the vector `b`.

**Task 1**: Check the native code produced for `loop_naive!(a,b,c,d)` and convince yourself that the Julia compiler hasn't vectorized this code. (There shouldn't be any usage of `ymm` or `zmm` registers etc.)

In [2]:
@code_native debuginfo=:none syntax=:intel loop_naive!(a,b,c,d)

	[0m.text
	[0m.file	[0m"loop_naive!"
	[0m.globl	[0m"japi1_loop_naive!_763"         [90m# -- Begin function japi1_loop_naive!_763[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0m"japi1_loop_naive!_763"[0m,[0m@function
[91m"japi1_loop_naive!_763":[39m                [90m# @"japi1_loop_naive!_763"[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mmov[22m[39m	[95mqword[39m [95mptr[39m [33m[[39m[0mrbp [0m- [33m56[39m[33m][39m[0m, [0mrsi
	[96m[1mmov[22m[39m	[0mrcx[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi [0m+ [33m8[39m[33m][39m
	[96m[1mmov[22m[39m	[0mrax[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi[33m][39m
	[96m[1mmov[22m[39m	[0mrdx[0m, [95m


**Task 2**: Implement the same loop in `loop_naive_simd!` and try to force SIMD-vectorization with the corresponding performance macro. (You shall keep the `@inbounds` as well.)

In [9]:
"naive loop + try force SIMD"
function loop_naive_simd!(a, b, c, d)
    #
    # TODO
    #
    @inbounds @simd for i in 1:LOOP_ITERATIONS
        a[i] = a[i] + b[i]
        b[i+2] = c[i] + d[i]
    end
end

loop_naive_simd!

**Task 3**: Check the native code of `loop_naive_simd!`. Has the code improved? The learning here is that just putting `@simd` in front of a loop and hoping for the best isn't a particularly good strategy 😉

In [10]:
@code_native debuginfo=:none syntax=:intel loop_naive_simd!(a,b,c,d)

	[0m.text
	[0m.file	[0m"loop_naive_simd!"
	[0m.globl	[0m"japi1_loop_naive_simd!_1038"   [90m# -- Begin function japi1_loop_naive_simd!_1038[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0m"japi1_loop_naive_simd!_1038"[0m,[0m@function
[91m"japi1_loop_naive_simd!_1038":[39m          [90m# @"japi1_loop_naive_simd!_1038"[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mmov[22m[39m	[95mqword[39m [95mptr[39m [33m[[39m[0mrbp [0m- [33m56[39m[33m][39m[0m, [0mrsi
	[96m[1mmov[22m[39m	[0mrcx[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi [0m+ [33m8[39m[33m][39m
	[96m[1mmov[22m[39m	[0mrax[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi[33m][39m
	[96m[1mmov[22m

**Task 4**: Benchmark and compare the variants. What do you observe?


In [5]:
using BenchmarkTools

walltime = @belapsed loop_naive!($a, $b, $c, $d) samples = 5 evals = 3
println("the naive loop: ", round(walltime * 1e6; digits=2), " μs")
walltime = @belapsed loop_naive_simd!($a, $b, $c, $d) samples = 5 evals = 3
println("the naive loop + `@simd`: ", round(walltime * 1e6; digits=2), " μs")

the naive loop: 2.62 μs
the naive loop + `@simd`: 2.63 μs



**Task 5**: Take a closer look at the loop. Can you "resolve" the data-dependency issue by splitting up the loop into two separate loops? Implement this improved version in the functions below. Use `@simd` for the loops in the second function. (Again, keep `@inbounds` for all loops in both functions.)

In [6]:
"optimized loop"
function loop_opt!(a, b, c, d)
    #
    # TODO
    #
    @inbounds for i in 1:LOOP_ITERATIONS
        b[i+2] = c[i] + d[i]
    end
    @inbounds for i in 1:LOOP_ITERATIONS
        a[i] = a[i] + b[i]
    end
end

"optimized loop + `@simd`"
function loop_opt_simd!(a, b, c, d)
    #
    # TODO
    #
    @inbounds @simd for i in 1:LOOP_ITERATIONS
        b[i+2] = c[i] + d[i]
    end
    @inbounds @simd for i in 1:LOOP_ITERATIONS
        a[i] = a[i] + b[i]
    end
end

loop_opt_simd!

**Task 6**: Benchmark those new variants as well.
  * How do they compare to each other?
  * Did the SIMD performance macro help? (Hint: It shouldn't.)
  * How does the performance compare to the unoptimized variants above?

In [7]:
walltime = @belapsed loop_opt!($a, $b, $c, $d) samples = 5 evals = 3
println("the optimized loop: ", round(walltime * 1e6; digits=2), " μs")
walltime = @belapsed loop_opt_simd!($a, $b, $c, $d) samples = 5 evals = 3
println("the optimized loop + `@simd`: ", round(walltime * 1e6; digits=2), " μs")

the optimized loop: 1.67 μs
the optimized loop + `@simd`: 1.66 μs



**Task 7**: Check the native code of e.g. `loop_opt_simd!`. Did it vectorize properly? (Look e.g. for `ymm` and `zmm` registers as well as a block of `vaddps` instructions. Note though, that this is system-dependent.)

In [8]:
@code_native debuginfo=:none syntax=:intel loop_opt_simd!(a, b, c, d)

	[0m.text
	[0m.file	[0m"loop_opt_simd!"
	[0m.globl	[0m"japi1_loop_opt_simd!_1036"     [90m# -- Begin function japi1_loop_opt_simd!_1036[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0m"japi1_loop_opt_simd!_1036"[0m,[0m@function
[91m"japi1_loop_opt_simd!_1036":[39m            [90m# @"japi1_loop_opt_simd!_1036"[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mmov[22m[39m	[95mqword[39m [95mptr[39m [33m[[39m[0mrbp [0m- [33m8[39m[33m][39m[0m, [0mrsi
	[96m[1mmov[22m[39m	[0mrax[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi [0m+ [33m8[39m[33m][39m
	[96m[1mmov[22m[39m	[0mrcx[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi [0m+ [33m16[39m[33m][39m
	[96m[1mmov[22m[39m	[0mr8[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi[33m][39m
	[96m[1mmov[22m[39m	[0mrsi[0m, [95mqword[39m [95mptr[39m [33m[[39m