In [1]:
import Pkg; Pkg.activate(@__DIR__); Pkg.instantiate()

[32m[1m Activating[22m[39m environment at `~/projects/sciware-julia/Project.toml`


In this notebook, we'll look at SIMD and threading support in Julia

In [2]:
A = rand(100_000)
A32 = rand(Float32, length(A)*2)

function simplesum(A)
    result = zero(eltype(A))
    for i in eachindex(A)
        @inbounds result += A[i]
    end
    return result
end

function simdsum(A)
    result = zero(eltype(A))
    @simd for i in eachindex(A)
        @inbounds result += A[i]
    end
    return result
end

simdsum (generic function with 1 method)

In [3]:
using BenchmarkTools

@btime sum($A)
@btime simplesum($A)
@btime simdsum($A)

@btime sum($A32)
@btime simplesum($A32)
@btime simdsum($A32)

  24.361 μs (0 allocations: 0 bytes)
  103.761 μs (0 allocations: 0 bytes)
  23.408 μs (0 allocations: 0 bytes)
  25.305 μs (0 allocations: 0 bytes)
  207.445 μs (0 allocations: 0 bytes)
  23.396 μs (0 allocations: 0 bytes)


100005.75f0

If `simdsum` is "faster", why not use it all the time?

In [4]:
simplesum(A), simdsum(A), sum(A)

(49993.1422985405, 49993.14229853963, 49993.142298539664)

In [5]:
simplesum(A32), simdsum(A32), sum(A32)

(100005.9f0, 100005.75f0, 100005.734f0)

How can we see if `@simd` is making good use of our CPU cores? One way would be a profiler, but we can also look directly from Julia:

In [6]:
@code_llvm simdsum(A32)


;  @ In[2]:13 within `simdsum'
define float @julia_simdsum_18007(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
;  @ In[2]:14 within `simdsum'
; ┌ @ simdloop.jl:69 within `macro expansion'
; │┌ @ abstractarray.jl:212 within `eachindex'
; ││┌ @ abstractarray.jl:95 within `axes1'
; │││┌ @ abstractarray.jl:75 within `axes'
; ││││┌ @ array.jl:155 within `size'
       %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
       %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_value_t addrspace(10)* addrspace(11)*
       %3 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %2, i64 3
       %4 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %3 to i64 addrspace(11)*
       %5 = load i64, i64 addrspace(11)* %4, align 8
; ││││└
; ││││┌ @ tuple.jl:157 within `map'
; │││││┌ @ range.jl:320 within `OneTo' @ range.jl:311
; ││││││┌ @ promotion.jl:409 within `max'
         %6 = icmp sgt i64 %5, 0
      

In [7]:
@code_llvm simplesum(A32)


;  @ In[2]:5 within `simplesum'
define float @julia_simplesum_18006(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
;  @ In[2]:6 within `simplesum'
; ┌ @ abstractarray.jl:212 within `eachindex'
; │┌ @ abstractarray.jl:95 within `axes1'
; ││┌ @ abstractarray.jl:75 within `axes'
; │││┌ @ array.jl:155 within `size'
      %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
      %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_value_t addrspace(10)* addrspace(11)*
      %3 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %2, i64 3
      %4 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %3 to i64 addrspace(11)*
      %5 = load i64, i64 addrspace(11)* %4, align 8
; │││└
; │││┌ @ tuple.jl:157 within `map'
; ││││┌ @ range.jl:320 within `OneTo' @ range.jl:311
; │││││┌ @ promotion.jl:409 within `max'
        %6 = icmp sgt i64 %5, 0
        %7 = select i1 %6, i64 %5, i64 0
; └└└└└└
  br i1 %6, 

We can also use threading in other ways. Julia can control the number of threads available to BLAS, or use threads ourselves!

In [8]:
using LinearAlgebra

A = rand(10_000, 2_000)
B = rand(2_000, 5_000)

BLAS.set_num_threads(1)
@btime $A*$B
BLAS.set_num_threads(4)
@btime $A*$B
# can set the number of threads even higher, of course. You can also use the environment variable 

  6.601 s (2 allocations: 381.47 MiB)
  6.683 s (2 allocations: 381.47 MiB)


10000×5000 Array{Float64,2}:
 497.448  498.026  501.672  500.861  …  498.099  507.642  504.022  499.956
 496.237  502.958  495.704  499.077     501.392  508.053  508.013  507.704
 499.26   504.947  500.042  503.678     500.23   509.411  516.929  499.546
 495.878  488.105  501.375  498.34      497.019  507.706  508.838  504.539
 482.359  491.169  486.223  500.083     483.32   507.865  503.426  497.928
 485.812  491.639  488.498  497.64   …  488.161  503.603  493.172  488.898
 498.015  501.214  493.245  502.747     492.809  511.176  507.424  495.032
 488.426  487.061  491.559  506.458     495.321  507.32   501.842  490.934
 496.852  500.111  495.713  506.139     499.837  512.365  506.775  509.127
 490.68   495.533  494.734  501.85      486.864  503.62   504.554  495.647
 488.769  492.557  498.574  502.918  …  500.26   512.339  499.901  509.2
 484.818  488.811  484.216  491.078     480.318  502.574  495.859  482.713
 490.716  490.706  488.029  489.303     490.139  496.699  504.417  495.39

But that's kind of boring... let's have some fun with threads ourselves.

In [9]:
using .Threads
nthreads()

4

In [10]:
# a regular loop doesn't use threads by default
A = zeros(Int, nthreads())
for i in 1:nthreads()
    A[i] = threadid()
end
A

4-element Array{Int64,1}:
 1
 1
 1
 1

In [11]:
# we need to use the @threads macro
A = zeros(Int, nthreads())
@threads for i in 1:nthreads()
    A[i] = threadid()
end
A

4-element Array{Int64,1}:
 1
 2
 3
 4

In [12]:
function threaded_sum1(A)
    r = zero(eltype(A))
    @threads for i in eachindex(A)
        @inbounds r += A[i]
    end
    return r
end

A = rand(100_000)
@btime sum($A)
@btime threaded_sum1($A)
sum(A), threaded_sum1(A)

  24.330 μs (0 allocations: 0 bytes)
  16.575 ms (200024 allocations: 3.05 MiB)


(50160.45459802721, 50160.45459802758)

In [13]:
function threaded_sum2(A)
    r = Atomic{eltype(A)}(zero(eltype(A)))
    @threads for i in eachindex(A)
        @inbounds atomic_add!(r, A[i])
    end
    return r[]
end
@btime sum($A)
@btime threaded_sum2($A)
sum(A), threaded_sum2(A)

  24.358 μs (0 allocations: 0 bytes)
  14.378 ms (24 allocations: 2.92 KiB)


(50160.45459802721, 50160.45459802711)

In [14]:
function threaded_sum3(A)
    R = zeros(eltype(A), nthreads())
    @threads for i in eachindex(A)
        @inbounds R[threadid()] += A[i]
    end
    r = zero(eltype(A))
    # sum the partial results from each thread
    for i in eachindex(R)
        @inbounds r += R[i]
    end
    return r
end

threaded_sum3(A)
@time threaded_sum3(A)

  0.061617 seconds (25 allocations: 3.031 KiB)


50160.45459802706