In [1]:
using Random
using BenchmarkTools
using Statistics

# Multi threading

By Multithreading we understand the capacity of a function to operate on different threads.

In [2]:
function my_sum(v)
    acc = zero(eltype(v))
    for i = 1:length(v), j = 1:i
        acc += v[j] * v[j]
    end
    acc
end

my_sum (generic function with 1 method)

In [3]:
Random.seed!(0); v = rand(1000);

In [4]:
@btime my_sum($v);

  445.945 μs (0 allocations: 0 bytes)


In [5]:
my_sum(v)

167658.7030407145

## Multi threaded sum

In [6]:
function my_multithreaded_sum(v; T = Threads.nthreads())
    acc = zeros(eltype(v), T)
    Threads.@threads for t = 1:T
        s = zero(eltype(v))
        for i = t:T:length(v), j = 1:i
            s += v[j] * v[j]
        end
        acc[t] = s
    end
    return sum(acc)
end

my_multithreaded_sum (generic function with 1 method)

In [7]:
Threads.nthreads()

8

In [8]:
Random.seed!(0); v = rand(1000);

In [9]:
@btime my_multithreaded_sum($v; T=1);
@btime my_multithreaded_sum($v; T=2);
@btime my_multithreaded_sum($v; T=4);
@btime my_multithreaded_sum($v; T=8);

  458.323 μs (3 allocations: 160 bytes)
  229.589 μs (2 allocations: 144 bytes)
  115.216 μs (2 allocations: 160 bytes)
  58.365 μs (2 allocations: 192 bytes)


In [10]:
time_my_multithreaded_sum = @benchmark my_multithreaded_sum($v; T=8);

In [11]:
mean(time_my_multithreaded_sum.times)

64869.8966

## SIMD version

We can get the same speed with a single thread using SIMD instructions

In [12]:
function my_sum(v)
    acc = zero(eltype(v))
    for i = 1:length(v), j = 1:i
        acc += v[j] * v[j]
    end
    acc
end

my_sum (generic function with 1 method)

In [13]:
function my_simd_sum(v)
    s = zero(eltype(v))
    for i = 1:length(v)
        @simd for j = 1:i
        @inbounds s += v[j] * v[j]
        end
    end
    return s
end

my_simd_sum (generic function with 1 method)

In [14]:
my_sum(v), my_simd_sum(v)

(167658.7030407145, 167658.7030406405)

In [15]:
@btime my_simd_sum($v);

  40.077 μs (0 allocations: 0 bytes)


## SIMD and Multi threadeding

In [16]:
function my_multithreaded_sum_simd(v; T = Threads.nthreads())
    partial_result_per_thread = zeros(eltype(v), T)
    len = div(length(v), T)
    Threads.@threads for t = 1:T
        s = zero(eltype(v))
        domain_per_thread = ((t-1)*len +1):t*len
        for i in domain_per_thread
           @simd for j in 1:i
            @inbounds s += v[j] * v[j]
            end
        end
        partial_result_per_thread[t] = s
    end
    return sum(partial_result_per_thread)
end

my_multithreaded_sum_simd (generic function with 1 method)

In [27]:
@btime my_multithreaded_sum_simd($v);

  9.181 μs (2 allocations: 192 bytes)


In [28]:
my_sum(v), my_simd_sum(v), my_multithreaded_sum_simd(v)

(167658.7030407145, 167658.7030406405, 167658.70304063277)

In [29]:
Random.seed!(0); v2 = rand(10000);
@time  my_simd_sum(v2);
@time  my_multithreaded_sum_simd(v2);

  0.004716 seconds (5 allocations: 176 bytes)
  0.002259 seconds (8 allocations: 384 bytes)


In [30]:
Random.seed!(0); v3 = rand(100000);
@time  my_simd_sum(v3);
@time  my_multithreaded_simd_sum(v3);

  0.581055 seconds (5 allocations: 176 bytes)


UndefVarError: UndefVarError: my_multithreaded_simd_sum not defined

That looks alright to me, but see my note above about interleaving the i indexes. The point of this is to let each thread do the same amount of work. Let’s say you have 8 threads and a 1000 element vector. With your approach, thread 1 will be doing 125*126/2 = 7875 additions, while thread 8 does (1000*1001-875*876)/2 = 117250 additions. This means that thread 1 (and 2, 3, …) will finish long before thread 8 and just sit and idle. On the contrary, by interleaving the i indices, all threads will do approximately the same number of additions (this is very problem specific though). On my system, this doubles the performance (also with SIMD).

Btw, the implementation above will also not work correctly if the vector size is not a multiple of the number of threads, e.g.:

julia> my_multithreaded_simd_sum([1 2 3])
0
To fix that, you could do something like this:

n = length(v)
domain_per_thread = 1+((t-1)*n÷T):t*n÷T

In [31]:
# In this function if the number of elements is not divisible by the number of threads it returns an erroneous number
my_multithreaded_simd_sum([1,2,3])

UndefVarError: UndefVarError: my_multithreaded_simd_sum not defined

In [32]:
function my_multithreaded_sum_simd2(v; T = Threads.nthreads())
    acc = zeros(eltype(v), T)
    Threads.@threads for t = 1:T
        s = zero(eltype(v))
        for i = t:T:length(v) # this is the "interleaving"
            @simd for j = 1:i
                @inbounds s += v[j] * v[j]
            end
        end
        acc[t] = s
    end
    return sum(acc)
end

my_multithreaded_sum_simd2 (generic function with 1 method)

In [33]:
@btime my_multithreaded_sum_simd($v);
@btime my_multithreaded_sum_simd2($v);

  9.131 μs (2 allocations: 192 bytes)
  7.260 μs (3 allocations: 208 bytes)


In [34]:
@btime my_multithreaded_sum_simd($v2);
@btime my_multithreaded_sum_simd2($v2);

  857.202 μs (2 allocations: 192 bytes)
  645.687 μs (2 allocations: 192 bytes)


In [39]:
@btime my_multithreaded_sum_simd($v3);
@btime my_multithreaded_sum_simd2($v3);

  158.177 ms (2 allocations: 192 bytes)
  95.655 ms (2 allocations: 192 bytes)


In [40]:
my_multithreaded_sum_simd2([1,2,3,10])

134

In [41]:
my_sum(v), my_simd_sum(v), my_multithreaded_sum_simd(v), my_multithreaded_sum_simd2(v)

(167658.7030407145, 167658.7030406405, 167658.70304063277, 167658.703040633)