Large slowdown for global memory segmented histogram on GPU #2024

nhey · 2023-09-29T14:10:39Z

I'm observing three orders-of-magnitude slowdowns for argmin-like segmented histograms on GPU when computed in global memory:

-- ==
-- input @ data/bins18K_updates1M_hists10.in
-- input @ data/bins20K_updates1M_hists10.in
-- input @ data/bins50K_updates1M_hists10.in

def argmin ((i, x): (i64, f32)) (j, y) =
  if x == y then
    if i < j && i != -1 then (i, x) else (j, y)
  else
    if x < y then (i, x) else (j, y)

def argmin_seghist [n][d] (bins: i64) (is: [n]i64) (xss: [d][n]f32) =
  let inds = iota n
  in map (\xss_col ->
           hist argmin (-1, f32.highest) bins is (zip inds xss_col)
         ) xss

def main [n][d] (bins: i64) (is: [n]i64) (xss: [d][n]f32) =
  argmin_seghist bins is xss

Inspecting runs with different number of bins using -D -P on the executable shows that the global memory version of the seghist kernel is the culprit. To generate data corresponding to 18, 20 and 50 thousand bins with uniformly distributed updates:

mkdir -p data
futhark dataset -b --i64-bounds=18000:18000 -g i64 --i64-bounds=0:17999 -g [1000000]i64 -g [10][1000000]f32 > data/bins18K_updates1M_hists10.in
futhark dataset -b --i64-bounds=20000:20000 -g i64 --i64-bounds=0:19999 -g [1000000]i64 -g [10][1000000]f32 > data/bins20K_updates1M_hists10.in
futhark dataset -b --i64-bounds=50000:50000 -g i64 --i64-bounds=0:49999 -g [1000000]i64 -g [10][1000000]f32 > data/bins50K_updates1M_hists10.in

The first two should illustrate the difference between shared memory and global memory on an A100. I get similar slowdowns on my desktop GPU.

The text was updated successfully, but these errors were encountered:

nhey · 2023-09-29T14:13:50Z

The slowdown being 18K bins (shared memory) compared to 20K bins (global memory).

athas · 2023-09-29T16:31:41Z

While slowdown is expected for this operator, as it requires a spinlock, three orders of magnitude is too much. I expect the segmented case is incorrectly tuned in code generation.

athas · 2023-10-06T07:37:57Z

I think the problem is that we are computing the wrong index for the lock to use, meaning lock contention will be extremely high.

(cherry picked from commit a329cbb)

athas added a commit that referenced this issue Oct 6, 2023

Fix #2024.

5282ee4

athas closed this as completed in a329cbb Oct 6, 2023

nhey pushed a commit to nhey/futhark that referenced this issue Oct 24, 2023

Fix diku-dk#2024. (diku-dk#2028)

ad2280b

(cherry picked from commit a329cbb)

nhey pushed a commit that referenced this issue Oct 25, 2023

Fix #2024. (#2028)

b409588

(cherry picked from commit a329cbb)

athas added bug compiler labels Nov 3, 2023

CKuke pushed a commit to CKuke/futhark-seq that referenced this issue Nov 8, 2023

Fix diku-dk#2024. (diku-dk#2028)

22e87cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large slowdown for global memory segmented histogram on GPU #2024

Large slowdown for global memory segmented histogram on GPU #2024

nhey commented Sep 29, 2023

nhey commented Sep 29, 2023

athas commented Sep 29, 2023

athas commented Oct 6, 2023

Large slowdown for global memory segmented histogram on GPU #2024

Large slowdown for global memory segmented histogram on GPU #2024

Comments

nhey commented Sep 29, 2023

nhey commented Sep 29, 2023

athas commented Sep 29, 2023

athas commented Oct 6, 2023