Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large slowdown for global memory segmented histogram on GPU #2024

Closed
nhey opened this issue Sep 29, 2023 · 3 comments
Closed

Large slowdown for global memory segmented histogram on GPU #2024

nhey opened this issue Sep 29, 2023 · 3 comments

Comments

@nhey
Copy link
Member

nhey commented Sep 29, 2023

I'm observing three orders-of-magnitude slowdowns for argmin-like segmented histograms on GPU when computed in global memory:

-- ==
-- input @ data/bins18K_updates1M_hists10.in
-- input @ data/bins20K_updates1M_hists10.in
-- input @ data/bins50K_updates1M_hists10.in

def argmin ((i, x): (i64, f32)) (j, y) =
  if x == y then
    if i < j && i != -1 then (i, x) else (j, y)
  else
    if x < y then (i, x) else (j, y)

def argmin_seghist [n][d] (bins: i64) (is: [n]i64) (xss: [d][n]f32) =
  let inds = iota n
  in map (\xss_col ->
           hist argmin (-1, f32.highest) bins is (zip inds xss_col)
         ) xss

def main [n][d] (bins: i64) (is: [n]i64) (xss: [d][n]f32) =
  argmin_seghist bins is xss

Inspecting runs with different number of bins using -D -P on the executable shows that the global memory version of the seghist kernel is the culprit. To generate data corresponding to 18, 20 and 50 thousand bins with uniformly distributed updates:

mkdir -p data
futhark dataset -b --i64-bounds=18000:18000 -g i64 --i64-bounds=0:17999 -g [1000000]i64 -g [10][1000000]f32 > data/bins18K_updates1M_hists10.in
futhark dataset -b --i64-bounds=20000:20000 -g i64 --i64-bounds=0:19999 -g [1000000]i64 -g [10][1000000]f32 > data/bins20K_updates1M_hists10.in
futhark dataset -b --i64-bounds=50000:50000 -g i64 --i64-bounds=0:49999 -g [1000000]i64 -g [10][1000000]f32 > data/bins50K_updates1M_hists10.in

The first two should illustrate the difference between shared memory and global memory on an A100. I get similar slowdowns on my desktop GPU.

@nhey
Copy link
Member Author

nhey commented Sep 29, 2023

The slowdown being 18K bins (shared memory) compared to 20K bins (global memory).

@athas
Copy link
Member

athas commented Sep 29, 2023

While slowdown is expected for this operator, as it requires a spinlock, three orders of magnitude is too much. I expect the segmented case is incorrectly tuned in code generation.

@athas
Copy link
Member

athas commented Oct 6, 2023

I think the problem is that we are computing the wrong index for the lock to use, meaning lock contention will be extremely high.

athas added a commit that referenced this issue Oct 6, 2023
@athas athas closed this as completed in a329cbb Oct 6, 2023
nhey pushed a commit to nhey/futhark that referenced this issue Oct 24, 2023
(cherry picked from commit a329cbb)
nhey pushed a commit that referenced this issue Oct 25, 2023
(cherry picked from commit a329cbb)
CKuke pushed a commit to CKuke/futhark-seq that referenced this issue Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants