[CUDA] Parallel Cuda Mergesort #7099

mbrookhart · 2020-12-12T06:33:16Z

@Laurawly @zhiics @icemelon9 @csullivan @tkonolige

There have been many complaints recently about stability and performance of the tir-based cuda sort kernel. I've spent a couple of days this week getting a cuda version of Parallel Mergesort. It's a stable sort, so it fixes the flakiness we've seen with argsort and argwhere, it changes the threading to support dynamic shapes, and it increases the performance significantly over the previous kernel.

This PR only addresses the core sort_ir function, extending this to other versions sort in this file is future work.

I tested performance on a variety of shapes using this script and obtained these numbers on my 1070TI. It's not as fast as Thrust, as expected, but it's much closer for all shapes tested here, and even manages to beat thrust on a few. (times are in milliseconds)

Thanks!

Shape	main	thrust	this
(2000, 2, 2)	7.77	0.58	1.67
(2, 2000, 2)	4.8	0.7	1.59
(2, 2, 2000)	3.24	0.63	1.54
(4000, 2, 2)	25.53	0.65	4.05
(2, 4000, 2)	13.78	0.62	3.3
(2, 2, 4000)	9.85	0.63	4.04
(2, 12000, 2)	369.99	0.68	13.87
(2, 2, 12000)	86.55	0.66	11.11
(12000, 2, 2)	486.65	0.66	13.69
(2000, 8, 8)	259.21	10.4	4.22
(8, 2000, 8)	111.14	8.45	3.43
(8, 8, 2000)	50.37	9.05	3.05
(4000, 8, 8)	671.53	8.24	9.58
(8, 4000, 8)	368.59	8.47	10.12
(8, 8, 4000)	171.18	8.74	6.27
(12000, 8, 8)	3571.97	15.22	42.99
(8, 12000, 8)	3517.72	15.07	45.84
(8, 8, 12000)	1417.97	15.03	27.57

mbrookhart · 2020-12-13T01:58:06Z

I'm hitting some very odd segfaults, just in the debug runtime with nvptx. Trying to figure out what's going on, I'll keep this as WIP until I can get that fixed.

tkonolige

Look great! I think the main implementation might benefit from a couple of comments describing what it is doing.

python/tvm/topi/cuda/sort.py

csullivan

Nice approach. I didn't spot an obvious reason for the segfault.. Also agree with @tkonolige on docs, some high level info on the problem flattening and index mapping / thread assignment for each slice (and as slices are merged) will help maintainability by others.

python/tvm/topi/cuda/sort.py

mbrookhart · 2020-12-15T06:40:31Z

Many thanks to @masahi for helping me find an issue with heterogeneous lowering and some overflow issues in how I was handling the threads. I think it should be ready for review now, thanks everyone!

masahi · 2020-12-15T07:07:08Z

python/tvm/driver/build_module.py

@@ -277,7 +277,7 @@ def _build_for_device(input_mod, target, target_host):
                lambda f: "calling_conv" not in f.attrs
                or f.attrs["calling_conv"].value != CallingConv.DEVICE_KERNEL_LAUNCH
            ),
-            tvm.tir.transform.Apply(lambda f: f.with_attr("target", target)),
+            tvm.tir.transform.Apply(lambda f: f.with_attr("target", target_host)),


For the record, segfault with nvptx was happening because the generated host code was calling intrinsics registered for nvptx, like __nv_log2 or __nv_ceil. The reason it was working on CUDA was just by coincident: there is no CUDA intrinsics registered for fp64 log2, ceil, so TVM was using the default lowering, which happens to be the right one (llvm).

This change fixes that issue.

python/tvm/topi/cuda/sort.py

masahi · 2020-12-18T04:45:00Z

@mbrookhart I think we can revive some tests that are currently disabled due to flaky sort. See

tvm/tests/python/relay/test_any.py

Lines 253 to 256 in bad149e

    
           # TODO(zhiics) Enable argwhere gpu test after sort is fixed. Otherwise, we have 
        
           # to use thrust to guarantee the correct results which has been tested locally. 
        
           # @tvm.testing.uses_gpu 
        
           def test_any_argwhere():

tvm/tests/python/topi/python/test_topi_argwhere.py

Lines 63 to 71 in 76b4ad0

    
                   # TODO(zhiics) Enable argwhere gpu test after sort is fixed. 
        
                   if ctx.device_type != 1: 
        
                       continue 
        
                   check_device(target, ctx) 
        
           # TODO(zhiics) Enable argwhere gpu test after sort is fixed. Otherwise, we have 
        
           # to use thrust to guarantee the correct results which has been tested locally. 
        
           # @tvm.testing.uses_gpu

fix lint

…ost/device code

masahi · 2020-12-18T20:53:15Z

We should also remove

tvm/tests/python/topi/python/test_topi_argwhere.py

Lines 63 to 65 in 76b4ad0

    
           # TODO(zhiics) Enable argwhere gpu test after sort is fixed. 
        
           if ctx.device_type != 1: 
        
               continue

zhiics

thanks for the work. please fix the unit test

mbrookhart · 2020-12-19T01:30:12Z

Oh no! A copy-paste error! Will fix

Laurawly

LGTM

mbrookhart changed the title ~~[CUDA] Parallel Cuda Mergesort~~ [WIP][CUDA] Parallel Cuda Mergesort Dec 13, 2020

tkonolige reviewed Dec 14, 2020

View reviewed changes

python/tvm/topi/cuda/sort.py Show resolved Hide resolved

csullivan reviewed Dec 14, 2020

View reviewed changes

python/tvm/topi/cuda/sort.py Outdated Show resolved Hide resolved

python/tvm/topi/cuda/sort.py Outdated Show resolved Hide resolved

mbrookhart mentioned this pull request Dec 14, 2020

[ONNX] NMS in ONNX #6839

Merged

mbrookhart force-pushed the cuda_mergesort branch 2 times, most recently from 6b8d79a to 8c6b03b Compare December 15, 2020 06:38

mbrookhart changed the title ~~[WIP][CUDA] Parallel Cuda Mergesort~~ [CUDA] Parallel Cuda Mergesort Dec 15, 2020

mbrookhart force-pushed the cuda_mergesort branch from 8c6b03b to 1771a20 Compare December 15, 2020 06:46

masahi reviewed Dec 15, 2020

View reviewed changes

python/tvm/topi/cuda/sort.py Outdated Show resolved Hide resolved

masahi reviewed Dec 15, 2020

View reviewed changes

python/tvm/topi/cuda/sort.py Outdated Show resolved Hide resolved

This was referenced Dec 16, 2020

[TOPI] Fix GPU Dynamic Op Schedule #7117

Merged

Parallelize cumsum in get_valid_counts #7123

Merged

mbrookhart added 7 commits December 18, 2020 10:01

implement parallel cuda mergesort

f6488d5

fix lint

fix a bug in build module when optimizing the host section of mixed h…

2a50289

…ost/device code

convert loop indices to int64 to prevent overflow in start calculation

1817989

comments and cleanup

dcf18e1

fix lint

84a2e2d

fix python casing

2d6d516

enable more flaky tests

5bc7a41

mbrookhart force-pushed the cuda_mergesort branch from ee7dc9c to 5bc7a41 Compare December 18, 2020 17:15

fix lint

1cbbb8f

fix lint, really enable test

e0b8994

zhiics reviewed Dec 19, 2020

View reviewed changes

fix bad rebase

7a53180

fix lint

7abb3a0

Laurawly approved these changes Dec 21, 2020

View reviewed changes

tqchen added status: need review status: accepted and removed status: need review labels Dec 21, 2020

tqchen merged commit 38273ee into apache:main Dec 21, 2020

masahi mentioned this pull request Dec 23, 2020

[TOPI] GPU sort IR refactor to enable sort by keys #7157

Merged

masahi pushed a commit to masahi/tvm that referenced this pull request Dec 24, 2020

[CUDA] Parallel Cuda Mergesort (apache#7099)

9970cfe

masahi mentioned this pull request Jan 4, 2021

[THRUST] Faster multi dimensional argsort by segmented sort #7195

Merged

mbrookhart deleted the cuda_mergesort branch January 4, 2021 17:10

TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Jan 20, 2021

[CUDA] Parallel Cuda Mergesort (apache#7099)

95025f2

trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jan 21, 2021

[CUDA] Parallel Cuda Mergesort (apache#7099)

565ee96

electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021

[CUDA] Parallel Cuda Mergesort (apache#7099)

e99438a

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Parallel Cuda Mergesort #7099

[CUDA] Parallel Cuda Mergesort #7099

mbrookhart commented Dec 12, 2020 •

edited

mbrookhart commented Dec 13, 2020

tkonolige left a comment

csullivan left a comment

mbrookhart commented Dec 15, 2020

masahi Dec 15, 2020 •

edited

masahi commented Dec 18, 2020

masahi commented Dec 18, 2020

zhiics left a comment

mbrookhart commented Dec 19, 2020

Laurawly left a comment

[CUDA] Parallel Cuda Mergesort #7099

[CUDA] Parallel Cuda Mergesort #7099

Conversation

mbrookhart commented Dec 12, 2020 • edited

mbrookhart commented Dec 13, 2020

tkonolige left a comment

Choose a reason for hiding this comment

csullivan left a comment

Choose a reason for hiding this comment

mbrookhart commented Dec 15, 2020

masahi Dec 15, 2020 • edited

Choose a reason for hiding this comment

masahi commented Dec 18, 2020

masahi commented Dec 18, 2020

zhiics left a comment

Choose a reason for hiding this comment

mbrookhart commented Dec 19, 2020

Laurawly left a comment

Choose a reason for hiding this comment

mbrookhart commented Dec 12, 2020 •

edited

masahi Dec 15, 2020 •

edited