Parallelize cumsum in get_valid_counts #7123

mbrookhart · 2020-12-17T17:09:31Z

As a followup to #6839 , this parallelizes the cumsum in get_valid_counts using an upsweep/downsweep tree-based prefix sum algorithm, similar to what I did in #7099.

On my 1070 Ti, testing deploy_ssd_gluoncv.py, I previously reported that get_valid_counts took 3674.62 microseconds this reduces that to ~~495.8~~ 258.497 microseconds.

@masahi has expressed interest in implementing a more general prefix scan for other ops, as future work I expect we'll refactor this and do possible cache optimization.

Thanks

cc @Laurawly @zhiics @kevinthesun

masahi · 2020-12-18T04:47:22Z

@mbrookhart Can you revive disabled topi get_valid_count test? It seems this test needs some updating.

tvm/tests/python/topi/python/test_topi_vision.py

Lines 124 to 129 in 76b4ad0

    
           @tvm.testing.uses_gpu 
        
           @pytest.mark.skip( 
        
               "Skip this test as it is intermittent." 
        
               "See https://github.com/apache/tvm/pull/4901#issuecomment-595040094" 
        
           ) 
        
           def test_get_valid_counts():

python/tvm/topi/cuda/nms.py

fix lint

mbrookhart · 2020-12-23T22:06:23Z

ping @Laurawly, any chance you could take a look?

python/tvm/topi/vision/nms.py

masahi · 2020-12-31T06:02:05Z

@Laurawly The plan is after we merge this first, we will generalize the cumsum IR in this PR into a reusable, exclusive scan primitive. After that, we can update our CUDA argwhere implementation to use ex scan + compaction, and introduce numpy style cumsum operator.

Laurawly · 2020-12-31T06:04:33Z

@Laurawly The plan is after we merge this first, we will generaliz the cumsum IR in this PR into a reusable, exclusive scan primitive. After that, we can update our CUDA argwhere implementation to use ex scan + compaction, and introduce numpy style cumsum operator.

Sure, I can merge this first.

trevor-m · 2021-01-05T16:19:56Z

Hi @mbrookhart thanks for this performance improvment!

I found that this PR is causing CUDA: an illegal memory access was encountered during inference for a TensorFlow SSD object detection model. I can't reproduce it in a standalone unit test, so I think there may be some race condition or code relying on unitialized memory. I'll let you know if I find out anything more.

mbrookhart · 2021-01-05T16:29:14Z

Thanks, Trevor. If you can share the model script you're using, I can also work to debug today.

masahi · 2021-01-05T20:50:37Z

I can reproduce the issue by running ssd test in tensorflow/test_forward.py with cuda target (I looked at this test yesterday for my PR, so I have a fresh memory):

terminate called after throwing an instance of 'dmlc::Error'
  what():  [05:42:13] /home/masa/projects/dev/tvm/src/runtime/cuda/cuda_device_api.cc:126: 
---------------------------------------------------------------
An internal invariant was violated during the execution of TVM.
Please read TVM's error reporting guidelines.
More details can be found here: https://discuss.tvm.ai/t/error-reporting/7793.
---------------------------------------------------------------
  Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading == false: CUDA: an illegal memory access was encountered
Stack trace:
  [bt] (0) /home/masa/projects/dev/tvm/build/libtvm.so(+0x14aa8e8) [0x7f4fcb8ca8e8]
  [bt] (1) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::CUDADeviceAPI::FreeDataSpace(DLContext, void*)+0xe4) [0x7f4fcb8cabe4]
  [bt] (2) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::NDArray::Internal::DefaultDeleter(tvm::runtime::Object*)+0x5b) [0x7f4fcb8593fb]
  [bt] (3) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::NDArray::CopyTo(DLContext const&) const+0x325) [0x7f4fcb5e4915]
  [bt] (4) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::vm::CopyTo(tvm::runtime::ObjectRef, DLContext const&)+0x311) [0x7f4fcb884b11]
  [bt] (5) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::vm::VirtualMachine::RunLoop()+0x2aee) [0x7f4fcb880dde]
  [bt] (6) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::runtime::vm::VirtualMachine::Invoke(tvm::runtime::vm::VMFunction const&, std::vector<tvm::runtime::ObjectRef, std::allocator<tvm::runtime::ObjectRef> > const&)+0x27) [0x7f4fcb881c17]
  [bt] (7) /home/masa/projects/dev/tvm/build/libtvm.so(+0x14621f0) [0x7f4fcb8821f0]
  [bt] (8) /home/masa/projects/dev/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7f4fcb835613]

@trevor-m Are you sure this is caused by get_valid_counts change? I've also changed NMS in #7172, I hope that change is fine.

masahi · 2021-01-05T21:32:57Z

hmm strange, after running the ssd test on GPU a few times, I cannot reproduce the error anymore. Could this error be random?

One annoying thing about this model is that compilation time is extremely slow. It also requires increasing the stack size limit, otherwise it segfaults.

mbrookhart · 2021-01-05T22:58:20Z

Yeah, ouch:
447.33s (0:07:27)

I'm not needing to increase the stack limit, and I haven't gotten this test to fail yet.

trevor-m · 2021-01-05T23:03:41Z

hmm strange, after running the ssd test on GPU a few times, I cannot reproduce the error anymore. Could this error be random?

One annoying thing about this model is that compilation time is extremely slow. It also requires increasing the stack size limit, otherwise it segfaults.

Yeah the error is a bit random. However, I was able to reproduce it 100% of the time with TRT offload enabled. I can share a script shortly.

@trevor-m Are you sure this is caused by get_valid_counts change? I've also changed NMS in #7172, I hope that change is fine.

Yeah, I did a git bisect to determine this PR was the source of the issue, and #7172 was fine.

anijain2305 · 2021-01-05T23:03:54Z

hmm strange, after running the ssd test on GPU a few times, I cannot reproduce the error anymore. Could this error be random?

One annoying thing about this model is that compilation time is extremely slow. It also requires increasing the stack size limit, otherwise it segfaults.

Maybe it depends on the input data. Trevor and I ran it across a bunch of models, and it fails for few of them (not all). I believe that it can be because of input data (as number of boxes etc change with input image)

mbrookhart · 2021-01-05T23:35:47Z

@trevor-m I'm in mountain time, so I'll need to leave in about half an hour. If you can post the script that consistently fails tonight, I'll jump in first thing tomorrow morning and start hunting for which line causes the issue.

masahi · 2021-01-05T23:36:44Z

@anijain2305 @trevor-m We should definitely use a fixed, real image for CI testing, like pytorch MaskRCNN test does. Please send a PR

tvm/tests/python/frontend/pytorch/test_object_detection.py

Lines 90 to 95 in 4c13ae9

    
           img = "test_street_small.jpg" 
        
           img_url = ( 
        
               "https://raw.githubusercontent.com/dmlc/web-data/" 
        
               "master/gluoncv/detection/street_small.jpg" 
        
           ) 
        
           download(img_url, img)

trevor-m · 2021-01-07T00:56:46Z

I ran the model that was failing (ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03) under cuda-gdb and was able to get some information from the crash:

CUDA Exception: Warp Out-of-range Address
The exception was triggered at PC 0x55556b2ef890

Thread 1 "python3" received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 2, grid 894452, block (0,0,0), thread (864,0,0), device 0, sm 0, warp 24, lane 0]
0x000055556b2ef8b0 in fused_vision_non_max_suppression_kernel1<<<(1,1,1),(1024,1,1)>>> ()

@masahi Any thoughts?

masahi · 2021-01-07T01:25:57Z

Does this mean NMS and not get_valid_counts kernel have an issue? I recognize the thread launch config (1,1,1),(1024,1,1), this is due to my NMS change. But that kernel should be fused_vision_non_max_suppression_kernel2 and not fused_vision_non_max_suppression_kernel1 as shown above, so this is weird.

mbrookhart · 2021-01-07T16:54:42Z

Looking at the code, assuming you have thrust enabled, this should be kernel0:

tvm/python/tvm/topi/cuda/nms.py

Lines 798 to 811 in 9815ae2

    
           score_tensor = te.extern( 
        
               [score_shape], 
        
               [data], 
        
               lambda ins, outs: _fetch_score_ir( 
        
                   ins[0], 
        
                   outs[0], 
        
                   score_axis, 
        
               ), 
        
               dtype=[data.dtype], 
        
               in_buffers=[data_buf], 
        
               out_buffers=[score_buf], 
        
               name="fetch_score", 
        
               tag="fetch_score", 
        
           )

the thrust argsort wont get a number:

tvm/python/tvm/topi/cuda/nms.py

Lines 818 to 820 in 9815ae2

    
           sort_tensor = argsort_thrust( 
        
               score_tensor, valid_count=None, axis=1, is_ascend=False, dtype=valid_count_dtype 
        
           )

And this should be 1:

tvm/python/tvm/topi/cuda/nms.py

Lines 543 to 579 in 9815ae2

    
           with ib.new_scope(): 
        
               nthread_tx = max_threads 
        
               nthread_bx = ceil_div(num_anchors, max_threads) 
        
               nthread_by = batch_size 
        
               tx = te.thread_axis("threadIdx.x") 
        
               bx = te.thread_axis("blockIdx.x") 
        
               by = te.thread_axis("blockIdx.y") 
        
               ib.scope_attr(by, "thread_extent", nthread_by) 
        
               ib.scope_attr(tx, "thread_extent", nthread_tx) 
        
               ib.scope_attr(bx, "thread_extent", nthread_bx) 
        
               i = by 
        
               base_idx = i * num_anchors * box_data_length 
        
               with ib.if_scope(tvm.tir.all(iou_threshold > 0, valid_count[i] > 0)): 
        
                   # Reorder output 
        
                   nkeep = if_then_else( 
        
                       tvm.tir.all(top_k > 0, top_k < valid_count[i]), top_k, valid_count[i] 
        
                   ) 
        
                   j = bx * max_threads + tx 
        
                   with ib.if_scope(j < num_anchors): 
        
                       box_indices[i * num_anchors + j] = -1 
        
                   with ib.if_scope(j < nkeep): 
        
                       # Fill in out with sorted boxes 
        
                       with ib.for_range(0, box_data_length) as k: 
        
                           out[(base_idx + j * box_data_length + k)] = data[ 
        
                               (base_idx + sorted_index[i * num_anchors + j] * box_data_length + k) 
        
                           ] 
        
                   with ib.else_scope(): 
        
                       # Indices > nkeep are discarded 
        
                       with ib.if_scope(j < num_anchors): 
        
                           with ib.for_range(0, box_data_length) as k: 
        
                               out[(base_idx + j * box_data_length + k)] = -1.0 
        
               with ib.else_scope(): 
        
                   with ib.if_scope(j < valid_count[i]): 
        
                       with ib.for_range(0, box_data_length) as k: 
        
                           offset = base_idx + j * box_data_length + k 
        
                           out[offset] = data[offset] 
        
                       box_indices[i * num_anchors + j] = j

That could have threads (1,1,1),(1024,1,1) if we have batch_size=1 and num_anchors <= 1024. I'm not seeing anything in there that jumps out as having an issue though. Every use of j is gaurded by and if scope with j<num_anchors, j< nkeep, or j< valid_count, and nkeep <= valid_count. The only way it could fail is if valid_count > num_anchors...

So possibly it's failing because my changes to get_valid_count are returning the wrong valid_count.

@trevor-m any chance we can dump the inputs/attrs for get_valid_count so I can make a unit test to check that hypothesis? I haven't been able to get it to fail with random inputs, but possibly there's an edge case in my exclusive_scan algorithm for this input data.

trevor-m · 2021-01-07T17:55:44Z

Thanks for looking into it and finding that info @mbrookhart !

Here is the relevant relay graph:

boxes = relay.var("boxes", shape=(1, relay.Any(), 5), dtype="float32")

max_output_size = relay.shape_of(boxes)
max_output_size = relay.strided_slice(max_output_size, begin=[1], end=[2], strides=[1])
max_output_size = relay.squeeze(max_output_size)
max_output_size = relay.minimum(relay.const(100, dtype="int32"), max_output_size)

ct, data, indices = relay.vision.get_valid_counts(
    boxes, score_threshold=0.0, id_index=-1, score_index=0
)

nms_ret = relay.vision.non_max_suppression(
    data=boxes,
    valid_count=ct,
    indices=indices,
    max_output_size=max_output_size,
    iou_threshold=0.6,
    force_suppress=True,
    top_k=-1,
    coord_start=1,
    score_index=0,
    id_index=-1,
    return_indices=True,
    invalid_to_bottom=False,
)

The input shape is [1, 0, 5] during the model execution when the crash occurs. I haven't been able to reproduce with this standalone test yet. Maybe there is an edge case for size 0 max_output_size or num_anchors?

mbrookhart · 2021-01-07T17:57:49Z

Ooh, interesting, doing NMS on no boxes, I'll take a look with that idea.

mbrookhart · 2021-01-07T18:31:16Z

I don't this this is valid if num_anchors is zero, it could lead to undefined behavior. Could you wrap that in an with ib.if_scope(num_anchors > 0) and see if that fixes the problem?

tvm/python/tvm/topi/cuda/nms.py

Lines 209 to 214 in 9815ae2

    
           with ib.new_scope(): 
        
               bx = te.thread_axis("blockIdx.x") 
        
               ib.scope_attr(bx, "thread_extent", batch_size) 
        
               with ib.if_scope(bx < batch_size): 
        
                   valid_count[bx] = valid_indices[(bx + 1) * num_anchors - 1] 
        
                   valid_indices[(bx + 1) * num_anchors - 1] = 0

This reverts commit c02c9c5.

anijain2305 · 2021-01-07T19:24:41Z

I don't this this is valid if num_anchors is zero, it could lead to undefined behavior. Could you wrap that in an with ib.if_scope(num_anchors > 0) and see if that fixes the problem?

tvm/python/tvm/topi/cuda/nms.py

Lines 209 to 214 in 9815ae2

with ib.new_scope():

bx = te.thread_axis("blockIdx.x")

ib.scope_attr(bx, "thread_extent", batch_size)

with ib.if_scope(bx < batch_size):

valid_count[bx] = valid_indices[(bx + 1) * num_anchors - 1]

valid_indices[(bx + 1) * num_anchors - 1] = 0

@@ -210,8 +211,9 @@ def get_valid_indices_ir(valid_boxes, valid_count, valid_indices):
         bx = te.thread_axis("blockIdx.x")
         ib.scope_attr(bx, "thread_extent", batch_size)
         with ib.if_scope(bx < batch_size):
-            valid_count[bx] = valid_indices[(bx + 1) * num_anchors - 1]
-            valid_indices[(bx + 1) * num_anchors - 1] = 0
+            with ib.if_scope(num_anchors > 0):
+                valid_count[bx] = valid_indices[(bx + 1) * num_anchors - 1]
+                valid_indices[(bx + 1) * num_anchors - 1] = 0

     with ib.for_range(0, lim, dtype="int64") as l2_width:
         width = 2 << (lim - l2_width - 1)

I tried this yesterday. Unfortunately, this is not the source. The test still failed.

mbrookhart · 2021-01-07T19:32:51Z

Alas. I would still very much appreciate the script to reproduce this so I can hunt it down.

mbrookhart · 2021-01-07T19:34:26Z

tvm/python/tvm/topi/cuda/nms.py

Lines 177 to 179 in 9815ae2

    
           lim = tvm.tir.generic.cast( 
        
               tvm.tir.ceil(tvm.tir.log2(tvm.tir.generic.cast(num_anchors, "float64"))), "int64" 
        
           )

Log(0) is undefined, we should probably just wrap the entire thing in a if num_anchors > 0

anijain2305 · 2021-01-07T19:40:14Z

Alas. I would still very much appreciate the script to reproduce this so I can hunt it down.

Yes, Trevor is working on it. It needs TRT workflow and thats why the delay.

tvm/python/tvm/topi/cuda/nms.py

Lines 177 to 179 in 9815ae2

lim = tvm.tir.generic.cast(

tvm.tir.ceil(tvm.tir.log2(tvm.tir.generic.cast(num_anchors, "float64"))), "int64"

)

Log(0) is undefined, we should probably just wrap the entire thing in a if num_anchors > 0

Let me try this as well.

trevor-m · 2021-01-07T19:42:29Z

Thanks @mbrookhart , I tried that but it didn't fix the error.

Here is a script to reproduce the error: https://gist.github.com/trevor-m/f44d3d0e7edcaee12722e518e5959b82

I also noticed this line in the kernel where cuda-gdb found a crash: https://github.com/apache/tvm/blob/main/python/tvm/topi/cuda/nms.py#L545
Shouldn't nthread_bx be 0 if num_anchors is 0? Does this mean num_anchors is wrong?

mbrookhart · 2021-01-07T19:44:44Z

Thanks, I'll try to reproduce. You're building with thrust and TRT, right?

you can't compile a cuda kernel with zero threads, so we always make sure it's at least 1:

tvm/python/tvm/tir/ir_builder.py

Lines 205 to 206 in 9815ae2

    
           if attr_key == "thread_extent": 
        
               value = op.max(1, value)

trevor-m · 2021-01-07T19:46:24Z

Thanks, I'll try to reproduce. You're building with thrust and TRT, right?

you can't compile a cuda kernel with zero threads, so we always make sure it's at least 1:

tvm/python/tvm/tir/ir_builder.py

Lines 205 to 206 in 9815ae2

if attr_key == "thread_extent":

value = op.max(1, value)

Yes, that's right, thrust + TRT. Thank you for your help with debugging this.

mbrookhart · 2021-01-08T16:58:29Z

For posperity, @trevor-m and I did some offline debugging yesterday, and #7229 seems to fix the issue.

* Parallelize cumsum in get_valid_counts * make the scan loop exclusive * switch to directly using exclusive scan * perform inner loop of final writes on anchor threads * fix flaky test fix lint * remove final cuda kernel Co-authored-by: masa <masa@pop-os.localdomain>

mbrookhart force-pushed the get_valid_counts_prefix_sum branch from e162f84 to 1360f30 Compare December 18, 2020 20:43

masahi reviewed Dec 20, 2020

View reviewed changes

python/tvm/topi/cuda/nms.py Show resolved Hide resolved

mbrookhart and others added 6 commits December 21, 2020 14:04

Parallelize cumsum in get_valid_counts

944ee3c

make the scan loop exclusive

45ba3cb

switch to directly using exclusive scan

9b5b1e1

perform inner loop of final writes on anchor threads

67db9cd

fix flaky test

86988e8

fix lint

remove final cuda kernel

eafdf9d

mbrookhart force-pushed the get_valid_counts_prefix_sum branch from 2127395 to eafdf9d Compare December 21, 2020 21:05

tqchen assigned Laurawly Dec 26, 2020

Laurawly reviewed Dec 31, 2020

View reviewed changes

python/tvm/topi/vision/nms.py Show resolved Hide resolved

Laurawly approved these changes Dec 31, 2020

View reviewed changes

Laurawly merged commit c02c9c5 into apache:main Dec 31, 2020

mbrookhart deleted the get_valid_counts_prefix_sum branch January 5, 2021 17:12

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jan 7, 2021

Revert "Parallelize cumsum in get_valid_counts (apache#7123)"

4ea5024

This reverts commit c02c9c5.

masahi mentioned this pull request Jan 18, 2021

[TOPI] Make cumsum IR reusable, add thrust scan #7303

Merged

mbrookhart restored the get_valid_counts_prefix_sum branch January 19, 2021 16:44

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize cumsum in get_valid_counts #7123

Parallelize cumsum in get_valid_counts #7123

mbrookhart commented Dec 17, 2020 •

edited

Loading

masahi commented Dec 18, 2020

mbrookhart commented Dec 23, 2020

masahi commented Dec 31, 2020 •

edited

Loading

Laurawly commented Dec 31, 2020

trevor-m commented Jan 5, 2021

mbrookhart commented Jan 5, 2021

masahi commented Jan 5, 2021 •

edited

Loading

masahi commented Jan 5, 2021 •

edited

Loading

mbrookhart commented Jan 5, 2021

trevor-m commented Jan 5, 2021

anijain2305 commented Jan 5, 2021

mbrookhart commented Jan 5, 2021

masahi commented Jan 5, 2021

trevor-m commented Jan 7, 2021 •

edited

Loading

masahi commented Jan 7, 2021

mbrookhart commented Jan 7, 2021 •

edited

Loading

trevor-m commented Jan 7, 2021 •

edited

Loading

mbrookhart commented Jan 7, 2021

mbrookhart commented Jan 7, 2021 •

edited

Loading

anijain2305 commented Jan 7, 2021 •

edited

Loading

mbrookhart commented Jan 7, 2021

mbrookhart commented Jan 7, 2021

anijain2305 commented Jan 7, 2021

trevor-m commented Jan 7, 2021

mbrookhart commented Jan 7, 2021

trevor-m commented Jan 7, 2021

mbrookhart commented Jan 8, 2021

Parallelize cumsum in get_valid_counts #7123

Parallelize cumsum in get_valid_counts #7123

Conversation

mbrookhart commented Dec 17, 2020 • edited Loading

masahi commented Dec 18, 2020

mbrookhart commented Dec 23, 2020

masahi commented Dec 31, 2020 • edited Loading

Laurawly commented Dec 31, 2020

trevor-m commented Jan 5, 2021

mbrookhart commented Jan 5, 2021

masahi commented Jan 5, 2021 • edited Loading

masahi commented Jan 5, 2021 • edited Loading

mbrookhart commented Jan 5, 2021

trevor-m commented Jan 5, 2021

anijain2305 commented Jan 5, 2021

mbrookhart commented Jan 5, 2021

masahi commented Jan 5, 2021

trevor-m commented Jan 7, 2021 • edited Loading

masahi commented Jan 7, 2021

mbrookhart commented Jan 7, 2021 • edited Loading

trevor-m commented Jan 7, 2021 • edited Loading

mbrookhart commented Jan 7, 2021

mbrookhart commented Jan 7, 2021 • edited Loading

anijain2305 commented Jan 7, 2021 • edited Loading

mbrookhart commented Jan 7, 2021

mbrookhart commented Jan 7, 2021

anijain2305 commented Jan 7, 2021

trevor-m commented Jan 7, 2021

mbrookhart commented Jan 7, 2021

trevor-m commented Jan 7, 2021

mbrookhart commented Jan 8, 2021

mbrookhart commented Dec 17, 2020 •

edited

Loading

masahi commented Dec 31, 2020 •

edited

Loading

masahi commented Jan 5, 2021 •

edited

Loading

masahi commented Jan 5, 2021 •

edited

Loading

trevor-m commented Jan 7, 2021 •

edited

Loading

mbrookhart commented Jan 7, 2021 •

edited

Loading

trevor-m commented Jan 7, 2021 •

edited

Loading

mbrookhart commented Jan 7, 2021 •

edited

Loading

anijain2305 commented Jan 7, 2021 •

edited

Loading