[BugFix][Relax] Fix scatter_elements and scatter_nd CUDA compilation by as4230 · Pull Request #19497 · apache/tvm

as4230 · 2026-05-03T07:04:42Z

topi.scatter_elements and topi.scatter_nd emit bare T.parallel loops in their te.extern IRBuilder bodies which trips VerifyMemory on CUDA targets:

RuntimeError: Memory verification failed
...
Did you forget to bind?

CPU (LLVM) is unaffected.

This fix makes the IRBuilder body in both topi/scatter_elements.py and topi/scatter.py target-aware. When Target.current() is a GPU target it emits thread bindings instead of T.parallel.

Fixes #19451.

gemini-code-assist

Code Review

This pull request introduces GPU-specific implementations for the scatter_nd and scatter_elements operators in TVM TOPI, utilizing explicit thread and block bindings for GPU targets while maintaining existing CPU paths. The changes include the addition of helper functions for reduction and index calculation. A review comment suggests refactoring scatter_elements.py to reduce logic duplication between the GPU and CPU paths by extracting common operations into helper functions.

as4230 · 2026-05-03T07:24:07Z

@tlopex I considered going with a dispatch pass like DispatchSortScan but ended up fixing the lowering inline. The CPU and GPU paths here use the same algorithm so adding a separate topi/gpu/ implementation plus dispatch felt like mostly boilerplate and duplication.

tlopex · 2026-05-03T12:20:27Z

Thanks for working on this. I agree that fixing this in the scatter lowering is the right layer, and this is much closer to what I had in mind than a generic T.parallel cleanup pass.

That said, I’d prefer the fix to restore a proper target-specific TOPI dispatch rather than make the generic topi.scatter_elements / topi.scatter_nd implementation target-aware internally.

The root issue in #19451 is that Relax legalization lost the old CUDA scatter lowering path: the op is lowered through te.extern, and the generic IRBuilder emits bare T.parallel, which is not valid GPU TIR. Since this is a target-specific legality issue, I think the more fundamental fix is to add a GPU TOPI implementation, e.g. under topi/gpu/, and dispatch to it from relax/transform/legalize_ops/manipulate.py when the current target is GPU. That is similar in spirit to the old Relay CUDA scatter path that was removed during the Relay cleanup.

Even if the CPU and GPU algorithms are mostly the same, the generated TIR is not target-neutral: GPU needs explicit blockIdx/threadIdx bindings, and reductions may need GPU-specific handling. Keeping that in a GPU-specific implementation avoids making the generic TOPI path depend on Target.current(), and it keeps VerifyMemory useful as a check that GPU lowering produced valid GPU TIR in the first place.

So my preference would be:

keep the generic TOPI scatter lowering target-neutral for CPU/default use;
add topi/gpu scatter lowering that emits thread-bound TIR;
dispatch to the GPU implementation from Relax legalization;
add CUDA compile/build regression tests for both scatter_elements and scatter_nd.

I don’t think we need the broad pass from #19363, but I would like this PR to fix the issue through explicit GPU scatter lowering rather than inline target checks in the generic implementation.

`topi.scatter_elements` and `topi.scatter_nd` emit bare `T.parallel` loops in their `te.extern` IRBuilder bodies which trips `VerifyMemory` on CUDA targets: RuntimeError: Memory verification failed ... Did you forget to bind? CPU (LLVM) is unaffected. This fix makes the IRBuilder body in both `topi/scatter_elements.py` and `topi/scatter.py` target-aware. When `Target.current()` is a GPU target it emits thread bindings instead of `T.parallel`. Fixes apache#19451.

as4230 · 2026-05-03T13:44:00Z

@tlopex Thanks for the feedback and telling me your preferences. I reworked the PR. The generic topi.scatter_elements and topi.scatter_nd are untouched, GPU lowering lives in topi/gpu/scatter_elements.py and topi/gpu/scatter_nd.py with explicit blockIdx/threadIdx bindings, and dispatch happens in legalize_ops/manipulate.py. I also added a CUDA compile/build regression tests for both ops next to the existing scatter tests in test_transform_legalize_ops_manipulate.py.

tlopex

LGTM

gemini-code-assist Bot reviewed May 3, 2026

View reviewed changes

Comment thread python/tvm/topi/scatter_elements.py

as4230 force-pushed the fix/relax-scatter-cuda-inline branch from 75435f3 to b6b11f9 Compare May 3, 2026 07:14

as4230 force-pushed the fix/relax-scatter-cuda-inline branch from b6b11f9 to 5542f66 Compare May 3, 2026 13:29

as4230 force-pushed the fix/relax-scatter-cuda-inline branch from 5542f66 to 2628c11 Compare May 3, 2026 13:31

tlopex approved these changes May 4, 2026

View reviewed changes

tlopex merged commit fde09d2 into apache:main May 4, 2026
9 checks passed

tlopex mentioned this pull request May 4, 2026

[TIRX] Bind parallel loops to GPU threads before VerifyMemory #19363

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix][Relax] Fix scatter_elements and scatter_nd CUDA compilation#19497

[BugFix][Relax] Fix scatter_elements and scatter_nd CUDA compilation#19497
tlopex merged 1 commit intoapache:mainfrom
as4230:fix/relax-scatter-cuda-inline

as4230 commented May 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

as4230 commented May 3, 2026

Uh oh!

tlopex commented May 3, 2026

Uh oh!

as4230 commented May 3, 2026

Uh oh!

tlopex left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

as4230 commented May 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

as4230 commented May 3, 2026

Uh oh!

tlopex commented May 3, 2026

Uh oh!

as4230 commented May 3, 2026

Uh oh!

tlopex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants