Skip to content

[BugFix][Relax] Fix scatter_elements and scatter_nd CUDA compilation#19497

Merged
tlopex merged 1 commit intoapache:mainfrom
as4230:fix/relax-scatter-cuda-inline
May 4, 2026
Merged

[BugFix][Relax] Fix scatter_elements and scatter_nd CUDA compilation#19497
tlopex merged 1 commit intoapache:mainfrom
as4230:fix/relax-scatter-cuda-inline

Conversation

@as4230
Copy link
Copy Markdown
Contributor

@as4230 as4230 commented May 3, 2026

topi.scatter_elements and topi.scatter_nd emit bare T.parallel loops in their te.extern IRBuilder bodies which trips VerifyMemory on CUDA targets:

RuntimeError: Memory verification failed
...
Did you forget to bind?

CPU (LLVM) is unaffected.

This fix makes the IRBuilder body in both topi/scatter_elements.py and topi/scatter.py target-aware. When Target.current() is a GPU target it emits thread bindings instead of T.parallel.

Fixes #19451.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GPU-specific implementations for the scatter_nd and scatter_elements operators in TVM TOPI, utilizing explicit thread and block bindings for GPU targets while maintaining existing CPU paths. The changes include the addition of helper functions for reduction and index calculation. A review comment suggests refactoring scatter_elements.py to reduce logic duplication between the GPU and CPU paths by extracting common operations into helper functions.

Comment thread python/tvm/topi/scatter_elements.py
@as4230 as4230 force-pushed the fix/relax-scatter-cuda-inline branch from 75435f3 to b6b11f9 Compare May 3, 2026 07:14
@as4230
Copy link
Copy Markdown
Contributor Author

as4230 commented May 3, 2026

@tlopex I considered going with a dispatch pass like DispatchSortScan but ended up fixing the lowering inline. The CPU and GPU paths here use the same algorithm so adding a separate topi/gpu/ implementation plus dispatch felt like mostly boilerplate and duplication.

@tlopex
Copy link
Copy Markdown
Member

tlopex commented May 3, 2026

Thanks for working on this. I agree that fixing this in the scatter lowering is the right layer, and this is much closer to what I had in mind than a generic T.parallel cleanup pass.

That said, I’d prefer the fix to restore a proper target-specific TOPI dispatch rather than make the generic topi.scatter_elements / topi.scatter_nd implementation target-aware internally.

The root issue in #19451 is that Relax legalization lost the old CUDA scatter lowering path: the op is lowered through te.extern, and the generic IRBuilder emits bare T.parallel, which is not valid GPU TIR. Since this is a target-specific legality issue, I think the more fundamental fix is to add a GPU TOPI implementation, e.g. under topi/gpu/, and dispatch to it from relax/transform/legalize_ops/manipulate.py when the current target is GPU. That is similar in spirit to the old Relay CUDA scatter path that was removed during the Relay cleanup.

Even if the CPU and GPU algorithms are mostly the same, the generated TIR is not target-neutral: GPU needs explicit blockIdx/threadIdx bindings, and reductions may need GPU-specific handling. Keeping that in a GPU-specific implementation avoids making the generic TOPI path depend on Target.current(), and it keeps VerifyMemory useful as a check that GPU lowering produced valid GPU TIR in the first place.

So my preference would be:

  1. keep the generic TOPI scatter lowering target-neutral for CPU/default use;
  2. add topi/gpu scatter lowering that emits thread-bound TIR;
  3. dispatch to the GPU implementation from Relax legalization;
  4. add CUDA compile/build regression tests for both scatter_elements and scatter_nd.

I don’t think we need the broad pass from #19363, but I would like this PR to fix the issue through explicit GPU scatter lowering rather than inline target checks in the generic implementation.

@as4230 as4230 force-pushed the fix/relax-scatter-cuda-inline branch from b6b11f9 to 5542f66 Compare May 3, 2026 13:29
`topi.scatter_elements` and `topi.scatter_nd` emit bare `T.parallel` loops in
their `te.extern` IRBuilder bodies which trips `VerifyMemory` on CUDA targets:

    RuntimeError: Memory verification failed
    ...
    Did you forget to bind?

CPU (LLVM) is unaffected.

This fix makes the IRBuilder body in both `topi/scatter_elements.py` and
`topi/scatter.py` target-aware. When `Target.current()` is a GPU target it emits
 thread bindings instead of `T.parallel`.

Fixes apache#19451.
@as4230 as4230 force-pushed the fix/relax-scatter-cuda-inline branch from 5542f66 to 2628c11 Compare May 3, 2026 13:31
@as4230
Copy link
Copy Markdown
Contributor Author

as4230 commented May 3, 2026

@tlopex Thanks for the feedback and telling me your preferences. I reworked the PR. The generic topi.scatter_elements and topi.scatter_nd are untouched, GPU lowering lives in topi/gpu/scatter_elements.py and topi/gpu/scatter_nd.py with explicit blockIdx/threadIdx bindings, and dispatch happens in legalize_ops/manipulate.py. I also added a CUDA compile/build regression tests for both ops next to the existing scatter tests in test_transform_legalize_ops_manipulate.py.

Copy link
Copy Markdown
Member

@tlopex tlopex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tlopex tlopex merged commit fde09d2 into apache:main May 4, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]scatter_elements and scatter_nd fail to compile for CUDA target

2 participants