Skip to content

[TIRx] Post-bringup op-dispatch / codegen / TVMScript follow-ups#19657

Merged
tqchen merged 1 commit into
apache:mainfrom
spectrometerHBH:apache-pr-tirx-followups
Jun 2, 2026
Merged

[TIRx] Post-bringup op-dispatch / codegen / TVMScript follow-ups#19657
tqchen merged 1 commit into
apache:mainfrom
spectrometerHBH:apache-pr-tirx-followups

Conversation

@spectrometerHBH

@spectrometerHBH spectrometerHBH commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up work on top of the TIRx infrastructure bring-up (#19581). It extends the TIRx operator-dispatch, codegen, and TVMScript surfaces with the next batch of low-level programming features for Blackwell-class GPUs, while keeping s_tir script support intact.

Main Changes

  • op-dispatch: warp ldmatrix/stmatrix copy dispatch; split CUDA copy into register / gmem-smem / ldgsts paths; tcgen05.ld/st .16x{64,128,256}b dispatch with a factory and M=128 layout; element-wise broadcast at the layout level with a copy vec-alignment fix.
  • gemm: CUDA synchronous mma.sync tensor-core dispatch; accept a Layout F C operand for M=64 MMAs.
  • op: add the permute_layout primitive (replaces permute_dims).
  • tvmscript: add the Tx.jit decorator, Tx.constexpr compile-time params, and Tx.wg_reg_tile.
  • lower-tirx: introduce the Tx.device_entry() marker (replacing ScopeKind::kKernel); canonical thread filters that drop the Tx.filter wrapper.
  • codegen: add a typed-pointer byte-offset intrinsic; remove the entry_cluster_sync codegen attribute.

Validation

  • pre-commit run (changed files) — clean
  • ninja -C build -j$(nproc) — builds
  • pytest tests/python/tirx/ -n 16
    • 1997 passed, 39 skipped, 3 xpassed
  • python -m pytest tests/python/all-platform-minimal-test
    • 37 passed, 105 skipped
  • TVM_TEST_TARGETS=llvm pytest tests/python/tirx-analysis tests/python/tirx-base tests/python/tirx-transform -n 16
    • 630 passed, 25 skipped, 8 xfailed, 1 xpassed

Local CI Notes

Several full CI-equivalent jobs are not locally reproducible because this machine is missing parts of the Apache TVM CI environment (e.g., specific llvm-config versions, Vulkan, ROCm, ARM/QEMU cross-toolchain, and web/wasm components). The Blackwell/Trainium kernel tests are maintained downstream and are intentionally not part of this PR.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the TVM TIRx execution scopes and dispatchers, removing the legacy kernel and world scopes in favor of a flat Tx.device_entry() marker, introducing a canonical thread-filter grammar, and rewriting the copy and elementwise dispatchers to use a unified layout-alignment and partitioning algorithm. It also adds support for Blackwell tcgen05 datapath layouts, replaces permute_dims with permute_layout for swizzled warp transposes, and optimizes the integer set evaluator to prevent exponential re-expansion on deep variable dependency chains. A critical correctness bug was identified in the _carve_tail helper, where the failure to fuse the consumed suffix into a single element results in incorrect outer loop iterations and misaligned vector memory accesses.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/tvm/tirx/operator/tile_primitive/cuda/copy/_common.py
@spectrometerHBH

Copy link
Copy Markdown
Contributor Author

@tvm-bot rerun

@spectrometerHBH spectrometerHBH force-pushed the apache-pr-tirx-followups branch from 459fa0e to 2b2adc8 Compare June 2, 2026 20:00
Follow-up work on top of the TIRx infrastructure (apache#19581):

- op-dispatch: warp ldmatrix/stmatrix copy dispatch; split CUDA copy into
  reg + gmem_smem + ldgsts; tcgen05.ld/st .16x{64,128,256}b dispatch +
  factory + M=128 layout; element-wise broadcast at the layout level +
  copy vec-alignment fix
- gemm: CUDA synchronous mma.sync tensor-core dispatch; accept Layout F C
  operand for M=64 MMAs
- op: add permute_layout primitive (removes permute_dims)
- tvmscript: Tx.jit decorator, Tx.constexpr params, Tx.wg_reg_tile
- lower-tirx: Tx.device_entry() marker replacing ScopeKind::kKernel;
  canonical thread filters (drop Tx.filter wrapper)
- codegen: typed pointer byte-offset intrinsic; remove the
  entry_cluster_sync codegen attribute
@spectrometerHBH spectrometerHBH force-pushed the apache-pr-tirx-followups branch from 2b2adc8 to 32176c5 Compare June 2, 2026 20:25
@tqchen tqchen merged commit 57c638f into apache:main Jun 2, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants