[TIRx] Post-bringup op-dispatch / codegen / TVMScript follow-ups#19657
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the TVM TIRx execution scopes and dispatchers, removing the legacy kernel and world scopes in favor of a flat Tx.device_entry() marker, introducing a canonical thread-filter grammar, and rewriting the copy and elementwise dispatchers to use a unified layout-alignment and partitioning algorithm. It also adds support for Blackwell tcgen05 datapath layouts, replaces permute_dims with permute_layout for swizzled warp transposes, and optimizes the integer set evaluator to prevent exponential re-expansion on deep variable dependency chains. A critical correctness bug was identified in the _carve_tail helper, where the failure to fuse the consumed suffix into a single element results in incorrect outer loop iterations and misaligned vector memory accesses.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
@tvm-bot rerun |
459fa0e to
2b2adc8
Compare
Follow-up work on top of the TIRx infrastructure (apache#19581): - op-dispatch: warp ldmatrix/stmatrix copy dispatch; split CUDA copy into reg + gmem_smem + ldgsts; tcgen05.ld/st .16x{64,128,256}b dispatch + factory + M=128 layout; element-wise broadcast at the layout level + copy vec-alignment fix - gemm: CUDA synchronous mma.sync tensor-core dispatch; accept Layout F C operand for M=64 MMAs - op: add permute_layout primitive (removes permute_dims) - tvmscript: Tx.jit decorator, Tx.constexpr params, Tx.wg_reg_tile - lower-tirx: Tx.device_entry() marker replacing ScopeKind::kKernel; canonical thread filters (drop Tx.filter wrapper) - codegen: typed pointer byte-offset intrinsic; remove the entry_cluster_sync codegen attribute
2b2adc8 to
32176c5
Compare
Summary
Follow-up work on top of the TIRx infrastructure bring-up (#19581). It extends the TIRx operator-dispatch, codegen, and TVMScript surfaces with the next batch of low-level programming features for Blackwell-class GPUs, while keeping
s_tirscript support intact.Main Changes
ldmatrix/stmatrixcopy dispatch; split CUDA copy into register / gmem-smem /ldgstspaths;tcgen05.ld/st.16x{64,128,256}bdispatch with a factory and M=128 layout; element-wise broadcast at the layout level with a copy vec-alignment fix.mma.synctensor-core dispatch; accept a Layout F C operand for M=64 MMAs.permute_layoutprimitive (replacespermute_dims).Tx.jitdecorator,Tx.constexprcompile-time params, andTx.wg_reg_tile.Tx.device_entry()marker (replacingScopeKind::kKernel); canonical thread filters that drop theTx.filterwrapper.entry_cluster_synccodegen attribute.Validation
pre-commit run(changed files) — cleanninja -C build -j$(nproc)— buildspytest tests/python/tirx/ -n 161997 passed, 39 skipped, 3 xpassedpython -m pytest tests/python/all-platform-minimal-test37 passed, 105 skippedTVM_TEST_TARGETS=llvm pytest tests/python/tirx-analysis tests/python/tirx-base tests/python/tirx-transform -n 16630 passed, 25 skipped, 8 xfailed, 1 xpassedLocal CI Notes
Several full CI-equivalent jobs are not locally reproducible because this machine is missing parts of the Apache TVM CI environment (e.g., specific
llvm-configversions, Vulkan, ROCm, ARM/QEMU cross-toolchain, and web/wasm components). The Blackwell/Trainium kernel tests are maintained downstream and are intentionally not part of this PR.