Skip to content

[TLE] Add warp-specialized pipeline support#592

Merged
zhzhcookie merged 9 commits into
triton_v3.6.xfrom
feature/tle-pipe
May 19, 2026
Merged

[TLE] Add warp-specialized pipeline support#592
zhzhcookie merged 9 commits into
triton_v3.6.xfrom
feature/tle-pipe

Conversation

@sunnycase
Copy link
Copy Markdown
Collaborator

@sunnycase sunnycase commented May 18, 2026

Summary

This PR adds the TLE pipe / warp-specialized pipeline stack for the triton_v3.6.x branch. It introduces typed tle.pipe endpoints, tle.gpu.warp_specialize integration, NVWS/mbarrier lowering, multi-reader pipe support, WGMMA scheduling/lifetime analysis, TMA store commit/wait support, and the sparse MLA TLE-pipe / TLE-FlashMLA prefill tutorial and benchmark path.

The branch has been merged with the latest origin/triton_v3.6.x before opening the PR. The diff does not include third_party/triton_shared.

What Changed

1. TLE pipe API and warp-specialized frontend

  • Added typed tle.pipe support with writer/reader endpoints, staged acquire/commit/wait/release, optional reader names, field subsets, close/check-close handling, and one-shot pipe use cases.
  • Added/updated tle.gpu.warp_specialize usage so warp functions receive restricted pipe reader/writer endpoints and can express producer/consumer roles through frontend semantics.
  • Extended SparseMLA tutorial coverage with TLE-pipe and TLE-FlashMLA prefill implementations.

2. Pipe lowering and synchronization

  • Lowered CTA-scoped TLE pipes to NVWS token/mbarrier synchronization.
  • Added multi-reader release tracking and released-field metadata so WGMMA operand lifetime waits are tied to actual storage reuse.
  • Added participant-arrive lowering for proven writer lanes, avoiding unnecessary full-partition barriers while preserving release/ready ordering.
  • Added static subview-aware membar handling for local pointer and memdesc view accesses.

3. WGMMA/TMA lowering and scheduling

  • Added a TLE WGMMA schedule/resource analysis to place warp_group_dot_commit and warp_group_dot_wait at real accumulator-consumption and operand-lifetime boundaries.
  • Preserved accumulator-chain C reuse when safe, and materialized accumulators before non-WGMMA boundaries inserted later by fence insertion.
  • Reused eligible shared-memory A/B operands directly as WGMMA descriptors, including local-load and transposed/local-alloc staging paths.
  • Kept GMMA descriptor arithmetic as normal LLVM SSA so ptxas can recognize and pack valid gdesc operands.
  • Added TMA store commit-group lowering and wait sinking support for TLE store pipelines.

4. Sparse MLA / FlashMLA-style examples and docs

  • Updated python/tutorials/tle/deepseek_v32/02-sparse-mla.py with TLE-pipe and TLE-FlashMLA prefill providers.
  • Updated tle.md and tle_cn.md with the pipe API contract, warp-specialization semantics, examples, and benchmark data.

5. Regression coverage

  • Added TLE MLIR tests under third_party/tle/test/GPU/ for pipe ops, pipe-to-NVWS/mbarrier lowering, multi-reader/released-field behavior, participant arrive, subview-aware membar, WGMMA descriptor lowering, WGMMA accumulator scheduling, TMA store commit/wait handling, and local-pointer optimizations.
  • Added/updated Python TLE unit and sparse MLA codegen coverage.

Performance

Environment recorded in tle.md:

  • GPU: NVIDIA H800
  • Data type: BF16
  • Benchmark command: python/tutorials/tle/deepseek_v32/02-sparse-mla.py --mode bench --warmup 200 --rep 500
  • FlashMLA-compatible sparse prefill cases: B=1, S=4096, H=128, HKV=1, DQK=576, DV=512, topk=2048

Latency in milliseconds:

SKV Triton TLE TLE-Pipe-Pipelined TLE-FlashMLA-Prefill TileLang-Seesaw FlashMLA
8192 9.896 6.927 4.832 4.273 5.160 3.850
32768 11.210 7.624 5.321 4.834 5.577 4.117
65536 11.655 8.378 5.731 5.305 5.786 4.348
98304 11.835 8.658 5.972 5.599 5.873 4.447
131072 11.923 8.863 6.122 5.887 5.916 4.534

Summary:

  • TLE-Pipe is 1.95x-2.11x faster than Triton on these prefill rows.
  • TLE-FlashMLA-Prefill is 2.03x-2.32x faster than Triton.
  • TLE-FlashMLA-Prefill is close to TileLang-Seesaw and remains behind FlashMLA in the largest cases, as documented in tle.md.

Validation

  • Rebuilt after merging origin/triton_v3.6.x:
    • conda run -n flagtree ninja -C build/cmake.linux-x86_64-cpython-3.10
  • Ran relevant TLE GPU lit tests:
    • conda run -n flagtree lit -v build/cmake.linux-x86_64-cpython-3.10/third_party/tle/test/GPU/test_tle_{arrive_barrier_release_fence,lower_pipe_to_nvws,lower_pipe_to_nvws_errors,lower_pipe_to_nvws_warpspec,pipe_to_mbarrier,optimize_dot_operands_wgmma_view,wgmma_accumulator_fence_lowering,wgmma_pipeline_accumulator_chain,wgmma_pipeline_serialized_warning,wgmma_shared_operand_fence,allocate_shared_memory_alias,arrive_barrier_errors,async_copy_cache_policy,membar_static_subviews,pipe_participant_commit,wgmma_descriptor_lowering}.mlir
    • Result: 16 passed
  • Sparse MLA smoke/correctness after merge:
    • PYTHONPATH=/root/repos/flagtree/python TRITON_CACHE_DIR=/tmp/tle_merge36_check conda run -n flagtree python python/tutorials/tle/deepseek_v32/02-sparse-mla.py --mode test --B 1 --S 80 --SKV 1024 --H 64 --HKV 1 --DQK 320 --DV 256 --topk 512 --skip-tle-pipe-check --skip-tle-flashmla-prefill-check
    • Result: Triton and TLE sparse MLA BF16 match reference
  • git diff --check passed.

@github-actions github-actions Bot added the hcu label May 18, 2026
Copy link
Copy Markdown
Collaborator

@zhzhcookie zhzhcookie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhzhcookie zhzhcookie changed the title [TLE] Add pipe warp-specialized pipeline support [TLE] Add warp-specialized pipeline support May 18, 2026
@zhzhcookie zhzhcookie merged commit 8f591fc into triton_v3.6.x May 19, 2026
10 checks passed
@zhzhcookie zhzhcookie deleted the feature/tle-pipe branch May 19, 2026 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants