[TLE] Add warp-specialized pipeline support by sunnycase · Pull Request #592 · flagos-ai/FlagTree

sunnycase · 2026-05-18T09:10:36Z

Summary

This PR adds the TLE pipe / warp-specialized pipeline stack for the triton_v3.6.x branch. It introduces typed tle.pipe endpoints, tle.gpu.warp_specialize integration, NVWS/mbarrier lowering, multi-reader pipe support, WGMMA scheduling/lifetime analysis, TMA store commit/wait support, and the sparse MLA TLE-pipe / TLE-FlashMLA prefill tutorial and benchmark path.

The branch has been merged with the latest origin/triton_v3.6.x before opening the PR. The diff does not include third_party/triton_shared.

What Changed

1. TLE pipe API and warp-specialized frontend

Added typed tle.pipe support with writer/reader endpoints, staged acquire/commit/wait/release, optional reader names, field subsets, close/check-close handling, and one-shot pipe use cases.
Added/updated tle.gpu.warp_specialize usage so warp functions receive restricted pipe reader/writer endpoints and can express producer/consumer roles through frontend semantics.
Extended SparseMLA tutorial coverage with TLE-pipe and TLE-FlashMLA prefill implementations.

2. Pipe lowering and synchronization

Lowered CTA-scoped TLE pipes to NVWS token/mbarrier synchronization.
Added multi-reader release tracking and released-field metadata so WGMMA operand lifetime waits are tied to actual storage reuse.
Added participant-arrive lowering for proven writer lanes, avoiding unnecessary full-partition barriers while preserving release/ready ordering.
Added static subview-aware membar handling for local pointer and memdesc view accesses.

3. WGMMA/TMA lowering and scheduling

Added a TLE WGMMA schedule/resource analysis to place warp_group_dot_commit and warp_group_dot_wait at real accumulator-consumption and operand-lifetime boundaries.
Preserved accumulator-chain C reuse when safe, and materialized accumulators before non-WGMMA boundaries inserted later by fence insertion.
Reused eligible shared-memory A/B operands directly as WGMMA descriptors, including local-load and transposed/local-alloc staging paths.
Kept GMMA descriptor arithmetic as normal LLVM SSA so ptxas can recognize and pack valid gdesc operands.
Added TMA store commit-group lowering and wait sinking support for TLE store pipelines.

4. Sparse MLA / FlashMLA-style examples and docs

Updated python/tutorials/tle/deepseek_v32/02-sparse-mla.py with TLE-pipe and TLE-FlashMLA prefill providers.
Updated tle.md and tle_cn.md with the pipe API contract, warp-specialization semantics, examples, and benchmark data.

5. Regression coverage

Added TLE MLIR tests under third_party/tle/test/GPU/ for pipe ops, pipe-to-NVWS/mbarrier lowering, multi-reader/released-field behavior, participant arrive, subview-aware membar, WGMMA descriptor lowering, WGMMA accumulator scheduling, TMA store commit/wait handling, and local-pointer optimizations.
Added/updated Python TLE unit and sparse MLA codegen coverage.

Performance

Environment recorded in tle.md:

GPU: NVIDIA H800
Data type: BF16
Benchmark command: python/tutorials/tle/deepseek_v32/02-sparse-mla.py --mode bench --warmup 200 --rep 500
FlashMLA-compatible sparse prefill cases: B=1, S=4096, H=128, HKV=1, DQK=576, DV=512, topk=2048

Latency in milliseconds:

SKV	Triton	TLE	TLE-Pipe-Pipelined	TLE-FlashMLA-Prefill	TileLang-Seesaw	FlashMLA
8192	9.896	6.927	4.832	4.273	5.160	3.850
32768	11.210	7.624	5.321	4.834	5.577	4.117
65536	11.655	8.378	5.731	5.305	5.786	4.348
98304	11.835	8.658	5.972	5.599	5.873	4.447
131072	11.923	8.863	6.122	5.887	5.916	4.534

Summary:

TLE-Pipe is 1.95x-2.11x faster than Triton on these prefill rows.
TLE-FlashMLA-Prefill is 2.03x-2.32x faster than Triton.
TLE-FlashMLA-Prefill is close to TileLang-Seesaw and remains behind FlashMLA in the largest cases, as documented in tle.md.

Validation

Rebuilt after merging origin/triton_v3.6.x:
- conda run -n flagtree ninja -C build/cmake.linux-x86_64-cpython-3.10
Ran relevant TLE GPU lit tests:
- conda run -n flagtree lit -v build/cmake.linux-x86_64-cpython-3.10/third_party/tle/test/GPU/test_tle_{arrive_barrier_release_fence,lower_pipe_to_nvws,lower_pipe_to_nvws_errors,lower_pipe_to_nvws_warpspec,pipe_to_mbarrier,optimize_dot_operands_wgmma_view,wgmma_accumulator_fence_lowering,wgmma_pipeline_accumulator_chain,wgmma_pipeline_serialized_warning,wgmma_shared_operand_fence,allocate_shared_memory_alias,arrive_barrier_errors,async_copy_cache_policy,membar_static_subviews,pipe_participant_commit,wgmma_descriptor_lowering}.mlir
- Result: 16 passed
Sparse MLA smoke/correctness after merge:
- PYTHONPATH=/root/repos/flagtree/python TRITON_CACHE_DIR=/tmp/tle_merge36_check conda run -n flagtree python python/tutorials/tle/deepseek_v32/02-sparse-mla.py --mode test --B 1 --S 80 --SKV 1024 --H 64 --HKV 1 --DQK 320 --DV 256 --topk 512 --skip-tle-pipe-check --skip-tle-flashmla-prefill-check
- Result: Triton and TLE sparse MLA BF16 match reference
git diff --check passed.

…pipe

zhzhcookie

LGTM

sunnycase added 4 commits May 9, 2026 06:21

Add TLE pipe support

2c3c296

Add TLE pipe warp-specialized pipeline support

37bdfef

Refine TLE pipe WGMMA lowering

a49f11b

Merge remote-tracking branch 'origin/triton_v3.6.x' into feature/tle-…

8c8e12b

…pipe

sunnycase requested a review from zhzhcookie as a code owner May 18, 2026 09:10

sunnycase added nvidia tle labels May 18, 2026

github-actions Bot added the triton_v3.6.x label May 18, 2026

Apply code-format changes

fe78a99

github-actions Bot added the hcu label May 18, 2026

Fix TLE CI build dependencies

66c4056

sunnycase force-pushed the feature/tle-pipe branch from 5ac138b to 66c4056 Compare May 18, 2026 09:41

sunnycase added 3 commits May 18, 2026 10:33

Fix TLE WGMMA scheduling without else

b9cab53

Fix TLE remote if data duplication

d170741

Fix TLE shared load vector alignment

f4296c4

zhzhcookie approved these changes May 18, 2026

View reviewed changes

zhzhcookie changed the title ~~[TLE] Add pipe warp-specialized pipeline support~~ [TLE] Add warp-specialized pipeline support May 18, 2026

zhzhcookie merged commit 8f591fc into triton_v3.6.x May 19, 2026
10 checks passed

zhzhcookie deleted the feature/tle-pipe branch May 19, 2026 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TLE] Add warp-specialized pipeline support#592

[TLE] Add warp-specialized pipeline support#592
zhzhcookie merged 9 commits into
triton_v3.6.xfrom
feature/tle-pipe

sunnycase commented May 18, 2026 •

edited

Loading

Uh oh!

zhzhcookie left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sunnycase commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

1. TLE pipe API and warp-specialized frontend

2. Pipe lowering and synchronization

3. WGMMA/TMA lowering and scheduling

4. Sparse MLA / FlashMLA-style examples and docs

5. Regression coverage

Performance

Validation

Uh oh!

zhzhcookie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sunnycase commented May 18, 2026 •

edited

Loading