Release v0.2.0 — SGLang, LongLive, and multi-backend / multi-hardware support · WeianMao/triattention

Highlights

TriAttention achieves 2.5× throughput and 10.7× KV memory reduction on AIME25 long reasoning while matching Full Attention accuracy (40.8 vs 40.8). This is the first tagged release of the project, and it adds SGLang as a
first-class inference backend alongside the existing vLLM path.

What's new in 0.2.0

SGLang backend (triattention.sglang) — scheduler / worker hooks, per-head KV compaction, TP-aware stats sharding, and a launcher entry point. Launch via python -m triattention.sglang --model-path <path>; no upstream SGLang
changes required. See the SGLang Integration guide.

Supported inference backends

Backend	Status	Notes
vLLM	Stable	Primary production path; OpenAI-compatible server, works with OpenClaw. See the OpenClaw guide.
SGLang	New in 0.2.0	See the SGLang guide.
MLX (Apple Silicon)	Experimental	M1 / M2 / M3 / M4 via `mlx-lm`. See the MLX guide.
llama.cpp (ggml, ROCm)	Community	AMD GPU port by @domvox — triattention-ggml.

Applications

Long video generation — autoregressive video with KV cache compression via LongLive.

Hardware ecosystem

NVIDIA DGX Spark (GB10 / sm-121) — community enablement by @dscain; vLLM path merged, non-vLLM path in progress.
Apple Silicon (M1 / M2 / M3 / M4) — via MLX, contributed by @DeadByDawn101 (RavenX AI).
AMD GPUs — via the community llama.cpp + ggml ROCm port.

Quick start (SGLang)

pip install -e .                                                                                                                                                                                                                         
pip install sglang[all]                                                                                                                                                                                                                  
                                                                                       
export TRIATTN_RUNTIME_SPARSE_STATS_PATH=triattention/calibration/qwen3_8b.pt
export TRIATTN_RUNTIME_KV_BUDGET=2048                                                                                                                                                                                                    
                                         
python -m triattention.sglang \                                                                                                                                                                                                          
    --model-path <model_path> \                                                        
    --dtype bfloat16 \                                                                                                                                                                                                                   
    --context-length 32768 \                                                                                                                                                                                                             
    --trust-remote-code

Compatibility

Python: 3.10+
Existing vLLM users: no behavior change. This release includes a minor runtime config relaxation so that non-Triton scoring paths are permitted, which the SGLang integration uses by default.
Radix cache / prefix caching: not supported with SGLang + KV compression (launcher auto-sets --disable-radix-cache).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0 — SGLang, LongLive, and multi-backend / multi-hardware support

Choose a tag to compare

Sorry, something went wrong.