Skip to content

v0.2.0 — SGLang, LongLive, and multi-backend / multi-hardware support

Latest

Choose a tag to compare

@WeianMao WeianMao released this 22 Apr 04:50
· 1 commit to main since this release

Highlights

TriAttention achieves 2.5× throughput and 10.7× KV memory reduction on AIME25 long reasoning while matching Full Attention accuracy (40.8 vs 40.8). This is the first tagged release of the project, and it adds SGLang as a
first-class inference backend alongside the existing vLLM path.

What's new in 0.2.0

  • SGLang backend (triattention.sglang) — scheduler / worker hooks, per-head KV compaction, TP-aware stats sharding, and a launcher entry point. Launch via python -m triattention.sglang --model-path <path>; no upstream SGLang
    changes required. See the SGLang Integration guide.

Supported inference backends

Backend Status Notes
vLLM Stable Primary production path; OpenAI-compatible server, works with OpenClaw. See the OpenClaw guide.
SGLang New in 0.2.0 See the SGLang guide.
MLX (Apple Silicon) Experimental M1 / M2 / M3 / M4 via mlx-lm. See the MLX guide.
llama.cpp (ggml, ROCm) Community AMD GPU port by @domvoxtriattention-ggml.

Applications

  • Long video generation — autoregressive video with KV cache compression via LongLive.

Hardware ecosystem

  • NVIDIA DGX Spark (GB10 / sm-121) — community enablement by @dscain; vLLM path merged, non-vLLM path in progress.
  • Apple Silicon (M1 / M2 / M3 / M4) — via MLX, contributed by @DeadByDawn101 (RavenX AI).
  • AMD GPUs — via the community llama.cpp + ggml ROCm port.

Quick start (SGLang)

pip install -e .                                                                                                                                                                                                                         
pip install sglang[all]                                                                                                                                                                                                                  
                                                                                       
export TRIATTN_RUNTIME_SPARSE_STATS_PATH=triattention/calibration/qwen3_8b.pt
export TRIATTN_RUNTIME_KV_BUDGET=2048                                                                                                                                                                                                    
                                         
python -m triattention.sglang \                                                                                                                                                                                                          
    --model-path <model_path> \                                                        
    --dtype bfloat16 \                                                                                                                                                                                                                   
    --context-length 32768 \                                                                                                                                                                                                             
    --trust-remote-code                                                                                                                                                                                                                  

Compatibility

  • Python: 3.10+
  • Existing vLLM users: no behavior change. This release includes a minor runtime config relaxation so that non-Triton scoring paths are permitted, which the SGLang integration uses by default.
  • Radix cache / prefix caching: not supported with SGLang + KV compression (launcher auto-sets --disable-radix-cache).

Documentation