Highlights
TriAttention achieves 2.5× throughput and 10.7× KV memory reduction on AIME25 long reasoning while matching Full Attention accuracy (40.8 vs 40.8). This is the first tagged release of the project, and it adds SGLang as a
first-class inference backend alongside the existing vLLM path.
What's new in 0.2.0
- SGLang backend (
triattention.sglang) — scheduler / worker hooks, per-head KV compaction, TP-aware stats sharding, and a launcher entry point. Launch viapython -m triattention.sglang --model-path <path>; no upstream SGLang
changes required. See the SGLang Integration guide.
Supported inference backends
| Backend | Status | Notes |
|---|---|---|
| vLLM | Stable | Primary production path; OpenAI-compatible server, works with OpenClaw. See the OpenClaw guide. |
| SGLang | New in 0.2.0 | See the SGLang guide. |
| MLX (Apple Silicon) | Experimental | M1 / M2 / M3 / M4 via mlx-lm. See the MLX guide. |
| llama.cpp (ggml, ROCm) | Community | AMD GPU port by @domvox — triattention-ggml. |
Applications
- Long video generation — autoregressive video with KV cache compression via LongLive.
Hardware ecosystem
- NVIDIA DGX Spark (GB10 / sm-121) — community enablement by @dscain; vLLM path merged, non-vLLM path in progress.
- Apple Silicon (M1 / M2 / M3 / M4) — via MLX, contributed by @DeadByDawn101 (RavenX AI).
- AMD GPUs — via the community llama.cpp + ggml ROCm port.
Quick start (SGLang)
pip install -e .
pip install sglang[all]
export TRIATTN_RUNTIME_SPARSE_STATS_PATH=triattention/calibration/qwen3_8b.pt
export TRIATTN_RUNTIME_KV_BUDGET=2048
python -m triattention.sglang \
--model-path <model_path> \
--dtype bfloat16 \
--context-length 32768 \
--trust-remote-code Compatibility
- Python: 3.10+
- Existing vLLM users: no behavior change. This release includes a minor runtime config relaxation so that non-Triton scoring paths are permitted, which the SGLang integration uses by default.
- Radix cache / prefix caching: not supported with SGLang + KV compression (launcher auto-sets
--disable-radix-cache).