Rust inference engine for Attention Matching (Zweiger et al., 2026) KV-cache compaction on Qwen3 models. Loads HuggingFace safetensors, runs the full AM compaction pipeline, and generates from compacted prefixes — all in a single Rust binary.
| Backend | Platform | Speed (Qwen3-0.6B) | Feature flag |
|---|---|---|---|
| MLX | Apple Silicon (M1-M4) | 72 tok/s (INT4) / 52 tok/s (bf16) | --features mlx |
| CUDA | NVIDIA GPU | ~250 tok/s (RTX 3080, bf16) | default |
# Build (first build compiles MLX-C from source, ~4 min)
cargo build --release --features mlx --no-default-features
# Generate text
cargo run --release --features mlx --no-default-features --example generate -- \
--model /path/to/Qwen3-0.6B --prompt "The capital of France is" --max-tokens 64
# Generate with INT4 quantization (1.4x faster)
cargo run --release --features mlx --no-default-features --example generate -- \
--model /path/to/Qwen3-0.6B --prompt "hello" --int4
# Run AM compaction demo
cargo run --release --features mlx --no-default-features --example smoke_compact -- \
--model /path/to/Qwen3-0.6B# Needs CUDA 13.x with nvcc on PATH
cargo build --release
cargo run --release --example generate -- --model /path/to/Qwen3-0.6BImplements Attention Matching (Zweiger et al., 2026) for KV cache compaction, with task-guided modifications from Latent Briefing (Ramp Labs, 2026) for multi-agent systems built on the Recursive Language Model (Zhang et al., 2025) framework.
The core AM equation finds compacted keys C_k, biases β, and values C_v such that:
softmax(Q · C_k^T + β) · C_v ≈ softmax(Q · K^T) · V
Pipeline:
- Forward pass — process trajectory tokens, capture KV cache
- Stage 1: Scoring — task-guided attention scoring with shared global token selection via MAD thresholding (Latent Briefing modification: uses task queries instead of context queries, shared mask across all heads)
- Stage 2: Beta fitting — NNLS via projected gradient descent fits bias corrections so softmax over kept keys approximates original distribution
- Stage 3: C_v fitting — ridge regression (X^T X + λI)^{-1} X^T Y reconstructs value vectors (regularized variant of AM's OLS formulation)
- Generation — decode from compacted prefix with β injected as additive attention bias
# End-to-end demo: compacts 48 tokens to 17 (65% reduction), generates answer
cargo run --release --features mlx --no-default-features --example smoke_compact -- \
--model /path/to/Qwen3-0.6B \
--trajectory "The Eiffel Tower is located in Paris, France. It was built in 1889." \
--task "What year was the Eiffel Tower built?"
# → Answer (compacted): The answer is 1889.Qwen3-0.6B on Apple M4:
| Config | Decode speed | Model memory |
|---|---|---|
| bf16 | 52 tok/s | 865 MB |
| INT4 | 72 tok/s | ~265 MB |
AM compaction: 0.09s for 48-token trajectory, 35% retention.
src/
├── backend/
│ ├── cuda/ # NVIDIA GPU backend (cuBLAS, cuSOLVER)
│ └── mlx_backend/ # Apple Silicon backend (mlx-rs)
├── model/ # Qwen3 transformer (attention, MLP, norms)
│ └── weight.rs # Weight enum (bf16 / INT4 quantized)
├── am/ # AM compaction pipeline (scoring, beta, c_v)
├── kv_cache.rs # KV cache management
├── weights.rs # Safetensors loader
├── session.rs # High-level API
└── config.rs # Model + inference config
- Apple Silicon: macOS, Xcode Command Line Tools, CMake (for MLX-C build)
- NVIDIA: CUDA 13.x, nvcc, Rust 1.75+
MIT