Skip to content

cbxss/am-compact

Repository files navigation

am-compact

Rust inference engine for Attention Matching (Zweiger et al., 2026) KV-cache compaction on Qwen3 models. Loads HuggingFace safetensors, runs the full AM compaction pipeline, and generates from compacted prefixes — all in a single Rust binary.

Backends

Backend Platform Speed (Qwen3-0.6B) Feature flag
MLX Apple Silicon (M1-M4) 72 tok/s (INT4) / 52 tok/s (bf16) --features mlx
CUDA NVIDIA GPU ~250 tok/s (RTX 3080, bf16) default

Quick Start (Apple Silicon)

# Build (first build compiles MLX-C from source, ~4 min)
cargo build --release --features mlx --no-default-features

# Generate text
cargo run --release --features mlx --no-default-features --example generate -- \
  --model /path/to/Qwen3-0.6B --prompt "The capital of France is" --max-tokens 64

# Generate with INT4 quantization (1.4x faster)
cargo run --release --features mlx --no-default-features --example generate -- \
  --model /path/to/Qwen3-0.6B --prompt "hello" --int4

# Run AM compaction demo
cargo run --release --features mlx --no-default-features --example smoke_compact -- \
  --model /path/to/Qwen3-0.6B

Quick Start (NVIDIA GPU)

# Needs CUDA 13.x with nvcc on PATH
cargo build --release
cargo run --release --example generate -- --model /path/to/Qwen3-0.6B

AM Compaction

Implements Attention Matching (Zweiger et al., 2026) for KV cache compaction, with task-guided modifications from Latent Briefing (Ramp Labs, 2026) for multi-agent systems built on the Recursive Language Model (Zhang et al., 2025) framework.

The core AM equation finds compacted keys C_k, biases β, and values C_v such that:

softmax(Q · C_k^T + β) · C_v ≈ softmax(Q · K^T) · V

Pipeline:

  1. Forward pass — process trajectory tokens, capture KV cache
  2. Stage 1: Scoring — task-guided attention scoring with shared global token selection via MAD thresholding (Latent Briefing modification: uses task queries instead of context queries, shared mask across all heads)
  3. Stage 2: Beta fitting — NNLS via projected gradient descent fits bias corrections so softmax over kept keys approximates original distribution
  4. Stage 3: C_v fitting — ridge regression (X^T X + λI)^{-1} X^T Y reconstructs value vectors (regularized variant of AM's OLS formulation)
  5. Generation — decode from compacted prefix with β injected as additive attention bias
# End-to-end demo: compacts 48 tokens to 17 (65% reduction), generates answer
cargo run --release --features mlx --no-default-features --example smoke_compact -- \
  --model /path/to/Qwen3-0.6B \
  --trajectory "The Eiffel Tower is located in Paris, France. It was built in 1889." \
  --task "What year was the Eiffel Tower built?"
# → Answer (compacted): The answer is 1889.

Benchmarks

Qwen3-0.6B on Apple M4:

Config Decode speed Model memory
bf16 52 tok/s 865 MB
INT4 72 tok/s ~265 MB

AM compaction: 0.09s for 48-token trajectory, 35% retention.

Project Structure

src/
├── backend/
│   ├── cuda/          # NVIDIA GPU backend (cuBLAS, cuSOLVER)
│   └── mlx_backend/   # Apple Silicon backend (mlx-rs)
├── model/             # Qwen3 transformer (attention, MLP, norms)
│   └── weight.rs      # Weight enum (bf16 / INT4 quantized)
├── am/                # AM compaction pipeline (scoring, beta, c_v)
├── kv_cache.rs        # KV cache management
├── weights.rs         # Safetensors loader
├── session.rs         # High-level API
└── config.rs          # Model + inference config

Requirements

  • Apple Silicon: macOS, Xcode Command Line Tools, CMake (for MLX-C build)
  • NVIDIA: CUDA 13.x, nvcc, Rust 1.75+

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors