am-compact

Rust inference engine for Attention Matching (Zweiger et al., 2026) KV-cache compaction on Qwen3 models. Loads HuggingFace safetensors, runs the full AM compaction pipeline, and generates from compacted prefixes — all in a single Rust binary.

Backends

Backend	Platform	Speed (Qwen3-0.6B)	Feature flag
MLX	Apple Silicon (M1-M4)	72 tok/s (INT4) / 52 tok/s (bf16)	`--features mlx`
CUDA	NVIDIA GPU	~250 tok/s (RTX 3080, bf16)	default

Quick Start (Apple Silicon)

# Build (first build compiles MLX-C from source, ~4 min)
cargo build --release --features mlx --no-default-features

# Generate text
cargo run --release --features mlx --no-default-features --example generate -- \
  --model /path/to/Qwen3-0.6B --prompt "The capital of France is" --max-tokens 64

# Generate with INT4 quantization (1.4x faster)
cargo run --release --features mlx --no-default-features --example generate -- \
  --model /path/to/Qwen3-0.6B --prompt "hello" --int4

# Run AM compaction demo
cargo run --release --features mlx --no-default-features --example smoke_compact -- \
  --model /path/to/Qwen3-0.6B

Quick Start (NVIDIA GPU)

# Needs CUDA 13.x with nvcc on PATH
cargo build --release
cargo run --release --example generate -- --model /path/to/Qwen3-0.6B

AM Compaction

Implements Attention Matching (Zweiger et al., 2026) for KV cache compaction, with task-guided modifications from Latent Briefing (Ramp Labs, 2026) for multi-agent systems built on the Recursive Language Model (Zhang et al., 2025) framework.

The core AM equation finds compacted keys C_k, biases β, and values C_v such that:

softmax(Q · C_k^T + β) · C_v ≈ softmax(Q · K^T) · V

Pipeline:

Forward pass — process trajectory tokens, capture KV cache
Stage 1: Scoring — task-guided attention scoring with shared global token selection via MAD thresholding (Latent Briefing modification: uses task queries instead of context queries, shared mask across all heads)
Stage 2: Beta fitting — NNLS via projected gradient descent fits bias corrections so softmax over kept keys approximates original distribution
Stage 3: C_v fitting — ridge regression (X^T X + λI)^{-1} X^T Y reconstructs value vectors (regularized variant of AM's OLS formulation)
Generation — decode from compacted prefix with β injected as additive attention bias

# End-to-end demo: compacts 48 tokens to 17 (65% reduction), generates answer
cargo run --release --features mlx --no-default-features --example smoke_compact -- \
  --model /path/to/Qwen3-0.6B \
  --trajectory "The Eiffel Tower is located in Paris, France. It was built in 1889." \
  --task "What year was the Eiffel Tower built?"
# → Answer (compacted): The answer is 1889.

Benchmarks

Qwen3-0.6B on Apple M4:

Config	Decode speed	Model memory
bf16	52 tok/s	865 MB
INT4	72 tok/s	~265 MB

AM compaction: 0.09s for 48-token trajectory, 35% retention.

Project Structure

src/
├── backend/
│   ├── cuda/          # NVIDIA GPU backend (cuBLAS, cuSOLVER)
│   └── mlx_backend/   # Apple Silicon backend (mlx-rs)
├── model/             # Qwen3 transformer (attention, MLP, norms)
│   └── weight.rs      # Weight enum (bf16 / INT4 quantized)
├── am/                # AM compaction pipeline (scoring, beta, c_v)
├── kv_cache.rs        # KV cache management
├── weights.rs         # Safetensors loader
├── session.rs         # High-level API
└── config.rs          # Model + inference config

Requirements

Apple Silicon: macOS, Xcode Command Line Tools, CMake (for MLX-C build)
NVIDIA: CUDA 13.x, nvcc, Rust 1.75+

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
examples		examples
kernels		kernels
scripts		scripts
src		src
tests		tests
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
ARCHITECTURE.md		ARCHITECTURE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-MIT		LICENSE-MIT
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

am-compact

Backends

Quick Start (Apple Silicon)

Quick Start (NVIDIA GPU)

AM Compaction

Benchmarks

Project Structure

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

am-compact

Backends

Quick Start (Apple Silicon)

Quick Start (NVIDIA GPU)

AM Compaction

Benchmarks

Project Structure

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages