Website | Rust SDK | Python SDK | TurboQuant | GitHub
A Rust-native LLM inference engine with near-optimal KV cache compression.
Built by Aeonmind. Powers Runcrate.
Arc is a high-performance LLM inference engine targeting 120 tokens/second on Qwen3-32B BF16 on a single B200 — 97% of theoretical HBM bandwidth utilization. Built on mistral.rs with CUDA graph capture, fused kernels, and TurboQuant KV cache compression (ICLR 2026, lossless 4.27x context scaling).
Peak Inference Plan — the full technical roadmap to 120 tok/s.
| Feature | What it does | Status |
|---|---|---|
| TurboQuant KV cache | 4-bit K / 3-bit V sub-byte packing with WHT rotation + Lloyd-Max codebooks. 4.27x context on H100. | ✅ |
| Fused attention kernels | CUDA kernels that read packed KV directly — codebook lookup in the attention inner loop | ✅ |
| 3 compression presets | Default (3.5-bit, lossless), Balanced (3.0-bit), Aggressive (2.5-bit) | ✅ |
| GPU-autonomous decode | Entire decode loop (forward + sampling + EOS check) runs on GPU without CPU involvement. CUDA 12.4 conditional graph nodes. | In progress |
| Fused GPU sampling | Argmax, top-p, penalties — all on GPU inside the captured graph. No CPU round-trip per token. | Planned |
| Zero-copy token streaming | GPU writes tokens to pinned host ring buffer. CPU streams to clients without synchronizing the decode loop. | Planned |
| Elastic tensor parallelism | Per-request GPU allocation, TP=1 to TP=8 dynamically | Planned |
| Disaggregated serving | Prefill-decode separation with KV-aware routing | Planned |
Everything from mistral.rs is included: PagedAttention, FlashAttention V2/V3, speculative decoding, continuous batching, 100+ model architectures, GGUF/GPTQ/AWQ/ISQ, LoRA, MCP, multi-GPU tensor parallelism, and more.
Linux/macOS (one-liner):
curl -fsSL https://raw.githubusercontent.com/aeonmindai/arc/master/install.sh | shAuto-detects CUDA, Metal, and FlashAttention. Downloads prebuilt binaries when available, builds from source otherwise.
From source:
cargo install --path arc-cli # CPU
cargo install --path arc-cli --features metal # Apple Silicon
cargo install --path arc-cli --features "cuda flash-attn" # NVIDIA GPU# Interactive chat — TurboQuant enabled by default
arc run -m Qwen/Qwen3-4B
# Start a server with web UI
arc serve --ui -m google/gemma-3-4b-it
# Benchmark
arc bench -m meta-llama/Llama-3.1-8B-Instruct# Default: 3.5-bit (K4/V3) — lossless, 2.2x compression
arc serve -m <model>
# Balanced: 3.0-bit (K3/V3) — 2.56x compression
arc serve -m <model> --pa-cache-type turboquant-3
# Aggressive: 2.5-bit (K3/V2) — 4.1x compression, ~1.2% quality loss
arc serve -m <model> --pa-cache-type turboquant-aggressive
# Disable TurboQuant (upstream behavior)
arc serve -m <model> --pa-cache-type autoTurboQuant (arXiv:2504.19874, ICLR 2026) compresses KV cache vectors to 2-4 bits using:
- Walsh-Hadamard rotation — O(d log d) random orthogonal transform that makes every coordinate follow a known Beta distribution, regardless of input data
- Lloyd-Max codebooks — pre-computed optimal scalar quantizers for the Beta distribution (no calibration, no training data needed)
- Sub-byte packing — 4-bit nibble and 3-bit 10-in-32 formats, with fused GPU kernels
The key insight: after rotation, optimal quantization is data-oblivious. The codebooks are compile-time constants.
Qwen3-32B on a single H100 80GB:
| Metric | BF16 (baseline) | TurboQuant (default) | Improvement |
|---|---|---|---|
| KV cache per token | 4,096 bytes/layer | 960 bytes/layer | 4.27x smaller |
| Available context | 39,680 tokens | 169,216 tokens | 4.27x longer |
| Concurrent users (4K ctx) | ~10 | ~42 | 4.27x more |
| Model quality | Baseline | Identical | Lossless |
Presets:
| Preset | Bits | Context multiplier | Quality |
|---|---|---|---|
| Default | 3.5 (K4/V3) | 4.27x | Lossless |
| Balanced | 3.0 (K3/V3) | ~5x | ~0.1% loss |
| Aggressive | 2.5 (K3/V2) | ~7x | ~1.2% loss |
At 3.5 bits: LongBench 50.06 = identical to FP16 50.06 on Llama-3.1-8B-Instruct. Zero quality loss on needle-in-a-haystack at all context lengths.
Write path (per token):
K/V vector → L2 normalize → WHT rotate (D·H·D) → Lloyd-Max quantize → pack
Read path (attention):
Rotate Q once → for each cached K: unpack → codebook lookup → dot product
No full dequantization. Codebook in shared memory.
The Python SDK wraps the upstream mistralrs package with TurboQuant defaults.
pip install mistralrs # CPU
pip install mistralrs-cuda # NVIDIA GPU
pip install mistralrs-metal # Apple Siliconfrom mistralrs import Runner, Which, ChatCompletionRequest, PagedCacheType
runner = Runner(
which=Which.Plain(model_id="Qwen/Qwen3-4B"),
in_situ_quant="4",
pa_cache_type=PagedCacheType.TurboQuant, # default
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
)
print(res.choices[0].message.content)TurboQuant presets available: PagedCacheType.TurboQuant (default), PagedCacheType.TurboQuant3, PagedCacheType.TurboQuantAggressive.
cargo add arc-engineuse arc_engine::core::{TextModelBuilder, PagedAttentionConfig, MemoryGpuConfig, PagedCacheType};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
arc_engine::print_banner();
let model = TextModelBuilder::new("Qwen/Qwen3-4B")
.with_paged_attn(|| {
// TurboQuant is the default — this is automatic
PagedAttentionConfig::new(
None,
MemoryGpuConfig::default(),
PagedCacheType::TurboQuant,
)
})?
.build()
.await?;
let response = model.chat("What is Rust's ownership model?").await?;
println!("{response}");
Ok(())
}The upstream mistralrs Rust crate also works — TurboQuant is the default there too.
Arc serves an OpenAI-compatible API:
arc serve -p 8080 -m meta-llama/Llama-3.1-8B-Instruct
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}]
}'Arc supports every model that mistral.rs supports — 100+ architectures across text, vision, speech, image generation, and embeddings.
Text — Granite 4.0, SmolLM 3, DeepSeek V3, GPT-OSS, Qwen 3, GLM 4, Gemma 2, Phi 3, Llama, Mistral, Mixtral, Starcoder 2, and more
Granite 4.0, SmolLM 3, DeepSeek V3, GPT-OSS, DeepSeek V2, Qwen 3 Next, Qwen 3 MoE, Phi 3.5 MoE, Qwen 3, GLM 4, GLM-4.7-Flash, GLM-4.7 MoE, Gemma 2, Qwen 2, Starcoder 2, Phi 3, Mixtral, Phi 2, Gemma, Llama, Mistral
Vision — Qwen 3.5, Gemma 3n, Llama 4, Mistral 3, Phi 4, MiniCPM-O, LLaVA, and more
Qwen 3.5, Qwen 3.5 MoE, Qwen 3-VL, Qwen 3-VL MoE, Gemma 3n, Llama 4, Gemma 3, Mistral 3, Phi 4 multimodal, Qwen 2.5-VL, MiniCPM-O, Llama 3.2 Vision, Qwen 2-VL, Idefics 3, Idefics 2, LLaVA Next, LLaVA, Phi 3V
Speech — Voxtral, Dia
Voxtral (ASR/speech-to-text), Dia
Image Generation — FLUX
FLUX
Embeddings — Embedding Gemma, Qwen 3 Embedding
Embedding Gemma, Qwen 3 Embedding
Arc uses a thin-wrapper architecture over mistral.rs for upstream compatibility:
arc-cli/ Arc binary (BSL-1.1)
arc-engine/ Wrapper crate (BSL-1.1)
arc-turbo/ TurboQuant: codebooks, WHT, cache, kernels (BSL-1.1)
mistralrs-*/ Upstream mistral.rs (MIT) — untouched, merge-compatible
git merge upstream/master works cleanly. New models and fixes from upstream are available immediately.
# CPU only
cargo build --release -p arc-cli
# NVIDIA GPU (CUDA + FlashAttention)
cargo build --release -p arc-cli --features "cuda flash-attn"
# Apple Silicon (Metal)
cargo build --release -p arc-cli --features metal
# Install globally
cargo install --path arc-cli --features <your-features>- CLI Reference
- HTTP API
- Quantization — ISQ, GGUF, GPTQ, AWQ, TurboQuant
- PagedAttention
- Device Mapping
- MCP Integration
- Full documentation
arc-*crates: Business Source License 1.1 — free for non-commercial and sub-$1M revenue use. Commercial inference-as-a-service requires a license from Aeonmind.mistralrs-*crates: MIT — upstream open source.
See NOTICE for details. For commercial licensing: support@runcrate.ai
Arc is built on mistral.rs by Eric Buehler and contributors, and the Candle ML framework by Hugging Face. TurboQuant is based on research by Zandieh et al. at Google Research (arXiv:2504.19874).
Arc by Aeonmind, LLC
The AI Cloud. Deploy, Scale, Infer.