Arc

Inference at the speed of physics.

Website | Rust SDK | Python SDK | TurboQuant | GitHub

A Rust-native LLM inference engine with near-optimal KV cache compression.
Built by Aeonmind. Powers Runcrate.

Arc is a high-performance LLM inference engine targeting 120 tokens/second on Qwen3-32B BF16 on a single B200 — 97% of theoretical HBM bandwidth utilization. Built on mistral.rs with CUDA graph capture, fused kernels, and TurboQuant KV cache compression (ICLR 2026, lossless 4.27x context scaling).

Peak Inference Plan — the full technical roadmap to 120 tok/s.

What Arc Adds

Feature	What it does	Status
TurboQuant KV cache	4-bit K / 3-bit V sub-byte packing with WHT rotation + Lloyd-Max codebooks. 4.27x context on H100.	✅
Fused attention kernels	CUDA kernels that read packed KV directly — codebook lookup in the attention inner loop	✅
3 compression presets	Default (3.5-bit, lossless), Balanced (3.0-bit), Aggressive (2.5-bit)	✅
GPU-autonomous decode	Entire decode loop (forward + sampling + EOS check) runs on GPU without CPU involvement. CUDA 12.4 conditional graph nodes.	In progress
Fused GPU sampling	Argmax, top-p, penalties — all on GPU inside the captured graph. No CPU round-trip per token.	Planned
Zero-copy token streaming	GPU writes tokens to pinned host ring buffer. CPU streams to clients without synchronizing the decode loop.	Planned
Elastic tensor parallelism	Per-request GPU allocation, TP=1 to TP=8 dynamically	Planned
Disaggregated serving	Prefill-decode separation with KV-aware routing	Planned

Everything from mistral.rs is included: PagedAttention, FlashAttention V2/V3, speculative decoding, continuous batching, 100+ model architectures, GGUF/GPTQ/AWQ/ISQ, LoRA, MCP, multi-GPU tensor parallelism, and more.

Quick Start

Install

Linux/macOS (one-liner):

curl -fsSL https://raw.githubusercontent.com/aeonmindai/arc/master/install.sh | sh

Auto-detects CUDA, Metal, and FlashAttention. Downloads prebuilt binaries when available, builds from source otherwise.

From source:

cargo install --path arc-cli                              # CPU
cargo install --path arc-cli --features metal             # Apple Silicon
cargo install --path arc-cli --features "cuda flash-attn" # NVIDIA GPU

Run

# Interactive chat — TurboQuant enabled by default
arc run -m Qwen/Qwen3-4B

# Start a server with web UI
arc serve --ui -m google/gemma-3-4b-it

# Benchmark
arc bench -m meta-llama/Llama-3.1-8B-Instruct

TurboQuant Presets

# Default: 3.5-bit (K4/V3) — lossless, 2.2x compression
arc serve -m <model>

# Balanced: 3.0-bit (K3/V3) — 2.56x compression
arc serve -m <model> --pa-cache-type turboquant-3

# Aggressive: 2.5-bit (K3/V2) — 4.1x compression, ~1.2% quality loss
arc serve -m <model> --pa-cache-type turboquant-aggressive

# Disable TurboQuant (upstream behavior)
arc serve -m <model> --pa-cache-type auto

TurboQuant

TurboQuant (arXiv:2504.19874, ICLR 2026) compresses KV cache vectors to 2-4 bits using:

Walsh-Hadamard rotation — O(d log d) random orthogonal transform that makes every coordinate follow a known Beta distribution, regardless of input data
Lloyd-Max codebooks — pre-computed optimal scalar quantizers for the Beta distribution (no calibration, no training data needed)
Sub-byte packing — 4-bit nibble and 3-bit 10-in-32 formats, with fused GPU kernels

The key insight: after rotation, optimal quantization is data-oblivious. The codebooks are compile-time constants.

Performance (Tested)

Qwen3-32B on a single H100 80GB:

Metric	BF16 (baseline)	TurboQuant (default)	Improvement
KV cache per token	4,096 bytes/layer	960 bytes/layer	4.27x smaller
Available context	39,680 tokens	169,216 tokens	4.27x longer
Concurrent users (4K ctx)	~10	~42	4.27x more
Model quality	Baseline	Identical	Lossless

Presets:

Preset	Bits	Context multiplier	Quality
Default	3.5 (K4/V3)	4.27x	Lossless
Balanced	3.0 (K3/V3)	~5x	~0.1% loss
Aggressive	2.5 (K3/V2)	~7x	~1.2% loss

At 3.5 bits: LongBench 50.06 = identical to FP16 50.06 on Llama-3.1-8B-Instruct. Zero quality loss on needle-in-a-haystack at all context lengths.

How It Works

Write path (per token):
  K/V vector → L2 normalize → WHT rotate (D·H·D) → Lloyd-Max quantize → pack

Read path (attention):
  Rotate Q once → for each cached K: unpack → codebook lookup → dot product
  No full dequantization. Codebook in shared memory.

Python SDK

The Python SDK wraps the upstream mistralrs package with TurboQuant defaults.

pip install mistralrs              # CPU
pip install mistralrs-cuda         # NVIDIA GPU
pip install mistralrs-metal        # Apple Silicon

from mistralrs import Runner, Which, ChatCompletionRequest, PagedCacheType

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-4B"),
    in_situ_quant="4",
    pa_cache_type=PagedCacheType.TurboQuant,  # default
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=256,
    )
)
print(res.choices[0].message.content)

TurboQuant presets available: PagedCacheType.TurboQuant (default), PagedCacheType.TurboQuant3, PagedCacheType.TurboQuantAggressive.

Python examples | Cookbook

Rust SDK

cargo add arc-engine

use arc_engine::core::{TextModelBuilder, PagedAttentionConfig, MemoryGpuConfig, PagedCacheType};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    arc_engine::print_banner();

    let model = TextModelBuilder::new("Qwen/Qwen3-4B")
        .with_paged_attn(|| {
            // TurboQuant is the default — this is automatic
            PagedAttentionConfig::new(
                None,
                MemoryGpuConfig::default(),
                PagedCacheType::TurboQuant,
            )
        })?
        .build()
        .await?;

    let response = model.chat("What is Rust's ownership model?").await?;
    println!("{response}");
    Ok(())
}

The upstream mistralrs Rust crate also works — TurboQuant is the default there too.

Rust API docs | Examples

HTTP API

Arc serves an OpenAI-compatible API:

arc serve -p 8080 -m meta-llama/Llama-3.1-8B-Instruct

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Full API documentation

Supported Models

Arc supports every model that mistral.rs supports — 100+ architectures across text, vision, speech, image generation, and embeddings.

Text — Granite 4.0, SmolLM 3, DeepSeek V3, GPT-OSS, Qwen 3, GLM 4, Gemma 2, Phi 3, Llama, Mistral, Mixtral, Starcoder 2, and more

Granite 4.0, SmolLM 3, DeepSeek V3, GPT-OSS, DeepSeek V2, Qwen 3 Next, Qwen 3 MoE, Phi 3.5 MoE, Qwen 3, GLM 4, GLM-4.7-Flash, GLM-4.7 MoE, Gemma 2, Qwen 2, Starcoder 2, Phi 3, Mixtral, Phi 2, Gemma, Llama, Mistral

Vision — Qwen 3.5, Gemma 3n, Llama 4, Mistral 3, Phi 4, MiniCPM-O, LLaVA, and more

Qwen 3.5, Qwen 3.5 MoE, Qwen 3-VL, Qwen 3-VL MoE, Gemma 3n, Llama 4, Gemma 3, Mistral 3, Phi 4 multimodal, Qwen 2.5-VL, MiniCPM-O, Llama 3.2 Vision, Qwen 2-VL, Idefics 3, Idefics 2, LLaVA Next, LLaVA, Phi 3V

Speech — Voxtral, Dia

Voxtral (ASR/speech-to-text), Dia

Image Generation — FLUX

FLUX

Embeddings — Embedding Gemma, Qwen 3 Embedding

Embedding Gemma, Qwen 3 Embedding

Architecture

Arc uses a thin-wrapper architecture over mistral.rs for upstream compatibility:

arc-cli/          Arc binary (BSL-1.1)
arc-engine/       Wrapper crate (BSL-1.1)
arc-turbo/        TurboQuant: codebooks, WHT, cache, kernels (BSL-1.1)
mistralrs-*/      Upstream mistral.rs (MIT) — untouched, merge-compatible

git merge upstream/master works cleanly. New models and fixes from upstream are available immediately.

Building from Source

# CPU only
cargo build --release -p arc-cli

# NVIDIA GPU (CUDA + FlashAttention)
cargo build --release -p arc-cli --features "cuda flash-attn"

# Apple Silicon (Metal)
cargo build --release -p arc-cli --features metal

# Install globally
cargo install --path arc-cli --features <your-features>

Documentation

CLI Reference
HTTP API
Quantization — ISQ, GGUF, GPTQ, AWQ, TurboQuant
PagedAttention
Device Mapping
MCP Integration
Full documentation

License

arc-* crates: Business Source License 1.1 — free for non-commercial and sub-$1M revenue use. Commercial inference-as-a-service requires a license from Aeonmind.
mistralrs-* crates: MIT — upstream open source.

See NOTICE for details. For commercial licensing: support@runcrate.ai

Credits

Arc is built on mistral.rs by Eric Buehler and contributors, and the Candle ML framework by Hugging Face. TurboQuant is based on research by Zandieh et al. at Google Research (arXiv:2504.19874).

Arc by Aeonmind, LLC
The AI Cloud. Deploy, Scale, Infer.

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 3,373 Commits
.cargo		.cargo
.claude		.claude
.github		.github
arc-cli		arc-cli
arc-cuda-graph		arc-cuda-graph
arc-engine		arc-engine
arc-turbo		arc-turbo
calibration_data		calibration_data
chat_templates		chat_templates
deploy		deploy
docs		docs
examples		examples
matformer_configs		matformer_configs
mistralrs-audio		mistralrs-audio
mistralrs-bench		mistralrs-bench
mistralrs-cli		mistralrs-cli
mistralrs-core		mistralrs-core
mistralrs-macros		mistralrs-macros
mistralrs-mcp		mistralrs-mcp
mistralrs-paged-attn		mistralrs-paged-attn
mistralrs-pyo3		mistralrs-pyo3
mistralrs-quant		mistralrs-quant
mistralrs-server-core		mistralrs-server-core
mistralrs-server		mistralrs-server
mistralrs-vision		mistralrs-vision
mistralrs-web-chat		mistralrs-web-chat
mistralrs		mistralrs
orderings		orderings
res		res
ring_configs		ring_configs
scripts		scripts
toml-selectors		toml-selectors
topologies		topologies
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.typos.toml		.typos.toml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.cuda-13.0-ubi9		Dockerfile.cuda-13.0-ubi9
Dockerfile.cuda-all		Dockerfile.cuda-all
Dockerfile.manylinux		Dockerfile.manylinux
LICENSE-BSL		LICENSE-BSL
LICENSE-MIT		LICENSE-MIT
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
install.ps1		install.ps1
install.sh		install.sh
sample_speech.wav		sample_speech.wav
speculative.toml		speculative.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arc

Inference at the speed of physics.

What Arc Adds

Quick Start

Install

Run

TurboQuant Presets

TurboQuant

Performance (Tested)

How It Works

Python SDK

Rust SDK

HTTP API

Supported Models

Architecture

Building from Source

Documentation

License

Credits

About

Licenses found

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Arc

Inference at the speed of physics.

What Arc Adds

Quick Start

Install

Run

TurboQuant Presets

TurboQuant

Performance (Tested)

How It Works

Python SDK

Rust SDK

HTTP API

Supported Models

Architecture

Building from Source

Documentation

License

Credits

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages