Skip to content

aeonmindai/arc

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,373 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arc

Inference at the speed of physics.

Website | Rust SDK | Python SDK | TurboQuant | GitHub

A Rust-native LLM inference engine with near-optimal KV cache compression.
Built by Aeonmind. Powers Runcrate.


Arc is a high-performance LLM inference engine targeting 120 tokens/second on Qwen3-32B BF16 on a single B200 — 97% of theoretical HBM bandwidth utilization. Built on mistral.rs with CUDA graph capture, fused kernels, and TurboQuant KV cache compression (ICLR 2026, lossless 4.27x context scaling).

Peak Inference Plan — the full technical roadmap to 120 tok/s.

What Arc Adds

Feature What it does Status
TurboQuant KV cache 4-bit K / 3-bit V sub-byte packing with WHT rotation + Lloyd-Max codebooks. 4.27x context on H100.
Fused attention kernels CUDA kernels that read packed KV directly — codebook lookup in the attention inner loop
3 compression presets Default (3.5-bit, lossless), Balanced (3.0-bit), Aggressive (2.5-bit)
GPU-autonomous decode Entire decode loop (forward + sampling + EOS check) runs on GPU without CPU involvement. CUDA 12.4 conditional graph nodes. In progress
Fused GPU sampling Argmax, top-p, penalties — all on GPU inside the captured graph. No CPU round-trip per token. Planned
Zero-copy token streaming GPU writes tokens to pinned host ring buffer. CPU streams to clients without synchronizing the decode loop. Planned
Elastic tensor parallelism Per-request GPU allocation, TP=1 to TP=8 dynamically Planned
Disaggregated serving Prefill-decode separation with KV-aware routing Planned

Everything from mistral.rs is included: PagedAttention, FlashAttention V2/V3, speculative decoding, continuous batching, 100+ model architectures, GGUF/GPTQ/AWQ/ISQ, LoRA, MCP, multi-GPU tensor parallelism, and more.

Quick Start

Install

Linux/macOS (one-liner):

curl -fsSL https://raw.githubusercontent.com/aeonmindai/arc/master/install.sh | sh

Auto-detects CUDA, Metal, and FlashAttention. Downloads prebuilt binaries when available, builds from source otherwise.

From source:

cargo install --path arc-cli                              # CPU
cargo install --path arc-cli --features metal             # Apple Silicon
cargo install --path arc-cli --features "cuda flash-attn" # NVIDIA GPU

Run

# Interactive chat — TurboQuant enabled by default
arc run -m Qwen/Qwen3-4B

# Start a server with web UI
arc serve --ui -m google/gemma-3-4b-it

# Benchmark
arc bench -m meta-llama/Llama-3.1-8B-Instruct

TurboQuant Presets

# Default: 3.5-bit (K4/V3) — lossless, 2.2x compression
arc serve -m <model>

# Balanced: 3.0-bit (K3/V3) — 2.56x compression
arc serve -m <model> --pa-cache-type turboquant-3

# Aggressive: 2.5-bit (K3/V2) — 4.1x compression, ~1.2% quality loss
arc serve -m <model> --pa-cache-type turboquant-aggressive

# Disable TurboQuant (upstream behavior)
arc serve -m <model> --pa-cache-type auto

TurboQuant

TurboQuant (arXiv:2504.19874, ICLR 2026) compresses KV cache vectors to 2-4 bits using:

  1. Walsh-Hadamard rotation — O(d log d) random orthogonal transform that makes every coordinate follow a known Beta distribution, regardless of input data
  2. Lloyd-Max codebooks — pre-computed optimal scalar quantizers for the Beta distribution (no calibration, no training data needed)
  3. Sub-byte packing — 4-bit nibble and 3-bit 10-in-32 formats, with fused GPU kernels

The key insight: after rotation, optimal quantization is data-oblivious. The codebooks are compile-time constants.

Performance (Tested)

Qwen3-32B on a single H100 80GB:

Metric BF16 (baseline) TurboQuant (default) Improvement
KV cache per token 4,096 bytes/layer 960 bytes/layer 4.27x smaller
Available context 39,680 tokens 169,216 tokens 4.27x longer
Concurrent users (4K ctx) ~10 ~42 4.27x more
Model quality Baseline Identical Lossless

Presets:

Preset Bits Context multiplier Quality
Default 3.5 (K4/V3) 4.27x Lossless
Balanced 3.0 (K3/V3) ~5x ~0.1% loss
Aggressive 2.5 (K3/V2) ~7x ~1.2% loss

At 3.5 bits: LongBench 50.06 = identical to FP16 50.06 on Llama-3.1-8B-Instruct. Zero quality loss on needle-in-a-haystack at all context lengths.

How It Works

Write path (per token):
  K/V vector → L2 normalize → WHT rotate (D·H·D) → Lloyd-Max quantize → pack

Read path (attention):
  Rotate Q once → for each cached K: unpack → codebook lookup → dot product
  No full dequantization. Codebook in shared memory.

Python SDK

The Python SDK wraps the upstream mistralrs package with TurboQuant defaults.

pip install mistralrs              # CPU
pip install mistralrs-cuda         # NVIDIA GPU
pip install mistralrs-metal        # Apple Silicon
from mistralrs import Runner, Which, ChatCompletionRequest, PagedCacheType

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-4B"),
    in_situ_quant="4",
    pa_cache_type=PagedCacheType.TurboQuant,  # default
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=256,
    )
)
print(res.choices[0].message.content)

TurboQuant presets available: PagedCacheType.TurboQuant (default), PagedCacheType.TurboQuant3, PagedCacheType.TurboQuantAggressive.

Python examples | Cookbook

Rust SDK

cargo add arc-engine
use arc_engine::core::{TextModelBuilder, PagedAttentionConfig, MemoryGpuConfig, PagedCacheType};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    arc_engine::print_banner();

    let model = TextModelBuilder::new("Qwen/Qwen3-4B")
        .with_paged_attn(|| {
            // TurboQuant is the default — this is automatic
            PagedAttentionConfig::new(
                None,
                MemoryGpuConfig::default(),
                PagedCacheType::TurboQuant,
            )
        })?
        .build()
        .await?;

    let response = model.chat("What is Rust's ownership model?").await?;
    println!("{response}");
    Ok(())
}

The upstream mistralrs Rust crate also works — TurboQuant is the default there too.

Rust API docs | Examples

HTTP API

Arc serves an OpenAI-compatible API:

arc serve -p 8080 -m meta-llama/Llama-3.1-8B-Instruct

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Full API documentation

Supported Models

Arc supports every model that mistral.rs supports — 100+ architectures across text, vision, speech, image generation, and embeddings.

Text — Granite 4.0, SmolLM 3, DeepSeek V3, GPT-OSS, Qwen 3, GLM 4, Gemma 2, Phi 3, Llama, Mistral, Mixtral, Starcoder 2, and more

Granite 4.0, SmolLM 3, DeepSeek V3, GPT-OSS, DeepSeek V2, Qwen 3 Next, Qwen 3 MoE, Phi 3.5 MoE, Qwen 3, GLM 4, GLM-4.7-Flash, GLM-4.7 MoE, Gemma 2, Qwen 2, Starcoder 2, Phi 3, Mixtral, Phi 2, Gemma, Llama, Mistral

Vision — Qwen 3.5, Gemma 3n, Llama 4, Mistral 3, Phi 4, MiniCPM-O, LLaVA, and more

Qwen 3.5, Qwen 3.5 MoE, Qwen 3-VL, Qwen 3-VL MoE, Gemma 3n, Llama 4, Gemma 3, Mistral 3, Phi 4 multimodal, Qwen 2.5-VL, MiniCPM-O, Llama 3.2 Vision, Qwen 2-VL, Idefics 3, Idefics 2, LLaVA Next, LLaVA, Phi 3V

Speech — Voxtral, Dia

Voxtral (ASR/speech-to-text), Dia

Image Generation — FLUX

FLUX

Embeddings — Embedding Gemma, Qwen 3 Embedding

Embedding Gemma, Qwen 3 Embedding

Architecture

Arc uses a thin-wrapper architecture over mistral.rs for upstream compatibility:

arc-cli/          Arc binary (BSL-1.1)
arc-engine/       Wrapper crate (BSL-1.1)
arc-turbo/        TurboQuant: codebooks, WHT, cache, kernels (BSL-1.1)
mistralrs-*/      Upstream mistral.rs (MIT) — untouched, merge-compatible

git merge upstream/master works cleanly. New models and fixes from upstream are available immediately.

Building from Source

# CPU only
cargo build --release -p arc-cli

# NVIDIA GPU (CUDA + FlashAttention)
cargo build --release -p arc-cli --features "cuda flash-attn"

# Apple Silicon (Metal)
cargo build --release -p arc-cli --features metal

# Install globally
cargo install --path arc-cli --features <your-features>

Documentation

License

  • arc-* crates: Business Source License 1.1 — free for non-commercial and sub-$1M revenue use. Commercial inference-as-a-service requires a license from Aeonmind.
  • mistralrs-* crates: MIT — upstream open source.

See NOTICE for details. For commercial licensing: support@runcrate.ai

Credits

Arc is built on mistral.rs by Eric Buehler and contributors, and the Candle ML framework by Hugging Face. TurboQuant is based on research by Zandieh et al. at Google Research (arXiv:2504.19874).


Arc by Aeonmind, LLC
The AI Cloud. Deploy, Scale, Infer.

Back to Top

About

SOTA Inference

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE-BSL
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Rust 81.5%
  • Cuda 11.0%
  • Metal 4.7%
  • JavaScript 1.0%
  • Python 0.9%
  • Jinja 0.3%
  • Other 0.6%