A high-performance LLM inference gateway in Rust. Route, load balance, rate-limit, and retry requests across multiple cloud LLM providers through a single API endpoint.
Instead of calling LLM providers directly, Zap sits between your app and the providers:
Your app → Zap → Groq (primary)
→ Cerebras (fallback)
Zap makes sense when you've outgrown "one app, one provider." Use it when you're mixing models (GPT-4o for hard stuff, Groq for cheap fast calls, local Ollama for sensitive data), when you want automatic failover so users don't notice an OpenAI outage, when you're saving money by routing simple queries to free/cheap providers, or when multiple services all need LLM access and you're tired of managing keys and retry logic in each one. If you're only calling one provider, building a prototype, or need provider-specific features that don't fit a generic chat completions interface, just call the provider directly. Zap is for when LLM calls become plumbing that lots of things depend on, not a one-off integration.
- Groq — console.groq.com (free tier, ~30 req/min)
- Cerebras — cloud.cerebras.ai (free tier)
[server]
host = "0.0.0.0"
port = 8000
[queue]
max_size = 1000
timeout_secs = 300
[[backends]]
url = "https://api.groq.com/openai"
weight = 2 # Gets 2x traffic (fastest)
health_path = "/v1/models"
api_key = "gsk_your_groq_key"
default_model = "llama-3.1-8b-instant"
[[backends]]
url = "https://api.cerebras.ai"
weight = 1
health_path = "/v1/models"
api_key = "csk-your_cerebras_key"
default_model = "llama3.1-8b"
[rate_limit]
requests_per_minute = 60cargo build --release
./target/release/zap# Short form — no model needed
curl http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}]}'
# OpenAI-compatible form
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}]}'- Multi-provider load balancing — round-robin or least-loaded across Groq, Cerebras, or any OpenAI-compatible API
- Automatic failover — health checks every 10s, unhealthy backends skipped after 3 failures
- Default model injection — clients send messages without specifying a model, Zap injects the backend's
default_model - API key injection — per-backend API keys attached as Bearer tokens automatically
- Request queue with backpressure — bounded queue, returns 503 when full
- Per-key rate limiting — sliding window with
Retry-Afterheaders - Retry with backoff — exponential backoff + jitter on 5xx/connection errors (max 2 retries)
- SSE streaming —
"stream": truefor real-time token streaming - Prometheus metrics — request counts, latency, queue depth, errors at
/metrics - Configurable paths — custom
health_pathandchat_pathper backend
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Base URL of the provider |
weight |
integer | required | Routing weight (higher = more traffic) |
health_path |
string | "/health" |
Health check endpoint path |
api_key |
string | none | API key injected as Authorization: Bearer <key> |
default_model |
string | none | Model name injected when client omits it |
chat_path |
string | "/v1/chat/completions" |
Chat completions endpoint path |
| Field | Type | Description |
|---|---|---|
max_size |
integer | Max queued requests before 503 |
timeout_secs |
integer | Request timeout in seconds |
| Field | Type | Description |
|---|---|---|
requests_per_minute |
integer | Max requests per minute per API key |
Security: Add
config.tomlto.gitignore— it contains your API keys.
| Endpoint | Method | Description |
|---|---|---|
/chat |
POST | Short-form chat completions |
/v1/chat/completions |
POST | OpenAI-compatible chat completions |
/health |
GET | Returns {"status":"ok"} |
/metrics |
GET | Prometheus metrics |
| Field | Type | Required | Description |
|---|---|---|---|
messages |
array | yes | Conversation messages with role and content |
model |
string | no | Model override (uses backend's default_model if omitted) |
stream |
boolean | no | Stream via SSE (default: false) |
temperature |
float | no | Sampling temperature (0.0 – 2.0) |
max_tokens |
integer | no | Max tokens to generate |
curl --no-buffer http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Write a poem about Rust"}], "stream": true}'curl http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a concise coding assistant."},
{"role": "user", "content": "Write fizzbuzz in Python"}
]
}'curl http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is 2 + 2?"},
{"role": "assistant", "content": "4."},
{"role": "user", "content": "Multiply that by 10"}
]
}'from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="any-key",
)
response = client.chat.completions.create(
model="", # uses backend default
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)curl http://localhost:8000/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer my-api-key" \
-d '{"messages": [{"role": "user", "content": "Hello!"}]}'Requests without Authorization share the "anonymous" rate limit bucket.
Client
│
▼
[Rate Limiter] ── 429 ──> Client
│
▼
[Bounded Queue] ── 503 ──> Client
│
▼
[Dispatcher] (spawns tokio task per request)
│
▼
[Load Balancer] (round-robin, weighted)
│ │
▼ ▼
Groq Cerebras (+API key + model injection)
│ │
└────┬────┘
▼
[Retry w/ exponential backoff + jitter]
│
▼
Response ──> Client
Any OpenAI-compatible API works. Add a [[backends]] block to config.toml:
[[backends]]
url = "https://api.together.xyz"
weight = 1
health_path = "/v1/models"
api_key = "your_key"
default_model = "meta-llama/Llama-3.1-8B-Instruct"For providers with non-standard paths, use chat_path:
chat_path = "/v1beta/openai/chat/completions"cargo build --release # Build
cargo test # Run tests
cargo clippy -- -D warnings # Lint
cargo fmt # FormatMIT