Skip to content

evanjt06/zap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zap

A high-performance LLM inference gateway in Rust. Route, load balance, rate-limit, and retry requests across multiple cloud LLM providers through a single API endpoint.

Why Zap?

Instead of calling LLM providers directly, Zap sits between your app and the providers:

Your app  →  Zap  →  Groq (primary)
                  →  Cerebras (fallback)

Zap makes sense when you've outgrown "one app, one provider." Use it when you're mixing models (GPT-4o for hard stuff, Groq for cheap fast calls, local Ollama for sensitive data), when you want automatic failover so users don't notice an OpenAI outage, when you're saving money by routing simple queries to free/cheap providers, or when multiple services all need LLM access and you're tired of managing keys and retry logic in each one. If you're only calling one provider, building a prototype, or need provider-specific features that don't fit a generic chat completions interface, just call the provider directly. Zap is for when LLM calls become plumbing that lots of things depend on, not a one-off integration.

Quickstart

1. Get free API keys

2. Configure config.toml

[server]
host = "0.0.0.0"
port = 8000

[queue]
max_size = 1000
timeout_secs = 300

[[backends]]
url = "https://api.groq.com/openai"
weight = 2                              # Gets 2x traffic (fastest)
health_path = "/v1/models"
api_key = "gsk_your_groq_key"
default_model = "llama-3.1-8b-instant"

[[backends]]
url = "https://api.cerebras.ai"
weight = 1
health_path = "/v1/models"
api_key = "csk-your_cerebras_key"
default_model = "llama3.1-8b"

[rate_limit]
requests_per_minute = 60

3. Build and run

cargo build --release
./target/release/zap

4. Send a request

# Short form — no model needed
curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

# OpenAI-compatible form
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

Features

  • Multi-provider load balancing — round-robin or least-loaded across Groq, Cerebras, or any OpenAI-compatible API
  • Automatic failover — health checks every 10s, unhealthy backends skipped after 3 failures
  • Default model injection — clients send messages without specifying a model, Zap injects the backend's default_model
  • API key injection — per-backend API keys attached as Bearer tokens automatically
  • Request queue with backpressure — bounded queue, returns 503 when full
  • Per-key rate limiting — sliding window with Retry-After headers
  • Retry with backoff — exponential backoff + jitter on 5xx/connection errors (max 2 retries)
  • SSE streaming"stream": true for real-time token streaming
  • Prometheus metrics — request counts, latency, queue depth, errors at /metrics
  • Configurable paths — custom health_path and chat_path per backend

Configuration

[[backends]]

Field Type Default Description
url string required Base URL of the provider
weight integer required Routing weight (higher = more traffic)
health_path string "/health" Health check endpoint path
api_key string none API key injected as Authorization: Bearer <key>
default_model string none Model name injected when client omits it
chat_path string "/v1/chat/completions" Chat completions endpoint path

[queue]

Field Type Description
max_size integer Max queued requests before 503
timeout_secs integer Request timeout in seconds

[rate_limit]

Field Type Description
requests_per_minute integer Max requests per minute per API key

Security: Add config.toml to .gitignore — it contains your API keys.

API

Endpoint Method Description
/chat POST Short-form chat completions
/v1/chat/completions POST OpenAI-compatible chat completions
/health GET Returns {"status":"ok"}
/metrics GET Prometheus metrics

Request body

Field Type Required Description
messages array yes Conversation messages with role and content
model string no Model override (uses backend's default_model if omitted)
stream boolean no Stream via SSE (default: false)
temperature float no Sampling temperature (0.0 – 2.0)
max_tokens integer no Max tokens to generate

Examples

Streaming

curl --no-buffer http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Write a poem about Rust"}], "stream": true}'

With system prompt

curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a concise coding assistant."},
      {"role": "user", "content": "Write fizzbuzz in Python"}
    ]
  }'

Multi-turn conversation

curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is 2 + 2?"},
      {"role": "assistant", "content": "4."},
      {"role": "user", "content": "Multiply that by 10"}
    ]
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any-key",
)

response = client.chat.completions.create(
    model="",  # uses backend default
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

With rate limiting

curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-api-key" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

Requests without Authorization share the "anonymous" rate limit bucket.

Architecture

Client
  │
  ▼
[Rate Limiter]  ── 429 ──>  Client
  │
  ▼
[Bounded Queue]  ── 503 ──>  Client
  │
  ▼
[Dispatcher]  (spawns tokio task per request)
  │
  ▼
[Load Balancer]  (round-robin, weighted)
  │         │
  ▼         ▼
Groq    Cerebras     (+API key + model injection)
  │         │
  └────┬────┘
       ▼
[Retry w/ exponential backoff + jitter]
       │
       ▼
  Response ──> Client

Adding a new provider

Any OpenAI-compatible API works. Add a [[backends]] block to config.toml:

[[backends]]
url = "https://api.together.xyz"
weight = 1
health_path = "/v1/models"
api_key = "your_key"
default_model = "meta-llama/Llama-3.1-8B-Instruct"

For providers with non-standard paths, use chat_path:

chat_path = "/v1beta/openai/chat/completions"

Development

cargo build --release       # Build
cargo test                  # Run tests
cargo clippy -- -D warnings # Lint
cargo fmt                   # Format

License

MIT

About

high-performance LLM inference gateway in Rust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors