Run multiple llama.cpp instances in parallel — GPU-accelerated, CPU-only, or any heterogeneous mix — from one browser dashboard.
LlamaFleet is a lightweight Node.js control plane and operator dashboard for multi-instance llama.cpp deployments. It partitions a multi-GPU machine — assigning specific GPUs to specific models — and manages the full lifecycle of each instance (launch, reload, drain, restart, remove) from a single browser UI without touching a terminal.
Each instance runs as an independent llama-server process with its own context window, queue limit, TTL, and GPU subset. LlamaFleet tracks state, catches crashes, and auto-restarts instances with configurable backoff.
Every instance is reachable through a single OpenAI-compatible API at http://host:8081/v1/instances/<id>/proxy/v1/... — same bearer token, same /v1/chat/completions and /v1/completions endpoints. A top-level /v1/chat/completions endpoint routes by model name with automatic round-robin across instances sharing the same model, so you can set base_url = http://host:8081/v1 once and let LlamaFleet handle load distribution.
Key capabilities:
- Per-instance GPU pinning via
CUDA_VISIBLE_DEVICESand equivalents for AMD/Intel/Metal - Headless process management — start, stop, drain, kill, remove from the browser or API
- OpenAI-compatible reverse proxy per instance — all
llama-serverprocesses bind to127.0.0.1; one port for everything - Named model routing with round-robin pool support —
POST /v1/chat/completionswith"model": "MyModel"round-robins across all running instances of that model; append-1,-2, etc. to pin to a specific instance (e.g."model": "MyModel-1").GET /v1/modelsreturns both the pool entry and each pinned alias so any OpenAI client can discover them automatically. - Heterogeneous compute pools — combine GPU-accelerated (NVIDIA/AMD/Intel), CPU-only, and mixed-offload
llama-serverinstances in the same round-robin pool under a single model name. Run a fast CUDA instance alongside a CPU fallback, or pool instances across different GPU vendors, and LlamaFleet distributes load across all of them automatically. - Global bearer token auth for both dashboard and all proxy traffic
- Config profiles — save a model + GPU + context + TTL combination and relaunch in one click
- Auto-restart with configurable backoff on unclean exits
- Periodic health monitoring — instances are polled every 30 s and auto-restarted if unhealthy
- Prometheus scrape endpoint at
GET /metrics(per-instance + per-GPU telemetry) - Compact VRAM bars in the GPU column with utilisation %, temperature, and power
- Log viewer with auto-tail and clone-setup action per instance
- Model Routing dashboard section — visual overview of which instances form a round-robin pool vs. solo routes, with one-click copy of each pinned model name
LlamaFleet uses GGUF models via llama-server directly — no LM Studio or Ollama required. Works on NVIDIA (including pre-Ampere V100/10xx/20xx), AMD, and CPU.
Security note: LlamaFleet is a single-tenant control plane designed for trusted local networks and homelabs. Do not expose port
8081to the public internet without a reverse proxy and firewall rules. See SECURITY.md for deployment guidance.
All three tools are built on top of llama.cpp, so they share the same hardware support and quantisation formats. The differences are in how much control you get over each running process and how you compose them.
| LlamaFleet | Ollama | LM Studio | |
|---|---|---|---|
Pass-through llama-server flags |
✅ Any flag, edited per instance | ⚠ Subset via Modelfile params |
⚠ Subset via GUI/JSON |
| Per-instance GPU pinning | ✅ Explicit CUDA_VISIBLE_DEVICES per process |
⚠ Global env var | ⚠ Per-model GPU select (recent versions) |
| Multiple models loaded at once | ✅ Unlimited, independent processes | ✅ Via OLLAMA_MAX_LOADED_MODELS |
✅ JIT-loaded |
| Round-robin pooling under one model name | ✅ Built-in across instances | ❌ | ❌ |
| Heterogeneous pools (mix GPU/CPU/runtimes) | ✅ Mix any runtimes under one model name | ❌ | ❌ |
| Any local GGUF | ✅ Scan paths + HF Hub browser | ✅ FROM ./model.gguf in Modelfile |
✅ Local files + HF browser |
| Browser dashboard | ✅ | ❌ (3rd-party only) | ❌ Desktop GUI only |
| OpenAI-compatible REST API | ✅ | ✅ | ✅ |
| Headless server / SSH box / systemd | ✅ Designed for it | ✅ | ❌ Desktop app |
| Multi-user auth / RBAC | ❌ Single shared bearer token | ❌ No auth | ❌ No auth |
The short version: if you need to carve up a multi-GPU box into several llama-server processes with explicit per-process hardware control, and pool them behind a single OpenAI endpoint, LlamaFleet is built for that. If you want a one-command model registry on your laptop, Ollama is simpler. If you want a polished desktop GUI for trying models locally, LM Studio is hard to beat.
curl -fsSL https://github.com/boringresearchjames/llamafleet/releases/latest/download/install.sh | sudo bashAuto-detects your GPU (NVIDIA/AMD/Vulkan/CPU), installs a matching llama-server binary, and sets up the systemd service. After install:
# Edit your tokens
sudo nano /etc/llamafleet/llamafleet.env
sudo systemctl restart llamafleetOpen http://localhost:8081.
- Node.js 18+
- A
llama-serverbinary — download a pre-built binary from the llama.cpp releases page or build from source:cmake -B build -DGGML_CUDA=on && cmake --build build --target llama-server -j$(nproc)
| Platform | Binary to download |
|---|---|
| Linux (NVIDIA, CUDA 12) | llama-*-bin-linux-x64-cuda-cu12*.zip |
| Linux (AMD, ROCm) | llama-*-bin-linux-x64-rocm*.zip |
| Linux (CPU) | llama-*-bin-linux-x64-avx2*.zip |
| Windows (NVIDIA, CUDA 12) | llama-*-bin-win-cuda-cu12-x64.zip |
| Windows (CPU / AVX2) | llama-*-bin-win-avx2-x64.zip |
git clone https://github.com/boringresearchjames/llamafleet.git
cd llamafleet
npm run install:depsMinimum required variables:
| Variable | Purpose |
|---|---|
LLAMA_SERVER_BIN |
Full path to your llama-server binary |
API_AUTH_TOKEN |
Bearer token for the dashboard and API (omit to disable auth) |
BRIDGE_AUTH_TOKEN |
Internal API<->bridge token (omit to disable) |
# Linux
export LLAMA_SERVER_BIN=/usr/local/bin/llama-server
export API_AUTH_TOKEN=change-me
export BRIDGE_AUTH_TOKEN=change-me# Windows (PowerShell)
$env:LLAMA_SERVER_BIN = "C:\Tools\llama\llama-server.exe"
$env:API_AUTH_TOKEN = "change-me"
$env:BRIDGE_AUTH_TOKEN = "change-me"MODELS_DIR — LlamaFleet auto-scans ~/.lmstudio/models, ~/.ollama/models, ~/.cache/huggingface/hub, and ~/unsloth_studio. Override with:
export MODELS_DIR=/mnt/nas/modelsnpm startOpen http://localhost:8081.
For watch-mode restarts during development:
npm run devThe full REST API reference is in docs/api.md.
It is also served live at http://localhost:8081/help with syntax-highlighted endpoint listings, request/response schemas, and Prometheus metric names.
All endpoints require Authorization: Bearer <token> when auth is enabled. Endpoints marked [admin] require the server API_AUTH_TOKEN specifically.
LlamaFleet is two core Node.js services plus an optional bridge router:
- API + dashboard (
apps/api, port8081) — serves the browser dashboard and REST API. Owns all state persistence (state.json) and config profiles. Authenticates inbound requests viaAPI_AUTH_TOKEN. - Host bridge (
apps/host-bridge, port8090) — runs natively on the host and spawnsllama-serverchild processes, one per instance. EnforcesCUDA_VISIBLE_DEVICESand six other device-visibility env vars for GPU pinning. Polls instance readiness and captures GPU telemetry vianvidia-smi. - Bridge router (
apps/bridge-router, optional) — sits between the API and multiple host bridges for multi-host deployments. Configure viaBRIDGE_POOLS_JSON.
sudo bash scripts/install-systemd.sh- Service unit:
deploy/systemd/llamafleet.service - Env template:
deploy/systemd/env/llamafleet.env.example->/etc/llamafleet/llamafleet.env - Full runbook:
deploy/systemd/README.md
| Variable | Default | Description |
|---|---|---|
API_AUTH_TOKEN |
(unset) | Bearer token for dashboard + API. Unset = auth disabled. |
BRIDGE_AUTH_TOKEN |
(unset) | Internal API<->bridge token. Unset = bridge auth disabled. |
| Variable | Default | Description |
|---|---|---|
PORT |
8081 |
API + dashboard listen port |
BRIDGE_URL |
http://127.0.0.1:8090 |
URL of the host bridge |
STATE_FILE |
./data/state.json |
Persistent state path |
SHARED_CONFIG_FILE |
./data/shared-config.yaml |
Shared config (profiles, security) |
MODELS_DIR |
~/.lmstudio/models |
Primary directory scanned for .gguf files. Also auto-scans ~/.ollama/models, ~/.cache/huggingface/hub, ~/unsloth_studio |
LLAMAFLEET_PUBLIC_HOST |
(unset) | This machine's IP, used in proxy URLs shown in the dashboard |
CORS_ORIGIN |
* |
Value of Access-Control-Allow-Origin |
| Variable | Default | Description |
|---|---|---|
BRIDGE_PORT |
8090 |
Bridge listen port |
LLAMA_SERVER_BIN |
llama-server |
Path to the llama-server binary |
DATA_ROOT |
./data |
Root directory for logs and instance metadata |
LOG_LINES_DEFAULT |
200 |
Default line count for log tail requests |
READINESS_POLL_MS |
2000 |
How often to poll instance /health during startup |
READINESS_HTTP_TIMEOUT_MS |
5000 |
Per-request timeout during readiness polling |
GPU_BLEED_MAX_DELTA_MIB |
256 |
Max allowed post-stop VRAM increase before flagging bleed |
SMOKE_CHECK_ENABLED |
false |
Run a test inference after startup to verify the instance responds |
STRICT_SMOKE_CHECK |
false |
Treat a failed smoke check as a fatal startup error |
This started as a practical fix for running a fleet of V100s with LM Studio: VRAM was bleeding between GPUs after model swaps, processes crashed under sustained load, and there was no way to pin a model to specific cards or isolate instances. No existing tool — GUI or headless — managed independent llama-server processes per GPU subset from a single control plane. LlamaFleet is that tool.
The GPU isolation problem is particularly relevant for pre-Ampere hardware. vLLM and SGLang require CUDA 11+ with Ampere-class features; older V100s and GTX 10/20 series cards either hit capability gaps or produce incorrect results. llama.cpp supports this hardware generation well, but running multiple instances with correct CUDA_VISIBLE_DEVICES per process, log management, readiness polling, and crash recovery is operationally tedious. LlamaFleet wraps all of that.
LlamaFleet is designed for homelab and internal network deployments. It is not hardened for direct public internet exposure. Follow these recommendations before deploying:
Always do:
- Set
API_AUTH_TOKENandBRIDGE_AUTH_TOKENto long random strings (32+ hex chars). Without these, the API and dashboard are open to anyone on the network. - Bind port
8081to your internal network interface only, not0.0.0.0, unless you intend it to be reachable network-wide. - Keep port
8090(the host bridge) firewalled — it should only be reachable from the API process on127.0.0.1. It has no auth by default.
If you expose port 8081 beyond your LAN:
- Put a reverse proxy (nginx, Caddy, Traefik) in front and terminate TLS there. LlamaFleet serves plain HTTP.
- Restrict the path via the proxy if you only want API access (not the dashboard).
- Consider IP allowlisting at the firewall or proxy level.
Token generation:
# Linux / macOS
openssl rand -hex 32
# PowerShell
-join ((1..32) | ForEach-Object { '{0:x2}' -f (Get-Random -Max 256) })What LlamaFleet does not provide:
- TLS — use a reverse proxy
- Per-user or per-instance auth — one global token for everything
- Rate limiting — your reverse proxy or firewall should handle this
- Audit logging for individual API calls — only instance lifecycle events are logged
- Single-host only — LlamaFleet manages instances on one machine. A bridge-router component exists for multi-host setups but multi-host is not the primary target.
llama-serverbinary required — LlamaFleet does not bundle or build it. Get a binary from the llama.cpp releases page.- No model download management — Models must be present on the host filesystem.
- No per-instance auth — Auth is enforced at the proxy layer via the global
API_AUTH_TOKEN. There is no per-instance key. - No speculative decoding or prefix caching — Pass the relevant
llama-serverflags viaruntimeArgsif the binary supports them.
MIT
