LlamaFleet

Run multiple llama.cpp instances in parallel — GPU-accelerated, CPU-only, or any heterogeneous mix — from one browser dashboard.

LlamaFleet is a lightweight Node.js control plane and operator dashboard for multi-instance llama.cpp deployments. It partitions a multi-GPU machine — assigning specific GPUs to specific models — and manages the full lifecycle of each instance (launch, reload, drain, restart, remove) from a single browser UI without touching a terminal.

Each instance runs as an independent llama-server process with its own context window, queue limit, TTL, and GPU subset. LlamaFleet tracks state, catches crashes, and auto-restarts instances with configurable backoff.

Every instance is reachable through a single OpenAI-compatible API at http://host:8081/v1/instances/<id>/proxy/v1/... — same bearer token, same /v1/chat/completions and /v1/completions endpoints. A top-level /v1/chat/completions endpoint routes by model name with automatic round-robin across instances sharing the same model, so you can set base_url = http://host:8081/v1 once and let LlamaFleet handle load distribution.

Key capabilities:

Per-instance GPU pinning via CUDA_VISIBLE_DEVICES and equivalents for AMD/Intel/Metal
Headless process management — start, stop, drain, kill, remove from the browser or API
OpenAI-compatible reverse proxy per instance — all llama-server processes bind to 127.0.0.1; one port for everything
Named model routing with round-robin pool support — POST /v1/chat/completions with "model": "MyModel" round-robins across all running instances of that model; append -1, -2, etc. to pin to a specific instance (e.g. "model": "MyModel-1"). GET /v1/models returns both the pool entry and each pinned alias so any OpenAI client can discover them automatically.
Heterogeneous compute pools — combine GPU-accelerated (NVIDIA/AMD/Intel), CPU-only, and mixed-offload llama-server instances in the same round-robin pool under a single model name. Run a fast CUDA instance alongside a CPU fallback, or pool instances across different GPU vendors, and LlamaFleet distributes load across all of them automatically.
Global bearer token auth for both dashboard and all proxy traffic
Config profiles — save a model + GPU + context + TTL combination and relaunch in one click
Auto-restart with configurable backoff on unclean exits
Periodic health monitoring — instances are polled every 30 s and auto-restarted if unhealthy
Prometheus scrape endpoint at GET /metrics (per-instance + per-GPU telemetry)
Compact VRAM bars in the GPU column with utilisation %, temperature, and power
Log viewer with auto-tail and clone-setup action per instance
Model Routing dashboard section — visual overview of which instances form a round-robin pool vs. solo routes, with one-click copy of each pinned model name

LlamaFleet uses GGUF models via llama-server directly — no LM Studio or Ollama required. Works on NVIDIA (including pre-Ampere V100/10xx/20xx), AMD, and CPU.

Security note: LlamaFleet is a single-tenant control plane designed for trusted local networks and homelabs. Do not expose port 8081 to the public internet without a reverse proxy and firewall rules. See SECURITY.md for deployment guidance.

Why LlamaFleet instead of Ollama / LM Studio?

All three tools are built on top of llama.cpp, so they share the same hardware support and quantisation formats. The differences are in how much control you get over each running process and how you compose them.

	LlamaFleet	Ollama	LM Studio
Pass-through `llama-server` flags	✅ Any flag, edited per instance	⚠ Subset via `Modelfile` params	⚠ Subset via GUI/JSON
Per-instance GPU pinning	✅ Explicit `CUDA_VISIBLE_DEVICES` per process	⚠ Global env var	⚠ Per-model GPU select (recent versions)
Multiple models loaded at once	✅ Unlimited, independent processes	✅ Via `OLLAMA_MAX_LOADED_MODELS`	✅ JIT-loaded
Round-robin pooling under one model name	✅ Built-in across instances	❌	❌
Heterogeneous pools (mix GPU/CPU/runtimes)	✅ Mix any runtimes under one model name	❌	❌
Any local GGUF	✅ Scan paths + HF Hub browser	✅ `FROM ./model.gguf` in Modelfile	✅ Local files + HF browser
Browser dashboard	✅	❌ (3rd-party only)	❌ Desktop GUI only
OpenAI-compatible REST API	✅	✅	✅
Headless server / SSH box / systemd	✅ Designed for it	✅	❌ Desktop app
Multi-user auth / RBAC	❌ Single shared bearer token	❌ No auth	❌ No auth

The short version: if you need to carve up a multi-GPU box into several llama-server processes with explicit per-process hardware control, and pool them behind a single OpenAI endpoint, LlamaFleet is built for that. If you want a one-command model registry on your laptop, Ollama is simpler. If you want a polished desktop GUI for trying models locally, LM Studio is hard to beat.

Dashboard

Quick Start (Linux — one line)

curl -fsSL https://github.com/boringresearchjames/llamafleet/releases/latest/download/install.sh | sudo bash

Auto-detects your GPU (NVIDIA/AMD/Vulkan/CPU), installs a matching llama-server binary, and sets up the systemd service. After install:

# Edit your tokens
sudo nano /etc/llamafleet/llamafleet.env

sudo systemctl restart llamafleet

Open http://localhost:8081.

Quick Start (Manual / Development)

1. Prerequisites

Node.js 18+
A llama-server binary — download a pre-built binary from the llama.cpp releases page or build from source:
```
cmake -B build -DGGML_CUDA=on && cmake --build build --target llama-server -j$(nproc)
```

Platform	Binary to download
Linux (NVIDIA, CUDA 12)	`llama--bin-linux-x64-cuda-cu12.zip`
Linux (AMD, ROCm)	`llama--bin-linux-x64-rocm.zip`
Linux (CPU)	`llama--bin-linux-x64-avx2.zip`
Windows (NVIDIA, CUDA 12)	`llama-*-bin-win-cuda-cu12-x64.zip`
Windows (CPU / AVX2)	`llama-*-bin-win-avx2-x64.zip`

2. Install

git clone https://github.com/boringresearchjames/llamafleet.git
cd llamafleet
npm run install:deps

3. Configure

Minimum required variables:

Variable	Purpose
`LLAMA_SERVER_BIN`	Full path to your `llama-server` binary
`API_AUTH_TOKEN`	Bearer token for the dashboard and API (omit to disable auth)
`BRIDGE_AUTH_TOKEN`	Internal API<->bridge token (omit to disable)

# Linux
export LLAMA_SERVER_BIN=/usr/local/bin/llama-server
export API_AUTH_TOKEN=change-me
export BRIDGE_AUTH_TOKEN=change-me

# Windows (PowerShell)
$env:LLAMA_SERVER_BIN = "C:\Tools\llama\llama-server.exe"
$env:API_AUTH_TOKEN   = "change-me"
$env:BRIDGE_AUTH_TOKEN = "change-me"

MODELS_DIR — LlamaFleet auto-scans ~/.lmstudio/models, ~/.ollama/models, ~/.cache/huggingface/hub, and ~/unsloth_studio. Override with:

export MODELS_DIR=/mnt/nas/models

4. Run

npm start

Open http://localhost:8081.

For watch-mode restarts during development:

npm run dev

API Reference

The full REST API reference is in docs/api.md.

It is also served live at http://localhost:8081/help with syntax-highlighted endpoint listings, request/response schemas, and Prometheus metric names.

All endpoints require Authorization: Bearer <token> when auth is enabled. Endpoints marked [admin] require the server API_AUTH_TOKEN specifically.

Architecture

LlamaFleet is two core Node.js services plus an optional bridge router:

API + dashboard (apps/api, port 8081) — serves the browser dashboard and REST API. Owns all state persistence (state.json) and config profiles. Authenticates inbound requests via API_AUTH_TOKEN.
Host bridge (apps/host-bridge, port 8090) — runs natively on the host and spawns llama-server child processes, one per instance. Enforces CUDA_VISIBLE_DEVICES and six other device-visibility env vars for GPU pinning. Polls instance readiness and captures GPU telemetry via nvidia-smi.
Bridge router (apps/bridge-router, optional) — sits between the API and multiple host bridges for multi-host deployments. Configure via BRIDGE_POOLS_JSON.

Systemd deployment (Linux)

sudo bash scripts/install-systemd.sh

Service unit: deploy/systemd/llamafleet.service
Env template: deploy/systemd/env/llamafleet.env.example -> /etc/llamafleet/llamafleet.env
Full runbook: deploy/systemd/README.md

Environment Variables

Shared

Variable	Default	Description
`API_AUTH_TOKEN`	(unset)	Bearer token for dashboard + API. Unset = auth disabled.
`BRIDGE_AUTH_TOKEN`	(unset)	Internal API<->bridge token. Unset = bridge auth disabled.

API (`apps/api`)

Variable	Default	Description
`PORT`	`8081`	API + dashboard listen port
`BRIDGE_URL`	`http://127.0.0.1:8090`	URL of the host bridge
`STATE_FILE`	`./data/state.json`	Persistent state path
`SHARED_CONFIG_FILE`	`./data/shared-config.yaml`	Shared config (profiles, security)
`MODELS_DIR`	`~/.lmstudio/models`	Primary directory scanned for `.gguf` files. Also auto-scans `~/.ollama/models`, `~/.cache/huggingface/hub`, `~/unsloth_studio`
`LLAMAFLEET_PUBLIC_HOST`	(unset)	This machine's IP, used in proxy URLs shown in the dashboard
`CORS_ORIGIN`	`*`	Value of `Access-Control-Allow-Origin`

Bridge (`apps/host-bridge`)

Variable	Default	Description
`BRIDGE_PORT`	`8090`	Bridge listen port
`LLAMA_SERVER_BIN`	`llama-server`	Path to the `llama-server` binary
`DATA_ROOT`	`./data`	Root directory for logs and instance metadata
`LOG_LINES_DEFAULT`	`200`	Default line count for log tail requests
`READINESS_POLL_MS`	`2000`	How often to poll instance `/health` during startup
`READINESS_HTTP_TIMEOUT_MS`	`5000`	Per-request timeout during readiness polling
`GPU_BLEED_MAX_DELTA_MIB`	`256`	Max allowed post-stop VRAM increase before flagging bleed
`SMOKE_CHECK_ENABLED`	`false`	Run a test inference after startup to verify the instance responds
`STRICT_SMOKE_CHECK`	`false`	Treat a failed smoke check as a fatal startup error

Background

This started as a practical fix for running a fleet of V100s with LM Studio: VRAM was bleeding between GPUs after model swaps, processes crashed under sustained load, and there was no way to pin a model to specific cards or isolate instances. No existing tool — GUI or headless — managed independent llama-server processes per GPU subset from a single control plane. LlamaFleet is that tool.

The GPU isolation problem is particularly relevant for pre-Ampere hardware. vLLM and SGLang require CUDA 11+ with Ampere-class features; older V100s and GTX 10/20 series cards either hit capability gaps or produce incorrect results. llama.cpp supports this hardware generation well, but running multiple instances with correct CUDA_VISIBLE_DEVICES per process, log management, readiness polling, and crash recovery is operationally tedious. LlamaFleet wraps all of that.

Network Security

LlamaFleet is designed for homelab and internal network deployments. It is not hardened for direct public internet exposure. Follow these recommendations before deploying:

Always do:

Set API_AUTH_TOKEN and BRIDGE_AUTH_TOKEN to long random strings (32+ hex chars). Without these, the API and dashboard are open to anyone on the network.
Bind port 8081 to your internal network interface only, not 0.0.0.0, unless you intend it to be reachable network-wide.
Keep port 8090 (the host bridge) firewalled — it should only be reachable from the API process on 127.0.0.1. It has no auth by default.

If you expose port 8081 beyond your LAN:

Put a reverse proxy (nginx, Caddy, Traefik) in front and terminate TLS there. LlamaFleet serves plain HTTP.
Restrict the path via the proxy if you only want API access (not the dashboard).
Consider IP allowlisting at the firewall or proxy level.

Token generation:

# Linux / macOS
openssl rand -hex 32

# PowerShell
-join ((1..32) | ForEach-Object { '{0:x2}' -f (Get-Random -Max 256) })

What LlamaFleet does not provide:

TLS — use a reverse proxy
Per-user or per-instance auth — one global token for everything
Rate limiting — your reverse proxy or firewall should handle this
Audit logging for individual API calls — only instance lifecycle events are logged

Known Limitations

Single-host only — LlamaFleet manages instances on one machine. A bridge-router component exists for multi-host setups but multi-host is not the primary target.
llama-server binary required — LlamaFleet does not bundle or build it. Get a binary from the llama.cpp releases page.
No model download management — Models must be present on the host filesystem.
No per-instance auth — Auth is enforced at the proxy layer via the global API_AUTH_TOKEN. There is no per-instance key.
No speculative decoding or prefix caching — Pass the relevant llama-server flags via runtimeArgs if the binary supports them.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github/workflows		.github/workflows
apps		apps
deploy/systemd		deploy/systemd
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
vitest.config.js		vitest.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LlamaFleet

Why LlamaFleet instead of Ollama / LM Studio?

Dashboard

Quick Start (Linux — one line)

Quick Start (Manual / Development)

1. Prerequisites

2. Install

3. Configure

4. Run

API Reference

Architecture

Systemd deployment (Linux)

Environment Variables

Shared

API (`apps/api`)

Bridge (`apps/host-bridge`)

Background

Network Security

Known Limitations

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LlamaFleet

Why LlamaFleet instead of Ollama / LM Studio?

Dashboard

Quick Start (Linux — one line)

Quick Start (Manual / Development)

1. Prerequisites

2. Install

3. Configure

4. Run

API Reference

Architecture

Systemd deployment (Linux)

Environment Variables

Shared

API (apps/api)

Bridge (apps/host-bridge)

Background

Network Security

Known Limitations

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

API (`apps/api`)

Bridge (`apps/host-bridge`)

Packages