Lightweight, single-file FastAPI proxy that sits between llama-server (port 8080) and any AI coding extension or OpenAI-compatible SDK. Transparently forwards every request, handles SSE streaming, aggregates lifetime metrics across restarts, and serves a live dashboard.
This is v0.1 and in continous development, bugs are expected and PRs are welcome.
Claude Code / Copilot / Cline / Cursor / Continue.dev
│ OpenAI API calls
▼
http://localhost:9090 ← this proxy
│ forwards everything
▼
http://localhost:8080 ← llama-server
Its good to use python venv for this.
pip install -r requirements.txt# OpenAI-compatible mode (default) — transparent passthrough for Cline, Cursor, Continue.dev, etc.
python proxy.py
# Claude mode — exposes Anthropic Messages API on top of llama-server
python proxy.py --mode claudeStartup banner:
==========================================================
llama-cpp-claude-code-proxy
==========================================================
Proxy URL : http://localhost:9090
llama-server : http://localhost:8080
Dashboard : http://localhost:9090/dashboard
Metrics file : ./metrics.json
==========================================================
The proxy starts even when llama-server is offline — it will log when llama-server comes online or goes offline and keep retrying silently.
Open http://localhost:9090/dashboard in a browser.
Displays lifetime token totals, current-session stats (TPS, context usage, KV cache), and live charts. Auto-refreshes every 5 seconds.
Example:

All tools use the base URL http://localhost:9090/v1 and accept any API key string (llama-server ignores authentication).
Claude Code uses the Anthropic Messages API (POST /v1/messages), not OpenAI format. Start the proxy in claude mode, which handles the translation internally:
python proxy.py --mode claudeThen point Claude Code at the proxy:
ANTHROPIC_BASE_URL=http://localhost:9090 claudeIn powershell:
$env:ANTHROPIC_BASE_URL="http://localhost:9090"
$env:ANTHROPIC_API_KEY="localkey"
claude
Claude Code will send Anthropic-format requests to the proxy, which converts them to OpenAI format, calls llama-server, and converts the response back — including full SSE streaming.
Add to VS Code settings.json:
"github.copilot.advanced": {
"debug.overrideEngine": "http://localhost:9090/v1"
}Note: GitHub Copilot routes through GitHub's servers by default. Full local redirection requires a Copilot-compatible model identifier and may need an enterprise policy override.
- Open the Cline settings panel.
- Set Provider →
OpenAI Compatible. - Set Base URL →
http://localhost:9090/v1. - Set API Key →
local(any non-empty string). - Pick whichever model name llama-server reports (from
GET /v1/models).
- Settings → Models → Add Model.
- Set the base URL to
http://localhost:9090/v1. - Set API Key to
local. - Enable the model and select it in the chat window.
Add to ~/.continue/config.json:
{
"models": [
{
"title": "llama-local",
"provider": "openai",
"model": "local",
"apiBase": "http://localhost:9090/v1",
"apiKey": "local"
}
]
}Python:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:9090/v1", api_key="local")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)JavaScript / TypeScript:
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:9090/v1", apiKey: "local" });
const response = await client.chat.completions.create({
model: "local",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completions (streaming + blocking) |
POST |
/v1/completions |
Legacy completions |
GET |
/v1/models |
List models |
POST |
/v1/embeddings |
Embeddings |
ANY |
/* |
Catch-all passthrough to llama-server |
GET |
/dashboard |
Metrics dashboard UI |
GET |
/metrics.json |
Raw metrics JSON |
Metrics are polled from http://localhost:8080/metrics (Prometheus format) every 10 seconds. The following values are tracked:
| Metric | Description |
|---|---|
llama_tokens_generated_total |
Cumulative output tokens |
llama_prompt_tokens_total |
Cumulative prompt tokens |
llama_tokens_per_second |
Current generation speed |
llama_context_tokens_used |
Tokens occupying the context |
llama_kv_cache_usage_ratio |
KV cache fill ratio |
llama_requests_processing |
In-flight requests |
Lifetime totals accumulate across llama-server restarts (counter resets are detected automatically). They are persisted to metrics.json after every poll and on graceful shutdown (SIGINT / SIGTERM).
- Python 3.10+
- llama-server built from ggerganov/llama.cpp running on port 8080
llama-cpp-claude-code-proxy/
├── proxy.py # single-file proxy server
├── requirements.txt
├── metrics.json # auto-created, persists lifetime totals
├── README.md
└── static/
└── dashboard.html # metrics dashboard