llama-cpp-claude-code-proxy

Lightweight, single-file FastAPI proxy that sits between llama-server (port 8080) and any AI coding extension or OpenAI-compatible SDK. Transparently forwards every request, handles SSE streaming, aggregates lifetime metrics across restarts, and serves a live dashboard.

This is v0.1 and in continous development, bugs are expected and PRs are welcome.

How it works

Claude Code / Copilot / Cline / Cursor / Continue.dev
            │  OpenAI API calls
            ▼
  http://localhost:9090        ← this proxy
            │  forwards everything
            ▼
  http://localhost:8080        ← llama-server

Install

Its good to use python venv for this.

pip install -r requirements.txt

Run

# OpenAI-compatible mode (default) — transparent passthrough for Cline, Cursor, Continue.dev, etc.
python proxy.py

# Claude mode — exposes Anthropic Messages API on top of llama-server
python proxy.py --mode claude

Startup banner:

==========================================================
  llama-cpp-claude-code-proxy
==========================================================
  Proxy URL    : http://localhost:9090
  llama-server : http://localhost:8080
  Dashboard    : http://localhost:9090/dashboard
  Metrics file : ./metrics.json
==========================================================

The proxy starts even when llama-server is offline — it will log when llama-server comes online or goes offline and keep retrying silently.

Dashboard

Open http://localhost:9090/dashboard in a browser.

Displays lifetime token totals, current-session stats (TPS, context usage, KV cache), and live charts. Auto-refreshes every 5 seconds. Example:

Pointing tools at the proxy

All tools use the base URL http://localhost:9090/v1 and accept any API key string (llama-server ignores authentication).

Claude Code (CLI)

Claude Code uses the Anthropic Messages API (POST /v1/messages), not OpenAI format. Start the proxy in claude mode, which handles the translation internally:

python proxy.py --mode claude

Then point Claude Code at the proxy:

ANTHROPIC_BASE_URL=http://localhost:9090 claude

In powershell:

$env:ANTHROPIC_BASE_URL="http://localhost:9090"
$env:ANTHROPIC_API_KEY="localkey"
claude

Claude Code will send Anthropic-format requests to the proxy, which converts them to OpenAI format, calls llama-server, and converts the response back — including full SSE streaming.

GitHub Copilot (VS Code extension)

Add to VS Code settings.json:

"github.copilot.advanced": {
  "debug.overrideEngine": "http://localhost:9090/v1"
}

Note: GitHub Copilot routes through GitHub's servers by default. Full local redirection requires a Copilot-compatible model identifier and may need an enterprise policy override.

Cline (VS Code extension)

Open the Cline settings panel.
Set Provider → OpenAI Compatible.
Set Base URL → http://localhost:9090/v1.
Set API Key → local (any non-empty string).
Pick whichever model name llama-server reports (from GET /v1/models).

Cursor

Settings → Models → Add Model.
Set the base URL to http://localhost:9090/v1.
Set API Key to local.
Enable the model and select it in the chat window.

Continue.dev

Add to ~/.continue/config.json:

{
  "models": [
    {
      "title": "llama-local",
      "provider": "openai",
      "model": "local",
      "apiBase": "http://localhost:9090/v1",
      "apiKey": "local"
    }
  ]
}

Any OpenAI Python/JS SDK

Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:9090/v1", api_key="local")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

JavaScript / TypeScript:

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:9090/v1", apiKey: "local" });
const response = await client.chat.completions.create({
  model: "local",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);

API endpoints

Method	Path	Description
`POST`	`/v1/chat/completions`	Chat completions (streaming + blocking)
`POST`	`/v1/completions`	Legacy completions
`GET`	`/v1/models`	List models
`POST`	`/v1/embeddings`	Embeddings
`ANY`	`/*`	Catch-all passthrough to llama-server
`GET`	`/dashboard`	Metrics dashboard UI
`GET`	`/metrics.json`	Raw metrics JSON

Metrics and lifetime tracking

Metrics are polled from http://localhost:8080/metrics (Prometheus format) every 10 seconds. The following values are tracked:

Metric	Description
`llama_tokens_generated_total`	Cumulative output tokens
`llama_prompt_tokens_total`	Cumulative prompt tokens
`llama_tokens_per_second`	Current generation speed
`llama_context_tokens_used`	Tokens occupying the context
`llama_kv_cache_usage_ratio`	KV cache fill ratio
`llama_requests_processing`	In-flight requests

Lifetime totals accumulate across llama-server restarts (counter resets are detected automatically). They are persisted to metrics.json after every poll and on graceful shutdown (SIGINT / SIGTERM).

Requirements

Python 3.10+
llama-server built from ggerganov/llama.cpp running on port 8080

Project structure

llama-cpp-claude-code-proxy/
├── proxy.py             # single-file proxy server
├── requirements.txt
├── metrics.json         # auto-created, persists lifetime totals
├── README.md
└── static/
    └── dashboard.html   # metrics dashboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-cpp-claude-code-proxy

How it works

Install

Run

Dashboard

Pointing tools at the proxy

Claude Code (CLI)

GitHub Copilot (VS Code extension)

Cline (VS Code extension)

Cursor

Continue.dev

Any OpenAI Python/JS SDK

API endpoints

Metrics and lifetime tracking

Requirements

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
static		static
.gitignore		.gitignore
README.md		README.md
proxy.py		proxy.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

llama-cpp-claude-code-proxy

How it works

Install

Run

Dashboard

Pointing tools at the proxy

Claude Code (CLI)

GitHub Copilot (VS Code extension)

Cline (VS Code extension)

Cursor

Continue.dev

Any OpenAI Python/JS SDK

API endpoints

Metrics and lifetime tracking

Requirements

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages