Skip to content

compiledthoughts/llama-cpp-claude-code-proxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama-cpp-claude-code-proxy

Lightweight, single-file FastAPI proxy that sits between llama-server (port 8080) and any AI coding extension or OpenAI-compatible SDK. Transparently forwards every request, handles SSE streaming, aggregates lifetime metrics across restarts, and serves a live dashboard.

This is v0.1 and in continous development, bugs are expected and PRs are welcome.

How it works

Claude Code / Copilot / Cline / Cursor / Continue.dev
            │  OpenAI API calls
            ▼
  http://localhost:9090        ← this proxy
            │  forwards everything
            ▼
  http://localhost:8080        ← llama-server

Install

Its good to use python venv for this.

pip install -r requirements.txt

Run

# OpenAI-compatible mode (default) — transparent passthrough for Cline, Cursor, Continue.dev, etc.
python proxy.py

# Claude mode — exposes Anthropic Messages API on top of llama-server
python proxy.py --mode claude

Startup banner:

==========================================================
  llama-cpp-claude-code-proxy
==========================================================
  Proxy URL    : http://localhost:9090
  llama-server : http://localhost:8080
  Dashboard    : http://localhost:9090/dashboard
  Metrics file : ./metrics.json
==========================================================

The proxy starts even when llama-server is offline — it will log when llama-server comes online or goes offline and keep retrying silently.


Dashboard

Open http://localhost:9090/dashboard in a browser.

Displays lifetime token totals, current-session stats (TPS, context usage, KV cache), and live charts. Auto-refreshes every 5 seconds. Example: image


Pointing tools at the proxy

All tools use the base URL http://localhost:9090/v1 and accept any API key string (llama-server ignores authentication).

Claude Code (CLI)

Claude Code uses the Anthropic Messages API (POST /v1/messages), not OpenAI format. Start the proxy in claude mode, which handles the translation internally:

python proxy.py --mode claude

Then point Claude Code at the proxy:

ANTHROPIC_BASE_URL=http://localhost:9090 claude

In powershell:

$env:ANTHROPIC_BASE_URL="http://localhost:9090"
$env:ANTHROPIC_API_KEY="localkey"
claude

Claude Code will send Anthropic-format requests to the proxy, which converts them to OpenAI format, calls llama-server, and converts the response back — including full SSE streaming.

GitHub Copilot (VS Code extension)

Add to VS Code settings.json:

"github.copilot.advanced": {
  "debug.overrideEngine": "http://localhost:9090/v1"
}

Note: GitHub Copilot routes through GitHub's servers by default. Full local redirection requires a Copilot-compatible model identifier and may need an enterprise policy override.

Cline (VS Code extension)

  1. Open the Cline settings panel.
  2. Set ProviderOpenAI Compatible.
  3. Set Base URLhttp://localhost:9090/v1.
  4. Set API Keylocal (any non-empty string).
  5. Pick whichever model name llama-server reports (from GET /v1/models).

Cursor

  1. Settings → Models → Add Model.
  2. Set the base URL to http://localhost:9090/v1.
  3. Set API Key to local.
  4. Enable the model and select it in the chat window.

Continue.dev

Add to ~/.continue/config.json:

{
  "models": [
    {
      "title": "llama-local",
      "provider": "openai",
      "model": "local",
      "apiBase": "http://localhost:9090/v1",
      "apiKey": "local"
    }
  ]
}

Any OpenAI Python/JS SDK

Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:9090/v1", api_key="local")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

JavaScript / TypeScript:

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:9090/v1", apiKey: "local" });
const response = await client.chat.completions.create({
  model: "local",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);

API endpoints

Method Path Description
POST /v1/chat/completions Chat completions (streaming + blocking)
POST /v1/completions Legacy completions
GET /v1/models List models
POST /v1/embeddings Embeddings
ANY /* Catch-all passthrough to llama-server
GET /dashboard Metrics dashboard UI
GET /metrics.json Raw metrics JSON

Metrics and lifetime tracking

Metrics are polled from http://localhost:8080/metrics (Prometheus format) every 10 seconds. The following values are tracked:

Metric Description
llama_tokens_generated_total Cumulative output tokens
llama_prompt_tokens_total Cumulative prompt tokens
llama_tokens_per_second Current generation speed
llama_context_tokens_used Tokens occupying the context
llama_kv_cache_usage_ratio KV cache fill ratio
llama_requests_processing In-flight requests

Lifetime totals accumulate across llama-server restarts (counter resets are detected automatically). They are persisted to metrics.json after every poll and on graceful shutdown (SIGINT / SIGTERM).


Requirements


Project structure

llama-cpp-claude-code-proxy/
├── proxy.py             # single-file proxy server
├── requirements.txt
├── metrics.json         # auto-created, persists lifetime totals
├── README.md
└── static/
    └── dashboard.html   # metrics dashboard

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors