Muxa is a self-hosted proxy that presents Anthropic- and OpenAI-compatible APIs so IDE/CLI tooling (Claude Code, Cursor, Codex, Continue, Copilot, etc.) can run against your choice of provider—cloud or local—without touching client settings.
- One URL for many providers – Point every client at
http://localhost:8081, then change providers (OpenRouter, Azure, Databricks, Ollama, etc.) centrally via.env. - Auto routing & fallback – Send simple prompts to a local Ollama model, heavy workloads to a cloud model, and fall back automatically when a provider fails.
- Token optimization – Prompt cache, semantic cache, memory injection, and headroom compression operate server-side so all clients enjoy reduced latency/cost.
- Observability + policy controls – Built-in load shedding, circuit breaker, structured logs, and Prometheus endpoints give production visibility; policy guards enforce workspace/host/git/test rules.
- Advanced providers – Some tools only speak OpenAI/Anthropic; Muxa converts that traffic to providers they don’t natively support (OpenRouter, Ollama, MLX).
- Easy rollouts – Update
.envonce; every IDE routed through Muxa immediately uses the new provider/policy set.
# Install globally
npm install -g @thelogicatelier/muxa
# Create a .env with your provider key (see docs for all options)
muxa # proxy listens on http://localhost:8081Or run instantly without installing:
npx @thelogicatelier/muxagit clone https://github.com/achatt89/muxa.git
cd muxa
npm install
cp .env.example .env # fill in OPENAI_API_KEY, OPENROUTER_API_KEY, etc.
npm start # proxy listens on http://localhost:8081The example below uses OpenAI, but you can substitute any supported provider (Anthropic, OpenRouter, Ollama, Databricks, Azure, etc.):
docker build -t muxa .
docker run --rm -p 8081:8081 \
-e MUXA_PRIMARY_PROVIDER=openai \
-e OPENAI_API_KEY=sk-your-key \
muxa:latestbrew tap thelogicatelier/muxa https://github.com/achatt89/muxa.git
brew install muxa
muxa --help| Mode | Description |
|---|---|
single |
All requests go to MUXA_PRIMARY_PROVIDER. Use this when you only want one provider. |
hybrid |
Muxa evaluates request “complexity” (prompt length, tool use, etc.) and routes high-cost calls to MUXA_FALLBACK_PROVIDER. If the primary fails, the fallback is also used. |
Example .env:
MUXA_ROUTING_STRATEGY=hybrid
MUXA_PRIMARY_PROVIDER=openrouter
MUXA_FALLBACK_PROVIDER=anthropic
OPENROUTER_API_KEY=sk-or-...
ANTHROPIC_API_KEY=sk-ant-...Use single mode if you don’t need fallback.
Enable caches and memory to cut token usage:
MUXA_PROMPT_CACHE_ENABLED=true
MUXA_PROMPT_CACHE_TTL_MS=120000
MUXA_SEMANTIC_CACHE_ENABLED=true
MUXA_SEMANTIC_CACHE_THRESHOLD=0.9
MUXA_MEMORY_ENABLED=true
MUXA_MEMORY_TOPK=3
MUXA_HEADROOM_ENABLED=true
MUXA_HEADROOM_MODE=optimize- Prompt cache instantly returns repeated prompts.
- Semantic cache reuses answers for similar prompts (requires embeddings provider—see docs/embeddings.md).
- Memory store injects top-K memories into each request.
- Headroom exposes
/metrics/compressionand/headroom/*to track savings.
Variable descriptions:
| Variable | Purpose |
|---|---|
MUXA_PROMPT_CACHE_ENABLED |
Enable exact match cache; repeated prompts return instantly. |
MUXA_PROMPT_CACHE_TTL_MS |
Time-to-live (milliseconds) for prompt cache entries. |
MUXA_SEMANTIC_CACHE_ENABLED |
Enable semantic (embeddings-based) cache. Set to true once embeddings are configured. |
MUXA_SEMANTIC_CACHE_THRESHOLD |
Cosine similarity threshold (0-1) for semantic cache hits. |
MUXA_MEMORY_ENABLED |
Enable long-term memory extraction/storage. |
MUXA_MEMORY_TOPK |
Number of memories injected into each request when relevant. |
MUXA_HEADROOM_ENABLED |
Enable headroom sidecar/compression pipeline. |
MUXA_HEADROOM_MODE |
audit (record metrics only) or optimize (mutate/compress history). |
- Start Muxa (npm or Docker as shown above).
- Point clients at Muxa:
| Client | Configuration |
|---|---|
| Claude Code CLI | ANTHROPIC_BASE_URL=http://localhost:8081 ANTHROPIC_API_KEY=sk-muxa claude "Prompt" (or export those vars once before running). |
| Cursor IDE | Settings → Features → Models → Base URL http://localhost:8081/v1, API key sk-muxa, select the model configured in .env. For @Codebase, enable embeddings docs/embeddings.md. |
| OpenAI Codex CLI | codex -c model_provider='"muxa"' -c model='"gpt-5.2-codex"' -c 'model_providers.muxa={name="Muxa Proxy",base_url="http://localhost:8081/v1",wire_api="responses",api_key="sk-muxa"}' (Muxa maps this to your local primary provider model). |
| GitHub Copilot | export GITHUB_COPILOT_PROXY_URL=http://localhost:8081/v1, export GITHUB_COPILOT_PROXY_KEY=dummy, restart the editor (Works for VS Code / JetBrains). |
| Cline / Continue / ClawdBot / other OpenAI-compatible tools | Set their custom OpenAI endpoint to http://localhost:8081/v1 with API key sk-muxa and use your desired model name. |
When you launch IDEs via Spotlight or the Dock, macOS ignores shell exports. Persist API keys for GUI apps (VS Code, Cursor, Claude CLI, etc.) with launchctl:
# Persist environment variables for GUI-launched apps
## NO NEED to provide the actual key - just needs a dummy value so that openAI/Anthropic don't complain). The actual keys are read from the .env file
launchctl setenv OPENAI_API_KEY sk-muxa
launchctl setenv ANTHROPIC_API_KEY sk-muxa
launchctl setenv MUXA_BASE_URL http://localhost:8081
# Inspect current values
launchctl getenv OPENAI_API_KEY
launchctl getenv ANTHROPIC_API_KEY
# Remove when rotating credentials
launchctl unsetenv OPENAI_API_KEY
launchctl unsetenv ANTHROPIC_API_KEYAfter setting values, quit/reopen the IDE so it inherits the updated environment.
/dashboard– lightweight HTML dashboard showing health, metrics, routing, compression, and headroom status (auto-refreshing)/health,/health/live,/health/ready– readiness probes/routing/stats,/debug/session,/v1/agents/*– routing + agent diagnostics/metrics,/metrics/prometheus,/metrics/compression,/metrics/semantic-cache– Prometheus-ready metrics, headroom/semantic cache stats/health/headroom,/headroom/status,/headroom/restart,/headroom/logs– headroom lifecycle
Muxa’s proxy lives between every IDE and the upstream provider, so optimization happens once and benefits everything:
- Caching chain – Prompt cache handles exact repeat prompts instantly; semantic cache (with embeddings) reuses near-duplicates; the memory store injects curated snippets for long-running projects.
- Headroom compression – Enable
MUXA_HEADROOM_ENABLED=trueto shrink chat histories and reduce token spend while keeping context intact—inspect savings via/metrics/compression. - Hybrid routing – Set
MUXA_ROUTING_STRATEGY=hybridwith a fallback provider to auto-route tool-heavy or long prompts to a premium model while keeping cheap models for short requests. Model aliases/fallbacks map IDE-only IDs (e.g.,gpt-5.3-codex) to real upstream SKUs without touching editor config. - Instrumentation loop –
/api/tokens/stats,/metrics/semantic-cache,/routing/stats, and the dashboard show exactly when caches hit or fallback routing triggers so you can prove savings to the team.
Best results happen when you warm caches (replay tests or common prompts), keep session_id/user identifiers consistent, and leave the proxy running for a day or two so semantic cache + headroom gather enough data. See docs/cost-optimization.md for the full playbook, configuration examples, timelines, and troubleshooting tips.
For a deep dive into how hybrid routing scores requests and auto-switches between providers, see docs/routing.md.
See docs/embeddings.md for Ollama, llama.cpp, OpenRouter, and OpenAI embeddings configuration. Example (Ollama):
ollama pull nomic-embed-text
export MUXA_SEMANTIC_CACHE_ENABLED=true
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_EMBEDDINGS_MODEL=nomic-embed-text
npm startnpm test # 90+ suites covering API/provider/integration/perf
node scripts/endpoint-parity-preflight.jsSee docker-compose.example.yml for a sample proxy + Ollama stack.
Detailed GitBook-style docs live under docs/:
- docs/README.md — table of contents
- docs/cost-optimization.md — cost optimization playbook
- docs/setup.md — installation/config cheat sheet
- docs/providers.md — provider-specific notes
- docs/observability.md — endpoints + dashboard
- docs/embeddings.md — embeddings/@Codebase setup
- docs/routing.md — hybrid routing & cost optimization
Built by The Logic Atelier