A transparent inspection proxy for LLM API traffic and MCP tool call flows.
Prompt injection via indirect tool response is an underaddressed attack class in current agentic AI deployments. When an agent retrieves external content (web pages, documents, database records, API responses), that content enters the model's context window with no structural distinction from trusted system instructions. A sufficiently crafted payload embedded in retrieved content can redirect agent behavior, override system prompts, or trigger unintended tool calls.
This attack surface has been demonstrated in production systems at scale. Anthropic's Git MCP server received three CVEs in January 2026 (CVE-2025-68143, CVE-2025-68144, CVE-2025-68145) for remote code execution achieved through prompt injection in tool responses. GitHub Copilot was assigned a CVSS 9.6 vulnerability in the same class. Microsoft 365 Copilot has been demonstrated susceptible to indirect injection through email and document content.
Existing mitigations target enterprise deployments with per-request pricing and managed onboarding. There is no open-source or self-hosted option for this threat class. Airlock is an attempt to fill that gap.
Airlock operates as a reverse proxy between the agent process and its upstream targets: the LLM API and, in later versions, MCP tool servers directly. All traffic passes through an inspection layer before forwarding.
┌───────────────────────────┐
│ AIRLOCK │
│ │
Your Agent ────────►│ inspect inbound prompt │────────► Anthropic API
│ inspect outbound reply │◄──────── Anthropic API
│ inspect tool responses │
│ enforce detection policy │────────► MCP Tool Servers
│ audit log │◄──────── MCP Tool Servers
│ │
└───────────────────────────┘
Integration requires a single environment variable change. No modifications to agent code are necessary.
# Standard configuration
ANTHROPIC_BASE_URL=https://api.anthropic.com
# With Airlock
ANTHROPIC_BASE_URL=http://localhost:4000The proxy is transparent to the agent and API responses are structurally identical to direct Anthropic responses. Inspection and enforcement occur out-of-band from the agent's perspective.
- Indirect prompt injection — adversarial instructions embedded in tool responses, retrieved documents, scraped content, or user-controlled input
- System prompt override attempts — known jailbreak templates, instruction suppression patterns, persona replacement attempts
- Credential and filesystem exfiltration — tool calls targeting sensitive paths (SSH keys, environment files, credential stores)
- Tool misuse — calls to tools outside expected behavioral scope, anomalous call frequency, unexpected sequencing
- Behavioral deviation — statistical divergence from baseline call patterns (configurable per agent)
Detection rules operate in one of two modes, configurable per rule:
Monitor (default) — traffic passes through; flagged events are logged and optionally forwarded to a webhook or Slack channel. Intended as the initial deployment mode to establish a behavioral baseline and validate rule accuracy before enabling enforcement.
Block — the request or response is rejected at the proxy layer before reaching the LLM or the agent. Rules with low false-positive risk (credential file access, exact-match jailbreak strings) default to block. Heuristic and pattern-based rules default to monitor.
Rule modes are adjustable through the dashboard based on observed traffic.
Detection does not rely on an external LLM service for evaluation, which would introduce latency, cost, and a secondary trust boundary. The detection pipeline uses local, layered heuristics:
- Static rule matching — pattern matching against a maintained library of known injection templates, override syntax, and exfiltration signatures
- Embedding similarity — cosine similarity against a local vector store of known attack patterns; no external embedding API calls
- Structural heuristics — detection of unicode obfuscation, zero-width character insertion, and syntactic override patterns
The full pipeline runs locally. No request or response data leaves the host.
A web dashboard provides:
- Real-time feed of proxied requests and responses
- Flagged event detail with the triggering rule and matched content
- Full audit log with exportable request/response payloads
- Per-rule mode configuration (monitor / block)
- Proxy: Node.js as the transparent HTTP proxy, Anthropic API-compatible
- Detection engine: Rule engine + local embeddings
- Dashboard: React + lightweight API server
- Deployment: Docker Compose
Early development. The architecture and detection model are designed. Implementation is in progress.
v1 scope:
- LLM proxy
- Static rule engine + known injection pattern library
- Embedding-based similarity detection
- Web dashboard, real-time feed and flagged event detail
- Webhook and Slack alert delivery
- Docker Compose deployment
v2 scope:
- Native MCP protocol proxy (tool server traffic inspection)
- Expanded LLM / Agent compatibility
- Custom rule authoring
- Audit log export (CSV/JSON)
- Per-agent policy profiles
Airlock is designed for self-hosted deployment. No data leaves the host; inspection runs entirely in-process. A managed cloud option is planned for teams that prefer not to operate the proxy directly.
→ [Airlock Cloud] — managed deployment, coming soon.
This project is in early development. If you are building MCP-connected agents and have encountered prompt injection in the wild, or have thoughts on the detection model or enforcement design, contributions and discussion are welcome.
- ⭐ Star this repo to signal interest
- 💬 Open a Discussion to share your use case or threat model
- 👀 Watch for updates as development progresses
AGPL-3.0. Free to use and self-host. See LICENSE for details.
*@bdhobson *