Skip to content

Safe-output schema validation in safeoutputs MCP server: drop unknown keys before forwarding #34885

@dfrysinger

Description

@dfrysinger

Feature request

The safeoutputs MCP server currently forwards every key it receives in a safe-output JSONL record to the GitHub API (via the safe_outputs job's downstream handler). When an LLM emits a key that is not in the documented schema for that safe-output type, the result is either a hard failure (e.g. HTTP 422 from workflow_dispatch if the receiver doesn't declare the extra input) or silent data loss.

Request: the safeoutputs MCP server should validate each emitted JSON object against the documented schema for its type and either reject (fail-loud) or strip (fail-graceful) unknown keys before forwarding.

Concrete reproduction (issue #153 in our policy-driven agent POC)

We use gpt-5-mini for the policy-dispatcher prompt. The dispatcher is documented in the prompt to emit dispatch_workflow records with shape:

{"type": "dispatch_workflow", "workflow_name": "<tier>", "inputs": {"issue_number": "<N>"}}

The model, however, generalizes from min-integrity (a key in the gh-aw MCP-guard env-var family that ALSO appears in its context) and adds hallucinated integrity / secrecy keys to the safe-output object:

{"type": "dispatch_workflow", "workflow_name": "tier-substantial", "inputs": {...}, "integrity": "high", "secrecy": "medium"}

The safeoutputs handler forwards these as workflow_dispatch inputs. GitHub's API rejects with HTTP 422 because the receiver workflow's on.workflow_dispatch.inputs block doesn't declare integrity or secrecy. The chain wedges and the issue cannot be processed.

Defense-in-depth we built downstream (workaround)

We shipped a deterministic post-emission sanitizer that injects a Sanitize Safe Outputs step into every *.lock.yml agent job, after agent emission and before the safe_outputs processor. The step reads the agent's output JSONL, parses each line, intersects keys against a schema file generated from each receiver workflow's declared on.workflow_dispatch.inputs, and rewrites the file in place. Unknown keys are dropped with a structured [sanitize-safe-outputs] dropped key=<key> from type=<type> log line. Fail-closed if the sanitizer or schema is missing.

The patcher and the schema file:

This works, but it's a per-repo patch on top of generated lock files (which then needs a sidecar hash manifest to survive re-compiles). The right place for the sanitization is upstream in the safeoutputs MCP server itself: ONE source of truth, every gh-aw user benefits.

Why this matters beyond #153

The hallucination pattern is robust across models. We've also seen GPT-5-mini hallucinate extra fields on add_comment (security-axis names like integrity, secrecy, min-integrity) and on add_labels (label-namespace names like scope, severity). Any LLM-emitted safe-output is at risk of this class of bug as long as the MCP server forwards unknown keys.

Proposed shape

gh-aw's safeoutputs MCP server already has the schema for each safe-output type internally (it has to, in order to construct the GitHub API request). Validate emitted JSON against that schema at MCP-server time. Options:

  1. Strict: reject the emission with a clear error back to the agent ("unknown key 'integrity' on type 'dispatch_workflow'"). Agent can self-correct.
  2. Lenient: strip unknown keys silently, log them. Same observable behavior as our downstream sanitizer.
  3. Configurable: a workflow frontmatter option safe-outputs.validate: strict|lenient|off lets users choose.

Option 3 is probably the cleanest -- defaults to strict for new workflows, lenient migration path for existing.

Filed by

@dfrysinger via the policy-driven-agent-poc project. Happy to review a PR and provide additional reproductions if useful.

Cross-reference: dfrysinger/policy-driven-agent-poc#153 (root cause), PR #162 (downstream sanitizer).

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions