feat: adversarial agent for preventing leaking of info and more #7948
feat: adversarial agent for preventing leaking of info and more #7948michaelneale merged 5 commits intomainfrom
Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac59b960b2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| session_id, | ||
| system_prompt, | ||
| conversation.messages(), |
There was a problem hiding this comment.
Isolate adversary checks from the main provider session
Calling provider.complete with the agent’s session_id here makes the adversary prompt part of the same provider conversation state as the user task; for stateful providers this contaminates subsequent turns and breaks the “independent reviewer” behavior. For example, ClaudeCodeProvider explicitly keeps context internally per session_id (see crates/goose/src/providers/claude_code.rs, last_user_content_blocks/stream), so each adversary review is appended into the live chat history, and later model outputs are influenced by these ALLOW/BLOCK exchanges rather than only the user workflow.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4a87675a13
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1c8910c0a9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let (response, _usage) = provider | ||
| .complete( | ||
| &model_config, | ||
| "", |
There was a problem hiding this comment.
Pass through session ID when invoking adversary model
consult_llm always calls provider.complete with an empty session ID, so every adversary check is multiplexed into the same provider-side conversation key. For stateful providers that retain context by session_id (for example ClaudeCodeProvider), this causes cross-check contamination and can leak prior users’ or prior sessions’ adversary prompts into later decisions, making ALLOW/BLOCK outcomes depend on unrelated history.
Useful? React with 👍 / 👎.
| return Some(AdversaryConfig { | ||
| tools: DEFAULT_TOOLS.iter().map(|s| (*s).to_string()).collect(), | ||
| rules: DEFAULT_RULES.to_string(), | ||
| }); |
There was a problem hiding this comment.
Fail closed when adversary.md cannot be read
When adversary.md exists but cannot be read, the code silently substitutes DEFAULT_RULES instead of using the user’s configured policy. In that scenario the guardrail no longer enforces the intended rules (for example, custom blocklists), which can allow tool calls the user explicitly tried to forbid; this should return disabled/error behavior rather than replacing policy content.
Useful? React with 👍 / 👎.
* origin/main: feat: adversarial agent for preventing leaking of info and more (#7948) Update contributing.md (#7927) docs: add credit balance monitoring section (#7952) docs: add Cerebras provider to supported providers list (#7953) docs: add TUI client documentation to ACP clients guide (#7950) fix: removed double dash in pnpm command (#7951) docs: polish ACP docs (#7946) claude adaptive thinking (#7944) feat: new onboarding flow (#7266) Add DCO git commit command to AGENTS.md (#7945) fix(claude-code): remove incorrect agent_visible filter on user message (#7931) No Check do Check (#7942) Log 500 errors and also show error for direct download (#7936) fix: retry on authentication failure with credential refresh (#7812) Remove java/.ai-usage-marker directory (#7925) test(acp): add terminal delegation fixtures and fix shell singleton (#7923) fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892) feat: persist GooseMode per-session via session DB (#7854)
* main: (32 commits) Revert message flush & test (block#7966) docs: add Remote Access section with Telegram Gateway documentation (block#7955) fix: update webmcp blog post metadata image URL (block#7967) fix: clean up OAuth token cache on provider deletion (block#7908) fix: hard-coded tool call id in code mode callback (block#7939) Fix SSE parsers to accept optional space after data: prefix (block#7929) docs: add GOOSE_INPUT_LIMIT to config-files.md (block#7961) Add WebMCP for Beginners blog post (block#7957) Fix download manager (block#7933) Improve the formatting of tool calls, show thinking, treat Reasoning and Thinking as the same thing (sorry Kant) (block#7626) don't imply running builds all the time in AGENTS.md (block#7865) fix: unregister goosed child process's listener (block#7956) feat: adversarial agent for preventing leaking of info and more (block#7948) Update contributing.md (block#7927) docs: add credit balance monitoring section (block#7952) docs: add Cerebras provider to supported providers list (block#7953) docs: add TUI client documentation to ACP clients guide (block#7950) fix: removed double dash in pnpm command (block#7951) docs: polish ACP docs (block#7946) claude adaptive thinking (block#7944) ...
This adds an implementation of https://github.com/michaelneale/adversarial-policy-agent specific to goose so you can have
~/.config/goose/adversary.md - which can simply state a policy in plain language. It will filter out certain tool calls when needed, and check them with an "adversarial agent"
for example:
and try to get goose to disobey - it won't be able to (this runs outside of agent loop). it is non deterministic complement to other techniques. Uses current provider and same model.