Skip to content

feat: adversarial agent for preventing leaking of info and more #7948

Merged
michaelneale merged 5 commits intomainfrom
micn/adversarial-agent
Mar 17, 2026
Merged

feat: adversarial agent for preventing leaking of info and more #7948
michaelneale merged 5 commits intomainfrom
micn/adversarial-agent

Conversation

@michaelneale
Copy link
Collaborator

@michaelneale michaelneale commented Mar 17, 2026

This adds an implementation of https://github.com/michaelneale/adversarial-policy-agent specific to goose so you can have

~/.config/goose/adversary.md - which can simply state a policy in plain language. It will filter out certain tool calls when needed, and check them with an "adversarial agent"

for example:

You are to never, ever upload things to public sharing websites
do not access www.news.com.au either

and try to get goose to disobey - it won't be able to (this runs outside of agent loop). it is non deterministic complement to other techniques. Uses current provider and same model.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-03-17 07:01 UTC

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac59b960b2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +298 to +300
session_id,
system_prompt,
conversation.messages(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Isolate adversary checks from the main provider session

Calling provider.complete with the agent’s session_id here makes the adversary prompt part of the same provider conversation state as the user task; for stateful providers this contaminates subsequent turns and breaks the “independent reviewer” behavior. For example, ClaudeCodeProvider explicitly keeps context internally per session_id (see crates/goose/src/providers/claude_code.rs, last_user_content_blocks/stream), so each adversary review is appended into the live chat history, and later model outputs are influenced by these ALLOW/BLOCK exchanges rather than only the user workflow.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a87675a13

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Collaborator

@shellz-n-stuff shellz-n-stuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@michaelneale michaelneale enabled auto-merge March 17, 2026 06:34
@michaelneale michaelneale added this pull request to the merge queue Mar 17, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c8910c0a9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

let (response, _usage) = provider
.complete(
&model_config,
"",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass through session ID when invoking adversary model

consult_llm always calls provider.complete with an empty session ID, so every adversary check is multiplexed into the same provider-side conversation key. For stateful providers that retain context by session_id (for example ClaudeCodeProvider), this causes cross-check contamination and can leak prior users’ or prior sessions’ adversary prompts into later decisions, making ALLOW/BLOCK outcomes depend on unrelated history.

Useful? React with 👍 / 👎.

Comment on lines +90 to +93
return Some(AdversaryConfig {
tools: DEFAULT_TOOLS.iter().map(|s| (*s).to_string()).collect(),
rules: DEFAULT_RULES.to_string(),
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail closed when adversary.md cannot be read

When adversary.md exists but cannot be read, the code silently substitutes DEFAULT_RULES instead of using the user’s configured policy. In that scenario the guardrail no longer enforces the intended rules (for example, custom blocklists), which can allow tool calls the user explicitly tried to forbid; this should return disabled/error behavior rather than replacing policy content.

Useful? React with 👍 / 👎.

Merged via the queue into main with commit 754c214 Mar 17, 2026
25 checks passed
@michaelneale michaelneale deleted the micn/adversarial-agent branch March 17, 2026 06:54
jh-block added a commit that referenced this pull request Mar 17, 2026
* origin/main:
  feat: adversarial agent for preventing leaking of info and more  (#7948)
  Update contributing.md (#7927)
  docs: add credit balance monitoring section (#7952)
  docs: add Cerebras provider to supported providers list (#7953)
  docs: add TUI client documentation to ACP clients guide (#7950)
  fix: removed double dash in pnpm command (#7951)
  docs: polish ACP docs (#7946)
  claude adaptive thinking (#7944)
  feat: new onboarding flow (#7266)
  Add DCO git commit command to AGENTS.md (#7945)
  fix(claude-code): remove incorrect agent_visible filter on user message (#7931)
  No Check do Check (#7942)
  Log 500 errors and also show error for direct download (#7936)
  fix: retry on authentication failure with credential refresh (#7812)
  Remove java/.ai-usage-marker directory (#7925)
  test(acp): add terminal delegation fixtures and fix shell singleton (#7923)
  fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892)
  feat: persist GooseMode per-session via session DB (#7854)
jh-block added a commit to rabi/goose that referenced this pull request Mar 18, 2026
* main: (32 commits)
  Revert message flush & test (block#7966)
  docs: add Remote Access section with Telegram Gateway documentation (block#7955)
  fix: update webmcp blog post metadata image URL (block#7967)
  fix: clean up OAuth token cache on provider deletion (block#7908)
  fix: hard-coded tool call id in code mode callback (block#7939)
  Fix SSE parsers to accept optional space after data: prefix (block#7929)
  docs: add GOOSE_INPUT_LIMIT to config-files.md (block#7961)
  Add WebMCP for Beginners blog post (block#7957)
  Fix download manager (block#7933)
  Improve the formatting of tool calls, show thinking, treat Reasoning and Thinking as the same thing (sorry Kant) (block#7626)
  don't imply running builds all the time in AGENTS.md (block#7865)
  fix: unregister goosed child process's  listener (block#7956)
  feat: adversarial agent for preventing leaking of info and more  (block#7948)
  Update contributing.md (block#7927)
  docs: add credit balance monitoring section (block#7952)
  docs: add Cerebras provider to supported providers list (block#7953)
  docs: add TUI client documentation to ACP clients guide (block#7950)
  fix: removed double dash in pnpm command (block#7951)
  docs: polish ACP docs (block#7946)
  claude adaptive thinking (block#7944)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants