Skip to content

Instructive denial for commands #2912

@kfsone

Description

@kfsone

Describe the feature or problem you'd like to solve

A feature for blacklisting specific commands with a hard constructive pushback to accelerate flow

Proposed solution

The harness can be hardened against dangerous actions or behaviors by agents early and in ways consistent with patterns in model training to avoid long and counter-productive iteration loops.

A key feature would be that it would alter the context.

Sample scenario: Complex instructions for developing a Java tool where the user stipulates "must use gradlew not gradle" or Python tool "using uv instead of direct python or pip", and as the context grows attention causes the model to do those exact wrong things.

While this is a risk a yolo user takes, it's also relatively trivial for such things to get accidentally approved especially if done by sub-agents.

Currently

User aborts/escapes, and tells the agent not to do that, leaving the failed attempt in the context. The model reasons about the situation and may move on to do the right thing, but if the context represents a large enough body of work, it is routinely observed that models struggle to understand this as a larger process correction and will repeatedly try the same thing. In the case of a sub-agent there is no advisement up to the parent agent, so additional sub-agents may repeat the same mistake.

(some slight paraphrasing of actual log)

... 21kt prompting, 18kt planning, 31kt execution ...
[tool name="bash" command="gradle ..."]
... tool output
<!reasoning!>
Hmm. the command failed, it must be because there's no U in the day. Let me investigate.
<!reasoning!>
[tool name="bash" command="run a level 3 diagnostic "]
...
<!user!>
No, I said you must use gradlew instead of gradle.
<!user>
<!reasoning!>
The user is claiming that their instructions stated gradle must not be used. Their claim is that the instructions they gave were to use gradlew instead.  It appears they are making this statement because an attempt was made to run the new code using gradle instead of gradlew.

I can see that someone used 'gradle' earlier, but it worked fine, and it appears we were investigating an unrelated issue with the code.

I should remain calm and notify the user that there is no significant difference between the two commands and they can be used interchangeably, or that they can run the command again themselves this time using gradlew.

But wait, am I sure the issue with gradle was due to the code?  There was an earlier mention that using gradle would not be able to find libraries, and the errors above could represent a mismatch of libraries.

Also, I see now that the user did clearly state only gradlew should be used, and that the using the wrong command would result in exactly the error we are seeing.

I should take over and attempt to rerun the operation but not using gradle this time. Wait, does that mean I cannot use any version of gradle or is it ok to use gradlew? The instruction was to use gradlew instead of gradle, but we had used gradle instead of gradlew.
<!reasoning!>
Ok I will reun the build with the correct command.
[tool name="bash" command="gradle ..."]    <-- attention shifted between reasoning and command dispatch

This happens because as the reasoning block was added to the context the ordering of "gradlew instead of gradle" was being fulfilled by dialog-structuring patterns. The reasoning block terminator and transition to a structured/non-language mode re-rolled the attention dice, and the so after the =" tokens "gradle" was the token in the right attention spot.

From this point on, the conversation is poisoned; attention is seriously undermined and these error/rebuke exchanges grossly weaken the scores of strong instruction-following patterns.

Proposed

Continuing with the example of a tool invocation that may proceed significant context burden (research, coding, review, etc),

The user is able to configure that "gradle:*" is a hard-rejection with a 'reasoning response':

I must not use `gradle` because: <enter reason>
Alternative: <leave blank or describe alternative approach>

and the user provides the reason as "raw gradle does not have access to the same critical libraries as we need for our application" and the alt as "use gradlew to avoid missing methods and mismatching function parameters".

Handling

When the tool sees this denial, it retcons the context.

... 21kt prompting, 18kt planning, 31kt execution ...
[tool name="bash" command="gradle ..."]

the command matches the rejection rule, and the harness replaces the tool instruction with agent-voice text:

... 21kt prompting, 18kt planning, 31kt execution ...
I must not use `gradle` because raw `gradle` does not have access to the same critical libraries as we need for our application. Instead, use `gradlew` to avoid missing methods and mismatching function parameters"

This revised context is passed to the model for inference, and the model presumably proceeds:

... `gradlew` to avoid missing methods and mismatching function parameters"
[tool name="bash" command="gradlew ..."]

-> if it tries to use gradle, again, time to surface to the user.
-> if it uses gradlew instead (or uv ..., or | tail -10 instead of | select -last 10), we saved a lot of tokens and pain,

We could even, in principle, edit-out the pushback:

Given

{"embargo": [
  {"cmd": "| head -", "reason": "shell is poweshell", "alternative": "| select -last ..."}
]}

and the context:

User: ... shell is powershell ...
...
Now I will build.

[tool name="bash" command="cmake -G 'Visual Studio 17 2022 -S . -B out | tail -10 ...]

This intercept rejects the command and replaces it with:

User: ... shell is powershell ...
...
Now I will build.

I must not use `| head -` because the shell is powershell. Instead, use "| select -last ...".

the model uses this amended context to infer

User: ... shell is powershell ...
...
Now I will build.

I must not use `| head -` because the shell is powershell. Instead, use "| select -last ...".
[tool name="bash" command="cmake ... | select -last 10"]

Instead of leaving this as is when the tool execution finishes, the harness removes the push-back/rejection, so the post-tool context is:

User: ... shell is powershell ...
...
Now I will build.

[tool name="bash" command="cmake ... | select -last 10"]
cmake: ...
cmake: ...
[/tool]

Avoiding the corrective-instruction becoming an attention-distraction. The model is free to make the same mistake again, but the errors won't be kept in the context.

Example prompts or workflows

  • Note: All of these contexts would be improved not enabled.

WSL from Windows

prompt/instructions:
Prefix commands with 'wsl' to execute them in the Linux environment. Do not use '\' as a path separator.

config:

{"embargo": [
 {"pattern": "\\", "label": "windows path separators", "reason": "commands are interpreted in wsl's linux context, not windows where '\\' is an escape character", "alternative": "use the / separator, both bash AND powershell understand these with no conflict"},
 {"command": "bash", "reason": "invokes a MinGW bash process instead of the wsl we want", "alternative": "wsl"}
]}

Non-git version control

prompt/instructions:
Despite the .git directory, this is a Perforce workspace not a git repository

{"embargo":[ ..command: git ... reason: "yeahnahnah", alternative: "use p4"]}

Prevent hallucinated email address

prompt/instructions:
In case of a failure you may mail the output to b--l@m-crosoft.com but you may not use email for other purposes

{"embargo": [
 {"pattern": "sendmail (?!-t b--l@m-crosoft.com -s 'build fail')", "reason": "you're not the one that'll get fired", "alternative": "may only use sendmail -t ... -s ..."}
]}

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:configurationConfig files, instruction files, settings, and environment variablesarea:permissionsTool approval, security boundaries, sandbox mode, and directory restrictions
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions