Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

The test was flaky because it checked if the LLM response text contained terminal bench. LLMs sometimes summarize command output instead of quoting it verbatim.

Changed to verify the bash tool completed by checking for tool-call-end events, which is deterministic and directly tests what we care about (command completed without hanging).


Generated with mux • Model: anthropic:claude-opus-4-5 • Thinking: high

The test was flaky because it checked if the LLM response text
contained 'terminal bench'. LLMs sometimes summarize command output
instead of quoting it verbatim.

Changed to verify the bash tool completed by checking for
tool-call-end events, which is deterministic and directly tests
what we care about (command completed without hanging).
@ammario ammario changed the title 🤖 fix: make grep|head test assertion deterministic 🤖 ci: make grep|head test assertion deterministic Dec 21, 2025
@ammario ammario merged commit 4e342d5 into main Dec 21, 2025
19 checks passed
@ammario ammario deleted the fix-flaky-grep-head-test branch December 21, 2025 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants