fix(ci): deflake smoke tests for Google models#7344
Merged
Conversation
The uppercase transformation prompt ('output this file's contents in
UPPERCASE') was ambiguous enough that Gemini models would frequently
hallucinate uppercase text instead of uppercasing the actual file
content (e.g. 'HELLO. I AM A LARGE LANGUAGE MODEL. I AM GOOSE...')
or uppercase the filename instead of the contents.
Replace with a two-file read-back test using random tokens per run.
This still verifies tool use (text_editor must be called) and proves
the model read the file contents (random tokens can't be guessed),
without requiring a transformation that trips up the models.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR deflakes the Live Provider “Smoke Tests” for Google Gemini models by replacing an ambiguous uppercase-transformation prompt with a deterministic two-file read-back prompt using per-run random tokens, ensuring the model must actually read file contents rather than follow a stylistic instruction.
Changes:
- Replace the uppercase transformation smoke test with a two-file read-back test using unique per-run tokens.
- Update validation to check that both random tokens are present in the model output (and that
text_editorwas used).
DOsinga
approved these changes
Feb 19, 2026
Collaborator
DOsinga
left a comment
There was a problem hiding this comment.
nice one. just wanted to start working on this
jh-block
added a commit
that referenced
this pull request
Feb 19, 2026
* origin/main: fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) chore: upgrade to rmcp 0.16.0 (#7274) docs: add monitoring subagent activity section (#7323) docs: document Desktop UI recipe editing for model/provider and extensions (#7327) docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330) fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321) docs: add Permission Policy documentation for MCP Apps (#7325) update RPI plan prompt (#7326) docs: add CLI syntax highlighting theme customization (#7324) fix(cli): replace shell-based update with native Rust implementation (#7148) docs: rename Code Execution extension to Code Mode extension (#7316)
aharvard
added a commit
that referenced
this pull request
Feb 19, 2026
* origin/main: (29 commits) fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) chore: upgrade to rmcp 0.16.0 (#7274) docs: add monitoring subagent activity section (#7323) docs: document Desktop UI recipe editing for model/provider and extensions (#7327) docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330) fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321) docs: add Permission Policy documentation for MCP Apps (#7325) update RPI plan prompt (#7326) docs: add CLI syntax highlighting theme customization (#7324) fix(cli): replace shell-based update with native Rust implementation (#7148) docs: rename Code Execution extension to Code Mode extension (#7316) docs: remove ALPHA_FEATURES flag from documentation (#7315) docs: escape variable syntax in recipes (#7314) ... # Conflicts: # ui/desktop/src/components/McpApps/McpAppRenderer.tsx # ui/desktop/src/components/McpApps/types.ts
katzdave
added a commit
that referenced
this pull request
Feb 19, 2026
* 'main' of github.com:block/goose: (24 commits) Docs: claude code uses stream-json (#7358) Improve link confirmation modal (#7333) fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) chore: upgrade to rmcp 0.16.0 (#7274) docs: add monitoring subagent activity section (#7323) docs: document Desktop UI recipe editing for model/provider and extensions (#7327) docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330) fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321) docs: add Permission Policy documentation for MCP Apps (#7325) update RPI plan prompt (#7326) docs: add CLI syntax highlighting theme customization (#7324) fix(cli): replace shell-based update with native Rust implementation (#7148) docs: rename Code Execution extension to Code Mode extension (#7316) ...
Collaborator
Author
|
@DOsinga the perfect one for an agent to solve - just told it to look through history for patterns, make a branch and try some things (and made sure it stayed true to intent of test!) |
michaelneale
added a commit
that referenced
this pull request
Feb 19, 2026
* main: (46 commits) chore(deps): bump hono from 4.11.9 to 4.12.0 in /ui/desktop (#7369) Include 3rd-party license copy for JavaScript/CSS minified files (#7352) docs for reasoning env var (#7367) docs: update skills detail page to reference Goose Summon extension (#7350) fix(apps): restore MCP app sampling support reverted by #6933 (#7366) feat: TUI client of goose-acp (#7362) docs: agent variable (#7365) docs: pass env vars to shell (#7361) docs: update sandbox topic (#7336) feat: add local inference provider with llama.cpp backend and HuggingFace model management (#6933) Docs: claude code uses stream-json (#7358) Improve link confirmation modal (#7333) fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The Live Provider Tests Smoke Tests have been flaking heavily on Google models (
gemini-2.5-proandgemini-3-flash-preview), causing ~75% of all Smoke Test failures across PRs. These are all flakes — re-triggered runs pass.The root cause is the uppercase transformation prompt:
Gemini models interpret "output in UPPERCASE" as a style instruction for their response rather than a transformation of specific file content. They read the file correctly but then hallucinate uppercase text instead:
HELLO. I AM A LARGE LANGUAGE MODEL. I AM GOOSE...HELLO WORLD. THIS IS A TEST. THE QUICK BROWN FOX...INPUT.TXT-ABC123(uppercased the filename, not the content)Fix
Replace the uppercase transformation test with a two-file read-back test using random tokens per run (
smoke-alpha-$RANDOM,smoke-bravo-$RANDOM).This still verifies:
text_editormust be called (same grep check as before)No transformation needed, just echo back. The prompt asks the model to reply with ONLY the file contents — much less ambiguous than asking for a case transformation.