🤖 Fix test flake by simplifying prompt and clarifying unlimited steps #406
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The
openai-web-search.test.tsintegration test was flaking in CI with timeouts after 120+ seconds:stream-endCI Run: https://github.com/coder/cmux/actions/runs/18766377932/job/53542148133
Root Cause
The test prompt was too complex for a reasoning model:
With
thinkingLevel: 'high'+web_search, this caused the model to enter excessive tool call loops:This is NOT a bug in the unlimited steps configuration - models MUST be able to run for hours or even days with unlimited tool calls for autonomous workflows.
Solution
Clarified unlimited steps intent: Added comment explaining that the 100k step limit is intentionally high to support long-running autonomous workflows
Simplified test prompt: Changed to simple weather query + picnic decision
Reduced thinking level: Changed from
hightomediumto avoid excessive deliberationAdjusted timeouts: Reduced to 120s/90s for simpler task
Testing
Type checking passes. The test still validates the same bug fix with a more stable prompt.
Generated with
cmux