Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

@ammar-agent ammar-agent commented Nov 7, 2025

Problem

The integration test should handle bash command with special characters was flaky in CI. Investigation revealed two issues:

  1. The test was testing AI escaping behavior rather than bash execution functionality - it expected the LLM to properly escape shell special characters ($, backticks, quotes), which is unpredictable
  2. gpt-5-mini was not reliably calling the bash tool - tests would complete but the tool call would be missing

Solution

  1. Removed the flaky special characters test entirely - it wasn't testing our code
  2. Switched all tests from gpt-5-mini to claude-haiku-4-5 - Haiku is faster and more reliable for tool use

Testing

Ran all 6 tests 3x locally - all passed:

  • Run 1: 6/6 passed (25.9s)
  • Run 2: 6/6 passed (28.3s)
  • Run 3: 6/6 passed (26.3s)

Tests now complete quickly and reliably with the Haiku model.

Generated with cmux

The test was attempting to verify AI escaping behavior rather than
bash execution functionality. This made it flaky and dependent on
LLM capabilities rather than our code.

Changed test to verify multi-step bash operations (create file, read file)
which is more deterministic and actually tests bash execution works correctly.
gpt-5-mini was not reliably calling the bash tool, causing tests to fail.
Switching to Anthropic's Haiku model which is fast and more reliable for
tool use.
@ammario ammario changed the title 🤖 fix: simplify flaky bash special characters test 🤖 ci: simplify flaky bash special characters test Nov 7, 2025
@ammario ammario merged commit eb177d4 into main Nov 7, 2025
13 checks passed
@ammario ammario deleted the fix-intg branch November 7, 2025 16:23
ibetitsmike pushed a commit that referenced this pull request Nov 7, 2025
## Problem

The integration test `should handle bash command with special
characters` was flaky in CI. Investigation revealed two issues:

1. **The test was testing AI escaping behavior** rather than bash
execution functionality - it expected the LLM to properly escape shell
special characters (`$`, backticks, quotes), which is unpredictable
2. **gpt-5-mini was not reliably calling the bash tool** - tests would
complete but the tool call would be missing

## Solution

1. Removed the flaky special characters test entirely - it wasn't
testing our code
2. Switched all tests from `gpt-5-mini` to `claude-haiku-4-5` - Haiku is
faster and more reliable for tool use

## Testing

Ran all 6 tests 3x locally - all passed:
- Run 1: 6/6 passed (25.9s)
- Run 2: 6/6 passed (28.3s)  
- Run 3: 6/6 passed (26.3s)

Tests now complete quickly and reliably with the Haiku model.

_Generated with `cmux`_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants