Fix gemini-cli and opencode E2E harness defects#1277
Merged
Conversation
Both agents accounted for ~75 of the persistent E2E failures, neither
caused by CLI product code — the bugs are in the e2e agent runners.
opencode (43 failures): RunPrompt set cmd.Dir but not PWD. Go's cmd.Dir
chdirs the child without updating the inherited PWD env var, and opencode
(Node) resolves its project root from process.env.PWD — so file
operations landed in the go-test package dir instead of the test repo,
the per-repo entire plugin never loaded, and no checkpoint was created
("checkpoint ref did not advance" / "git add: pathspec did not match").
Extract openCodePromptEnv to force PWD to match cmd.Dir.
gemini-cli (32 failures): the model intermittently aborts a turn with
"Invalid stream: empty response or malformed tool call" on stderr while
still exiting 0 with empty stdout. The turn never completes, the
after-agent hook never fires, no checkpoint is created. Because the
process exits 0, the transient-retry path (err != nil && IsTransientError)
was never reached. Surface the abort as an error in RunPrompt and add the
signature to IsTransientError so the scenario restarts instead of failing
on a missing checkpoint.
Adds unit tests (run in the standard suite, not e2e-gated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: d8dbfcccbf60
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes persistent E2E harness failures for the opencode and gemini-cli agent runners by correcting headless execution environment handling and by treating a known gemini-cli “aborted turn” signature (stderr-only, exit 0) as a transient error so scenarios retry instead of failing on missing checkpoints.
Changes:
opencode: ensure headlessRunPromptforcesPWDin the child environment to matchcmd.Dir(and strips any stale inheritedPWD).gemini-cli: detect server-side aborted turns via stderr signatures; synthesize an error when the process exits 0 so the existing transient-retry path triggers.- Add unit tests covering
PWDoverride logic and gemini abort/transient detection.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| e2e/agents/opencode.go | Adds openCodePromptEnv() and uses it so PWD matches the repo dir for headless runs. |
| e2e/agents/opencode_test.go | New test verifying PWD override and env filtering behavior. |
| e2e/agents/gemini.go | Adds aborted-turn signature detection; marks it transient and synthesizes an error on “abort but exit 0”. |
| e2e/agents/gemini_test.go | Adds tests for transient detection and aborted-turn signature matching. |
pjbgf
previously approved these changes
May 27, 2026
The branch run confirmed the harness retry fix works (gemini's "Invalid stream" aborts now trigger scenario restarts), but ~26 tests still exhaust all 3 retries: gemini-2.5-flash returns an empty/malformed response on the first model turn even for trivial prompts. That is an upstream model/CLI instability, not a checkpoint/strategy bug. Make the gemini model tunable via E2E_GEMINI_MODEL (mirrors E2E_CODEX_MODEL / E2E_OPENCODE_MODEL), defaulting to the cheap gemini-2.5-flash locally, and pin the CI gemini-cli job to gemini-2.5-pro to see whether the more capable model reduces the empty-response aborts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 752c67fbcfe7
gemini-2.5-pro reduced the "Invalid stream" empty-response aborts (26 -> 18 failures) but didn't eliminate them, and added 429 rate-limit noise (retried, not fatal). Try gemini-3.1-flash-lite instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 5e2987223baf
pjbgf
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
https://entire.io/gh/entireio/cli/trails/431
Summary
Triage of the persistent E2E failures on
main(run 26496760848) found that ~75 of the failures came from two agents — opencode (43) and gemini-cli (32) — and neither is caused by CLI product code. Both are defects in thee2e/agent runners. This PR fixes both. Nocmd/entire/cli/changes.opencode (43 failures) — wrong working directory
RunPromptsetcmd.Dirbut notPWD. Go'scmd.Dirchdirs the child without updating the inheritedPWDenv var, and opencode (Node) resolves its project/worktree root fromprocess.env.PWD. So opencode's file writes landed in thego testpackage dir instead of the/tmp/e2e-repo-*test repo, the per-repo.opencode/plugins/entire.tsnever loaded, and no checkpoint was created — surfacing ascheckpoint ref entire/checkpoints/v1 did not advance within 30sandgit add <path>: exit status 128(file never existed in the repo).Fix: extract
openCodePromptEnv(), which strips any stalePWDand forces it to matchcmd.Dir. The tmux/interactive path was already correct (tmux new-session -c dirsetsPWD), which is why the 5 interactive tests passed.gemini-cli (32 failures) — aborted turn exits 0
gemini-cli intermittently aborts a turn server-side with
Invalid stream: The model returned an empty response or malformed tool callon stderr while exiting 0 with empty stdout. The turn never completes → theafter-agentlifecycle hook never fires → no checkpoint. Because the process exits 0, the harness's transient-retry path (err != nil && IsTransientError) was never reached, so a retryable glitch was scored as a hard "checkpoint did not advance" failure.Fix:
RunPromptnow synthesizes an error when it detects the abort signature in stderr, andIsTransientErrorrecognizes it (geminiAbortSignatures), so the scenario restarts (up to 3 attempts) instead of failing on a missing checkpoint.Tests
e2e/agents/opencode_test.go(new):TestOpenCodePromptEnv_OverridesPWDe2e/agents/gemini_test.go:TestGeminiIsTransientError_RecognizesAbortedTurn,TestGeminiAbortedTurnThese run in the standard
mise run testsuite —e2e/agentsis not behind thee2ebuild tag — so they gate merges without needing a paid E2E run.Verification
mise run lint— cleango test ./e2e/agents/— passgofmt/dup:staged— cleanNotes / caveats
Invalid stream, which is high for pure flakiness. The retry will salvage genuinely transient cases; if gemini-2.5-flash is systematically choking, retries will exhaust and it'll still fail (correctly) — at which point it's a model/version question, not a harness one.FACTORY_API_KEYdevice-auth + Copilot trust-dialog sentinel).🤖 Generated with Claude Code
Note
Low Risk
Test-harness-only changes in e2e/agents with unit tests; no production CLI or auth paths touched.
Overview
Fixes two E2E agent runner bugs in
e2e/agents(no CLI product changes).opencode: Headless
RunPromptnow sets child env viaopenCodePromptEnv()soPWDmatchescmd.Dir. Without that, opencode (Node) used a stalePWDfromgo test, wrote files outside the temp repo, and checkpoints never advanced.gemini-cli: When stderr shows a server-side abort (
Invalid stream/ empty or malformed tool call) but the process exits 0,RunPromptnow returns an error andIsTransientErrortreats it as retryable so scenarios restart instead of failing on missing checkpoints.Unit tests cover env override and abort/transient detection; they run in the normal
go test ./e2e/agents/suite.Reviewed by Cursor Bugbot for commit 1ee9286. Configure here.