Skip to content

Fix gemini-cli and opencode E2E harness defects#1277

Merged
Soph merged 3 commits into
mainfrom
soph/fix-e2e-gemini-opencode
May 27, 2026
Merged

Fix gemini-cli and opencode E2E harness defects#1277
Soph merged 3 commits into
mainfrom
soph/fix-e2e-gemini-opencode

Conversation

@Soph
Copy link
Copy Markdown
Collaborator

@Soph Soph commented May 27, 2026

https://entire.io/gh/entireio/cli/trails/431

Summary

Triage of the persistent E2E failures on main (run 26496760848) found that ~75 of the failures came from two agents — opencode (43) and gemini-cli (32) — and neither is caused by CLI product code. Both are defects in the e2e/ agent runners. This PR fixes both. No cmd/entire/cli/ changes.

opencode (43 failures) — wrong working directory

RunPrompt set cmd.Dir but not PWD. Go's cmd.Dir chdirs the child without updating the inherited PWD env var, and opencode (Node) resolves its project/worktree root from process.env.PWD. So opencode's file writes landed in the go test package dir instead of the /tmp/e2e-repo-* test repo, the per-repo .opencode/plugins/entire.ts never loaded, and no checkpoint was created — surfacing as checkpoint ref entire/checkpoints/v1 did not advance within 30s and git add <path>: exit status 128 (file never existed in the repo).

Fix: extract openCodePromptEnv(), which strips any stale PWD and forces it to match cmd.Dir. The tmux/interactive path was already correct (tmux new-session -c dir sets PWD), which is why the 5 interactive tests passed.

gemini-cli (32 failures) — aborted turn exits 0

gemini-cli intermittently aborts a turn server-side with Invalid stream: The model returned an empty response or malformed tool call on stderr while exiting 0 with empty stdout. The turn never completes → the after-agent lifecycle hook never fires → no checkpoint. Because the process exits 0, the harness's transient-retry path (err != nil && IsTransientError) was never reached, so a retryable glitch was scored as a hard "checkpoint did not advance" failure.

Fix: RunPrompt now synthesizes an error when it detects the abort signature in stderr, and IsTransientError recognizes it (geminiAbortSignatures), so the scenario restarts (up to 3 attempts) instead of failing on a missing checkpoint.

Tests

  • e2e/agents/opencode_test.go (new): TestOpenCodePromptEnv_OverridesPWD
  • e2e/agents/gemini_test.go: TestGeminiIsTransientError_RecognizesAbortedTurn, TestGeminiAbortedTurn

These run in the standard mise run test suite — e2e/agents is not behind the e2e build tag — so they gate merges without needing a paid E2E run.

Verification

  • mise run lint — clean
  • go test ./e2e/agents/ — pass
  • gofmt / dup:staged — clean
  • Real E2E validation to be run remotely on this branch.

Notes / caveats

  • gemini's failures were 100% correlated with Invalid stream, which is high for pure flakiness. The retry will salvage genuinely transient cases; if gemini-2.5-flash is systematically choking, retries will exhaust and it'll still fail (correctly) — at which point it's a model/version question, not a harness one.
  • Out of scope (separate clusters from the same triage): cursor-cli (account out of usage quota) and factory/copilot interactive PTY (FACTORY_API_KEY device-auth + Copilot trust-dialog sentinel).

🤖 Generated with Claude Code


Note

Low Risk
Test-harness-only changes in e2e/agents with unit tests; no production CLI or auth paths touched.

Overview
Fixes two E2E agent runner bugs in e2e/agents (no CLI product changes).

opencode: Headless RunPrompt now sets child env via openCodePromptEnv() so PWD matches cmd.Dir. Without that, opencode (Node) used a stale PWD from go test, wrote files outside the temp repo, and checkpoints never advanced.

gemini-cli: When stderr shows a server-side abort (Invalid stream / empty or malformed tool call) but the process exits 0, RunPrompt now returns an error and IsTransientError treats it as retryable so scenarios restart instead of failing on missing checkpoints.

Unit tests cover env override and abort/transient detection; they run in the normal go test ./e2e/agents/ suite.

Reviewed by Cursor Bugbot for commit 1ee9286. Configure here.

Both agents accounted for ~75 of the persistent E2E failures, neither
caused by CLI product code — the bugs are in the e2e agent runners.

opencode (43 failures): RunPrompt set cmd.Dir but not PWD. Go's cmd.Dir
chdirs the child without updating the inherited PWD env var, and opencode
(Node) resolves its project root from process.env.PWD — so file
operations landed in the go-test package dir instead of the test repo,
the per-repo entire plugin never loaded, and no checkpoint was created
("checkpoint ref did not advance" / "git add: pathspec did not match").
Extract openCodePromptEnv to force PWD to match cmd.Dir.

gemini-cli (32 failures): the model intermittently aborts a turn with
"Invalid stream: empty response or malformed tool call" on stderr while
still exiting 0 with empty stdout. The turn never completes, the
after-agent hook never fires, no checkpoint is created. Because the
process exits 0, the transient-retry path (err != nil && IsTransientError)
was never reached. Surface the abort as an error in RunPrompt and add the
signature to IsTransientError so the scenario restarts instead of failing
on a missing checkpoint.

Adds unit tests (run in the standard suite, not e2e-gated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: d8dbfcccbf60
Copilot AI review requested due to automatic review settings May 27, 2026 13:40
@Soph Soph requested a review from a team as a code owner May 27, 2026 13:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes persistent E2E harness failures for the opencode and gemini-cli agent runners by correcting headless execution environment handling and by treating a known gemini-cli “aborted turn” signature (stderr-only, exit 0) as a transient error so scenarios retry instead of failing on missing checkpoints.

Changes:

  • opencode: ensure headless RunPrompt forces PWD in the child environment to match cmd.Dir (and strips any stale inherited PWD).
  • gemini-cli: detect server-side aborted turns via stderr signatures; synthesize an error when the process exits 0 so the existing transient-retry path triggers.
  • Add unit tests covering PWD override logic and gemini abort/transient detection.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
e2e/agents/opencode.go Adds openCodePromptEnv() and uses it so PWD matches the repo dir for headless runs.
e2e/agents/opencode_test.go New test verifying PWD override and env filtering behavior.
e2e/agents/gemini.go Adds aborted-turn signature detection; marks it transient and synthesizes an error on “abort but exit 0”.
e2e/agents/gemini_test.go Adds tests for transient detection and aborted-turn signature matching.

pjbgf
pjbgf previously approved these changes May 27, 2026
The branch run confirmed the harness retry fix works (gemini's "Invalid
stream" aborts now trigger scenario restarts), but ~26 tests still
exhaust all 3 retries: gemini-2.5-flash returns an empty/malformed
response on the first model turn even for trivial prompts. That is an
upstream model/CLI instability, not a checkpoint/strategy bug.

Make the gemini model tunable via E2E_GEMINI_MODEL (mirrors
E2E_CODEX_MODEL / E2E_OPENCODE_MODEL), defaulting to the cheap
gemini-2.5-flash locally, and pin the CI gemini-cli job to
gemini-2.5-pro to see whether the more capable model reduces the
empty-response aborts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 752c67fbcfe7
gemini-2.5-pro reduced the "Invalid stream" empty-response aborts
(26 -> 18 failures) but didn't eliminate them, and added 429 rate-limit
noise (retried, not fatal). Try gemini-3.1-flash-lite instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 5e2987223baf
@Soph Soph merged commit 6360303 into main May 27, 2026
20 of 22 checks passed
@Soph Soph deleted the soph/fix-e2e-gemini-opencode branch May 27, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants