Skip to content

fix(e2e): droid prompt submission and claude-code subagent flakes#1599

Merged
toothbrush merged 7 commits into
mainfrom
fix-e2e-tests
Jul 2, 2026
Merged

fix(e2e): droid prompt submission and claude-code subagent flakes#1599
toothbrush merged 7 commits into
mainfrom
fix-e2e-tests

Conversation

@pfleidi

@pfleidi pfleidi commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

https://entire.io/gh/entireio/cli/trails/725

Why

E2E runs on main have been failing repeatedly (latest: run 28549683381):

  • factoryai-droidTestFactoryTaskCheckpointExistsBeforeCommit and TestFactoryCommittedCheckpointExcludesPreExistingUntrackedFiles fail because the harness sends Enter a fixed 200ms after pasting the prompt. Droid v0.162.x ingests long pasted prompts over several seconds, the Enter gets swallowed, and the prompt sits unsubmitted in the input box until the test times out. Recurring across at least three recent main runs.
  • claude-code (Linux and Windows) — TestSingleSessionSubagentCommitInTurn and TestSubagentCommitFlow fail because newer Claude Code releases can run Task subagents in the background: the tool call returns in ~40ms, the foreground turn ends (stop hook sees no changes), and the subagent's file changes and commit land after turn-end with no active session — no checkpoint trailer, no condensation, so the checkpoint-advance assertions time out.

What changed

  • TmuxSession.Send waits until the echoed input has fully rendered before submitting, then verifies the pane reacted to Enter and retries up to 3× if it was swallowed.
  • The stableAtSend snapshot is now taken before Enter, so it can never include response output from a fast agent (a post-Enter snapshot deadlocks WaitFor's change-detection guard — caught by the Vogon canary during development).
  • The Vogon REPL ignores empty lines (exits only on exit/quit), so a retried Enter cannot terminate the session mid-test.
  • The two subagent test prompts instruct claude-code to run the subagent in the foreground.

Decisions made during development

  • The Send hardening lives in the shared TmuxSession, not a droid-specific override: the race is universal (the old code already worked around the same issue for Claude's TUI with a fixed sleep); droid merely widened the timing window past it.
  • The prompt pinning keeps the subagent tests on the synchronous path they were written to cover. Making the CLI track background-subagent work (e.g. wiring Claude Code's SubagentStop hook) is a separate product gap and deliberately not part of this PR — with a foreground subagent, hook logs confirm pre-task/post-task now span the actual subagent work (~22s) and the mid-turn commit gets its trailer.

Reviewer notes

  • Droid cannot run locally (no FACTORY_API_KEY); the droid-side fix is exercised by the full Vogon canary (which drives the same TmuxSession.Send path, 59/59 green) but needs the CI E2E workflow for final confirmation.
  • copilot-cli's Send override still carries the old fixed-delay + unverified-Enter pattern; unifying it requires parameterizing around Copilot's Ctrl+S/autocomplete submission semantics — follow-up candidate, no regression here.

Note

Low Risk
Changes are confined to E2E harness, Vogon canary, and test timeouts/prompts; no production CLI or hook logic is modified.

Overview
Hardens shared TmuxSession.Send for slow TUIs (especially factoryai-droid): wait until pasted text is fully echoed (raw pane polling with consecutive stability), snapshot stableAtSend before Enter so fast agents cannot deadlock WaitFor, then submit Enter with pane-change verification and up to three retries when Enter is swallowed.

Updates the Vogon interactive REPL to ignore empty lines (only exit/quit end the session) so Enter retries from Send do not terminate mid-test.

Factory droid hook tests extend file and task-rewind waits (90s / 30s) because prompt-pattern WaitFor can return mid-turn while a Worker is still running.

Claude Code subagent tests add explicit instructions to run the subagent in the foreground and wait for completion, avoiding background Task runs that finish after hooks and break checkpoint assertions.

Reviewed by Cursor Bugbot for commit a3db5ce. Configure here.

pfleidi added 4 commits July 1, 2026 15:50
Droid ingests long pasted prompts over several seconds; the fixed
200ms delay before Enter meant the keypress arrived while the input
handler was still processing and got swallowed, leaving the prompt
unsubmitted (recurring TestFactory* failures in CI).

Send now waits until the echoed input has fully rendered, snapshots
the pre-submit pane for WaitFor's settle guard, then verifies the
pane reacted to Enter and retries up to three times. The snapshot is
taken before Enter so it can never include response output from a
fast agent, which would deadlock WaitFor's change requirement.

Entire-Checkpoint: 0da2e3e238ea
New Claude Code releases can run Task subagents in the background:
the tool call returns immediately, the foreground turn ends, and the
subagent's file changes and commit land after turn-end with no active
session — so no checkpoint trailer or condensation happens and the
checkpoint-advance assertions time out (CI failures on Linux and
Windows).

Instruct the agent to run the subagent in the foreground and wait for
it, keeping these tests on the synchronous path they are meant to
cover. Tracking background-subagent work in the CLI itself (e.g. via
the SubagentStop hook) is a separate product gap.

Entire-Checkpoint: 91b00f9ec504
TmuxSession.Send may retry Enter when the pane does not visibly react
within its 2s window. Vogon treated an empty stdin line as session
termination, so on a slow runner a retried Enter could end the session
mid-test. Exit only on explicit exit/quit and update the comments that
described the pre-rewrite Send semantics.

Entire-Checkpoint: ffcdd5a7efd3
waitForInputIngested already holds the settled pane content, so return
it instead of re-capturing for the stableAtSend snapshot and the first
Enter verification. Saves two tmux subprocess spawns per Send and makes
the snapshot exactly the content the stability wait verified.

Entire-Checkpoint: 6537ed5b37ba
Copilot AI review requested due to automatic review settings July 1, 2026 23:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the E2E tmux-based harness to make prompt submission reliable for slower-ingesting TUIs (notably factoryai-droid), and reduces claude-code subagent flakes by updating prompts to request foreground subagent execution. Also adjusts the Vogon REPL test double to tolerate Enter retries without terminating the session.

Changes:

  • Reworked TmuxSession.Send to wait for echoed input to settle, snapshot pre-Enter state, and retry Enter submission.
  • Updated Vogon interactive mode to ignore empty lines (so Enter retries don’t end the session).
  • Updated subagent E2E prompts to explicitly request foreground subagent execution.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
e2e/agents/tmux.go Adds “wait for ingest”, pre-Enter snapshotting, and Enter verification/retry logic for tmux-driven agents.
e2e/vogon/main.go Makes Vogon interactive mode ignore empty lines to support Enter retries from Send.
e2e/tests/subagent_commit_flow_test.go Updates subagent prompt text to request foreground execution to avoid background-subagent flakes.
e2e/tests/single_session_test.go Updates the same-turn subagent+commit prompt text to request foreground execution.

Comment thread e2e/agents/tmux.go
Comment thread e2e/agents/tmux.go
Comment thread e2e/vogon/main.go
@pfleidi

pfleidi commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Bugbot run

Comment thread e2e/agents/tmux.go
pfleidi added 2 commits July 1, 2026 17:03
Require two consecutive stable polls in waitForInputIngested so a
single quiet interval mid-paste can't fake stability, and document why
it compares raw captures instead of stableContent. Re-print the vogon
prompt when ignoring an empty line so manual REPL sessions stay
readable.

Addresses PR #1599 review feedback.

Entire-Checkpoint: 944b219ab908
Droid's prompt pattern matches the always-visible input box, so
WaitFor can return mid-turn; the 10s file wait then expires while the
Worker is still executing (60-120s turns on CI). Widen the file and
rewind-point waits to absorb the Worker runtime.

Entire-Checkpoint: 63e9c5af67f1
@pfleidi

pfleidi commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Bugbot run

@pfleidi pfleidi marked this pull request as ready for review July 2, 2026 00:09
@pfleidi pfleidi requested a review from a team as a code owner July 2, 2026 00:09

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit a3db5ce. Configure here.

Comment thread e2e/tests/factory_hooks_test.go Outdated
The adjacent comment documents Worker turns of 60-120s on CI but the
file wait was capped at 90s, so a slow Worker could outlive the wait
even when it succeeds. Droid's 2x timeout multiplier gives these tests
a 6-minute budget, so the wider wait fits comfortably.

Entire-Checkpoint: 132cc906436e

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread e2e/agents/tmux.go
@toothbrush toothbrush merged commit b58b5fe into main Jul 2, 2026
12 of 20 checks passed
@toothbrush toothbrush deleted the fix-e2e-tests branch July 2, 2026 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants