Skip to content

Consolidate E2E test suite into cli repo#474

Merged
khaong merged 51 commits intomainfrom
alex/consolidate-e2e-tests
Feb 25, 2026
Merged

Consolidate E2E test suite into cli repo#474
khaong merged 51 commits intomainfrom
alex/consolidate-e2e-tests

Conversation

@khaong
Copy link
Contributor

@khaong khaong commented Feb 24, 2026

Summary

  • Moves the consolidated E2E test suite from entire-cli-e2e-tests into e2e/ as the single E2E suite
  • Removes the old shadow-hook based suite at cmd/entire/cli/e2e_test/
  • Updates mise tasks to point at new e2e/tests/... path with optional filter args

What's in e2e/

  • 3 agents: Claude Code, Gemini CLI, OpenCode — all run via ForEachAgent
  • E2E_AGENT env var for CI matrix targeting (when set, only matching agent registers)
  • //go:build e2e tag on all test files (invisible to normal go build/go test)
  • 37 tests covering: single/multi session, auto-commit, attribution, rewind, stash workflows, split commits, subagent flows, session lifecycle, edge cases, checkpoint metadata validation
  • Artifact capture (git log, tree, checkpoint metadata, entire logs, console transcript) for failure debugging
  • exploratory/ directory for on-demand test scenarios not run by CI

Key design decisions

  • Always uses os.MkdirTemp instead of t.TempDir() — Go's nested subtest dirs confuse some agents' path resolution
  • OpenCode gets opencode.json config written to allow external_directory permission in non-interactive mode
  • filepath.EvalSymlinks resolves macOS /var/private/var so agent CLIs see consistent paths

Test plan

  • mise run test:e2e:claude TestSingleSessionManualCommit — PASS
  • mise run test:e2e:opencode TestSingleSessionManualCommit — PASS
  • go build ./... — clean (e2e tests invisible without build tag)
  • go test -tags=e2e -list '.*' ./e2e/tests/... — all 37 tests listed
  • Full suite run against Claude
  • Full suite run against Gemini
  • Full suite run agains Opencode
  • CI workflow update

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings February 24, 2026 05:54
@cursor
Copy link

cursor bot commented Feb 24, 2026

PR Summary

Medium Risk
Test/CI-heavy change that deletes and replaces the E2E harness, adds a new agent to the CI matrix, and changes artifact generation/upload; main product code impact is low but CI stability and E2E signal quality could regress if the new runner/bootstraps are flaky.

Overview
Consolidates end-to-end testing into a new top-level e2e/ suite with agent abstractions (Claude Code, Gemini CLI, OpenCode), tmux-backed interactive sessions, artifact capture, and a small testreport tool to render gotest JSON events into a readable report.

Removes the legacy shadow-hook based E2E suite under cmd/entire/cli/e2e_test/, and replaces/augments coverage by adding new e2e/tests/* cases plus an integration test for resume in a relocated repo.

Updates CI workflows to install needed deps (tmux), build and run the CLI, run an agent bootstrap step before tests, expand the matrix to include Gemini, and always upload e2e/artifacts/. Also updates docs/config (e2e/README.md, CLAUDE.md, .gitignore, golangci exclusions) to reflect the new suite and ignore generated artifacts/binaries.

Written by Cursor Bugbot for commit 94cc998. Configure here.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates E2E test infrastructure by moving the comprehensive test suite from a separate repository into e2e/ within the CLI repo, replacing the older shadow-hook based suite at cmd/entire/cli/e2e_test/.

Changes:

  • Adds new consolidated E2E test suite in e2e/ with 37 tests covering single/multi-session workflows, auto-commit, attribution, rewind, stash, split commits, and edge cases
  • Removes old shadow-hook based E2E suite from cmd/entire/cli/e2e_test/
  • Updates mise tasks to point at e2e/tests/... with optional filter arguments for running specific tests

Reviewed changes

Copilot reviewed 40 out of 41 changed files in this pull request and generated no comments.

Show a summary per file
File Description
mise.toml Updates E2E test tasks to target new e2e/tests/... path with filter arguments
e2e/testutil/repo.go Core test utilities for repo setup, agent execution, and git operations
e2e/testutil/metadata.go Checkpoint and session metadata structures for validation
e2e/testutil/assertions.go Test assertion helpers for checkpoint validation and verification
e2e/testutil/artifacts.go Artifact capture system for debugging failed tests
e2e/tests/*.go 37 comprehensive E2E tests across multiple workflow scenarios
e2e/agents/*.go Agent abstraction layer supporting Claude Code, Gemini CLI, and OpenCode
e2e/entire/entire.go CLI wrapper functions for entire commands in tests
cmd/entire/cli/e2e_test/*.go Removed old E2E test suite files (11 files deleted)
.gitignore Adds e2e/artifacts/ to ignored files

@khaong
Copy link
Contributor Author

khaong commented Feb 24, 2026

bugbot run

@khaong khaong force-pushed the alex/consolidate-e2e-tests branch from 6963fc2 to d9ac795 Compare February 24, 2026 12:54
@khaong khaong marked this pull request as ready for review February 24, 2026 12:55
@khaong khaong requested a review from a team as a code owner February 24, 2026 12:55
@khaong khaong enabled auto-merge February 24, 2026 12:55
@khaong khaong disabled auto-merge February 24, 2026 12:58
@khaong khaong enabled auto-merge February 24, 2026 13:03
@Soph
Copy link
Collaborator

Soph commented Feb 24, 2026

I tried this locally, and I couldn't get it to work at all :(

Claude thinks:

⏺ That's the bug. StartSession has the comment:
  // Locally, we skip CLAUDE_CONFIG_DIR so the Keychain-based auth works.

  But RunPrompt always sets CLAUDE_CONFIG_DIR, even locally. On macOS, Claude Code uses Keychain for auth, which is tied to the default ~/.claude path. When you override CLAUDE_CONFIG_DIR to a temp dir, Claude Code can't find auth credentials in the Keychain for that path → "Not logged in".

  RunPrompt needs the same CI-only guard that StartSession has. But this is the branch author's code, not ours. Do you want me to fix it, or flag it to the branch author?

I'm not sure if this is a me thing or why it works for you, so stopping trying to fix it :(

Also:

  1. files_touched per-checkpoint validation is gone from most tests
  2. Shadow branch cleanup checks are gone
  3. Subagent task checkpoint fields not validated (to be fair, those were logs only so fine to skip them)

And last but not least: it's just testing the binary in path, I feel mise run test:e2e makes it look like if it would test what's checked out and that's what we should do, right?

@khaong khaong force-pushed the alex/consolidate-e2e-tests branch from ec0233d to cf9b3f7 Compare February 24, 2026 23:03
@khaong
Copy link
Contributor Author

khaong commented Feb 25, 2026

I tried this locally, and I couldn't get it to work at all :(

Claude thinks:

⏺ That's the bug. StartSession has the comment:
  // Locally, we skip CLAUDE_CONFIG_DIR so the Keychain-based auth works.

  But RunPrompt always sets CLAUDE_CONFIG_DIR, even locally. On macOS, Claude Code uses Keychain for auth, which is tied to the default ~/.claude path. When you override CLAUDE_CONFIG_DIR to a temp dir, Claude Code can't find auth credentials in the Keychain for that path → "Not logged in".

  RunPrompt needs the same CI-only guard that StartSession has. But this is the branch author's code, not ours. Do you want me to fix it, or flag it to the branch author?

I'm not sure if this is a me thing or why it works for you, so stopping trying to fix it :(

Also:

  1. files_touched per-checkpoint validation is gone from most tests
  2. Shadow branch cleanup checks are gone
  3. Subagent task checkpoint fields not validated (to be fair, those were logs only so fine to skip them)

And last but not least: it's just testing the binary in path, I feel mise run test:e2e makes it look like if it would test what's checked out and that's what we should do, right?

@Soph - Yup, it was a real problem 🤦‍♂️

It turns out that I've got my claude set up just a little bit different to the 'standard OAuth' flow.

Thanks to @dvydra @toothbrush we've got to the bottom of it and fixed some other 'user scope' transgressions the tests were doing.

I think longer term we want to move these tests to docker in any case, just to be 100% certain about blast radius.

@khaong
Copy link
Contributor Author

khaong commented Feb 25, 2026

@Soph and as for the other points:

1. files_touched per-checkpoint validation

TestCheckpointMetadataDeepValidation validates files_touched end-to-end, and the assertion helpers (AssertCheckpointFilesTouched, AssertCheckpointFilesTouchedContains) are available for any test that needs them. Detailed per-field checkpoint validation is also covered extensively in integration tests (17 test files in cmd/entire/cli/integration_test/).

2. Shadow branch cleanup checks

TestShadowBranchCleanedAfterAgentCommit covers the e2e path. Detailed shadow branch lifecycle is covered in integration tests.

Edit: I've added post-shadow checks to all the scenarios now

3. Testing the binary in PATH

Intentional design - the suite is meant to run against any installed binary for version comparison and regression detection. TestMain prints the version and stamps entire-version.txt into the artifact dir. Also added the version to the mise task output so it's explicit upfront what binary is being tested.

Switched it to run against the code in the accompanying branch by default, overridable by E2E_ENTIRE_BIN

khaong and others added 14 commits February 25, 2026 15:46
Moves the entire-cli-e2e-tests suite into e2e/ as the single E2E test
suite. Uses real git hooks, tmux interactive sessions, artifact capture,
and ForEachAgent multi-agent execution.

Key features:
- 3 agents: Claude Code, Gemini CLI, OpenCode
- E2E_AGENT env var for CI matrix targeting
- //go:build e2e tag (tests don't compile in normal builds)
- Artifact capture for failure debugging
- Per-agent concurrency gating and timeout scaling
- Deep checkpoint metadata validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The consolidated E2E suite with real hooks, tmux, and artifact capture
now lives at e2e/. The old suite's unique scenarios have been ported,
and its internal-logic coverage is handled by existing unit/integration
tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updates test:e2e, test:e2e:claude, test:e2e:gemini, and test:e2e:opencode
tasks to point at ./e2e/tests/... and adds optional filter args. Fixes
gemini agent name to gemini-cli to match agent registration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: e5405a1ee475
The consolidated E2E suite needs tmux for interactive session tests
and the entire binary on PATH. Also prints entire version before
running tests for debugging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 2f51fe6d2c6f
CI runners don't have global git user.name/user.email configured,
causing `git commit` to fail. Set per-repo identity after git init.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 197748497091
Create ~/.claude/.claude.json with primaryApiKey so Claude Code's
interactive TUI uses API key auth instead of trying OAuth login,
which isn't possible on headless CI runners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 8e2188ca94aa
On CI runners (no macOS Keychain), use isolatedConfigDir with
CLAUDE_CONFIG_DIR so Claude Code picks up ANTHROPIC_API_KEY
from the environment instead of trying the OAuth browser flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: ca3321c4f01b
- Harden TestStashModificationsToTrackedFiles prompt for opencode
- Add WaitForSessionMetadata to handle race where checkpoint branch
  advances before session metadata is fully committed
- Upload e2e artifacts on CI for failure debugging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 3909f9373db2
Relax golangci-lint rules for e2e/ (test infrastructure, not production
code) and fix missing imports that caused compilation errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: ceb86081b9a8
opencode's first-run DB migration and node_modules resolve can race
with test execution (upstream issue #6935, "Cannot find package jose").
Retry `opencode --version` until initialization completes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: e4a2bbc441e5
Non-Claude agents (opencode) store transcripts in pretty-printed JSON,
not JSONL. The per-line parse check produced ~60 log lines of noise
per test while providing no signal. Keep the non-empty check only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: cc64c19b8a48
opencode's TUI occasionally fails to render on CI, producing a
completely empty tmux pane. Retry once if the pane is empty after
the 15s startup timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: e845678e905b
opencode --version doesn't trigger DB schema creation or node_modules
resolution. Use a trivial `opencode run` prompt instead so the SQLite
tables and jose/auth modules are fully initialized before parallel
tests start.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: b5d954fb2751
khaong and others added 4 commits February 25, 2026 15:49
OpenCode sometimes exits 0 despite fatal errors (e.g. "Token refresh
failed: 400") that only appear in stderr. This caused tests to proceed
as if the prompt succeeded, leading to confusing failures later.

- Add detectStderrError to catch "Error:" lines in stderr on exit 0
- Add "Token refresh failed" to transient error patterns for retry

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 1b54f6374ffa
Show which binary is being tested upfront, matching the original repo's
pattern. Also includes the uncommitted opencode stderr detection fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: f03cd80c7159
Add pane snapshots to console.log after every WaitFor call and a final
pane.txt artifact at cleanup, so interactive session failures always
have the agent's actual output available for post-mortem analysis.

- Add StartSession/WaitFor helpers on RepoState
- Capture final pane in CaptureArtifacts when a session is registered
- Migrate interactive tests to use the new helpers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 01e834ea60b0
The session close was registered as a separate t.Cleanup which ran
before CaptureArtifacts (LIFO ordering), leaving pane.txt empty.
Move close into CaptureArtifacts so pane is captured while still alive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: ae0d86477875
@khaong khaong force-pushed the alex/consolidate-e2e-tests branch from 5a59dc9 to e4736c2 Compare February 25, 2026 04:51
khaong and others added 3 commits February 25, 2026 15:59
The test was written for auto-commit which auto-creates checkpoints on
commit. Manual-commit requires GitCommitWithShadowHooks to trigger the
prepare-commit-msg and post-commit hooks that create the checkpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: b4a194277a38
Auto-commit strategy was removed from main. The E2E test is no longer
applicable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 6e4ca1c6d5f5
The --strategy flag was removed from the CLI. Simplify Enable() to
drop the unused strategy parameter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: f50c83b5de79
@khaong khaong disabled auto-merge February 25, 2026 05:03
@khaong
Copy link
Contributor Author

khaong commented Feb 25, 2026

bugbot run

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 49 changed files in this pull request and generated 2 comments.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

khaong and others added 6 commits February 25, 2026 16:28
Shadow branches left behind after condensation is a recurring failure
mode. Add AssertNoShadowBranches helper and apply it to all 33 E2E
tests that end with a commit. Replaces the inline check in
TestShadowBranchCleanedAfterAgentCommit with the shared helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 1f6c3ecd3039
E2E tests now build the `entire` binary from source on first use via
sync.Once, removing the requirement to have `entire` in PATH. Set
E2E_ENTIRE_BIN to override with a pre-built binary (used in CI to
avoid double-building).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: addaf7c035a8
- Shell-quote all args in tmux command construction to prevent injection
- Clean up auto-generated variable names (shellCmdSb25/29 → parts)
- Extract pollInterval const from magic 500ms in WaitFor loop
- Add TmuxSession.OnClose for resource cleanup callbacks
- Fix config dir leak in Claude StartSession on CI via OnClose
- Skip ForEachAgent when no agents match the E2E_AGENT filter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: edbce1f861a0
- Return nil instead of closed session in opencode StartSession failure path
- Set ExitCode=-1 for non-ExitError failures in Claude and Gemini RunPrompt,
  matching OpenCode's existing behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: c5ed97794bde
Use git log --grep to find commits mentioning the checkpoint ID anywhere
on the branch, instead of requiring the last N sequential commits to all
match. This prevents false failures when a subagent makes multiple
commits in a single turn, interleaving other checkpoint IDs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 753bbf4ab0a7
…al test

TestTrailerRemovalSkipsCondensation commits with core.hooksPath=/dev/null,
bypassing all git hooks. The shadow branch legitimately persists because
post-commit cleanup never runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 0137bc9830b6
@khaong
Copy link
Contributor Author

khaong commented Feb 25, 2026

so enabling the shadow branch leftover checks have surfaced cases where we're not cleaning up properly - these are showing as failing tests:
TestModifiedFileAlwaysGetsCheckpoint
TestContentOverlapRevertNewFile

we are choosing to leave these in as bug markers

Copy link
Collaborator

@Soph Soph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  TestContentOverlapRevertNewFile

  The AssertNoShadowBranches assertion is incorrect for this test and should be removed.

  This test deliberately tests a scenario where no checkpoint should be created (content mismatch on a new file). When no checkpoint is condensed, the shadow branch correctly remains
  - it preserves the agent's discarded work for potential later access via entire rewind.

  TestModifiedFileAlwaysGetsCheckpoint

  This one should work - a checkpoint should be created because modified files always count as overlap. If the shadow branch isn't being cleaned up, there may be a bug in the
  condensation logic or FilesTouched population.```

So I think `TestContentOverlapRevertNewFile` is actually fine keeping the shadow branch, only `TestModifiedFileAlwaysGetsCheckpoint` is indicating an issue.

@Soph
Copy link
Collaborator

Soph commented Feb 25, 2026

I had to run brew install gotestsum

@khaong
Copy link
Contributor Author

khaong commented Feb 25, 2026

I had to run brew install gotestsum

mise install should handle this I think?

@khaong
Copy link
Contributor Author

khaong commented Feb 25, 2026

TestContentOverlapRevertNewFile

The AssertNoShadowBranches assertion is incorrect for this test and should be removed.

This test deliberately tests a scenario where no checkpoint should be created (content mismatch on a new file). When no checkpoint is condensed, the shadow branch correctly remains

  • it preserves the agent's discarded work for potential later access via entire rewind.

I think the mismatch here is how we're using the one-shot RunPrompt (which is 'ending' the session), I'll do a follow up where we hold the session open to better reflect the scenario. 👍

@khaong khaong merged commit 83edba9 into main Feb 25, 2026
3 checks passed
@khaong khaong deleted the alex/consolidate-e2e-tests branch February 25, 2026 22:02
@khaong khaong mentioned this pull request Feb 25, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants