Agentic loop stability fixes from repeated e2e testing by habdelra · Pull Request #4394 · cardstack/boxel

habdelra · 2026-04-13T20:07:04Z

Summary

Addresses state transition issues, testing infrastructure bugs, and consistency problems discovered through repeated end-to-end cycling of the software factory flow (factory:go → bootstrap → implement → validate). Each fix was validated by running the full e2e loop (10 runs).

Orchestrator owns issue status transitions

The agent can no longer set issue status to done or in_progress — the loop manages these transitions:

backlog → in_progress: when the loop picks up an issue
in_progress → done: when the agent calls signal_done AND validation passes
Agent can only set blocked (can't proceed) or backlog (unblock)

Previously the agent could mark issues done before validation passed, causing the loop to exit with failing tests.

Card update tools use read-patch-write

update_issue, update_project, and create_knowledge now read the existing card, merge provided attributes, and write back — instead of creating a fresh document that clobbers existing attributes. This was a critical bug: the bootstrap-seed issue was reduced to just status and updatedAt after an update.

Fix darkfactory module URL in buildCardDocument

buildCardDocument() was constructing adoptsFrom.module from the target realm URL instead of the software-factory realm. This caused refreshIssue() searches (which filter by type module) to not find updated cards, making the loop think issues were still in backlog after being marked done.

Fix Playwright waitForFunction timeout

The timeout was passed as the second argument (treated as arg) instead of the third (options). Playwright fell back to its 30s default, causing test runs that took 30-50s to silently time out instead of completing.

Remove run_tests from agent tools

The validation pipeline runs tests automatically after every agent turn. Giving the agent the run_tests tool caused it to waste iterations running tests manually (sometimes before writing test files), creating duplicate TestRun instances, and even attempting to run Node.js scripts via run_command.

Other fixes

ensureJsonExtension: Card tool paths normalized with .json (realm API card+source requires full path with extension)
Project auto-completion: Project status set to completed when all issues done
Status badges: Issue and Project cards show colored status badges in fitted + isolated templates (matching TestRun pattern)
TestRun 0/0 empty state: Shows gray "empty" badge instead of green "passed" when no tests ran
Fix uneven red border: Failing test module group border is now uniform
Issue linked-card: Project in Issue isolated view uses @format='embedded' to prevent layout occlusion
Drop "MVP" suffix: Project naming convention simplified
run_command description: Explicitly states "Boxel host commands ONLY"
Updated phase-2-plan.md: Documents all validated design decisions

Test plan

pnpm lint passes (lint:js, lint:types, lint:format)
pnpm test — 352 unit tests pass (17 new tests covering status transitions, auto-mark, revert, project completion, read-patch-write, status stripping)
10 end-to-end runs (smoke-test-1 through smoke-test-10) validating the full factory:go flow

🤖 Generated with Claude Code

Addresses state transition issues, testing infrastructure bugs, and consistency problems discovered through repeated end-to-end cycling of the software factory flow (factory:go → bootstrap → implement → validate). Key changes: - Orchestrator owns issue status transitions (backlog → in_progress → done) rather than letting the agent set status directly, which caused premature exits when validation was still failing - Card update tools (update_issue, update_project, create_knowledge) use read-patch-write instead of full document replacement, preventing attribute clobbering on partial updates - Fix buildCardDocument to use correct darkfactoryModuleUrl from software-factory realm instead of target realm, which broke issue refresh searches by type - Fix Playwright waitForFunction timeout argument position (was being treated as arg instead of options, causing 30s default instead of configured 5min) - Remove run_tests from agent tool set — validation pipeline handles test execution automatically after each agent turn - Add ensureJsonExtension helper for card tool paths (card+source requires full file path with extension) - Auto-mark project as completed when all issues are done - Set issue to in_progress when loop picks it up - Status badges on Issue and Project cards (fitted + isolated templates) - TestRun 0/0 passed shows gray "empty" badge instead of green - Fix uneven red border on failing test module groups - Issue linked-card in isolated view uses embedded format to prevent occlusion - Drop "MVP" from project naming convention - Update prompts, skills, and phase-2-plan.md with validated design decisions - 352 unit tests passing, all covering new behaviors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c31b802e40

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

packages/software-factory/src/issue-loop.ts

packages/software-factory/src/issue-scheduler.ts

Copilot

Pull request overview

Stabilizes the software-factory agentic issue loop based on repeated end-to-end runs by tightening ownership of status transitions, preventing card-clobbering updates, fixing darkfactory module URL usage, and improving validation/test execution reliability.

Changes:

Move issue status transitions (“in_progress”, “done”) to the orchestrator (loop) and restrict agent status updates to “blocked/backlog” + signal_done intent.
Update card-writing tools (update_issue, update_project, create_knowledge) to read-patch-write and normalize JSON paths with .json.
Fix infra/UI reliability issues (Playwright waitForFunction timeout arg position, darkfactory module URL threading, test result/status badge rendering).

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
packages/software-factory/tests/issue-loop.test.ts	Adds/adjusts tests for orchestrator-owned transitions and project completion behavior.
packages/software-factory/tests/factory-tool-executor.spec.ts	Threads `darkfactoryModuleUrl` into tool building in tests.
packages/software-factory/tests/factory-tool-builder.test.ts	Updates tool builder tests for read-patch-write behavior and removes run_tests coverage.
packages/software-factory/src/test-run-execution.ts	Fixes Playwright `waitForFunction` timeout handling and increases timeout.
packages/software-factory/src/realm-operations.ts	Adds `ensureJsonExtension()` helper for realm card paths.
packages/software-factory/src/issue-scheduler.ts	Uses `.json` normalization for issue updates; adds project status update support.
packages/software-factory/src/issue-loop.ts	Implements orchestrator-owned `in_progress`/`done` transitions + project completion hook.
packages/software-factory/src/factory-tool-builder.ts	Adds `darkfactoryModuleUrl`, removes `run_tests` tool, enforces status stripping + read-patch-write + `.json` normalization.
packages/software-factory/src/factory-issue-loop-wiring.ts	Threads `darkfactoryModuleUrl` into tool-builder config.
packages/software-factory/src/darkfactory-schemas.ts	Fixes `buildCardDocument` to use the software-factory darkfactory module URL (not target realm).
packages/software-factory/scripts/smoke-tests/factory-tools-smoke.ts	Updates smoke test config for `darkfactoryModuleUrl`.
packages/software-factory/realm/test-results.gts	Improves status display (adds “empty” state) and refines module group styling.
packages/software-factory/realm/darkfactory.gts	Adds status badges and adjusts Issue embedded Project rendering.
packages/software-factory/prompts/ticket-implement.md	Updates agent workflow ordering + clarifies `signal_done` and status ownership.
packages/software-factory/prompts/system.md	Removes mention of run_tests and documents post-`signal_done` validation pipeline.
packages/software-factory/prompts/bootstrap-implement.md	Updates bootstrap instructions (paths/naming + `signal_done` instead of status updates).
packages/software-factory/docs/phase-2-plan.md	Documents validated design decisions and new loop/tool semantics.
packages/software-factory/.agents/skills/software-factory-operations/SKILL.md	Updates skills to match orchestrator-owned testing/status semantics.
packages/software-factory/.agents/skills/software-factory-bootstrap/SKILL.md	Updates bootstrap skill docs (paths/naming + `signal_done`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/software-factory/src/factory-tool-builder.ts

packages/software-factory/tests/factory-tool-builder.test.ts

packages/software-factory/src/issue-loop.ts

packages/software-factory/src/factory-tool-builder.ts

The test was passing status: 'in_progress' to update_issue, but the tool now only allows 'blocked' and 'backlog' (orchestrator owns other transitions). Updated test to use 'blocked'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Gate project completion on all realm issues being done (not just processed ones) - Read-patch-write: only create fresh doc on confirmed 404, error on other failures - Extract shared readPatchDocument helper for update_project/update_issue/create_knowledge - Copy attributes before mutating to avoid side effects on caller's args - Fix test comment about removed buildRunTestsTool - Add note that we expect one Project per realm Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove -mvp suffix from example path to match updated naming convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

listIssues() returns [] on search failures, and [].every() is true, which would falsely mark the project completed. Require non-empty list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Apr 13, 2026

View reviewed changes

packages/software-factory/src/issue-loop.ts Outdated Show resolved Hide resolved

packages/software-factory/src/issue-scheduler.ts Show resolved Hide resolved

habdelra requested a review from Copilot April 13, 2026 20:15

Copilot started reviewing on behalf of habdelra April 13, 2026 20:16 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

habdelra and others added 2 commits April 13, 2026 16:22

habdelra requested a review from a team April 13, 2026 20:41

This was referenced Apr 13, 2026

Surface HTTP status codes from readFile() in realm-operations.ts #4396

Open

Add comments support to Issue schema (CS-10724) #4395

Open

habdelra and others added 2 commits April 13, 2026 17:23

Fix stale example path in update_project tool description

76a394f

Remove -mvp suffix from example path to match updated naming convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Guard project completion against empty issue list

04d0c7a

listIssues() returns [] on search failures, and [].every() is true, which would falsely mark the project completed. Require non-empty list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agentic loop stability fixes from repeated e2e testing#4394

Agentic loop stability fixes from repeated e2e testing#4394
habdelra wants to merge 5 commits intomainfrom
agentic-loop-stability-fixes

habdelra commented Apr 13, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

habdelra commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Orchestrator owns issue status transitions

Card update tools use read-patch-write

Fix darkfactory module URL in buildCardDocument

Fix Playwright waitForFunction timeout

Remove run_tests from agent tools

Other fixes

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

habdelra commented Apr 13, 2026 •

edited

Loading