software-factory: Claude agent uses native tools; OpenRouter unchanged for now#4633
software-factory: Claude agent uses native tools; OpenRouter unchanged for now#4633jurgenwerk merged 13 commits intomainfrom
Conversation
…ls required for tracker cards Drop the read_file / write_file MCP shims for the Claude Agent SDK backend and let the model use native Read / Write / Edit / Bash / Glob / Grep against the factory workspace (cwd = workspaceDir). Realm-side ops, validators, structured card updates, and signals stay as in-process MCP tools. OpenRouter is unchanged — it has no native fs and still gets the full FactoryTool[]. The bootstrap and operations skills previously instructed the agent to write Project / Issue / KnowledgeArticle JSON via write_file, which on the Claude path silently bypassed the structured update tools that enforce schema and immutability invariants (e.g. update_issue strips description). Both skills and the system-prompt addendum now require the structured tools for tracker-schema cards and forbid native Write for those paths. Adds a closed catalog so the model stops inventing tool names like mcp__factory__realm-read. Improve --debug output: tool_result lines now carry the originating tool name and a truncated payload snippet, so a run is auditable from the log alone instead of from indirect signals (sync output, mid-turn validator behaviour). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalize the filter constant (NATIVE_FS_REPLACED_TOOLS → CLAUDE_FILTERED_FACTORY_TOOLS) so it can describe tools the Claude backend doesn't need for any reason — native or boxel CLI alternative, or unused in practice — and add run_command to it. run_command is unused on the Claude path: the orchestrator pre-loads card-type schemas into the structured update tools' parameter shapes, so the agent never needs to call get-card-type-schema or any other host command during a normal factory run. If a future workflow does need one, Bash + `boxel run-command` reaches the same realm-server endpoint. OpenRouter still gets run_command — the path keeps the full FactoryTool[] and has no Bash fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Claude path Add fetch_transpiled_module to the Claude filter set. The boxel CLI exposes the same realm endpoint via `boxel read-transpiled <path> --realm <url>`, which the agent can run through Bash. Single string arg, no JSON-quoting hazard, used only when an eval/instantiate validation error reports a transpiled line/column — so the marginal cost of shelling out is negligible. Update the operations skill to direct each backend to its appropriate transpiled-fetch path: Claude shells out via Bash, OpenRouter keeps using the MCP tool. Cross-references in the validator descriptions and the dedicated "Debugging Runtime Evaluation Errors" section now name both paths instead of hardcoding the OpenRouter tool name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add search_realm to the Claude filter set. The boxel CLI exposes the same federated-search endpoint via `boxel search --realm <url> --query '<json>' --json`, which the agent can run through Bash. The risk here is shell quoting: the query is a JSON object that has to ride the command line. The operations skill now spells out the quoting contract — single-quote the entire JSON object so the shell doesn't expand or split it; keep keys and values double-quoted inside — and gives a concrete example. If quoting bites in practice, the fallback is a tiny shell wrapper that takes JSON on stdin. OpenRouter still gets the structured search_realm tool — no shell, no quoting concerns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e path The factory builds two tool catalogs and feeds both into the SDK MCP server: a "core" set defined in factory-tool-builder, and a kebab-case set wrapped from the ToolRegistry's script + realm-api manifests (realm-read, realm-write, search-realm, boxel-sync, …). The kebab-case set shadows the core set with overlapping functionality, and the model picks the kebab-case name at random — calling realm-read against the target realm hit a "not in allowed list" error in the last factory run, and search-realm took a different code path than the structured search_realm tool we already filtered. Mark each FactoryTool with `source: 'core' | 'registered'` and filter `'registered'` out of the Claude MCP catalog. OpenRouter still gets every tool — the registry tools may legitimately serve non-target-realm scratch work for that backend. This also closes the loop on the addendum's "do not call any tool whose plain name is not in this list" rule, which was meaningless while the kebab-case shadow tools were still in the catalog. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…laude path In the previous run the agent invented an absolute path (`/Users/jurgen/code/boxel/test-realm-claude-7/sticky-note.gts`) from the realm slug, called native `Write` against it, and produced real bytes outside the factory workspace. The SDK's `cwd` only governs *relative* path resolution; absolute paths bypass it entirely. Under `bypassPermissions` the model can write anywhere on the user's filesystem. Switch the Claude path's permissionMode from `bypassPermissions` to `dontAsk` (same auto-approve semantics for everything in `allowedTools`, but `canUseTool` still fires) and add a hook that denies `Read` / `Write` / `Edit` / `MultiEdit` / `NotebookEdit` / `NotebookRead` whenever `file_path` resolves outside the workspace. The deny message names the violation so the model can self-correct to a realm-relative path. Bash is intentionally not gated. Read-only inspection (`ls`, `find`, `cat`, `boxel status`) needs the full filesystem, and parsing arbitrary shell for write side effects is its own footgun. The prompt addendum already constrains Bash to "safe shell helpers"; absolute-path writes go through the typed fs tools where the hook catches them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oped canUseTool
On macOS the typical factory workspace is rooted at `/var/folders/...`,
which is a symlink to `/private/var/folders/...`. The SDK's native fs
tools resolve paths to their canonical form, so an absolute file_path
flowing into `canUseTool` may use either form. Without canonicalization
on both sides, `node:path.relative('/var/folders/W',
'/private/var/folders/W/foo.gts')` returns `../../private/var/...` and
the hook denies a path that's actually inside the workspace.
Resolve both the workspace and each candidate input through realpath
before computing the relative path. For paths whose leaf doesn't exist
yet — typical for `Write` creating a new file — walk up to the deepest
existing ancestor, canonicalize that, then re-append the missing
segments. Falls back to the input on unexpected errors so the hook
never throws (which would crash the SDK turn).
Add a portable test that creates a real temp dir + a symlink pointing
at it and exercises the hook through both forms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip out the parts of the factory tool registry / executor that are no longer used and aren't reachable from any working code path. CS-10883 calls for retiring these tools — this commit lands what's safe to remove right now, ahead of the larger compound-tool retirement. Removed from `factory-tool-registry.ts`: - `BOXEL_CLI_TOOLS` (boxel-sync / boxel-push / boxel-pull / boxel-status / boxel-create / boxel-history) — defined but never registered in the production wiring; the orchestrator owns sync, the agent never called these, and the skill explicitly told it not to. - `SCRIPT_TOOLS` (search-realm, run-realm-tests) — kebab-case duplicates of the canonical `search_realm` and `run_tests` factory tools. The Claude path filtered them; OpenRouter never chose them over the core versions. - `realm-read`, `realm-write`, `realm-delete`, `realm-search` — the HTTP-direct realm tools listed for retirement in CS-10883. The factory loop only operates on the target realm via the workspace, so these were always either dead or rejected by the executor's target-realm bypass guard. Only `realm-create` remains in the registry — the entrypoint uses it to create the target realm before the agent runs, which is the "factory-specific orchestration" the ticket explicitly preserves. Trimmed from `factory-tool-executor.ts` (~480 lines): - The script-spawning sub-executor (`executeScript` + `spawnProcess`), along with `SCRIPT_FILE_MAP` and the related `node:child_process` imports. - The boxel-cli sub-executor (`executeBoxelCli`), `BOXEL_CLI_COMMAND_MAP`, `composeBoxelCliInvocation`, and the per-subcommand arg builders. - The `realm-read` / `realm-write` / `realm-delete` / `realm-search` branches of `executeRealmApiInner`, plus their `TARGET_REALM_BYPASS_TOOLS` guard and the boxel-push --delete destructive-op validator. - `validateRealmTarget` (now unreachable — no remaining tool takes a `realm` or `realm-url` arg). Also deleted: the `scripts/boxel-search.ts` and `scripts/run-realm-tests.ts` files those tools used to spawn, plus the `RunRealmTestsOutput` / `RunRealmTestsFailure` types they emitted (defined and re-exported but never actually consumed). Tests updated: `factory-tool-registry.test.ts` and `factory-tool-executor.test.ts` were rewritten around `realm-create`; the integration test keeps its safety-rejection coverage but drops the realm-api request shapes (those moved into the Playwright spec, which now also drops its `realm-search` blocks). `factory-tool-builder.test.ts`'s registered-tool-delegation test now goes through `realm-create`. Net: -2 064 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, test:realm) The previous commit deleted scripts/boxel-search.ts and scripts/run-realm-tests.ts as dead code, but missed that they're still wired up as developer-facing pnpm scripts in package.json (`pnpm boxel:search`, `pnpm test:realm`). The scripts no longer back any agent-facing tool — that wrapper layer is gone — but they remain useful CLI utilities for humans, so restore them and update the registry comment to reflect the new reality. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- claude-code.ts: silence `no-constant-condition` on canonicalize- ancestor's `while (true)` (the loop terminates on realpath success or filesystem-root reached, eslint can't see that) - factory-tool-executor.test.ts: rewrite the missing-required-args assertion to use a single `includes()` instead of an OR chain (qunit/no-assert-logical-expression) - prettier --write fallout in two skill files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I did a factory run with both claude and openrouter agents and it worked OK! |
There was a problem hiding this comment.
Pull request overview
This PR narrows the software-factory tool surface for the Claude backend by removing most registry-backed tools, relying on Claude Code’s native workspace tools plus the existing boxel CLI, while keeping the remaining factory MCP tools for structured realm operations and control flow.
Changes:
- Retires script/boxel-cli/most realm-api registry tools so
realm-createis the only surviving registered tool. - Reworks the Claude agent to enable native
Read/Write/Edit/Bash/Glob/Grep, adds prompt/skill updates, and introduces workspace-path scoping hooks. - Updates tests, smoke fixtures, and shared prompt/skill text to reflect the reduced registry and new workspace-centric model.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
packages/software-factory/tests/factory-tool-registry.test.ts |
Updates registry expectations/tests for the single remaining registered tool. |
packages/software-factory/tests/factory-tool-executor.test.ts |
Rewrites unit tests around realm-create, safety checks, and logging. |
packages/software-factory/tests/factory-tool-executor.spec.ts |
Removes old live executor coverage for retired registry tools. |
packages/software-factory/tests/factory-tool-executor.integration.test.ts |
Deletes old HTTP-shape integration tests and keeps preflight safety checks. |
packages/software-factory/tests/factory-tool-builder.test.ts |
Adjusts builder tests for the reduced registry surface. |
packages/software-factory/tests/factory-prompt-loader.test.ts |
Updates prompt-loader assertions for workspace-mirror wording. |
packages/software-factory/tests/factory-agent-claude-code.test.ts |
Adds Claude native-tool and workspace-scoping tests. |
packages/software-factory/src/test-run-types.ts |
Removes obsolete run-realm-tests output types. |
packages/software-factory/src/factory-tool-registry.ts |
Shrinks the built-in registry to realm-create only. |
packages/software-factory/src/factory-tool-executor.ts |
Simplifies executor dispatch/safety to the remaining realm-create path. |
packages/software-factory/src/factory-tool-builder.ts |
Wraps only realm-api registered tools and tags them by source. |
packages/software-factory/src/factory-test-realm.ts |
Drops re-exports for deleted run-realm-tests types. |
packages/software-factory/src/factory-issue-loop-wiring.ts |
Wires the smaller registry and forwards workspaceDir to Claude. |
packages/software-factory/src/factory-agent/types.ts |
Narrows tool-manifest categories and adds Claude workspaceDir config. |
packages/software-factory/src/factory-agent/claude-code.ts |
Enables Claude native tools, filters MCP tools, adds prompt addendum, path scoping, and richer debug logging. |
packages/software-factory/scripts/smoke-tests/factory-prompt-smoke.ts |
Updates smoke-test sample manifests for the new category shape. |
packages/software-factory/prompts/system.md |
Rephrases system prompt around workspace-mirror operations. |
packages/software-factory/.agents/skills/software-factory-operations/SKILL.md |
Rewrites operations guidance for native workspace tools and structured card tools. |
packages/software-factory/.agents/skills/software-factory-bootstrap/SKILL.md |
Clarifies bootstrap flow should use structured factory tools instead of raw writes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Four issues raised on the PR review: 1. Bash was auto-approved alongside the typed fs tools but explicitly skipped by `canUseTool`, so the prompt addendum oversold the workspace-scoping guarantee. The hook can't reliably parse arbitrary shell for write side-effects (`>`, `tee`, `cp`, `mv`, …) and read-only inspection genuinely needs the broader fs, so the right move is to be honest in the contract: tighten the prompt addendum and the in-code comment to explicitly say `Bash` is NOT structurally gated and must be treated as read-only inspection. Writes go through `Write` / `Edit` where the workspace guard fires. 2. Without `workspaceDir` we still exposed `Read` / `Write` / `Bash` but omitted both `cwd` and `canUseTool` — relative paths would resolve against the process's working directory and absolute paths would be unrestricted, reintroducing the host-fs access this PR was trying to prevent. Fall back to MCP-only when no workspace is configured: `tools: []`, no native fs in `allowedTools`, no hook, no cwd. New unit test verifies the fallback. 3. The `realm-create` test fixtures used `endpoint: 'user/my-realm'`, but the realm server validates `endpoint` against `^[a-z0-9-]+$` — slashes are rejected. The mock client in the unit tests doesn't enforce this, so the tests stayed green, but they were exercising a value the live server would reject. Replace with a single-slug `'my-realm'` (and `'new'` in factory-tool-builder.test.ts) so the fixtures match the real contract. 4. The integration-test comment claimed live request-shape coverage for `realm-create` had moved to `factory-tool-executor.spec.ts`, but the spec only covers the factory-builder card-write tools — `realm-create` lost its end-to-end HTTP-shape coverage rather than relocating it. Replace the misleading note with an honest one acknowledging the coverage gap (entrypoint integration covers it end-to-end via `factory:go`; a focused unit-level shape test is a follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
There is a prompt for an end to end test in the packages/software-factory/tests/end-to-end-test-prompt.md. can you prompt you agent to run this test and confirm if it's able to. UPDATE |
# Conflicts: # packages/software-factory/tests/factory-tool-executor.spec.ts
|
I tested it - end-to-end prompt worked well too! |
…edTools The `canUseTool` hook from PR #4633 was dead code: empirical testing (scripts/canusetool-repro.ts) shows the SDK auto-approves any tool in `allowedTools` and never invokes the hook. Because Write/Edit/etc were in allowedTools under `permissionMode: 'dontAsk'`, the path-scoping safety net never fired and the model could write to absolute paths outside the factory workspace. Fix: - permissionMode: 'dontAsk' → 'default' (only mode that actually invokes canUseTool for tools outside allowedTools) - Remove NATIVE_FS_TOOLS from allowedTools (they stay in `tools` so they remain available, just no longer auto-approved) - MCP factory tools stay in allowedTools — auto-approve is fine for our own in-process implementations Verification: - scripts/canusetool-fix-verify.ts: hook fires for every Write call; in-workspace Write succeeds, absolute out-of-workspace Write is rejected - New unit test locks in both structural conditions (native fs out of allowedTools + permissionMode='default') so any future regression on either knob fails at unit-test time Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Reduces the custom tool surface the Claude Agent SDK backend sees in the factory loop, leaning on its native tools (Read / Write / Edit / Bash / Glob / Grep) and the existing
boxelCLI instead of bespoke MCP shims. The OpenRouter backend is unchanged — it has no native fs / Bash, so it keeps the wrapper tools it needs.This is a partial step toward CS-10883. Compound-tool retirement (
update_*/create_*/add_comment) and the OpenRouter-sideread_file/write_fileretirement are deferred to follow-ups that depend on the CS-10613 skill rewrite.What changed for Claude
Native tools cover the workspace surface; factory MCP tools handle realm-runtime ops, structured updates, and control signals.
Read/Write/Edit/Glob/Grep/Bash, scoped tocwd = factory workspace. Native error messages (e.g. "file does not exist; cwd is …") help the model self-correct.read_file/write_file→ nativeRead/Write/Editfetch_transpiled_module→ Bash +boxel read-transpiled <path> --realm <url>search_realm→ Bash +boxel search --realm <url> --query '<json>' --jsonrun_command→ unused in practice (card-type schemas are pre-loaded)script/realm-apiregistered tools (kebab-case shadows the model used to invent siblings of)run_lint,run_tests,run_evaluate,run_parse,run_instantiateupdate_project,update_issue,create_knowledge,create_catalog_spec,add_commentsignal_done,request_clarificationPath scoping
Native fs was the biggest risk — under
bypassPermissionsthe model could write absolute paths anywhere on disk (and did, in one early run). Switched topermissionMode: 'dontAsk'and added acanUseToolhook that deniesRead/Write/Edit/MultiEdit/NotebookEdit/NotebookReadoutside the workspace. The hook canonicalizes paths throughrealpathso macOS/var↔/private/varsymlinks don't trip it up.Skills + prompts updated
software-factory-bootstrapandsoftware-factory-operationsskills now distinguish workspace files (use native tools orwrite_file) from tracker-schema cards (always use the structured factory MCP tools — neverWrite).tool_resultlines with the originating tool and a payload snippet, so a--debugrun is auditable from the log alone.What changed for OpenRouter
Almost nothing. It still consumes the full
FactoryTool[]because it has no native fs / Bash. A few skill / prompt edits got rephrased to be less Claude-specific (operation language with a per-backend mapping note), but every tool OpenRouter calls today still works.What's gone (factory tool catalog deletions)
These were dead surface — agent-facing wrappers nothing exercised in practice — and got removed outright as part of CS-10883:
boxel-sync,boxel-push,boxel-pull,boxel-status,boxel-create,boxel-historyboxel-cliwrappers; orchestrator owns sync, agent never legitimately needed them, and they weren't even registered in the production wiringsearch-realm(kebab-case),run-realm-testssearch_realmandrun_testsrealm-read,realm-write,realm-delete,realm-searchPlus the script files those tools used to spawn (
scripts/boxel-search.ts,scripts/run-realm-tests.ts) — wait, those got restored in1a59385a3fbecause they're still wired up as developer-facing pnpm scripts (pnpm boxel:search,pnpm test:realm); only the agent-facing wrapper layer is gone.After this PR,
factory-tool-registry.tscontains a single manifest (realm-create), andfactory-tool-executor.tsis ~480 lines lighter (no script spawning, no boxel-cli wrapping, no HTTP-direct realm dispatch).What's still on the retirement list (future PRs)
CS-10883's full scope plus follow-up work, deferred from this PR:
update_project,update_issue,create_knowledge,create_catalog_spec,add_comment. These need to go away once the CS-10613 skill rewrite teaches the JSON:API +adoptsFromschemas directly. Risky to drop them earlier because they encode invariants (e.g.update_issuestripsdescription,create_catalog_specsets the rightadoptsFrom).read_file/write_file/fetch_transpiled_module/search_realm/run_command— replaced with a per-backend split: a thin OpenRouter-onlyFactoryTool[](read / write / edit / federatedsearch_realms+ validators + signals), Claude-only path stays minimal.search_realms— federated replacement for the per-realmsearch_realm. Backed by CS-10643 (already shipped:boxel searchfederated). Needs to be wired up as a tool.realm-createis currently still in the OpenRouter catalog. The orchestrator already drives it before the agent runs, so the agent never needs to call it.FactoryTool[]we keep around for OpenRouter can be retired entirely. Both backends will then converge on the same minimal MCP catalog (validators + signals) plus their respective native tools, and the prompt/skills can drop the per-backend mapping notes.After all that, Claude's MCP catalog would be 7 tools (5 validators + 2 signals); OpenRouter's interim catalog would be ~9 (read / write / edit /
search_realms+ 5 validators + 2 signals); post-opencode swap, opencode lands at the same 7 as Claude.🤖 Generated with Claude Code