Skip to content

software-factory: Claude agent uses native tools; OpenRouter unchanged for now#4633

Merged
jurgenwerk merged 13 commits intomainfrom
move-claude-agent-to-native-tools
May 5, 2026
Merged

software-factory: Claude agent uses native tools; OpenRouter unchanged for now#4633
jurgenwerk merged 13 commits intomainfrom
move-claude-agent-to-native-tools

Conversation

@jurgenwerk
Copy link
Copy Markdown
Contributor

@jurgenwerk jurgenwerk commented May 4, 2026

Summary

Reduces the custom tool surface the Claude Agent SDK backend sees in the factory loop, leaning on its native tools (Read / Write / Edit / Bash / Glob / Grep) and the existing boxel CLI instead of bespoke MCP shims. The OpenRouter backend is unchanged — it has no native fs / Bash, so it keeps the wrapper tools it needs.

This is a partial step toward CS-10883. Compound-tool retirement (update_* / create_* / add_comment) and the OpenRouter-side read_file / write_file retirement are deferred to follow-ups that depend on the CS-10613 skill rewrite.

What changed for Claude

Native tools cover the workspace surface; factory MCP tools handle realm-runtime ops, structured updates, and control signals.

  • Workspace fsRead / Write / Edit / Glob / Grep / Bash, scoped to cwd = factory workspace. Native error messages (e.g. "file does not exist; cwd is …") help the model self-correct.
  • Filtered out of the Claude MCP catalog (model never sees these on the Claude path):
    • read_file / write_file → native Read / Write / Edit
    • fetch_transpiled_module → Bash + boxel read-transpiled <path> --realm <url>
    • search_realm → Bash + boxel search --realm <url> --query '<json>' --json
    • run_command → unused in practice (card-type schemas are pre-loaded)
    • All script / realm-api registered tools (kebab-case shadows the model used to invent siblings of)
  • Still exposed via factory MCP (no native equivalent):
    • Validators: run_lint, run_tests, run_evaluate, run_parse, run_instantiate
    • Structured tracker-schema updates: update_project, update_issue, create_knowledge, create_catalog_spec, add_comment
    • Control signals: signal_done, request_clarification

Path scoping

Native fs was the biggest risk — under bypassPermissions the model could write absolute paths anywhere on disk (and did, in one early run). Switched to permissionMode: 'dontAsk' and added a canUseTool hook that denies Read / Write / Edit / MultiEdit / NotebookEdit / NotebookRead outside the workspace. The hook canonicalizes paths through realpath so macOS /var/private/var symlinks don't trip it up.

Skills + prompts updated

  • software-factory-bootstrap and software-factory-operations skills now distinguish workspace files (use native tools or write_file) from tracker-schema cards (always use the structured factory MCP tools — never Write).
  • The system prompt addendum lists the closed factory tool catalog so the model doesn't invent siblings.
  • Debug logging now labels tool_result lines with the originating tool and a payload snippet, so a --debug run is auditable from the log alone.

What changed for OpenRouter

Almost nothing. It still consumes the full FactoryTool[] because it has no native fs / Bash. A few skill / prompt edits got rephrased to be less Claude-specific (operation language with a per-backend mapping note), but every tool OpenRouter calls today still works.

What's gone (factory tool catalog deletions)

These were dead surface — agent-facing wrappers nothing exercised in practice — and got removed outright as part of CS-10883:

Removed Why
boxel-sync, boxel-push, boxel-pull, boxel-status, boxel-create, boxel-history boxel-cli wrappers; orchestrator owns sync, agent never legitimately needed them, and they weren't even registered in the production wiring
search-realm (kebab-case), run-realm-tests Script-spawn duplicates of the canonical search_realm and run_tests
realm-read, realm-write, realm-delete, realm-search HTTP-direct realm tools the ticket explicitly retires; the factory loop only operates on the target realm via the workspace

Plus the script files those tools used to spawn (scripts/boxel-search.ts, scripts/run-realm-tests.ts) — wait, those got restored in 1a59385a3f because they're still wired up as developer-facing pnpm scripts (pnpm boxel:search, pnpm test:realm); only the agent-facing wrapper layer is gone.

After this PR, factory-tool-registry.ts contains a single manifest (realm-create), and factory-tool-executor.ts is ~480 lines lighter (no script spawning, no boxel-cli wrapping, no HTTP-direct realm dispatch).

What's still on the retirement list (future PRs)

CS-10883's full scope plus follow-up work, deferred from this PR:

  • Compound toolsupdate_project, update_issue, create_knowledge, create_catalog_spec, add_comment. These need to go away once the CS-10613 skill rewrite teaches the JSON:API + adoptsFrom schemas directly. Risky to drop them earlier because they encode invariants (e.g. update_issue strips description, create_catalog_spec sets the right adoptsFrom).
  • OpenRouter-side read_file / write_file / fetch_transpiled_module / search_realm / run_command — replaced with a per-backend split: a thin OpenRouter-only FactoryTool[] (read / write / edit / federated search_realms + validators + signals), Claude-only path stays minimal.
  • search_realms — federated replacement for the per-realm search_realm. Backed by CS-10643 (already shipped: boxel search federated). Needs to be wired up as a tool.
  • Drop registered tools for OpenRouter toorealm-create is currently still in the OpenRouter catalog. The orchestrator already drives it before the agent runs, so the agent never needs to call it.
  • Replace OpenRouter with opencode — the OpenRouter backend is going away in a separate follow-up. opencode is a native-tool agent SDK (Read / Write / Edit / Bash, like the Claude path), so the per-backend split above is interim: once opencode lands, the workspace-fs FactoryTool[] we keep around for OpenRouter can be retired entirely. Both backends will then converge on the same minimal MCP catalog (validators + signals) plus their respective native tools, and the prompt/skills can drop the per-backend mapping notes.

After all that, Claude's MCP catalog would be 7 tools (5 validators + 2 signals); OpenRouter's interim catalog would be ~9 (read / write / edit / search_realms + 5 validators + 2 signals); post-opencode swap, opencode lands at the same 7 as Claude.

🤖 Generated with Claude Code

jurgenwerk and others added 8 commits May 4, 2026 12:59
…ls required for tracker cards

Drop the read_file / write_file MCP shims for the Claude Agent SDK
backend and let the model use native Read / Write / Edit / Bash /
Glob / Grep against the factory workspace (cwd = workspaceDir).
Realm-side ops, validators, structured card updates, and signals
stay as in-process MCP tools. OpenRouter is unchanged — it has no
native fs and still gets the full FactoryTool[].

The bootstrap and operations skills previously instructed the agent
to write Project / Issue / KnowledgeArticle JSON via write_file,
which on the Claude path silently bypassed the structured update
tools that enforce schema and immutability invariants (e.g.
update_issue strips description). Both skills and the system-prompt
addendum now require the structured tools for tracker-schema cards
and forbid native Write for those paths. Adds a closed catalog so
the model stops inventing tool names like mcp__factory__realm-read.

Improve --debug output: tool_result lines now carry the originating
tool name and a truncated payload snippet, so a run is auditable
from the log alone instead of from indirect signals (sync output,
mid-turn validator behaviour).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalize the filter constant (NATIVE_FS_REPLACED_TOOLS →
CLAUDE_FILTERED_FACTORY_TOOLS) so it can describe tools the Claude
backend doesn't need for any reason — native or boxel CLI alternative,
or unused in practice — and add run_command to it.

run_command is unused on the Claude path: the orchestrator pre-loads
card-type schemas into the structured update tools' parameter shapes,
so the agent never needs to call get-card-type-schema or any other
host command during a normal factory run. If a future workflow does
need one, Bash + `boxel run-command` reaches the same realm-server
endpoint.

OpenRouter still gets run_command — the path keeps the full
FactoryTool[] and has no Bash fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Claude path

Add fetch_transpiled_module to the Claude filter set. The boxel CLI
exposes the same realm endpoint via `boxel read-transpiled <path>
--realm <url>`, which the agent can run through Bash. Single string
arg, no JSON-quoting hazard, used only when an eval/instantiate
validation error reports a transpiled line/column — so the marginal
cost of shelling out is negligible.

Update the operations skill to direct each backend to its
appropriate transpiled-fetch path: Claude shells out via Bash,
OpenRouter keeps using the MCP tool. Cross-references in the
validator descriptions and the dedicated "Debugging Runtime
Evaluation Errors" section now name both paths instead of hardcoding
the OpenRouter tool name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add search_realm to the Claude filter set. The boxel CLI exposes the
same federated-search endpoint via `boxel search --realm <url> --query
'<json>' --json`, which the agent can run through Bash.

The risk here is shell quoting: the query is a JSON object that has
to ride the command line. The operations skill now spells out the
quoting contract — single-quote the entire JSON object so the shell
doesn't expand or split it; keep keys and values double-quoted inside —
and gives a concrete example. If quoting bites in practice, the
fallback is a tiny shell wrapper that takes JSON on stdin.

OpenRouter still gets the structured search_realm tool — no shell,
no quoting concerns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e path

The factory builds two tool catalogs and feeds both into the SDK MCP
server: a "core" set defined in factory-tool-builder, and a
kebab-case set wrapped from the ToolRegistry's script + realm-api
manifests (realm-read, realm-write, search-realm, boxel-sync, …).
The kebab-case set shadows the core set with overlapping
functionality, and the model picks the kebab-case name at random —
calling realm-read against the target realm hit a "not in allowed
list" error in the last factory run, and search-realm took a
different code path than the structured search_realm tool we
already filtered.

Mark each FactoryTool with `source: 'core' | 'registered'` and
filter `'registered'` out of the Claude MCP catalog. OpenRouter
still gets every tool — the registry tools may legitimately serve
non-target-realm scratch work for that backend.

This also closes the loop on the addendum's "do not call any tool
whose plain name is not in this list" rule, which was meaningless
while the kebab-case shadow tools were still in the catalog.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…laude path

In the previous run the agent invented an absolute path
(`/Users/jurgen/code/boxel/test-realm-claude-7/sticky-note.gts`)
from the realm slug, called native `Write` against it, and produced
real bytes outside the factory workspace. The SDK's `cwd` only
governs *relative* path resolution; absolute paths bypass it
entirely. Under `bypassPermissions` the model can write anywhere on
the user's filesystem.

Switch the Claude path's permissionMode from `bypassPermissions` to
`dontAsk` (same auto-approve semantics for everything in
`allowedTools`, but `canUseTool` still fires) and add a hook that
denies `Read` / `Write` / `Edit` / `MultiEdit` / `NotebookEdit` /
`NotebookRead` whenever `file_path` resolves outside the workspace.
The deny message names the violation so the model can self-correct
to a realm-relative path.

Bash is intentionally not gated. Read-only inspection (`ls`, `find`,
`cat`, `boxel status`) needs the full filesystem, and parsing
arbitrary shell for write side effects is its own footgun. The
prompt addendum already constrains Bash to "safe shell helpers";
absolute-path writes go through the typed fs tools where the hook
catches them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oped canUseTool

On macOS the typical factory workspace is rooted at `/var/folders/...`,
which is a symlink to `/private/var/folders/...`. The SDK's native fs
tools resolve paths to their canonical form, so an absolute file_path
flowing into `canUseTool` may use either form. Without canonicalization
on both sides, `node:path.relative('/var/folders/W',
'/private/var/folders/W/foo.gts')` returns `../../private/var/...` and
the hook denies a path that's actually inside the workspace.

Resolve both the workspace and each candidate input through realpath
before computing the relative path. For paths whose leaf doesn't exist
yet — typical for `Write` creating a new file — walk up to the deepest
existing ancestor, canonicalize that, then re-append the missing
segments. Falls back to the input on unexpected errors so the hook
never throws (which would crash the SDK turn).

Add a portable test that creates a real temp dir + a symlink pointing
at it and exercises the hook through both forms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip out the parts of the factory tool registry / executor that are
no longer used and aren't reachable from any working code path. CS-10883
calls for retiring these tools — this commit lands what's safe to
remove right now, ahead of the larger compound-tool retirement.

Removed from `factory-tool-registry.ts`:

- `BOXEL_CLI_TOOLS` (boxel-sync / boxel-push / boxel-pull / boxel-status
  / boxel-create / boxel-history) — defined but never registered in the
  production wiring; the orchestrator owns sync, the agent never called
  these, and the skill explicitly told it not to.
- `SCRIPT_TOOLS` (search-realm, run-realm-tests) — kebab-case duplicates
  of the canonical `search_realm` and `run_tests` factory tools. The
  Claude path filtered them; OpenRouter never chose them over the core
  versions.
- `realm-read`, `realm-write`, `realm-delete`, `realm-search` — the
  HTTP-direct realm tools listed for retirement in CS-10883. The
  factory loop only operates on the target realm via the workspace, so
  these were always either dead or rejected by the executor's
  target-realm bypass guard.

Only `realm-create` remains in the registry — the entrypoint uses it
to create the target realm before the agent runs, which is the
"factory-specific orchestration" the ticket explicitly preserves.

Trimmed from `factory-tool-executor.ts` (~480 lines):

- The script-spawning sub-executor (`executeScript` + `spawnProcess`),
  along with `SCRIPT_FILE_MAP` and the related `node:child_process`
  imports.
- The boxel-cli sub-executor (`executeBoxelCli`), `BOXEL_CLI_COMMAND_MAP`,
  `composeBoxelCliInvocation`, and the per-subcommand arg builders.
- The `realm-read` / `realm-write` / `realm-delete` / `realm-search`
  branches of `executeRealmApiInner`, plus their `TARGET_REALM_BYPASS_TOOLS`
  guard and the boxel-push --delete destructive-op validator.
- `validateRealmTarget` (now unreachable — no remaining tool takes a
  `realm` or `realm-url` arg).

Also deleted: the `scripts/boxel-search.ts` and `scripts/run-realm-tests.ts`
files those tools used to spawn, plus the `RunRealmTestsOutput` /
`RunRealmTestsFailure` types they emitted (defined and re-exported but
never actually consumed).

Tests updated: `factory-tool-registry.test.ts` and `factory-tool-executor.test.ts`
were rewritten around `realm-create`; the integration test keeps its
safety-rejection coverage but drops the realm-api request shapes (those
moved into the Playwright spec, which now also drops its `realm-search`
blocks). `factory-tool-builder.test.ts`'s registered-tool-delegation
test now goes through `realm-create`.

Net: -2 064 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jurgenwerk jurgenwerk changed the title software-factory: Claude agent uses native tools; OpenRouter unchanged software-factory: Claude agent uses native tools; OpenRouter unchanged for now May 4, 2026
jurgenwerk and others added 2 commits May 4, 2026 14:45
…, test:realm)

The previous commit deleted scripts/boxel-search.ts and
scripts/run-realm-tests.ts as dead code, but missed that they're
still wired up as developer-facing pnpm scripts in package.json
(`pnpm boxel:search`, `pnpm test:realm`). The scripts no longer
back any agent-facing tool — that wrapper layer is gone — but
they remain useful CLI utilities for humans, so restore them and
update the registry comment to reflect the new reality.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- claude-code.ts: silence `no-constant-condition` on canonicalize-
  ancestor's `while (true)` (the loop terminates on realpath success
  or filesystem-root reached, eslint can't see that)
- factory-tool-executor.test.ts: rewrite the missing-required-args
  assertion to use a single `includes()` instead of an OR chain
  (qunit/no-assert-logical-expression)
- prettier --write fallout in two skill files

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jurgenwerk
Copy link
Copy Markdown
Contributor Author

I did a factory run with both claude and openrouter agents and it worked OK!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR narrows the software-factory tool surface for the Claude backend by removing most registry-backed tools, relying on Claude Code’s native workspace tools plus the existing boxel CLI, while keeping the remaining factory MCP tools for structured realm operations and control flow.

Changes:

  • Retires script/boxel-cli/most realm-api registry tools so realm-create is the only surviving registered tool.
  • Reworks the Claude agent to enable native Read/Write/Edit/Bash/Glob/Grep, adds prompt/skill updates, and introduces workspace-path scoping hooks.
  • Updates tests, smoke fixtures, and shared prompt/skill text to reflect the reduced registry and new workspace-centric model.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/software-factory/tests/factory-tool-registry.test.ts Updates registry expectations/tests for the single remaining registered tool.
packages/software-factory/tests/factory-tool-executor.test.ts Rewrites unit tests around realm-create, safety checks, and logging.
packages/software-factory/tests/factory-tool-executor.spec.ts Removes old live executor coverage for retired registry tools.
packages/software-factory/tests/factory-tool-executor.integration.test.ts Deletes old HTTP-shape integration tests and keeps preflight safety checks.
packages/software-factory/tests/factory-tool-builder.test.ts Adjusts builder tests for the reduced registry surface.
packages/software-factory/tests/factory-prompt-loader.test.ts Updates prompt-loader assertions for workspace-mirror wording.
packages/software-factory/tests/factory-agent-claude-code.test.ts Adds Claude native-tool and workspace-scoping tests.
packages/software-factory/src/test-run-types.ts Removes obsolete run-realm-tests output types.
packages/software-factory/src/factory-tool-registry.ts Shrinks the built-in registry to realm-create only.
packages/software-factory/src/factory-tool-executor.ts Simplifies executor dispatch/safety to the remaining realm-create path.
packages/software-factory/src/factory-tool-builder.ts Wraps only realm-api registered tools and tags them by source.
packages/software-factory/src/factory-test-realm.ts Drops re-exports for deleted run-realm-tests types.
packages/software-factory/src/factory-issue-loop-wiring.ts Wires the smaller registry and forwards workspaceDir to Claude.
packages/software-factory/src/factory-agent/types.ts Narrows tool-manifest categories and adds Claude workspaceDir config.
packages/software-factory/src/factory-agent/claude-code.ts Enables Claude native tools, filters MCP tools, adds prompt addendum, path scoping, and richer debug logging.
packages/software-factory/scripts/smoke-tests/factory-prompt-smoke.ts Updates smoke-test sample manifests for the new category shape.
packages/software-factory/prompts/system.md Rephrases system prompt around workspace-mirror operations.
packages/software-factory/.agents/skills/software-factory-operations/SKILL.md Rewrites operations guidance for native workspace tools and structured card tools.
packages/software-factory/.agents/skills/software-factory-bootstrap/SKILL.md Clarifies bootstrap flow should use structured factory tools instead of raw writes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/software-factory/src/factory-agent/claude-code.ts Outdated
Comment thread packages/software-factory/src/factory-agent/claude-code.ts Outdated
Comment thread packages/software-factory/tests/factory-tool-executor.test.ts Outdated
Comment thread packages/software-factory/tests/factory-tool-executor.integration.test.ts Outdated
Four issues raised on the PR review:

1. Bash was auto-approved alongside the typed fs tools but explicitly
   skipped by `canUseTool`, so the prompt addendum oversold the
   workspace-scoping guarantee. The hook can't reliably parse arbitrary
   shell for write side-effects (`>`, `tee`, `cp`, `mv`, …) and
   read-only inspection genuinely needs the broader fs, so the right
   move is to be honest in the contract: tighten the prompt addendum
   and the in-code comment to explicitly say `Bash` is NOT structurally
   gated and must be treated as read-only inspection. Writes go
   through `Write` / `Edit` where the workspace guard fires.

2. Without `workspaceDir` we still exposed `Read` / `Write` / `Bash`
   but omitted both `cwd` and `canUseTool` — relative paths would
   resolve against the process's working directory and absolute paths
   would be unrestricted, reintroducing the host-fs access this PR was
   trying to prevent. Fall back to MCP-only when no workspace is
   configured: `tools: []`, no native fs in `allowedTools`, no hook,
   no cwd. New unit test verifies the fallback.

3. The `realm-create` test fixtures used `endpoint: 'user/my-realm'`,
   but the realm server validates `endpoint` against `^[a-z0-9-]+$` —
   slashes are rejected. The mock client in the unit tests doesn't
   enforce this, so the tests stayed green, but they were exercising a
   value the live server would reject. Replace with a single-slug
   `'my-realm'` (and `'new'` in factory-tool-builder.test.ts) so the
   fixtures match the real contract.

4. The integration-test comment claimed live request-shape coverage
   for `realm-create` had moved to `factory-tool-executor.spec.ts`,
   but the spec only covers the factory-builder card-write tools —
   `realm-create` lost its end-to-end HTTP-shape coverage rather than
   relocating it. Replace the misleading note with an honest one
   acknowledging the coverage gap (entrypoint integration covers it
   end-to-end via `factory:go`; a focused unit-level shape test is a
   follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jurgenwerk jurgenwerk requested a review from habdelra May 4, 2026 14:21
@jurgenwerk jurgenwerk marked this pull request as ready for review May 4, 2026 14:21
@jurgenwerk jurgenwerk requested a review from a team May 4, 2026 14:21
@habdelra
Copy link
Copy Markdown
Contributor

habdelra commented May 4, 2026

There is a prompt for an end to end test in the packages/software-factory/tests/end-to-end-test-prompt.md. can you prompt you agent to run this test and confirm if it's able to.

UPDATE
oh wait nvm, I saw your comment that you did an end to end test--thanks!

# Conflicts:
#	packages/software-factory/tests/factory-tool-executor.spec.ts
@jurgenwerk jurgenwerk merged commit 4464827 into main May 5, 2026
28 checks passed
@jurgenwerk
Copy link
Copy Markdown
Contributor Author

I tested it - end-to-end prompt worked well too!

jurgenwerk added a commit that referenced this pull request May 5, 2026
…edTools

The `canUseTool` hook from PR #4633 was dead code: empirical testing
(scripts/canusetool-repro.ts) shows the SDK auto-approves any tool in
`allowedTools` and never invokes the hook. Because Write/Edit/etc were
in allowedTools under `permissionMode: 'dontAsk'`, the path-scoping
safety net never fired and the model could write to absolute paths
outside the factory workspace.

Fix:
  - permissionMode: 'dontAsk' → 'default' (only mode that actually
    invokes canUseTool for tools outside allowedTools)
  - Remove NATIVE_FS_TOOLS from allowedTools (they stay in `tools` so
    they remain available, just no longer auto-approved)
  - MCP factory tools stay in allowedTools — auto-approve is fine for
    our own in-process implementations

Verification:
  - scripts/canusetool-fix-verify.ts: hook fires for every Write call;
    in-workspace Write succeeds, absolute out-of-workspace Write is
    rejected
  - New unit test locks in both structural conditions (native fs out
    of allowedTools + permissionMode='default') so any future
    regression on either knob fails at unit-test time

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants