feat(evaluator): collect_artifacts glob to download workspace files on any outcome#85
Merged
Conversation
…es on any outcome Declare doublestar glob patterns (`*.md`, `a/**/b`, `**/*.json`) selecting workspace files to download as run artifacts. Matches are copied to <output-dir>/<case>/<config>/outputs/workspace/ preserving their workspace-relative path, regardless of whether the agent succeeded, failed, or timed out. - schema: cases.defaults.collect_artifacts (all cases) + per-case collect_artifacts, merged as a de-duplicated union. - evaluator: collectGlobArtifacts runs after the agent run and before the timeout/error early-return and the deferred rt.Close(); it detaches from the (possibly cancelled/deadline-exceeded) case context so collection still runs on timeout. It is read-only — only `find` + DownloadFile, never mutates git — so it stays orthogonal to the agent_judge workspace diff. - validator: reject empty/invalid globs via doublestar.ValidatePattern. - docs (en/zh) + skill reference + CHANGELOG updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-glob # Conflicts: # docs/guide/writing-evals.md # docs/zh/guide/writing-evals.md # internal/config/validator.go
Collaborator
Author
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a963a78efc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…llection Mirror prepareOutputDir behaviour: remove outputs/workspace before downloading the current matches so a reused --output-dir or a retry attempt cannot leave stale artifacts from an earlier run that the current workspace no longer produces or matches. The clear happens only after listing succeeds, so a transient list failure preserves prior output. Adds a regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
doublestar's **/* match dotfile paths, and agent_judge runs commit a baseline .git into the workspace (prepareWorkspaceDiffState), so a broad glob like "**" would sweep the entire VCS object store into the artifacts. Exclude .git in the find listing so collection captures agent output, not framework noise. Adds a regression test and a doc note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zpzjzj
added a commit
that referenced
this pull request
Jun 3, 2026
…ests CI on `windows-latest` surfaced four failures after the latest rebase onto main's 0.2.4 (which added new timeout/process-group tests under PR #85 and PR #24). `classifyExecError`: when ctx terminated the process, force exit code -1 *before* checking *exec.ExitError. On POSIX a killed `sleep` already surfaces as -1 (signal), but on Windows bash's `sleep 1` killed via ctx-cancel returns OS exit code 1, which the surrounding Exec switch would then misclassify as "real failure" instead of "killed by context" — so the log line "command exited with code 1: sleep 1" showed up instead of "command killed by context". Forcing -1 in the ctx-cancel branch fixes `TestNoneRuntime_ExecReturnsContextErrorOnTimeout` and the two `LogsDeadlineDelta` / `OmitsDeadlineDelta` tests on Windows without changing POSIX behaviour (POSIX still arrives via the same branch with -1 from the signal kill). `TestNoneRuntime_ExecKillsDescendantsOnTimeout` and `TestScriptJudge_Timeout`: both rely on POSIX process-group semantics — killing the parent shell takes the whole descendant tree with it. Windows has no equivalent; `configureProcessGroup` is a no-op there, so a backgrounded ping/sleep survives ctx-cancel and (in the script-judge case) holds t.TempDir as its cwd, breaking RemoveAll cleanup at test end. Skip both on Windows with a comment pointing at the cancellation-itself test as the remaining Windows coverage. `TestCustomAgent_CollectArtifacts_MixedSafeAndEscaping`: the safe-suffix check used `strings.HasSuffix(p, "/safe.txt")`, which can never match a Windows path like `C:\...\safe.txt`. Switch to `filepath.Base(p) == "safe.txt"` so the assertion is OS-neutral. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zpzjzj
added a commit
that referenced
this pull request
Jun 4, 2026
…ests CI on `windows-latest` surfaced four failures after the latest rebase onto main's 0.2.4 (which added new timeout/process-group tests under PR #85 and PR #24). `classifyExecError`: when ctx terminated the process, force exit code -1 *before* checking *exec.ExitError. On POSIX a killed `sleep` already surfaces as -1 (signal), but on Windows bash's `sleep 1` killed via ctx-cancel returns OS exit code 1, which the surrounding Exec switch would then misclassify as "real failure" instead of "killed by context" — so the log line "command exited with code 1: sleep 1" showed up instead of "command killed by context". Forcing -1 in the ctx-cancel branch fixes `TestNoneRuntime_ExecReturnsContextErrorOnTimeout` and the two `LogsDeadlineDelta` / `OmitsDeadlineDelta` tests on Windows without changing POSIX behaviour (POSIX still arrives via the same branch with -1 from the signal kill). `TestNoneRuntime_ExecKillsDescendantsOnTimeout` and `TestScriptJudge_Timeout`: both rely on POSIX process-group semantics — killing the parent shell takes the whole descendant tree with it. Windows has no equivalent; `configureProcessGroup` is a no-op there, so a backgrounded ping/sleep survives ctx-cancel and (in the script-judge case) holds t.TempDir as its cwd, breaking RemoveAll cleanup at test end. Skip both on Windows with a comment pointing at the cancellation-itself test as the remaining Windows coverage. `TestCustomAgent_CollectArtifacts_MixedSafeAndEscaping`: the safe-suffix check used `strings.HasSuffix(p, "/safe.txt")`, which can never match a Windows path like `C:\...\safe.txt`. Switch to `filepath.Base(p) == "safe.txt"` so the assertion is OS-neutral. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zpzjzj
added a commit
that referenced
this pull request
Jun 4, 2026
* ci: run build & test on a windows-latest runner Promote Windows toward a first-class platform (issue #31): the build job now runs on an ubuntu/windows OS matrix. The Windows leg is continue-on-error for now so its failures surface as a regression baseline without gating merges; codecov upload stays Linux-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: make the script judge cross-platform on Windows The script judge hardcoded a POSIX execution model: scripts were uploaded to /tmp without an extension and run via `chmod 700 && ./script` through a `bash -c` shell. On native Windows none of that works, so every script-judge test was skipped. Introduce internal/platform to centralize OS-conditional shell and binary discovery, and dispatch script execution by interpreter: - runtime.NewShellCmd selects the host shell — bash/sh on POSIX, cmd.exe on Windows (wrapping the command so cmd's outer-quote stripping leaves embedded quoted paths intact). - Runtime gains an optional TargetOSer interface; NoneRuntime reports the host OS, OpenSandboxRuntime always reports linux. - The script judge plans execution from the file extension (or shebang): POSIX targets keep the original chmod+shebang behavior; Windows targets dispatch .ps1 to PowerShell, .cmd/.bat to cmd, and .sh to a discovered bash (Git Bash / WSL / SKILL_UP_BASH), with a clear error when absent. - shellquote gains QuoteWindows (CommandLineToArgvW semantics) and QuoteFor. Every t.Skip("windows") in the script judge tests is replaced with platform-aware table-driven cases, plus new interpreter and quoting unit tests that run on all platforms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(agent): route quoting through shellquote, fix Check on Windows The agent package carried its own POSIX single-quote implementation, duplicating internal/shellquote. Collapse shellQuote into a thin delegator so the project keeps one quoting implementation; agent commands always run under bash, so POSIX quoting stays correct. CLIAgent.Check ran `command -v <bin>`, which cmd.exe cannot execute. checkCommandForOS now translates it to `where <bin>` for Windows targets, so agent availability checks work on a Windows host. Drop the unused platform.LookAgentBinary wrapper: agent binaries are resolved inside the runtime shell (which may be a remote sandbox), so a host-side exec.LookPath wrapper has no caller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: add PowerShell tooling and example script for Windows GNU make and POSIX sh are not available out of the box on Windows. Add PowerShell counterparts under scripts/windows/ for the hooks, lint-tools, and verify Makefile targets, and a .ps1 version of the example judge-debug-eval script so Windows users can author script judges without a bash dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: pin line endings via .gitattributes Without explicit attributes, a Windows clone with core.autocrlf=true rewrites checked-in scripts with CRLF. A CRLF shebang breaks shell script execution and CRLF in .go files trips gofmt. Pin .go/.sh/.ps1 to LF and .cmd/.bat to CRLF. Path construction in internal/runner, internal/report, and internal/skill was audited: all of it already uses filepath.Join, so no code changes were needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add Windows support guide Add a Windows support page (English + Chinese) covering supported features, the script judge interpreter dispatch, SKILL_UP_BASH, the scripts/windows/ tooling, and known limitations. Register it in the VitePress sidebar and point AGENTS.md / CONTRIBUTING.md at it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify OpenSandbox runtime behavior on Windows Research outcome for the "Windows + opensandbox" question: the opensandbox runtime never spawns a host shell and already crosses the host->sandbox path boundary via filepath.ToSlash, so driving a remote (Linux) sandbox from a native Windows host works today. OpenSandbox has no Windows-container support — its SDK and execution API are Linux-only — and WSL2 is the recommended path for the full agent workflow without a remote sandbox. Document all three points. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: correct OpenSandbox Windows-sandbox availability Earlier I stated OpenSandbox has no Windows-container support. That was wrong: OpenSandbox does offer a Windows guest profile (dockur/windows in KVM/QEMU) selected via `platform.os=windows` on create. The actual gap is in the Go SDK, which has not yet added the Platform field to its CreateSandboxRequest. Note that as the upstream dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): use POSIX quoting and forward-slash paths where targets are POSIX Two Windows CI failures with the same root cause: * internal/evaluator/fixtures.go and internal/runtime/opensandbox.go build commands that run inside a POSIX shell (the runtime sandbox or bash session). They were using shellquote.Quote, which is build-tagged to host quoting and produced Windows double quotes when skill-up runs on a Windows host. Switch to the explicit QuotePOSIX so the quoting matches the target shell regardless of host OS. Fixes TestGitInitUploader_QuotesSpecialCharsInRemote, TestApplyDiffUploader, TestGitCheckoutUploader, and the two OpenSandbox upload tests. * internal/cli/import.go writes case file paths into the generated eval.yaml. eval.yaml is a portable config consumed by the loader on any OS; use path.Join with filepath.ToSlash so the value stays "cases/case-N.yaml" instead of becoming "cases\case-N.yaml" on Windows. Fixes TestGenerateEvalConfig, TestImportCmd_OutputIsSkillRoot, TestGenerateEvalConfig_UsesImportedCaseIDs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): force -1 on ctx-cancel and make tests OS-aware Round two of fixing the Windows CI failures. * classifyExecError now returns -1 unconditionally when ctx.Err() is set, so a killed cmd.exe (which surfaces exit code 1) does not look like a normal failure. TestNoneRuntime_ExecReturnsContextErrorOnTimeout also now uses `ping -n 3 127.0.0.1 > nul` on Windows to give the context time to fire before the process exits on its own. * TestNoneRuntime_ExecExpandsPathFromRuntimeEnv asserts POSIX `printf "$PATH"` expansion; Windows cmd.exe has neither, so skip. * TestIterationWorkspace_Paths now builds expected paths with filepath.Join so the assertions match the host separator. * TestContextFilesUploader_RejectsUnsafePaths's "absolute" case used `filepath.Join(string(filepath.Separator), "tmp", "secret.txt")`, which is `\tmp\secret.txt` on Windows and treated as relative. A small absoluteSecretPath() helper returns a true Windows-absolute path (`C:\tmp\secret.txt`) when running there. * TestCLIAgent_Run / RunExitError / InstallSkillWithCmd / InstallMCPUsesResolvedEndpointConfigRefAndEnv all bake POSIX-shell templates (`exit %d`, `mkdir -p ... && echo > ...`, `$VAR`) into their RunCmd/InstallSkillCmd/InstallMCPCmd. Native Windows agent execution is intentionally out of scope (users go through WSL2), so a shared skipIfNoPOSIXShell helper skips these on Windows. * TestFindQoderSessionFile_SelectsNewestByModTime hand-builds a workspace-key directory that embeds the workspace path; on Windows that path starts with `C:` and is not a legal directory component. Skipped along with the rest of native-Windows Qoder support. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): use POSIX quoting in git checkout helper and run tests under bash Two remaining Windows CI failures after rebase: * gitCheckoutUploader was added on main with shellquote.Quote(branch), the host-aware variant, which produced double-quotes when running on a Windows host. The script always runs in a POSIX shell inside the runtime (the `none` runtime on a POSIX host or the Linux OpenSandbox), so use QuotePOSIX explicitly. Fixes TestGitCheckoutUploader_ErrorPathDoesNotEvaluateBranch. * `go test -race -coverprofile=coverage.out ./...` produced `FAIL .out [setup failed]` on Windows: pwsh's legacy native-argument passing splits `-coverprofile=coverage.out` and feeds `.out` to go as a package import path. Force `shell: bash` on the test step so args reach go.exe verbatim; Git Bash is preinstalled on windows-latest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): address PR #33 review nits Round of fixes for the P2 nits left on the Windows-support PR: * Prefer Git Bash on Windows for NoneRuntime shell invocations and fall back to cmd /d /c when no bash is discoverable. This keeps the many internal POSIX command strings working (agent CLI prompts via shellQuote, `set -eu` git fixtures, workspace-diff `if ...; then` pipelines) and gives `%VAR%`-bearing inputs bash's literal double-quote semantics so cmd's mid-arg expansion stops mangling them. `/d` disables HKLM/HKCU AutoRun so the host's cmd AutoRun cannot prepend commands to every Exec. * DiscoverBash no longer probes `C:\Windows\System32\bash.exe`; WSL bash expects `/mnt/c/...` paths, so picking it up would let `.sh` script judges fail with file-not-found. Users with WSL pipelines can still point SKILL_UP_BASH at it after arranging path translation. * shellquote.QuoteWindows now treats `(`, `)`, and `%` as quoting triggers, so paths like `C:\tmp\(draft)\script.cmd` get wrapped instead of being passed as cmd syntax. The `%` case is documented as best-effort (cmd expands `%VAR%` even inside quotes; preferring bash via NewShellCmd is the real fix). * shebangExtension parses the interpreter basename instead of doing a substring match, so `#!/usr/bin/env fish`, `python3`, `swish`, etc. no longer get misclassified as `.sh`. `env -S <name>` is handled. * Agent's PATH override (`$HOME/.local/bin:$HOME/.nvm/current/bin:$PATH`) is build-tagged out on Windows so the inherited Windows `Path` reaches `where codex`/`where claude` checks intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): always quote QuoteWindows output and disable cmd AutoRun for judge Two follow-up nits from the latest Codex review pass: * P1: QuoteWindows previously returned the raw string for inputs without whitespace/metacharacters, so common script-judge paths such as `C:\tmp\skill-up-judge-N\script.cmd` were emitted unquoted. With NewShellCmd now routing through bash on Windows when Git Bash is available, unquoted backslashes are stripped by bash and the downstream `cmd /c`/`powershell -File` receives `C:tmpscript.cmd` — which made every script-judge test fail on the windows-latest runner. Drop the fast path and always wrap; double-quoted backslashes are literal under both bash and cmd, and CommandLineToArgvW unwraps them the same way. * P2: the script-judge `.cmd`/`.bat` invocation and the workspace cleanup both used plain `cmd /c`, so HKLM/HKCU `Command Processor\AutoRun` could prepend commands before the judge script and make results non-deterministic. Switch them to `cmd /d /c` to match the runtime shell fallback hardening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): defang MSYS argv conversion and skip stdio-framing test The remaining Windows CI failures after the QuoteWindows always-quote change came from two unrelated places: * Git Bash / MSYS rewrites `/x`-shaped argv entries as POSIX paths before invoking native Windows binaries, so `bash -c "cmd /d /c X"` reached cmd.exe as `cmd "C:/Program Files/Git/d" ...` — cmd never saw its `/d /c` switches, dropped into the interactive prompt, and every script-judge test got the cmd banner as the script's stdout. Setting `MSYS_NO_PATHCONV=1` and `MSYS2_ARG_CONV_EXCL=*` in the env NoneRuntime hands the bash child is the standard escape hatch. The vars are no-ops when bash isn't the launched shell. * TestMockedGenericServer_FramedNonASCIIRoundTrip drives the mocked MCP server's Content-Length framing through a Node child whose stdout, on default Windows, is CRLF-translated and codepage-rewritten — that corrupts a non-ASCII byte stream. Verifying the framing on POSIX is enough; skip with a doc comment pointing at the underlying Node stdio behavior. Not related to the Windows-support work itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: make Windows job required now that all tests pass The windows-latest leg of the Build & Test matrix has been green for several CI runs in a row (script judge, agent, evaluator, runtime, mcp, all packages pass). Drop the continue-on-error gate so future Windows regressions block the PR instead of being silently absorbed. This is the final commit of the issue #31 Windows-support rollout sketched in the original plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): skip WSL bash even when PATH-discovered Even with C:\Windows\System32\bash.exe removed from knownWindowsBashPaths, DiscoverBash on Windows still returned whatever exec.LookPath finds first — and System32 is on PATH on every Windows host, so the WSL bash shim wins. The shim expects Linux-format `/mnt/c/...` paths, so the host-shell command strings and script-judge `.sh` invocations built from `C:/...` host paths fail with file-not-found despite "bash discovered" succeeding. Detect the System32-located shim by directory and fall through to the Git Bash well-known paths so PATH-based hosts behave the same as ones without bash on PATH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(mcp): skip RejectsSymlinkEscape on Windows for the same Node-stdio reason windows-latest creates symlinks fine (the runner runs with the required privilege), so this test no longer falls into the existing "symlinks not supported" skip path; it instead reaches the same Node stdio framing hang that TestMockedGenericServer_FramedNonASCIIRoundTrip hits. Skip on Windows with the same justification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(mcp): centralize Windows skip in startMockServer Every mocked MCP server test spawns a child Node process and exchanges Content-Length-framed JSON-RPC over its stdout — the same setup that default-config Node stdio corrupts on Windows (codepage + CRLF). Moving the Windows skip from per-test guards into the shared startMockServer helper covers all current and future mock-server tests in one place, and removes the two earlier per-test skips that this helper now subsumes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): reject WSL bash via SKILL_UP_BASH override too The override branch in DiscoverBash accepted any regular file at the configured path, including C:\Windows\System32\bash.exe (the WSL shim). The shim expects /mnt/<drive>/... paths, so script-judge `.sh` invocations built from C:/... host paths still fail even when the override is set. Apply the same isWSLBash check on the override path that PATH-discovery already uses; advanced WSL users must point the override at a non-WSL bash or arrange path translation upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: add a Windows e2e job The Windows leg of Build & Test exercises the unit tests, but the Windows-specific code paths added by issue #31 (NewShellCmd shell selection, MSYS argv handling, QuoteWindows, WaitDelay, script-judge interpreter dispatch) only really light up when skill-up runs as a subprocess through its own CLI — which is the e2e package. Add an E2E (none runtime, Windows) job that runs `go test -tags e2e ./e2e` without SKILL_UP_FULL_E2E, so the LLM-dependent tests self-skip and the mock-engine / script-judge contract tests land on Windows. This costs ~one extra CI job slot and no API keys; real-LLM coverage keeps living in the Linux e2e job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(e2e): emit skill-up.exe on Windows so TestMain build is executable The e2e harness's TestMain builds the skill-up CLI into a temp dir and then exec's it for every test. `go build -o <path>` writes the binary to exactly the path given, so on Windows the result has no extension and Windows refuses to execute it ("executable file not found in %PATH%"). Append .exe on Windows so every test in the suite can actually launch the binary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): normalize transcript path and honor shebang options for .sh Two follow-up nits from the latest Codex pass: * P1: EVAL_TRANSCRIPT_PATH was built with joinForGOOS, so on Windows it landed as `C:\...\transcript.json`. The `.sh` script-judge path runs under Git Bash where common idioms like `cat "$EVAL_TRANSCRIPT_PATH"` go through POSIX coreutils that only accept slash-style paths. Add an envPath translator to scriptPlan: identity for POSIX targets and `.ps1`/`.cmd`/`.bat` (native Windows paths); filepath.ToSlash for the `.sh` Windows plan. script.go applies it before injecting the env var. * P2: planWindowsScript invoked bash explicitly as `bash <script>`, silently dropping any shebang-encoded options (e.g. `#!/bin/bash -eu` or `#!/usr/bin/env -S bash -eu`). On POSIX the kernel would honor those flags via shebang dispatch, so the same script behaved differently across platforms — a script relying on `-e` could continue past failures on Windows and report PASS. Refactor parseShebang to return (interpreter, opts) and forward opts in the bash command for the Windows `.sh` plan. Tests cover the env -S, direct, and only-flags shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): escape bash actives in script paths; align WSL docs Two follow-up nits from the latest Codex pass: * P2: NewShellCmd prefers Git Bash on Windows, so the Windows script-judge commands ultimately run through bash -c. Paths that contain `$` or backtick (e.g. via an unusual %TMP%) would be expanded/command-substituted by bash inside the QuoteWindows-emitted double quotes, so the downstream powershell -File / cmd /d /c saw a rewritten path. Add quoteWindowsThroughBash that wraps QuoteWindows and additionally escapes those two characters; the extra leading backslash is collapsed harmlessly by Windows path normalization when the same command falls through to cmd. Used by every Windows script-judge command builder and removeDirCommand. * P2: docs/guide/windows.md (en + zh) advertised WSL bash as one of the discoverable options, but DiscoverBash now actively rejects any bash under %SystemRoot%\System32 (the WSL shim) at every step — SKILL_UP_BASH, PATH, and well-known locations — because it expects /mnt/c/... paths and silently fails on the Windows paths skill-up generates. Drop the WSL mention from the discovery table and add an explicit note documenting the rejection plus the recommended escape hatch (run skill-up inside WSL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): make cmd fallback reliable, shell-aware quoter, env -S parsing Codex flagged four P2s on the previous round; addressed: * P2: cmd /d /c could trigger cmd's "preserve quotes" branch for certain command shapes (single-token executable, exactly two quotes). Switch the cmd fallback in NewShellCmd to `cmd /d /s /c "<command>"` so the outer-wrap strip is deterministic regardless of the inner command's shape; matching `cmd /d /s /c` is now also used by the script-judge .cmd plan and the cleanup command. * P2: quoteWindowsThroughBash blindly escaped `$` and backtick into `\$` / `` \` `` even on hosts where NoneRuntime.Exec falls back to cmd (no bash). On the cmd path the extra backslash is preserved literally in the path, so e.g. `C:\tmp\$foo\script.cmd` became `C:\tmp\\$foo\…` -- a different file. Pick the quoter at plan time based on whether DiscoverBash succeeds (matching what NewShellCmd will choose), so the bash-active escapes only apply when the command actually goes through `bash -c`. * P2: parseShebang accepted `env -S bash -eu` (split form) but rejected the compact GNU forms `env -Sbash -eu` and `env --split-string=bash`. Add explicit handling for both, rejoining the trailing tokens with a space so the result matches what env's -S would receive on a real shebang. Tests cover the split / compact / long forms. * P2: scriptPlan now carries its own cleanupCommand (the global removeDirCommand is gone) so the cleanup goes through the same shell-aware quoter and the same `cmd /d /s /c` fallback as the run command. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): consume env value-flags and use shell-style env -S tokenizer Codex flagged two more parser-fidelity P2s plus a third about cmd percent-expansion (which has no quoting-layer fix; replied with the limitation): * P2: parseEnvShebang treated every `-prefixed` env token as a single flag, so shebangs like `#!/usr/bin/env -u VAR bash -eu` or `#!/usr/bin/env -C /tmp bash` consumed the flag's value as the interpreter token (and reported "cannot determine interpreter" or routed to the wrong binary). Add a small set of value-taking short flags (-u, -C) the parser advances past; the long forms use `=` syntax so they remain self-contained. * P2: splitStringInterpreter used strings.Fields, which breaks tokens inside quotes. A real shebang like `#!/usr/bin/env -S bash -c "echo ok"` would forward `["bash", "-c", "\"echo", "ok\""]` to bash on Windows instead of the three-token argv POSIX produces. Replace with a small shell-style tokenizer that respects single quotes, double quotes with `\"`/`\\` escapes, and outside-quote backslash escapes; unterminated quotes produce a clean "cannot determine interpreter" error rather than mis-split argv. Tests cover all three shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): bash-safe quoter, restrict shebang routing, /dev/null in where Codex flagged three issues this round; all addressed: * P1: windowsQuoter built on top of QuoteWindows then post-replaced every `$` with `\$`. For an input like `C:\tmp\$work\script.cmd` QuoteWindows preserved the literal `\$`, and the post-replace turned it into `\\$work` -- which under `bash -c` decoded to `\` + `$work` expansion, mangling the path. Replace the post-process scheme with a proper double-quote-with-bash-escapes quoter (every \ / " / $ / backtick doubled). Bash decodes them back to the original byte and cmd later collapses the resulting `\\` runs via Windows path normalization, so paths with `\$`-style segments survive both shells. * P2: shebangPOSIXShells routed dash/ksh/zsh/ash to `.sh`, but the Windows .sh plan always invokes Git Bash -- silently running a zsh script through bash changes semantics. Restrict the Windows shebang classifier to `sh` and `bash`; other POSIX shells now return the planner's "cannot determine interpreter" error. POSIX targets are unaffected because planScript ignores extension there. * P2: checkCommandForOS rewrote `command -v` to `where` but left POSIX `/dev/null` redirects untouched. cmd opens that as a literal path and the probe fails. Rewrite `/dev/null` to cmd's `nul` as part of the translation so `command -v codex >/dev/null 2>&1` becomes `where codex >/dev/null 2>&1`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): handle env long-flag values and `\_` separator in env -S Two more parser-fidelity P2s from the Codex pass: * P2: parseEnvShebang consumed `-u`/`-C` short flags' value tokens but not the long-form split shape (`--unset FOO bash`, `--chdir /tmp bash`). Add envValueTakingLongFlags and reuse the same +2 advance. The `=`-suffixed shapes (`--unset=FOO`) are still handled by the generic single-token skip. * P2: tokenizeShebangSplitString treated every unquoted `\X` as "escape next byte literally", so the documented env -S separator `\_` (along with `\t`, `\n`, ...) silently became part of the same token. Decode the whitespace escapes (`\_`, `\t`, `\n`, `\r`, `\v`, `\f`) as token separators in unquoted context, so a shebang like `#!/usr/bin/env -S bash\_-eu` produces `(bash, [-eu])` instead of one glued `bash_-eu` token. Other backslash escapes keep their existing literal behavior. Tests cover --unset/--chdir split forms and \_/\t separators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(windows): unify shell host, drop dead quote API, require TargetGOOS Self-review cleanup for the Windows-support PR. Five reviewer items were addressable in code; four are documented limitations or out-of-scope. Centralize shell decisions in platform.Host(): both NoneRuntime.Exec and the .sh script-judge planner now consume a single HostShell value that binds the shell command, quoter, and MSYS opt-out env together. This removes the duplicate DiscoverBash calls in interpreter.go and the goruntime.GOOS branch with inline MSYS env in NoneRuntime — the platform package once again owns all OS dispatch. Drop shellquote.Quote, shellquote.QuoteFor, and the windows/posix build-tag files. No call site referenced them; QuotePOSIX and QuoteWindows are explicit at every caller, so the dead symbols only risked future misuse. Promote TargetGOOS to a required Runtime interface method. The previous TargetOSer optional interface defaulted unknown runtimes to "linux", which would silently mis-route a future Windows-native runtime. Adding the method to all six mocks surfaces the requirement at compile time. Replace checkCommandForOS's strict prefix match with a regex so quiet probes like `command -v codex >/dev/null 2>&1` translate correctly to `where codex >/dev/null 2>&1` on Windows; the prior check only handled an exact `command -v <name>` form. Document two known limitations in the Windows guide: native agent CLI execution still requires bash (so the PR's "first-class" scope covers the runtime/judge layers, not agent bootstrap), and cmd.exe's %VAR% expansion inside double-quoted arguments has no command-line escape — install Git Bash to avoid the cmd fallback entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): cache platform.Host() and require bash for agent CLIs Self-review follow-up addressing M1 and M3 from PR #33. M3 (perf): platform.Host() now caches its HostShell via sync.OnceValue on both POSIX and Windows. Previously every NoneRuntime.Exec call (and every script-judge plan on Windows) re-ran exec.LookPath("bash") plus the well-known Git Bash stats. The result is process-stable by design -- SKILL_UP_BASH is documented as read-once at startup -- so this is a straight cache. Removes the hottest filesystem-IO cost on the Exec path. M1 (security): all CLIAgent / ClaudeCodeAgent / CodexAgent / QoderCLIAgent entry points now reject runs when rt.TargetGOOS() is Windows and platform.Host().IsBash is false. Without this guard, the cmd.exe fallback would still accept agent commands whose instruction string is POSIX-quoted; cmd treats `'` as literal, so metacharacters (& | " %VAR%) from arbitrary case messages could reach the host shell. Native agent execution on Windows already required bash for the nvm/Node bootstrap (documented in docs/guide/windows.md), so the new ErrAgentRequiresBash makes that requirement explicit at the API surface with a clear hint pointing at Git for Windows / SKILL_UP_BASH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(judge): clean .sh temp dirs via bash rm on Windows Self-review follow-up addressing M8 from PR #33. The .sh script-judge plan already runs through bash -c, but cleanup went through `cmd /d /s /c rd /s /q "<dir>"`. That meant the bash shell's double-quote escaping (\\, ", $, `) had to round-trip safely through cmd's strip-first-and-last-quote rule, which worked only because judgeTempDir paths happen to be free of spaces and metachars that cmd would mis-parse. Stay inside bash for cleanup too: emit `rm -rf <forward-slash dir>`. Git Bash's rm understands the forward-slash form of Windows temp paths natively, so the bash -> cmd hop is avoided entirely. .ps1 / .cmd / .bat continue to use the cmd-side cleanup since their command itself is dispatched via cmd anyway. Add TestCleanupCommand_Windows_ShellScriptUsesBashRm to lock in the new shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(windows): post-review cleanups (CI versions, docs, lint dead code) Self-review follow-up addressing M4/M5/M6 and several Info nits from PR #33. CI (M4): normalize actions/upload-artifact to @v5 across all jobs so the upload/download halves stay on one major (previously a mix of @v5 and @v7), and bring release-dryrun's actions/setup-go from @v5 up to @v6 to match every other job. Docs (M5, M6): add a CHANGELOG [Unreleased] entry covering Windows support, SKILL_UP_BASH, PowerShell tooling, and the agent bash requirement. Point both READMEs at docs/guide/windows.md so the "first-class Windows support" promise is discoverable without relying on the vitepress sidebar. Lint / code hygiene (I-series): - Hoist string literals: introduce platform.GOOSWindows / Linux / Darwin so OS-dispatch sites stop comparing against bare strings; silences goconst on the new checkCommandForOS table tests. - Drop the `_ = baseURL` noise in BaseAgent.logCredentialStatus; the baseURL value was never used after the binding. - Drop the `plan.envPath != nil` guard in script judge; every scriptPlan returned by planScript sets envPath, so the nil branch was dead. - Name the 10s WaitDelay magic number as execContextGracePeriod with a sourced comment. - CONTRIBUTING: document the Windows equivalent of `make e2e` (`go test -tags e2e -v ./e2e`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(judge): handle env -a, optional-arg signal flags, NAME=VALUE, and \c Codex auto-review on PR #33 flagged four GNU env corner cases the shebang parser mishandled. All four would have produced a misleading "cannot determine interpreter" error on Windows for shebangs that the kernel's real env(1) would have executed cleanly. Each gets a targeted fix plus a TestParseShebang table case. - `-a ARG` / `--argv0=ARG`: env's documented argv0-override short option DOES consume a separate value token. Previously `-a myargv0 bash` parsed as interpreter `myargv0`. Added `-a` to envValueTakingShortFlags. - `--default-signal` / `--ignore-signal` / `--block-signal`: these take an OPTIONAL `[=sig]` argument, not a separate value token. Previously `--ignore-signal bash` was parsed as flag with value `bash`, leaving no interpreter. Removed them from envValueTakingLongFlags; the `--name=value` form is still handled by the generic "starts with `-`" skip arm. - Leading `NAME=VALUE` assignments: env(1) accepts `env PYTHONPATH=/opt python3` and the equivalent `env -S PYTHONPATH=/opt python3`. Added isEnvAssignment + interpreterFromArgs so both parseEnvShebang's default branch and splitStringInterpreter walk past assignment tokens to the real interpreter. - `\c` terminator: GNU env -S documents unquoted `\c` as "ignore the rest of the split-string body". tokenizerState gains a `done` flag the unquoted handler sets on `\c`; the top-level loop honors it, flushing the current token and discarding the tail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(judge): honor pwsh shebang and forward PowerShell shebang flags Codex auto-review on PR #33 flagged two `.ps1` planning bugs that diverged from how a kernel shebang would invoke the script. `#!/usr/bin/env pwsh`: previously the extension dispatch always ran `powershell.exe` (Windows PowerShell 5.x), so an extensionless script asking for PowerShell Core 7+ silently got the legacy interpreter -- breaking pwsh-only syntax/modules. The planner now picks `pwsh` when the shebang names it explicitly and keeps `powershell` as the default. `#!/usr/bin/env -S pwsh -NoLogo`: previously the `.ps1` plan hard-coded `powershell -NoProfile -ExecutionPolicy Bypass -File ...` and dropped any shebang options, so a request like `-NoLogo` was silently lost. The planner now forwards shebang options when the shebang names a PowerShell interpreter, then appends the safety defaults so unsigned scripts still execute in CI. A `.ps1` file with no shebang (or a non-PowerShell shebang) keeps the old `powershell.exe` + `-NoProfile -ExecutionPolicy Bypass -File` behavior. Tests cover all three shapes plus the `pwsh` routing and option forwarding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): tag this branch's CHANGELOG entry as 0.3.0 Maintainer asked for a concrete version on the CHANGELOG section introduced by this PR. After rebasing onto main (now at 0.2.1), the next release is 0.3.0: - Adds first-class Windows support (substantial new feature surface). - Promotes Runtime.TargetGOOS() from optional to required, which is a breaking change for any out-of-tree Runtime implementer. Under semver this argues for a minor bump while the project is in 0.x. Also wires DockerRuntime (landed on main while this branch was open) into the required TargetGOOS interface: docker provisions a Linux guest, so TargetGOOS returns platform.GOOSLinux. Without this the rebased branch would no longer build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(agent): use platform.GOOSWindows constant in requireBashOnWindowsHost Every other TargetGOOS comparison in this package and judge/runtime/cli already routes through platform.GOOSWindows; only requireBashOnWindowsHost was still using the bare "windows" string literal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(judge): defensive copy in .ps1 plan.command to match .sh branch The .ps1 closure was returning strings.Join(append(psArgs, ...), " "), which mutates psArgs's backing array if cap exceeds len. Currently safe because plan.command is called exactly once per script run, but a latent footgun if anyone refactors to multi-invocation. The .sh branch right below already does the copy-then-append dance; mirror it for consistency. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(runtime): correct DockerRuntime.TargetGOOS doc comment The PR commit that introduced TargetGOOS() on DockerRuntime accidentally duplicated the leading "// Start starts the container" line above the new method, which revive flags as a malformed exported-method comment. The pre-existing main lint job (cd41d6d) now fails the build on this class of violation; drop the orphan line so the doc comment starts with the method name as required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(release): bump 0.3.0 date to 2026-05-27 to follow main's 0.2.3 * fix(windows): reject rooted POSIX-style paths in workspace/sandbox guards `IterationWorkspace.WriteFile` and `safeLocalTarget` both used `filepath.IsAbs` alone to reject user-supplied absolute paths. On Windows `filepath.IsAbs` requires a volume name (`C:\...`), so a POSIX-style `/abs.txt` or `/absolute` slipped through and the underlying `os.WriteFile` / `filepath.Join` happily honored it — exactly the path-traversal escape the guard was supposed to block. Both call sites now additionally reject any path whose first byte is a separator (`/` or `\`). The unit tests `TestIterationWorkspace_WriteFile` and `TestOpenSandboxLocalHelpersRejectUnsafePathsAndPreserveRemoteScope`, which assert `/abs.txt` / `/absolute` are rejected, now pass on the Windows CI runner this PR introduces. Surfaced by adding `windows-latest` to the build matrix; main does not yet run these tests on Windows. Tests themselves are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(changelog): drop empty [0.3.0] header after main absorbed Windows into 0.2.4 The earlier "bump 0.3.0 date" commit on this branch staged a 0.3.0 section to hold the Windows entries. Main has since cut a 0.2.4 release that incorporated those entries directly, so the standalone [0.3.0] header on this branch is now empty and would confuse readers. Remove it; the Windows content is preserved under the just-released [0.2.4] section above [0.2.3]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows): force -1 exit on ctx-cancel; skip POSIX-process-group tests CI on `windows-latest` surfaced four failures after the latest rebase onto main's 0.2.4 (which added new timeout/process-group tests under PR #85 and PR #24). `classifyExecError`: when ctx terminated the process, force exit code -1 *before* checking *exec.ExitError. On POSIX a killed `sleep` already surfaces as -1 (signal), but on Windows bash's `sleep 1` killed via ctx-cancel returns OS exit code 1, which the surrounding Exec switch would then misclassify as "real failure" instead of "killed by context" — so the log line "command exited with code 1: sleep 1" showed up instead of "command killed by context". Forcing -1 in the ctx-cancel branch fixes `TestNoneRuntime_ExecReturnsContextErrorOnTimeout` and the two `LogsDeadlineDelta` / `OmitsDeadlineDelta` tests on Windows without changing POSIX behaviour (POSIX still arrives via the same branch with -1 from the signal kill). `TestNoneRuntime_ExecKillsDescendantsOnTimeout` and `TestScriptJudge_Timeout`: both rely on POSIX process-group semantics — killing the parent shell takes the whole descendant tree with it. Windows has no equivalent; `configureProcessGroup` is a no-op there, so a backgrounded ping/sleep survives ctx-cancel and (in the script-judge case) holds t.TempDir as its cwd, breaking RemoveAll cleanup at test end. Skip both on Windows with a comment pointing at the cancellation-itself test as the remaining Windows coverage. `TestCustomAgent_CollectArtifacts_MixedSafeAndEscaping`: the safe-suffix check used `strings.HasSuffix(p, "/safe.txt")`, which can never match a Windows path like `C:\...\safe.txt`. Switch to `filepath.Base(p) == "safe.txt"` so the assertion is OS-neutral. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: align upload-artifact on @v7 to match main An earlier commit on this branch normalized all upload-artifact refs to @v5 for internal consistency. Main has since standardized on @v7 (dependabot bump #...), so four of this branch's refs (the e2e-none, opensandbox, and badge-on-main uploads) were effectively downgrading @v7 → @v5 after each rebase. Bring all six refs to @v7 so the branch matches main's convention and introduces no version drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Adds
collect_artifacts: declare doublestar glob patterns (*.md,a/**/b,**/*.json) that select files from the case workspace to download as run artifacts — regardless of whether the agent succeeded, failed, or timed out.Previously, files only landed in
outputs/when the agent itself reported them viaSessionResult.Artifacts.GeneratedFiles(basename-flattened), and on timeout/error that download path was skipped entirely. This gives evals an explicit, declarative way to capture workspace outputs even on failure.Behaviour
<output-dir>/<case>/<config>/outputs/workspace/<relative-path>, preserving the workspace-relative path (report/run-1/x.json→outputs/workspace/report/run-1/x.json).cases.defaults.collect_artifacts— applies to every case.collect_artifacts— appended on top.*matches within a path segment,**matches across directories.Implementation notes for reviewers
collectGlobArtifactsis invoked inexecuteCaseOnceafter the agent run and before both the timeout/error early-return and the deferredrt.Close(). The case context may already be cancelled /DeadlineExceeded, so collection runs on a detached context (context.WithoutCancel+ its own 60s deadline).find . -type fthrough the runtime, then matches in Go with doublestar — works uniformly acrossnone/docker/opensandbox.find+DownloadFile, never mutates git state. Distinct fromreport.artifacts(artifact types) and theagent_judgeworkspace diff (a string fed to the judge, not downloaded files). Placed after the diff finalize so it can't pollute the diff.skill-up validateviadoublestar.ValidatePattern.Tests
internal/evaluator: relative-path preservation, download even when the context is cancelled (timeout simulation), no-patterns no-op, union/dedup resolution.internal/config: valid globs accepted; empty + invalid globs rejected.Docs
docs/{guide,zh/guide}/writing-evals.md(full-reference + dedicated section), bothcli-reference.mdoutput trees (outputs/workspace/),skills/skill-upper/references/eval-yaml.md, andCHANGELOG.md.🤖 Generated with Claude Code