feat: first-class Windows support across CLI, runtime, and judges#33
feat: first-class Windows support across CLI, runtime, and judges#33zpzjzj wants to merge 41 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 19e36f3620
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f5e64c81c0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0e88b94ac
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
a0e88b9 to
c233641
Compare
Round of fixes for the P2 nits left on the Windows-support PR: * Prefer Git Bash on Windows for NoneRuntime shell invocations and fall back to cmd /d /c when no bash is discoverable. This keeps the many internal POSIX command strings working (agent CLI prompts via shellQuote, `set -eu` git fixtures, workspace-diff `if ...; then` pipelines) and gives `%VAR%`-bearing inputs bash's literal double-quote semantics so cmd's mid-arg expansion stops mangling them. `/d` disables HKLM/HKCU AutoRun so the host's cmd AutoRun cannot prepend commands to every Exec. * DiscoverBash no longer probes `C:\Windows\System32\bash.exe`; WSL bash expects `/mnt/c/...` paths, so picking it up would let `.sh` script judges fail with file-not-found. Users with WSL pipelines can still point SKILL_UP_BASH at it after arranging path translation. * shellquote.QuoteWindows now treats `(`, `)`, and `%` as quoting triggers, so paths like `C:\tmp\(draft)\script.cmd` get wrapped instead of being passed as cmd syntax. The `%` case is documented as best-effort (cmd expands `%VAR%` even inside quotes; preferring bash via NewShellCmd is the real fix). * shebangExtension parses the interpreter basename instead of doing a substring match, so `#!/usr/bin/env fish`, `python3`, `swish`, etc. no longer get misclassified as `.sh`. `env -S <name>` is handled. * Agent's PATH override (`$HOME/.local/bin:$HOME/.nvm/current/bin:$PATH`) is build-tagged out on Windows so the inherited Windows `Path` reaches `where codex`/`where claude` checks intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56911c048a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef46a049ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb6ccca95f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f6d820cbdc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Round of fixes for the P2 nits left on the Windows-support PR: * Prefer Git Bash on Windows for NoneRuntime shell invocations and fall back to cmd /d /c when no bash is discoverable. This keeps the many internal POSIX command strings working (agent CLI prompts via shellQuote, `set -eu` git fixtures, workspace-diff `if ...; then` pipelines) and gives `%VAR%`-bearing inputs bash's literal double-quote semantics so cmd's mid-arg expansion stops mangling them. `/d` disables HKLM/HKCU AutoRun so the host's cmd AutoRun cannot prepend commands to every Exec. * DiscoverBash no longer probes `C:\Windows\System32\bash.exe`; WSL bash expects `/mnt/c/...` paths, so picking it up would let `.sh` script judges fail with file-not-found. Users with WSL pipelines can still point SKILL_UP_BASH at it after arranging path translation. * shellquote.QuoteWindows now treats `(`, `)`, and `%` as quoting triggers, so paths like `C:\tmp\(draft)\script.cmd` get wrapped instead of being passed as cmd syntax. The `%` case is documented as best-effort (cmd expands `%VAR%` even inside quotes; preferring bash via NewShellCmd is the real fix). * shebangExtension parses the interpreter basename instead of doing a substring match, so `#!/usr/bin/env fish`, `python3`, `swish`, etc. no longer get misclassified as `.sh`. `env -S <name>` is handled. * Agent's PATH override (`$HOME/.local/bin:$HOME/.nvm/current/bin:$PATH`) is build-tagged out on Windows so the inherited Windows `Path` reaches `where codex`/`where claude` checks intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
67fa2c9 to
512a7f4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 512a7f4edd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b0ed3b3e0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 833292d784
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d0521dae00
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f193c726a3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
This comment has been minimized.
This comment has been minimized.
…n for judge Two follow-up nits from the latest Codex review pass: * P1: QuoteWindows previously returned the raw string for inputs without whitespace/metacharacters, so common script-judge paths such as `C:\tmp\skill-up-judge-N\script.cmd` were emitted unquoted. With NewShellCmd now routing through bash on Windows when Git Bash is available, unquoted backslashes are stripped by bash and the downstream `cmd /c`/`powershell -File` receives `C:tmpscript.cmd` — which made every script-judge test fail on the windows-latest runner. Drop the fast path and always wrap; double-quoted backslashes are literal under both bash and cmd, and CommandLineToArgvW unwraps them the same way. * P2: the script-judge `.cmd`/`.bat` invocation and the workspace cleanup both used plain `cmd /c`, so HKLM/HKCU `Command Processor\AutoRun` could prepend commands before the judge script and make results non-deterministic. Switch them to `cmd /d /c` to match the runtime shell fallback hardening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The remaining Windows CI failures after the QuoteWindows always-quote change came from two unrelated places: * Git Bash / MSYS rewrites `/x`-shaped argv entries as POSIX paths before invoking native Windows binaries, so `bash -c "cmd /d /c X"` reached cmd.exe as `cmd "C:/Program Files/Git/d" ...` — cmd never saw its `/d /c` switches, dropped into the interactive prompt, and every script-judge test got the cmd banner as the script's stdout. Setting `MSYS_NO_PATHCONV=1` and `MSYS2_ARG_CONV_EXCL=*` in the env NoneRuntime hands the bash child is the standard escape hatch. The vars are no-ops when bash isn't the launched shell. * TestMockedGenericServer_FramedNonASCIIRoundTrip drives the mocked MCP server's Content-Length framing through a Node child whose stdout, on default Windows, is CRLF-translated and codepage-rewritten — that corrupts a non-ASCII byte stream. Verifying the framing on POSIX is enough; skip with a doc comment pointing at the underlying Node stdio behavior. Not related to the Windows-support work itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the MSYS argv-conversion fix all script-judge tests pass on windows-latest. The last red test was TestNoneRuntime_ExecReturnsContextErrorOnTimeout, which hung for the test's 2-minute budget under bash on Windows: when `bash -c "ping -n 3 ... > nul"` (or any equivalent that spawns a grandchild) is killed by ctx-cancel, MSYS bash dies but the grandchild inherits bash's stderr pipe write end, so Go's io.Copy goroutine on our stderr pipe never sees EOF and Cmd.Wait blocks forever. Setting cmd.WaitDelay = 10s force-closes the inherited descriptors a grace window after ctx-cancel, so Wait can return with the killed exit code we already report as -1. No effect on the happy path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The windows-latest leg of the Build & Test matrix has been green for several CI runs in a row (script judge, agent, evaluator, runtime, mcp, all packages pass). Drop the continue-on-error gate so future Windows regressions block the PR instead of being silently absorbed. This is the final commit of the issue #31 Windows-support rollout sketched in the original plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Even with C:\Windows\System32\bash.exe removed from knownWindowsBashPaths, DiscoverBash on Windows still returned whatever exec.LookPath finds first — and System32 is on PATH on every Windows host, so the WSL bash shim wins. The shim expects Linux-format `/mnt/c/...` paths, so the host-shell command strings and script-judge `.sh` invocations built from `C:/...` host paths fail with file-not-found despite "bash discovered" succeeding. Detect the System32-located shim by directory and fall through to the Git Bash well-known paths so PATH-based hosts behave the same as ones without bash on PATH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…io reason windows-latest creates symlinks fine (the runner runs with the required privilege), so this test no longer falls into the existing "symlinks not supported" skip path; it instead reaches the same Node stdio framing hang that TestMockedGenericServer_FramedNonASCIIRoundTrip hits. Skip on Windows with the same justification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every mocked MCP server test spawns a child Node process and exchanges Content-Length-framed JSON-RPC over its stdout — the same setup that default-config Node stdio corrupts on Windows (codepage + CRLF). Moving the Windows skip from per-test guards into the shared startMockServer helper covers all current and future mock-server tests in one place, and removes the two earlier per-test skips that this helper now subsumes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The override branch in DiscoverBash accepted any regular file at the configured path, including C:\Windows\System32\bash.exe (the WSL shim). The shim expects /mnt/<drive>/... paths, so script-judge `.sh` invocations built from C:/... host paths still fail even when the override is set. Apply the same isWSLBash check on the override path that PATH-discovery already uses; advanced WSL users must point the override at a non-WSL bash or arrange path translation upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Windows leg of Build & Test exercises the unit tests, but the Windows-specific code paths added by issue #31 (NewShellCmd shell selection, MSYS argv handling, QuoteWindows, WaitDelay, script-judge interpreter dispatch) only really light up when skill-up runs as a subprocess through its own CLI — which is the e2e package. Add an E2E (none runtime, Windows) job that runs `go test -tags e2e ./e2e` without SKILL_UP_FULL_E2E, so the LLM-dependent tests self-skip and the mock-engine / script-judge contract tests land on Windows. This costs ~one extra CI job slot and no API keys; real-LLM coverage keeps living in the Linux e2e job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The e2e harness's TestMain builds the skill-up CLI into a temp dir and
then exec's it for every test. `go build -o <path>` writes the binary
to exactly the path given, so on Windows the result has no extension
and Windows refuses to execute it ("executable file not found in
%PATH%"). Append .exe on Windows so every test in the suite can
actually launch the binary.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… .sh Two follow-up nits from the latest Codex pass: * P1: EVAL_TRANSCRIPT_PATH was built with joinForGOOS, so on Windows it landed as `C:\...\transcript.json`. The `.sh` script-judge path runs under Git Bash where common idioms like `cat "$EVAL_TRANSCRIPT_PATH"` go through POSIX coreutils that only accept slash-style paths. Add an envPath translator to scriptPlan: identity for POSIX targets and `.ps1`/`.cmd`/`.bat` (native Windows paths); filepath.ToSlash for the `.sh` Windows plan. script.go applies it before injecting the env var. * P2: planWindowsScript invoked bash explicitly as `bash <script>`, silently dropping any shebang-encoded options (e.g. `#!/bin/bash -eu` or `#!/usr/bin/env -S bash -eu`). On POSIX the kernel would honor those flags via shebang dispatch, so the same script behaved differently across platforms — a script relying on `-e` could continue past failures on Windows and report PASS. Refactor parseShebang to return (interpreter, opts) and forward opts in the bash command for the Windows `.sh` plan. Tests cover the env -S, direct, and only-flags shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-up nits from the latest Codex pass: * P2: NewShellCmd prefers Git Bash on Windows, so the Windows script-judge commands ultimately run through bash -c. Paths that contain `$` or backtick (e.g. via an unusual %TMP%) would be expanded/command-substituted by bash inside the QuoteWindows-emitted double quotes, so the downstream powershell -File / cmd /d /c saw a rewritten path. Add quoteWindowsThroughBash that wraps QuoteWindows and additionally escapes those two characters; the extra leading backslash is collapsed harmlessly by Windows path normalization when the same command falls through to cmd. Used by every Windows script-judge command builder and removeDirCommand. * P2: docs/guide/windows.md (en + zh) advertised WSL bash as one of the discoverable options, but DiscoverBash now actively rejects any bash under %SystemRoot%\System32 (the WSL shim) at every step — SKILL_UP_BASH, PATH, and well-known locations — because it expects /mnt/c/... paths and silently fails on the Windows paths skill-up generates. Drop the WSL mention from the discovery table and add an explicit note documenting the rejection plus the recommended escape hatch (run skill-up inside WSL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…parsing Codex flagged four P2s on the previous round; addressed: * P2: cmd /d /c could trigger cmd's "preserve quotes" branch for certain command shapes (single-token executable, exactly two quotes). Switch the cmd fallback in NewShellCmd to `cmd /d /s /c "<command>"` so the outer-wrap strip is deterministic regardless of the inner command's shape; matching `cmd /d /s /c` is now also used by the script-judge .cmd plan and the cleanup command. * P2: quoteWindowsThroughBash blindly escaped `$` and backtick into `\$` / `` \` `` even on hosts where NoneRuntime.Exec falls back to cmd (no bash). On the cmd path the extra backslash is preserved literally in the path, so e.g. `C:\tmp\$foo\script.cmd` became `C:\tmp\\$foo\…` -- a different file. Pick the quoter at plan time based on whether DiscoverBash succeeds (matching what NewShellCmd will choose), so the bash-active escapes only apply when the command actually goes through `bash -c`. * P2: parseShebang accepted `env -S bash -eu` (split form) but rejected the compact GNU forms `env -Sbash -eu` and `env --split-string=bash`. Add explicit handling for both, rejoining the trailing tokens with a space so the result matches what env's -S would receive on a real shebang. Tests cover the split / compact / long forms. * P2: scriptPlan now carries its own cleanupCommand (the global removeDirCommand is gone) so the cleanup goes through the same shell-aware quoter and the same `cmd /d /s /c` fallback as the run command. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nizer Codex flagged two more parser-fidelity P2s plus a third about cmd percent-expansion (which has no quoting-layer fix; replied with the limitation): * P2: parseEnvShebang treated every `-prefixed` env token as a single flag, so shebangs like `#!/usr/bin/env -u VAR bash -eu` or `#!/usr/bin/env -C /tmp bash` consumed the flag's value as the interpreter token (and reported "cannot determine interpreter" or routed to the wrong binary). Add a small set of value-taking short flags (-u, -C) the parser advances past; the long forms use `=` syntax so they remain self-contained. * P2: splitStringInterpreter used strings.Fields, which breaks tokens inside quotes. A real shebang like `#!/usr/bin/env -S bash -c "echo ok"` would forward `["bash", "-c", "\"echo", "ok\""]` to bash on Windows instead of the three-token argv POSIX produces. Replace with a small shell-style tokenizer that respects single quotes, double quotes with `\"`/`\\` escapes, and outside-quote backslash escapes; unterminated quotes produce a clean "cannot determine interpreter" error rather than mis-split argv. Tests cover all three shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n where Codex flagged three issues this round; all addressed: * P1: windowsQuoter built on top of QuoteWindows then post-replaced every `$` with `\$`. For an input like `C:\tmp\$work\script.cmd` QuoteWindows preserved the literal `\$`, and the post-replace turned it into `\\$work` -- which under `bash -c` decoded to `\` + `$work` expansion, mangling the path. Replace the post-process scheme with a proper double-quote-with-bash-escapes quoter (every \ / " / $ / backtick doubled). Bash decodes them back to the original byte and cmd later collapses the resulting `\\` runs via Windows path normalization, so paths with `\$`-style segments survive both shells. * P2: shebangPOSIXShells routed dash/ksh/zsh/ash to `.sh`, but the Windows .sh plan always invokes Git Bash -- silently running a zsh script through bash changes semantics. Restrict the Windows shebang classifier to `sh` and `bash`; other POSIX shells now return the planner's "cannot determine interpreter" error. POSIX targets are unaffected because planScript ignores extension there. * P2: checkCommandForOS rewrote `command -v` to `where` but left POSIX `/dev/null` redirects untouched. cmd opens that as a literal path and the probe fails. Rewrite `/dev/null` to cmd's `nul` as part of the translation so `command -v codex >/dev/null 2>&1` becomes `where codex >/dev/null 2>&1`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two more parser-fidelity P2s from the Codex pass: * P2: parseEnvShebang consumed `-u`/`-C` short flags' value tokens but not the long-form split shape (`--unset FOO bash`, `--chdir /tmp bash`). Add envValueTakingLongFlags and reuse the same +2 advance. The `=`-suffixed shapes (`--unset=FOO`) are still handled by the generic single-token skip. * P2: tokenizeShebangSplitString treated every unquoted `\X` as "escape next byte literally", so the documented env -S separator `\_` (along with `\t`, `\n`, ...) silently became part of the same token. Decode the whitespace escapes (`\_`, `\t`, `\n`, `\r`, `\v`, `\f`) as token separators in unquoted context, so a shebang like `#!/usr/bin/env -S bash\_-eu` produces `(bash, [-eu])` instead of one glued `bash_-eu` token. Other backslash escapes keep their existing literal behavior. Tests cover --unset/--chdir split forms and \_/\t separators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…getGOOS Self-review cleanup for the Windows-support PR. Five reviewer items were addressable in code; four are documented limitations or out-of-scope. Centralize shell decisions in platform.Host(): both NoneRuntime.Exec and the .sh script-judge planner now consume a single HostShell value that binds the shell command, quoter, and MSYS opt-out env together. This removes the duplicate DiscoverBash calls in interpreter.go and the goruntime.GOOS branch with inline MSYS env in NoneRuntime — the platform package once again owns all OS dispatch. Drop shellquote.Quote, shellquote.QuoteFor, and the windows/posix build-tag files. No call site referenced them; QuotePOSIX and QuoteWindows are explicit at every caller, so the dead symbols only risked future misuse. Promote TargetGOOS to a required Runtime interface method. The previous TargetOSer optional interface defaulted unknown runtimes to "linux", which would silently mis-route a future Windows-native runtime. Adding the method to all six mocks surfaces the requirement at compile time. Replace checkCommandForOS's strict prefix match with a regex so quiet probes like `command -v codex >/dev/null 2>&1` translate correctly to `where codex >/dev/null 2>&1` on Windows; the prior check only handled an exact `command -v <name>` form. Document two known limitations in the Windows guide: native agent CLI execution still requires bash (so the PR's "first-class" scope covers the runtime/judge layers, not agent bootstrap), and cmd.exe's %VAR% expansion inside double-quoted arguments has no command-line escape — install Git Bash to avoid the cmd fallback entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review follow-up addressing M1 and M3 from PR #33. M3 (perf): platform.Host() now caches its HostShell via sync.OnceValue on both POSIX and Windows. Previously every NoneRuntime.Exec call (and every script-judge plan on Windows) re-ran exec.LookPath("bash") plus the well-known Git Bash stats. The result is process-stable by design -- SKILL_UP_BASH is documented as read-once at startup -- so this is a straight cache. Removes the hottest filesystem-IO cost on the Exec path. M1 (security): all CLIAgent / ClaudeCodeAgent / CodexAgent / QoderCLIAgent entry points now reject runs when rt.TargetGOOS() is Windows and platform.Host().IsBash is false. Without this guard, the cmd.exe fallback would still accept agent commands whose instruction string is POSIX-quoted; cmd treats `'` as literal, so metacharacters (& | " %VAR%) from arbitrary case messages could reach the host shell. Native agent execution on Windows already required bash for the nvm/Node bootstrap (documented in docs/guide/windows.md), so the new ErrAgentRequiresBash makes that requirement explicit at the API surface with a clear hint pointing at Git for Windows / SKILL_UP_BASH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review follow-up addressing M8 from PR #33. The .sh script-judge plan already runs through bash -c, but cleanup went through `cmd /d /s /c rd /s /q "<dir>"`. That meant the bash shell's double-quote escaping (\\, ", $, `) had to round-trip safely through cmd's strip-first-and-last-quote rule, which worked only because judgeTempDir paths happen to be free of spaces and metachars that cmd would mis-parse. Stay inside bash for cleanup too: emit `rm -rf <forward-slash dir>`. Git Bash's rm understands the forward-slash form of Windows temp paths natively, so the bash -> cmd hop is avoided entirely. .ps1 / .cmd / .bat continue to use the cmd-side cleanup since their command itself is dispatched via cmd anyway. Add TestCleanupCommand_Windows_ShellScriptUsesBashRm to lock in the new shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review follow-up addressing M4/M5/M6 and several Info nits from PR #33. CI (M4): normalize actions/upload-artifact to @v5 across all jobs so the upload/download halves stay on one major (previously a mix of @v5 and @v7), and bring release-dryrun's actions/setup-go from @v5 up to @v6 to match every other job. Docs (M5, M6): add a CHANGELOG [Unreleased] entry covering Windows support, SKILL_UP_BASH, PowerShell tooling, and the agent bash requirement. Point both READMEs at docs/guide/windows.md so the "first-class Windows support" promise is discoverable without relying on the vitepress sidebar. Lint / code hygiene (I-series): - Hoist string literals: introduce platform.GOOSWindows / Linux / Darwin so OS-dispatch sites stop comparing against bare strings; silences goconst on the new checkCommandForOS table tests. - Drop the `_ = baseURL` noise in BaseAgent.logCredentialStatus; the baseURL value was never used after the binding. - Drop the `plan.envPath != nil` guard in script judge; every scriptPlan returned by planScript sets envPath, so the nil branch was dead. - Name the 10s WaitDelay magic number as execContextGracePeriod with a sourced comment. - CONTRIBUTING: document the Windows equivalent of `make e2e` (`go test -tags e2e -v ./e2e`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex auto-review on PR #33 flagged four GNU env corner cases the shebang parser mishandled. All four would have produced a misleading "cannot determine interpreter" error on Windows for shebangs that the kernel's real env(1) would have executed cleanly. Each gets a targeted fix plus a TestParseShebang table case. - `-a ARG` / `--argv0=ARG`: env's documented argv0-override short option DOES consume a separate value token. Previously `-a myargv0 bash` parsed as interpreter `myargv0`. Added `-a` to envValueTakingShortFlags. - `--default-signal` / `--ignore-signal` / `--block-signal`: these take an OPTIONAL `[=sig]` argument, not a separate value token. Previously `--ignore-signal bash` was parsed as flag with value `bash`, leaving no interpreter. Removed them from envValueTakingLongFlags; the `--name=value` form is still handled by the generic "starts with `-`" skip arm. - Leading `NAME=VALUE` assignments: env(1) accepts `env PYTHONPATH=/opt python3` and the equivalent `env -S PYTHONPATH=/opt python3`. Added isEnvAssignment + interpreterFromArgs so both parseEnvShebang's default branch and splitStringInterpreter walk past assignment tokens to the real interpreter. - `\c` terminator: GNU env -S documents unquoted `\c` as "ignore the rest of the split-string body". tokenizerState gains a `done` flag the unquoted handler sets on `\c`; the top-level loop honors it, flushing the current token and discarding the tail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex auto-review on PR #33 flagged two `.ps1` planning bugs that diverged from how a kernel shebang would invoke the script. `#!/usr/bin/env pwsh`: previously the extension dispatch always ran `powershell.exe` (Windows PowerShell 5.x), so an extensionless script asking for PowerShell Core 7+ silently got the legacy interpreter -- breaking pwsh-only syntax/modules. The planner now picks `pwsh` when the shebang names it explicitly and keeps `powershell` as the default. `#!/usr/bin/env -S pwsh -NoLogo`: previously the `.ps1` plan hard-coded `powershell -NoProfile -ExecutionPolicy Bypass -File ...` and dropped any shebang options, so a request like `-NoLogo` was silently lost. The planner now forwards shebang options when the shebang names a PowerShell interpreter, then appends the safety defaults so unsigned scripts still execute in CI. A `.ps1` file with no shebang (or a non-PowerShell shebang) keeps the old `powershell.exe` + `-NoProfile -ExecutionPolicy Bypass -File` behavior. Tests cover all three shapes plus the `pwsh` routing and option forwarding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Maintainer asked for a concrete version on the CHANGELOG section introduced by this PR. After rebasing onto main (now at 0.2.1), the next release is 0.3.0: - Adds first-class Windows support (substantial new feature surface). - Promotes Runtime.TargetGOOS() from optional to required, which is a breaking change for any out-of-tree Runtime implementer. Under semver this argues for a minor bump while the project is in 0.x. Also wires DockerRuntime (landed on main while this branch was open) into the required TargetGOOS interface: docker provisions a Linux guest, so TargetGOOS returns platform.GOOSLinux. Without this the rebased branch would no longer build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndowsHost Every other TargetGOOS comparison in this package and judge/runtime/cli already routes through platform.GOOSWindows; only requireBashOnWindowsHost was still using the bare "windows" string literal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The .ps1 closure was returning strings.Join(append(psArgs, ...), " "), which mutates psArgs's backing array if cap exceeds len. Currently safe because plan.command is called exactly once per script run, but a latent footgun if anyone refactors to multi-invocation. The .sh branch right below already does the copy-then-append dance; mirror it for consistency. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The constant is set on every cmd.WaitDelay regardless of GOOS, but the old comment only justified it for the Windows-Git-Bash grandchild case and left readers guessing what it does on POSIX. Spell out that it acts as a SIGTERM-ignoring child upper bound there. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The PR commit that introduced TargetGOOS() on DockerRuntime accidentally duplicated the leading "// Start starts the container" line above the new method, which revive flags as a malformed exported-method comment. The pre-existing main lint job (cd41d6d) now fails the build on this class of violation; drop the orphan line so the doc comment starts with the method name as required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ee74217 to
d5f1bd0
Compare
…ards `IterationWorkspace.WriteFile` and `safeLocalTarget` both used `filepath.IsAbs` alone to reject user-supplied absolute paths. On Windows `filepath.IsAbs` requires a volume name (`C:\...`), so a POSIX-style `/abs.txt` or `/absolute` slipped through and the underlying `os.WriteFile` / `filepath.Join` happily honored it — exactly the path-traversal escape the guard was supposed to block. Both call sites now additionally reject any path whose first byte is a separator (`/` or `\`). The unit tests `TestIterationWorkspace_WriteFile` and `TestOpenSandboxLocalHelpersRejectUnsafePathsAndPreserveRemoteScope`, which assert `/abs.txt` / `/absolute` are rejected, now pass on the Windows CI runner this PR introduces. Surfaced by adding `windows-latest` to the build matrix; main does not yet run these tests on Windows. Tests themselves are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Implements all six phases of #31, promoting Windows to a first-class supported platform for the
noneruntime and script judges. The agent CLI path on native Windows remains bash-gated — see the scope limits below.buildruns on aubuntu+windowsmatrix (both required, nocontinue-on-error). A separateE2E (none runtime, Windows)contract job exercises the mock engine and script-judge pipeline end-to-end. Codecov upload + coverage badge are Linux-only.internal/platformpackage centralizes host shell, quoter, bash discovery, and MSYS env into a single cachedHost()(sync.OnceValue). The judge dispatches by interpreter: POSIX targets keep the originalchmod+shebang behavior; Windows targets run.ps1via PowerShell (withpwsh.exewhen the shebang names it and shebang flags forwarded),.cmd/.batviacmd.exe, and.shvia the discovered bash (Git Bash /SKILL_UP_BASH; the WSL shim underSystem32is intentionally rejected). The.shplan cleans up viarm -rfinside bash rather than dispatching tocmd.envshebang parsing — fullenv -Ssupport including-S<body>compact form,--split-string=,\_/\twhitespace escapes,\cbody terminator, value-taking flags (-u/-C/-a/--unset/--chdir/--argv0), optional-arg flags (--*-signal), and leadingNAME=VALUEassignments.Runtime.TargetGOOS()is now a required interface method (returnsruntime.GOOSfor NoneRuntime, hardcoded"linux"for OpenSandboxRuntime).NoneRuntime.Execconsumesplatform.Host()instead of hardcodingbash -c;cmd.WaitDelaycaps the grandchild stderr-pipe hang under Git Bash.CLIAgent.Checktranslatescommand -vtowhereon Windows via regex (handles>/dev/null 2>&1→>/dev/null 2>&1). All four agent entry points (CLIAgent, ClaudeCodeAgent, CodexAgent, QoderCLIAgent) hard-fail withErrAgentRequiresBashwhen running on Windows without a discoverable bash, instead of silently invoking the unsafe cmd fallback with POSIX-quoted instructions.QuotePOSIXandQuoteWindows(CommandLineToArgvWsemantics, always-quote) as explicit APIs. No moreQuote/QuoteForindirection.scripts/windows/forhooks/lint-tools/verify;examples/judge-debug-eval.ps1;.gitattributespins line endings.README.md/README.zh.md/AGENTS.md/CONTRIBUTING.mdlink to it;CHANGELOG.md[Unreleased]entry.Scope limits (read before merging)
This PR delivers first-class Windows support for the runtime and judge layers. The following are deliberate scope cuts, each with a follow-up issue:
SKILL_UP_BASH. WSL2 remains the recommended workflow for full agent evals. → E2E coverage for agent CLIs on native Windows #48 tracks adding real e2e coverage for the agent path on Windows.E2E (none runtime, Windows)is contract-only. It runs the mock engine + script-judge pipeline; it does NOT cover real LLM execution. The LinuxE2E (none runtime)job is what verifies the agent path. Do not infer from "11 checks green" that native Windows agent execution is e2e-tested.pwsh.exeselection requires PowerShell Core 7+ already installed. If the shebang namespwshbut the user hasn't installed PS Core, the script will fail with "command not found" rather than silently fall back to Windows PowerShell. The docs call this out; users without PS Core should use the.ps1extension (defaultpowershell.exe) or#!/usr/bin/env -S powershell.TargetGOOS() = "linux". Enabling Windows guests requires the upstream Go SDK'sPlatformfield (OpenSandbox-team/OpenSandbox#921 + skill-up OpenSandbox Windows guest profile (wait on Go SDK Platform field) #45).Runtime.TargetGOOS()is a typed-string contract; the cleaner shape isRuntime.Shell(). The current required method is sufficient for two implementations but will be folded into a richerShell()abstraction in a follow-up. → Architecture: fold TargetGOOS into a Runtime.Shell() abstraction #44.SKILL_UP_BASHis read on every platform but only documented in the Windows guide. → Document SKILL_UP_BASH for Linux/macOS users #47.qoder.com/install | bashin the e2e job. → Supply chain: pin qoder.com installer with sha256 verification #43.Test plan
go test -race ./...passes on macOSGOOS=windows go build / vet / test -cclean for everyinternal/...package🤖 Generated with Claude Code