Skip to content

fix: harden artifact store, skills loader, hooks, shell and agent warnings#2480

Merged
dgageot merged 1 commit intodocker:mainfrom
dgageot:board/code-review-find-bugs-and-security-issue-d9ce6d2b
Apr 21, 2026
Merged

fix: harden artifact store, skills loader, hooks, shell and agent warnings#2480
dgageot merged 1 commit intodocker:mainfrom
dgageot:board/code-review-find-bugs-and-security-issue-d9ce6d2b

Conversation

@dgageot
Copy link
Copy Markdown
Member

@dgageot dgageot commented Apr 21, 2026

A code review of the repository surfaced several bugs and security issues. This PR groups the fixes for the ones that can be addressed without touching unrelated files.

Fixes

pkg/content/store.go — path traversal via crafted digest

resolveIdentifier previously accepted any sha256:-prefixed string verbatim, so an identifier like sha256:../../etc/passwd would flow through filepath.Join(baseDir, digest+".tar") in GetArtifactImage / GetArtifactPath / DeleteArtifact, allowing reads or deletions outside the artifact store. Digests are now strictly validated (sha256: + 64 lowercase hex characters) both for the bare-digest and digest-reference (repo@sha256:...) forms, on the value read back from the refs index, and in StoreArtifact (defense-in-depth before the write path). Malformed digests are rejected with a new ErrInvalidDigest sentinel.

pkg/skills/remote.go — path traversal via remote skill name

The remote-skills loader already validated entry.Files against path traversal, but not entry.Name. The skill name flows into filepath.Join(cacheBase, urlHash, skillName) and into the returned Skill's FilePath / BaseDir, so a hostile remote index could use a name like ../evil or a/b to cache files outside the skills cache directory and later redirect SKILL.md reads to attacker-chosen filesystem locations. Skill names now must match a conservative regex (^[A-Za-z0-9_-][A-Za-z0-9._-]{0,127}$).

pkg/hooks/executor.go — PreToolUse hooks failed open

aggregateResults silently skipped hook results whose process failed to run (timeout, context cancellation, shell not found, binary missing…), leaving finalResult.Allowed == true. For PreToolUse hooks — which are used as a security gate around tool calls — this was a fail-open: a hook configured to deny e.g. rm -rf could be bypassed by anything that made it error out before reporting a verdict.

Now:

  • executeHook normalizes a fired timeout / parent-context cancellation to a plain execution error, so the condition is reliably detectable by the caller.
  • aggregateResults, when eventType == EventPreToolUse and a hook returned an execution error, sets Allowed = false and surfaces the reason in Message / Stderr. Other event types (PostToolUse, SessionStart, Stop, Notification, …) keep the log-and-continue behavior since they are observational, not gating.

pkg/agent/agent.go — data race on pendingWarnings

addToolWarning and DrainWarnings mutate a.pendingWarnings (append / read + reset) from any goroutine that calls Agent.Tools(), StartedTools() or ensureToolSetsAreStarted(). Those are reached from the runtime loop, the MCP server, the session manager, the TUI and tests — often in parallel — so the concurrent writes were a data race and could corrupt the slice or produce torn reads. A dedicated warningsMu sync.Mutex now guards both paths.

pkg/tools/builtin/shell.go — zombie child on createProcessGroup failure

RunShellBackground and runNativeCommand both call cmd.Start() and then createProcessGroup(cmd.Process). If createProcessGroup failed, the child had already been started — but the existing code either leaked it (runNativeCommand: returned without killing or waiting) or only called kill without a subsequent cmd.Wait() (RunShellBackground), leaving a zombie and holding the stdout/stderr pipes open until GC. A new reapSpawnedChild helper sends SIGTERM via the process group, escalates to SIGKILL after a short grace period, and always calls cmd.Wait(). Both error paths now use it.

Testing

  • Targeted tests added for each fix, including a race-aware test for pendingWarnings.
  • mise lint — 0 issues across 754 files.
  • go test ./... — all packages pass.
  • go test -race — all affected packages pass.

Known issues intentionally not fixed here

The review also surfaced several issues outside the footprint of this PR (SSRF in the fetch built-in tool and in OAuth resource-metadata discovery, env-var injection in the script_shell tool, file-permission widening in filesystem edit/write, and unsynchronized in-memory session metadata). They're best addressed in follow-up PRs scoped to those subsystems.

…nings

A code review of the repository surfaced several bugs and security

issues that this commit addresses together:

  * content/store: resolveIdentifier and StoreArtifact now strictly

    validate digests ("sha256:" + 64 hex chars) before using them as

    filesystem path components. Without this, an identifier such as

    "sha256:../../etc/passwd" flowed through filepath.Join and let

    GetArtifact*/DeleteArtifact read or delete files outside the

    store directory. A new ErrInvalidDigest sentinel is returned and

    the refs-file path revalidates the digest on read.

  * skills/remote: the remote-skills loader now rejects index entries

    whose name is not a plain single path component (matched by a

    conservative regex). Previously only entry.Files was validated,

    so a hostile index could use a name like "../evil" to write

    cache files outside the cache directory and later redirect

    SKILL.md reads to attacker-chosen filesystem locations.

  * hooks/executor: PreToolUse hooks that fail to run to completion

    (timeout, parent-context cancellation, spawn error, missing

    binary, ...) now deny the tool call instead of silently allowing

    it. executeHook normalizes a fired context to a plain execution

    error so aggregateResults can reliably fail closed on that event.

    PostToolUse and other observational events keep their log-and-

    continue behavior.

  * agent: Agent.pendingWarnings is now guarded by a dedicated

    sync.Mutex. addToolWarning and DrainWarnings are called from the

    runtime loop, the MCP server, the TUI and the session manager in

    parallel, and the concurrent append/read+clear was a data race.

  * tools/builtin/shell: when cmd.Start() succeeds but the follow-up

    createProcessGroup call fails, a new reapSpawnedChild helper

    sends SIGTERM (with a SIGKILL escalation after a grace period)

    and calls cmd.Wait() so the child is reaped and its stdout/

    stderr pipes are closed. Both affected error paths now use it.

Each change ships with targeted tests (including -race where

relevant). mise lint is clean and the full test suite passes.

Assisted-By: docker-agent
@dgageot dgageot requested a review from a team as a code owner April 21, 2026 08:22
@dgageot dgageot merged commit 93617e1 into docker:main Apr 21, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants