fix: harden artifact store, skills loader, hooks, shell and agent warnings by dgageot · Pull Request #2480 · docker/docker-agent

dgageot · 2026-04-21T08:22:29Z

A code review of the repository surfaced several bugs and security issues. This PR groups the fixes for the ones that can be addressed without touching unrelated files.

Fixes

`pkg/content/store.go` — path traversal via crafted digest

resolveIdentifier previously accepted any sha256:-prefixed string verbatim, so an identifier like sha256:../../etc/passwd would flow through filepath.Join(baseDir, digest+".tar") in GetArtifactImage / GetArtifactPath / DeleteArtifact, allowing reads or deletions outside the artifact store. Digests are now strictly validated (sha256: + 64 lowercase hex characters) both for the bare-digest and digest-reference (repo@sha256:...) forms, on the value read back from the refs index, and in StoreArtifact (defense-in-depth before the write path). Malformed digests are rejected with a new ErrInvalidDigest sentinel.

`pkg/skills/remote.go` — path traversal via remote skill name

The remote-skills loader already validated entry.Files against path traversal, but not entry.Name. The skill name flows into filepath.Join(cacheBase, urlHash, skillName) and into the returned Skill's FilePath / BaseDir, so a hostile remote index could use a name like ../evil or a/b to cache files outside the skills cache directory and later redirect SKILL.md reads to attacker-chosen filesystem locations. Skill names now must match a conservative regex (^[A-Za-z0-9_-][A-Za-z0-9._-]{0,127}$).

`pkg/hooks/executor.go` — PreToolUse hooks failed open

aggregateResults silently skipped hook results whose process failed to run (timeout, context cancellation, shell not found, binary missing…), leaving finalResult.Allowed == true. For PreToolUse hooks — which are used as a security gate around tool calls — this was a fail-open: a hook configured to deny e.g. rm -rf could be bypassed by anything that made it error out before reporting a verdict.

Now:

executeHook normalizes a fired timeout / parent-context cancellation to a plain execution error, so the condition is reliably detectable by the caller.
aggregateResults, when eventType == EventPreToolUse and a hook returned an execution error, sets Allowed = false and surfaces the reason in Message / Stderr. Other event types (PostToolUse, SessionStart, Stop, Notification, …) keep the log-and-continue behavior since they are observational, not gating.

`pkg/agent/agent.go` — data race on `pendingWarnings`

addToolWarning and DrainWarnings mutate a.pendingWarnings (append / read + reset) from any goroutine that calls Agent.Tools(), StartedTools() or ensureToolSetsAreStarted(). Those are reached from the runtime loop, the MCP server, the session manager, the TUI and tests — often in parallel — so the concurrent writes were a data race and could corrupt the slice or produce torn reads. A dedicated warningsMu sync.Mutex now guards both paths.

`pkg/tools/builtin/shell.go` — zombie child on `createProcessGroup` failure

RunShellBackground and runNativeCommand both call cmd.Start() and then createProcessGroup(cmd.Process). If createProcessGroup failed, the child had already been started — but the existing code either leaked it (runNativeCommand: returned without killing or waiting) or only called kill without a subsequent cmd.Wait() (RunShellBackground), leaving a zombie and holding the stdout/stderr pipes open until GC. A new reapSpawnedChild helper sends SIGTERM via the process group, escalates to SIGKILL after a short grace period, and always calls cmd.Wait(). Both error paths now use it.

Testing

Targeted tests added for each fix, including a race-aware test for pendingWarnings.
mise lint — 0 issues across 754 files.
go test ./... — all packages pass.
go test -race — all affected packages pass.

Known issues intentionally not fixed here

The review also surfaced several issues outside the footprint of this PR (SSRF in the fetch built-in tool and in OAuth resource-metadata discovery, env-var injection in the script_shell tool, file-permission widening in filesystem edit/write, and unsynchronized in-memory session metadata). They're best addressed in follow-up PRs scoped to those subsystems.

…nings A code review of the repository surfaced several bugs and security issues that this commit addresses together: * content/store: resolveIdentifier and StoreArtifact now strictly validate digests ("sha256:" + 64 hex chars) before using them as filesystem path components. Without this, an identifier such as "sha256:../../etc/passwd" flowed through filepath.Join and let GetArtifact*/DeleteArtifact read or delete files outside the store directory. A new ErrInvalidDigest sentinel is returned and the refs-file path revalidates the digest on read. * skills/remote: the remote-skills loader now rejects index entries whose name is not a plain single path component (matched by a conservative regex). Previously only entry.Files was validated, so a hostile index could use a name like "../evil" to write cache files outside the cache directory and later redirect SKILL.md reads to attacker-chosen filesystem locations. * hooks/executor: PreToolUse hooks that fail to run to completion (timeout, parent-context cancellation, spawn error, missing binary, ...) now deny the tool call instead of silently allowing it. executeHook normalizes a fired context to a plain execution error so aggregateResults can reliably fail closed on that event. PostToolUse and other observational events keep their log-and- continue behavior. * agent: Agent.pendingWarnings is now guarded by a dedicated sync.Mutex. addToolWarning and DrainWarnings are called from the runtime loop, the MCP server, the TUI and the session manager in parallel, and the concurrent append/read+clear was a data race. * tools/builtin/shell: when cmd.Start() succeeds but the follow-up createProcessGroup call fails, a new reapSpawnedChild helper sends SIGTERM (with a SIGKILL escalation after a grace period) and calls cmd.Wait() so the child is reaped and its stdout/ stderr pipes are closed. Both affected error paths now use it. Each change ships with targeted tests (including -race where relevant). mise lint is clean and the full test suite passes. Assisted-By: docker-agent

dgageot requested a review from a team as a code owner April 21, 2026 08:22

trungutt approved these changes Apr 21, 2026

View reviewed changes

dgageot merged commit 93617e1 into docker:main Apr 21, 2026
9 checks passed

BrewTestBot mentioned this pull request Apr 21, 2026

docker-agent 1.49.0 Homebrew/homebrew-core#278620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden artifact store, skills loader, hooks, shell and agent warnings#2480

fix: harden artifact store, skills loader, hooks, shell and agent warnings#2480
dgageot merged 1 commit intodocker:mainfrom
dgageot:board/code-review-find-bugs-and-security-issue-d9ce6d2b

dgageot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dgageot commented Apr 21, 2026

Fixes

pkg/content/store.go — path traversal via crafted digest

pkg/skills/remote.go — path traversal via remote skill name

pkg/hooks/executor.go — PreToolUse hooks failed open

pkg/agent/agent.go — data race on pendingWarnings

pkg/tools/builtin/shell.go — zombie child on createProcessGroup failure

Testing

Known issues intentionally not fixed here

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`pkg/content/store.go` — path traversal via crafted digest

`pkg/skills/remote.go` — path traversal via remote skill name

`pkg/hooks/executor.go` — PreToolUse hooks failed open

`pkg/agent/agent.go` — data race on `pendingWarnings`

`pkg/tools/builtin/shell.go` — zombie child on `createProcessGroup` failure