Summary
#30840 was closed COMPLETED with the v0.75.0 release, and #33777 is tracking one specific follow-up (unix-socket DOCKER_HOST from a sibling DinD pod). On v0.75.4 the closed-issue scenario (tcp://-shaped DOCKER_HOST from a runner pod with a DinD sidecar in the same ARC RunnerScaleSet pod) is still not first-class: getting a working agent run on Copilot + AWF chroot mode required six distinct workflow/infra-level workarounds.
This issue enumerates those remaining gaps in v0.75.4 with concrete reproduction details.
Repro
- gh-aw v0.75.4 (AWF v0.25.53), Copilot engine.
- ARC scale-set on EKS.
RunnerScaleSet template has the runner container and a dind sidecar in the same pod.
- Runner env:
DOCKER_HOST=tcp://localhost:2375 (matches AWF's TCP-detection regex, so --docker-host-path-prefix /tmp/gh-aw is engaged correctly — this part works).
- Stock helm: DinD sidecar uses
docker:dind (Alpine).
- Shared
gh-aw-tmp emptyDir mounted at /tmp on both containers (standard gh-aw recommendation for ARC).
- Workflow: stock Copilot agent,
sandbox.agent.id: awf, tools: { github: {toolsets: [all]}, bash: true }.
The full workaround set we currently ship is
name: 'ARC GAW Bootstrap (workaround)'
description: |
Stages /etc/passwd, /etc/group, /etc/hosts overrides into the shared DinD
/tmp volume and copies the runner-installed copilot binary into the DinD
daemon's /usr/local/bin so AWF's /usr:/host/usr:ro system mount exposes it
to the chrooted agent. Node is baked into the DinD image.
Workaround for upstream gh-aw bugs on Kubernetes/ARC runners:
- https://github.com/github/gh-aw/issues/30838
- https://github.com/github/gh-aw/issues/30840
Once both issues are fixed upstream:
1. Delete this whole workflows/arc-gaw-bootstrap/ directory upstream
(or .github/workflows/arc-gaw-bootstrap/ in consumer repos).
2. In each workflow .md, delete every block between
`# WORKAROUND-START` and `# WORKAROUND-END` (grep-friendly markers).
3. Drop the `resources:` list from each workflow .md.
4. Run `gh aw compile` to regenerate the lock files.
runs:
using: composite
steps:
- name: Prepare ARC DinD temp directories and stage copilot into the daemon
shell: bash
env:
DOCKER_HOST: tcp://localhost:2375
run: bash -eo pipefail "${GITHUB_ACTION_PATH}/prepare-dind-dirs.sh"
# gh-aw v0.75.4 still hardcodes `github` as an "internal" MCP server in
# mount_mcp_as_cli.cjs, so it never gets mounted as a CLI shim and
# workers invoked with `copilot --disable-builtin-mcps` can't see
# github_* tools. Remove it from INTERNAL_SERVERS so the agent can
# read issues, list PRs, etc. via the github MCP server.
- name: Patch mount_mcp_as_cli.cjs to expose github MCP as a CLI tool
shell: bash
run: |
set -euo pipefail
script="${RUNNER_TEMP}/gh-aw/actions/mount_mcp_as_cli.cjs"
if [ ! -f "$script" ]; then
echo "mount_mcp_as_cli.cjs not found at $script — gh-aw version drift?" >&2
exit 0
fi
if grep -q 'INTERNAL_SERVERS = new Set(\["github"\])' "$script"; then
sed -i 's|INTERNAL_SERVERS = new Set(\["github"\])|INTERNAL_SERVERS = new Set([])|' "$script"
echo "Patched: removed github from INTERNAL_SERVERS"
elif grep -q 'INTERNAL_SERVERS = new Set(\[\])' "$script"; then
echo "Patch already applied or not needed"
else
echo "WARN: unrecognized INTERNAL_SERVERS pattern in $script — gh-aw upstream may have changed" >&2
fi
mkdir -p /tmp/gh-aw/.cache /tmp/gh-aw/.config /tmp/gh-aw/.local/state /tmp/gh-aw/home
chmod -R 0777 /tmp/gh-aw/.cache /tmp/gh-aw/.config /tmp/gh-aw/.local /tmp/gh-aw/home
docker run --rm --user 0:0 --entrypoint /bin/sh \
-v /tmp:/host-tmp:rw \
ghcr.io/github/gh-aw-mcpg:v0.3.6@sha256:2bb8eef86006a4c5963c55616a9c51c32f27bfdecb023b8aa6f91f6718d9171c \
-c 'mkdir -p /host-tmp/gh-aw/.cache /host-tmp/gh-aw/.config /host-tmp/gh-aw/.local/state /host-tmp/gh-aw/home /host-tmp/gh-aw/mcp-logs /host-tmp/gh-aw/mcp-payloads /host-tmp/gh-aw/sandbox/firewall/logs /host-tmp/gh-aw/sandbox/firewall/logs/api-proxy-logs /host-tmp/gh-aw/sandbox/firewall/logs/cli-proxy-logs && chmod -R 0777 /host-tmp/gh-aw'
mkdir -p /tmp/gh-aw/arc-etc
printf '%s\n' 'runner:x:1001:1001:GitHub Actions Runner:/tmp/gh-aw/home:/bin/bash' 'nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin' > /tmp/gh-aw/arc-etc/passwd
printf '%s\n' 'runner:x:1001:' 'nobody:x:65534:' > /tmp/gh-aw/arc-etc/group
printf '%s\n' '127.0.0.1 localhost' '::1 localhost ip6-localhost ip6-loopback' '172.30.0.1 host.docker.internal' > /tmp/gh-aw/arc-etc/hosts
chmod a+r /tmp/gh-aw/arc-etc/passwd /tmp/gh-aw/arc-etc/group /tmp/gh-aw/arc-etc/hosts
# Stage the runner's installed copilot CLI into the DinD daemon's
# /usr/local/bin/copilot so AWF's /usr:/host/usr:ro system mount exposes it
# inside the chrooted agent. Bind-mounting copilot directly into the chroot
# from the runner fails: the daemon doesn't see the runner's filesystem, and
# /host/usr is read-only so a file mountpoint can't be created there.
#
# Node is already pre-installed in the DinD image at /usr/local/bin/node.
# Copilot is staged at runtime rather than baked into the image so its
# version stays bound to whatever gh-aw installs on the runner.
COPILOT_BIN="$(command -v copilot 2>/dev/null || true)"
if [ -n "$COPILOT_BIN" ] && [ -x "$COPILOT_BIN" ]; then
mkdir -p /tmp/gh-aw/copilot-stage
cp -Lf "$COPILOT_BIN" /tmp/gh-aw/copilot-stage/copilot.real
chmod a+rx /tmp/gh-aw/copilot-stage/copilot.real
# AWF chroot mode passes the AWF agent container's HOME=/home/runner,
# USER=root, LOGNAME=root through to the agent exec regardless of
# engine.env (the XDG_* overrides come through; the identity vars do not).
# With HOME=/home/runner copilot can't write to ~/.copilot in the chrooted
# DinD filesystem and exits silently with status 1. The shim forces the
# identity vars to the writable /tmp/gh-aw/home tree before exec'ing the
# real binary.
cat > /tmp/gh-aw/copilot-stage/copilot <<'SHIM_EOF'
#!/bin/bash
exec env HOME=/tmp/gh-aw/home USER=runner LOGNAME=runner /usr/local/bin/copilot.real "$@"
SHIM_EOF
chmod a+rx /tmp/gh-aw/copilot-stage/copilot
docker run --rm --user 0:0 --entrypoint /bin/sh \
-v /usr/local/bin:/daemon-usr-local-bin:rw \
-v /tmp:/host-tmp:ro \
ghcr.io/github/gh-aw-mcpg:v0.3.6@sha256:2bb8eef86006a4c5963c55616a9c51c32f27bfdecb023b8aa6f91f6718d9171c \
-c '
cp /host-tmp/gh-aw/copilot-stage/copilot.real /daemon-usr-local-bin/copilot.real
chmod 0755 /daemon-usr-local-bin/copilot.real
cp /host-tmp/gh-aw/copilot-stage/copilot /daemon-usr-local-bin/copilot
chmod 0755 /daemon-usr-local-bin/copilot
'
fi
tar -C /tmp/gh-aw -cf - arc-etc | docker run --rm -i --user 0:0 --entrypoint /bin/sh \
-v /tmp:/host-tmp:rw \
ghcr.io/github/gh-aw-mcpg:v0.3.6@sha256:2bb8eef86006a4c5963c55616a9c51c32f27bfdecb023b8aa6f91f6718d9171c \
-c 'mkdir -p /host-tmp/gh-aw && tar -C /host-tmp/gh-aw -xf - && chmod -R a+rX /host-tmp/gh-aw/arc-etc'
if [ -d /tmp/gh-aw/aw-prompts ]; then
tar -C /tmp/gh-aw -cf - aw-prompts | docker run --rm -i --user 0:0 --entrypoint /bin/sh \
-v /tmp:/host-tmp:rw \
ghcr.io/github/gh-aw-mcpg:v0.3.6@sha256:2bb8eef86006a4c5963c55616a9c51c32f27bfdecb023b8aa6f91f6718d9171c \
-c 'mkdir -p /host-tmp/gh-aw && tar -C /host-tmp/gh-aw -xf - && chmod -R a+rX /host-tmp/gh-aw/aw-prompts'
fi
Remaining gaps in v0.75.4
Gap 1 — AWF chroot rejects Alpine/musl daemon hosts but the official ARC DinD is Alpine
Symptom (with stock docker:dind):
[entrypoint][WARN] one-shot-token.so failed to load on host dynamic linker (host libc incompatibility, e.g. musl/Alpine) chroot: failed to run command '/bin/sh': No such file or directory
[entrypoint][ERROR] capsh not found on host system
What we had to do: Build a custom Ubuntu-22.04 DinD image with docker-ce, libcap2-bin (with /usr/sbin/capsh symlinked into /usr/bin), Node.js installed at /usr/local/bin/node.
Expected: Either gh-aw ships a glibc DinD image as a documented companion, or AWF stages capsh / node / /bin/sh into the daemon filesystem itself from a known agent-image bundle (rather than requiring them to be pre-present in the daemon's rootfs).
Gap 2 — engine.env HOME / USER / LOGNAME are silently overridden
Symptom: Copilot exits with status 1 after 9s, producing zero bytes on stdout and stderr. Diagnostic shim placed in /usr/local/bin/copilot captured the actual exec env:
HOME=/home/runner # we set HOME=/tmp/gh-aw/home in engine.env
USER=root # we set USER=runner in engine.env
LOGNAME=root # we set LOGNAME=runner in engine.env
XDG_CACHE_HOME=/tmp/gh-aw/.cache # this DID come from engine.env
So engine.env partially propagates (XDG_* survives) but the identity triple is clobbered to the AWF agent container's pre-capsh values. Copilot then can't write ~/.copilot/state and dies.
Workaround: A one-line shim at /usr/local/bin/copilot on the daemon: exec env HOME=/tmp/gh-aw/home USER=runner LOGNAME=runner /usr/local/bin/copilot.real "$@".
Expected: engine.env should be the authoritative source for HOME/USER/LOGNAME of the agent exec — applied AFTER capsh's user-switch, not before.
Gap 3 — AWF chroot needs a unix-socket DOCKER_HOST that lives on the daemon's own filesystem
Symptom (with engine.env.DOCKER_HOST left at the runner-pod default tcp://localhost:2375): AWF can't locate the daemon's filesystem to bind-mount as /host, falls back to the awf-agent container's own (Alpine) rootfs, and the chroot probe fails as in Gap 1.
Workaround:
- Add a second listener on the DinD daemon:
--host=unix:///dind-sock/docker.sock
- Mount a shared
dind-sock emptyDir at /dind-sock on both the runner container and the DinD sidecar.
- Override the engine env:
engine.env.DOCKER_HOST=unix:///dind-sock/docker.sock
Expected: Either AWF should be able to chroot correctly with a TCP DOCKER_HOST (since the daemon container ID is discoverable from the Docker API), or the chroot setup should accept a configuration knob like awf.chroot.daemon_filesystem_path instead of inferring it from DOCKER_HOST's URL scheme.
Gap 4 — Copilot CLI is installed on the runner pod, but the chroot only sees the daemon pod's /usr/local/bin
The Install GitHub Copilot CLI step in the gh-aw lock writes copilot to /home/runner/.npm-global/bin/copilot on the runner. The AWF chroot mounts the daemon's /usr read-only as /host/usr — so the runner-installed copilot binary is not visible inside chroot.
Workaround: A pre-agent composite action that copies the runner-installed copilot binary into the DinD daemon's /usr/local/bin/copilot.real via a helper container with -v /usr/local/bin:/daemon-usr-local-bin:rw -v /tmp:/host-tmp:ro.
Expected: The Install GitHub Copilot CLI step should be aware of ARC/DinD and install into the daemon's filesystem when chroot mode is active. Or the agent image bundle should ship copilot already.
Gap 5 — mount_mcp_as_cli.cjs hardcodes github as INTERNAL_SERVERS, so --disable-builtin-mcps hides github_* tools
Symptom: When the agent harness invokes copilot with --disable-builtin-mcps (the AWF default for chroot mode), the github MCP server doesn't get mounted as a CLI shim because it's in the hardcoded INTERNAL_SERVERS = new Set(["github"]) set in ${RUNNER_TEMP}/gh-aw/actions/mount_mcp_as_cli.cjs. The agent can't list issues, read PRs, etc.
Workaround: A pre-agent sed -i patch:
sed -i 's|INTERNAL_SERVERS = new Set(\["github"\])|INTERNAL_SERVERS = new Set([])|' \
"${RUNNER_TEMP}/gh-aw/actions/mount_mcp_as_cli.cjs"
Expected: This is a straight bug — either INTERNAL_SERVERS should be empty when --disable-builtin-mcps is set, or github should be made an opt-in entry rather than hardcoded.
Gap 6 — /etc/passwd, /etc/group, /etc/hosts overrides still need workflow-level mounts
The daemon's DinD image rootfs has no UID 1001 in /etc/passwd (runner doesn't exist), no runner group, and no host.docker.internal in /etc/hosts. capsh's user switch + HOME resolution + MCP gateway resolution all break.
Workaround: sandbox.agent.mounts with daemon-staged files:
sandbox:
agent:
mounts:
- /tmp/gh-aw/arc-etc/passwd:/etc/passwd:ro
- /tmp/gh-aw/arc-etc/group:/etc/group:ro
- /tmp/gh-aw/arc-etc/hosts:/etc/hosts:ro
with a pre-agent step synthesizing those files via a helper container into a daemon-visible path.
Expected: AWF should synthesize a minimal /etc/passwd / /etc/group containing the awfuser/runner UID it's about to switch to, and an /etc/hosts entry for host.docker.internal (the host-gateway IP), without requiring the workflow to ship them.
Gap 7 — safe-outputs.threat-detection job runs without the workflow's pre-agent-steps and re-hits Gap 4
Symptom: Even with all of the above workarounds applied, the auto-generated detection job fails inside AWF chroot:
[copilot-harness] attempt 1: spawning: /usr/local/bin/copilot ...
[copilot-harness] attempt 1: failed to start process '/usr/local/bin/copilot':
spawn /usr/local/bin/copilot ENOENT (code=ENOENT syscall=spawn /usr/local/bin/copilot)
[copilot-harness] attempt 1: process closed exitCode=-2 duration=0s stdout=0B stderr=0B hasOutput=false
...
📄 No lines containing THREAT_DETECTION_RESULT found in 132 lines
##[error]ERR_PARSE: ❌ No THREAT_DETECTION_RESULT found in detection log.
The detection job:
- Runs gh-aw's
Install GitHub Copilot CLI step, which installs to /home/runner/.npm-global/bin/copilot on the runner pod.
- Starts AWF chroot — the daemon-level workarounds (Ubuntu DinD,
capsh, node in image, DOCKER_HOST unix socket) all carry over because they're cluster/helm-level.
- AWF chroots into the daemon's filesystem and tries to spawn
/usr/local/bin/copilot — never staged there for the detection job → ENOENT.
The job is silently marked successful because GH_AW_DETECTION_CONTINUE_ON_ERROR !== 'false' (gh-aw's default), which means the overall workflow goes green with no threat detection having actually run. That's a security regression: a workflow author who has correctly configured safe-outputs.threat-detection will believe their outputs were screened, when in fact the detector no-op'd.
Per docs/src/content/docs/reference/steps-jobs.md, gh-aw exposes pre-agent-steps, post-steps, and jobs.<id>.pre-steps, but NO pre-detection-steps or safe-outputs.threat-detection.pre-steps hook. The detection job is an auto-generated job with no public injection point, so the same workaround we applied for the agent job cannot be applied here from the workflow level.
Expected: The fix for Gap 4 (install copilot CLI into the chroot-runtime overlay when chroot mode is active) is sufficient if it covers both the agent and detection jobs. As a contingency, expose a safe-outputs.threat-detection.pre-steps hook so users can apply ARC-specific staging until the runtime overlay ships.
Files to touch:
pkg/workflow/compiler_threat_detection.go (or wherever the detection job YAML is emitted) — inject the same staging logic used by the agent job, OR add a frontmatter pre-steps field to the threat-detection block.
pkg/parser/schemas/frontmatter.json — schema entry for the new pre-steps field if option B is taken.
pkg/workflow/threat_detection_test.go — new test asserting that on ARC/DinD the detection job successfully invokes copilot.
Default GH_AW_DETECTION_CONTINUE_ON_ERROR should also be reconsidered: the current behavior masks setup failures as successful no-op detections.
Root cause summary
The v0.75.0 fix (referenced in #30840's closing comments) addressed the bind-mount-source split-filesystem problem and the squid log-dir ownership issue. It did not address:
- Daemon filesystem libc requirements (Gap 1)
- engine.env identity-var propagation through capsh (Gap 2)
- Daemon-filesystem discovery from a TCP DOCKER_HOST (Gap 3)
- Runner-installed agent binaries not being visible in chroot (Gap 4)
- INTERNAL_SERVERS hardcoding (Gap 5 — independent of chroot, but
blocks the same use case)
- Minimal identity/hosts synthesis in chroot (Gap 6)
- Threat-detection job runs without
pre-agent-steps and silently no-ops on chroot setup failures (Gap 7)
Proposed implementation plan
1. Bundle an AWF chroot runtime tarball in the agent image
In gh-aw-firewall, build a small "chroot-runtime" tarball at image build time containing:
Stage this tarball from the AWF agent image into the daemon's filesystem at startup via a Docker API helper-container pattern (daemon-visible path → extract into /awf/runtime → chroot reads from there). This removes the requirement that users provide a glibc daemon image, and removes Gap 1.
Files to touch:
gh-aw-firewall/containers/agent/Dockerfile — assemble tarball
gh-aw-firewall/containers/agent/entrypoint.sh — extract on
startup, chroot into /awf/runtime overlay
gh-aw-firewall/src/services/agent-volumes.ts — add the
staging mount
2. Preserve engine.env identity vars across capsh
gh-aw-firewall/containers/agent/entrypoint.sh performs the capsh user switch. After the switch, environment is rebuilt from /etc/passwd (HOME), and the AWF agent container's pre-switch USER/ LOGNAME survive — engine.env's overrides are lost. Apply HOME/USER/LOGNAME from engine.env (passed as AWF_ENGINE_ENV_* docker env vars by gh-aw) after the capsh exec, before exec'ing the engine binary.
Files to touch:
gh-aw-firewall/containers/agent/entrypoint.sh
pkg/workflow/engine_env.go (gh-aw side: ensure engine.env is forwarded as AWF_ENGINE_ENV_*)
3. Daemon-filesystem discovery without DOCKER_HOST coupling
Today AWF infers the daemon's bind-mountable filesystem from the DOCKER_HOST URL scheme (unix:// path's parent dir → daemon-visible mount root). Instead, use the Docker API to inspect the daemon container itself and discover its MergedDir/UpperDir (overlay2) or equivalent. This makes chroot work with TCP DOCKER_HOST and removes the need for the extra unix-socket listener + shared emptyDir (Gap 3).
Files to touch:
gh-aw-firewall/src/host-env.ts — add resolveDaemonFilesystem()
gh-aw-firewall/src/services/agent-service.ts — use the resolved path instead of inferring from DOCKER_HOST
4. Install Copilot CLI into the chroot-runtime overlay
The gh-aw Install GitHub Copilot CLI step currently npm i -gs on the runner. When runs-on selects an ARC runner where AWF chroot mode will be enabled, install instead into the chroot-runtime overlay (Gap 1's bundle), via a Docker helper container that writes into the daemon's filesystem. The detection signal is identical to what --docker-host-path-prefix already detects.
Files to touch:
pkg/workflow/agent_install.go (or equivalent — the step is generated by gh-aw compile, lookup needed)
gh-aw-firewall/src/services/agent-volumes.ts — expose the chroot-runtime mountpoint to the install step
5. Fix INTERNAL_SERVERS hardcoding
In pkg/workflow/mcp_internal.go (or wherever the INTERNAL_SERVERS = new Set(["github"]) literal is generated into
mount_mcp_as_cli.cjs), remove the hardcoded entry. With --disable-builtin-mcps, all MCP servers should be eligible for CLI mounting, including github.
Files to touch:
pkg/workflow/mcp_internal.go (or the template that emits mount_mcp_as_cli.cjs)
pkg/workflow/mcp_internal_test.go
6. Synthesize minimal /etc/passwd, /etc/group, /etc/hosts in chroot
gh-aw-firewall/containers/agent/entrypoint.sh already writes /host/etc/resolv.conf — extend the same machinery to write minimal identity files containing only the AWF user (UID AWF_USER_UID,groupname matching awfuser) and an /etc/hosts containing host.docker.internal (resolved from the extra_hosts host-gateway).
Files to touch:
gh-aw-firewall/containers/agent/entrypoint.sh
gh-aw-firewall/src/services/agent-service.ts (drop the requirement that workflows mount these themselves)
7. Make threat-detection ARC-aware and fail loud on setup errors
Two independent changes:
a. The auto-generated detection job must apply the same chroot prerequisites as the agent job. With the chroot-runtime overlay (Step 1) and chroot-routed Copilot install (Step 4), the detection job inherits the same setup for free — confirm with an integration test.
b. Change GH_AW_DETECTION_CONTINUE_ON_ERROR semantics so that AWF chroot setup failures (spawn ENOENT, capsh missing, etc.) propagate to the job conclusion. A model-output-format mismatch is one thing; a copilot binary not being spawnable is another. Differentiate "model produced unparseable output" from "the detection engine never started" and only continue-on-error for the former.
Files to touch:
pkg/workflow/compiler_threat_detection.go — emit the staging prerequisites (or rely on Step 4's overlay).
pkg/workflow/threat_detection_parser.go (or similar) — distinguish parse-failure vs spawn-failure.
pkg/workflow/threat_detection_test.go.
Test plan
Following the patterns in gh-aw-firewall/tests/integration/chroot-*.test.ts:
- Glibc-free daemon integration test: spin up an Alpine
docker:dind as the test daemon, point AWF at it, and confirm the agent step runs to completion (chroot-runtime overlay supplies capsh / sh / etc.).
- engine.env propagation unit test: assert that engine.env
HOME=/foo, USER=u, LOGNAME=u reach the engine exec inside chroot (verified via a probe binary that prints id, $HOME, $USER, $LOGNAME).
- TCP DOCKER_HOST integration test: identical to the existing ARC integration but with
DOCKER_HOST=tcp://docker-daemon:2375
and no shared filesystem mount — agent step must still succeed.
- Copilot install routing test: in chroot mode, the
Install GitHub Copilot CLI step writes to the chroot-runtime overlay and /usr/local/bin/copilot is discoverable from inside chroot.
- github MCP CLI mount test: with
--disable-builtin-mcps, mount_mcp_as_cli emits a github shim.
- Synthesized identity-files test: chroot has
id runner → UID matching AWF_USER_UID, and getent hosts host.docker.internal returns the host-gateway IP, without any user-supplied sandbox.agent.mounts.
- Threat-detection ARC test: with
safe-outputs.threat-detection enabled and runs-on: arc-gaw, the detection job successfully spawns copilot, produces a parseable THREAT_DETECTION_RESULT, and a deliberate spawn failure (e.g. unstaged copilot) causes the detection job to fail rather than silently no-op.
make agent-finish must pass.
Acceptance criteria
A consumer can write the workflow below and have it run on an ARC RunnerScaleSet with a stock docker:dind sidecar, with no sandbox.agent.mounts, no engine.env overrides, no pre-agent-steps, and no custom DinD image:
---
on: { workflow_dispatch: }
runs-on: arc-gaw
engine:
id: copilot
model: gpt-5.4
safe-outputs:
threat-detection:
runs-on: arc-gaw
tools:
github: { toolsets: [all] }
bash: true
---
# Hello agent
Both the agent job AND the threat-detection job must run end-to-end.
A workflow that successfully runs the agent but silently no-ops
threat-detection (today's behavior) does not satisfy this criterion.
Labels
area:awf, area:arc-dind, area:engine-copilot, type:bug, scope:cross-cutting.
Related
The current issue is scoped to remaining gaps after the v0.75.0 fixes, with tcp:// DOCKER_HOST + DinD-sidecar topology.
Summary
#30840 was closed COMPLETED with the v0.75.0 release, and #33777 is tracking one specific follow-up (unix-socket DOCKER_HOST from a sibling DinD pod). On v0.75.4 the closed-issue scenario (
tcp://-shaped DOCKER_HOST from a runner pod with a DinD sidecar in the same ARC RunnerScaleSet pod) is still not first-class: getting a working agent run on Copilot + AWF chroot mode required six distinct workflow/infra-level workarounds.This issue enumerates those remaining gaps in v0.75.4 with concrete reproduction details.
Repro
RunnerScaleSettemplate has the runner container and adindsidecar in the same pod.DOCKER_HOST=tcp://localhost:2375(matches AWF's TCP-detection regex, so--docker-host-path-prefix /tmp/gh-awis engaged correctly — this part works).docker:dind(Alpine).gh-aw-tmpemptyDir mounted at/tmpon both containers (standard gh-aw recommendation for ARC).sandbox.agent.id: awf,tools: { github: {toolsets: [all]}, bash: true }.The full workaround set we currently ship is
Remaining gaps in v0.75.4
Gap 1 — AWF chroot rejects Alpine/musl daemon hosts but the official ARC DinD is Alpine
Symptom (with stock
docker:dind):What we had to do: Build a custom Ubuntu-22.04 DinD image with
docker-ce,libcap2-bin(with/usr/sbin/capshsymlinked into/usr/bin), Node.js installed at/usr/local/bin/node.Expected: Either gh-aw ships a glibc DinD image as a documented companion, or AWF stages
capsh/node//bin/shinto the daemon filesystem itself from a known agent-image bundle (rather than requiring them to be pre-present in the daemon's rootfs).Gap 2 — engine.env
HOME/USER/LOGNAMEare silently overriddenSymptom: Copilot exits with status 1 after 9s, producing zero bytes on stdout and stderr. Diagnostic shim placed in
/usr/local/bin/copilotcaptured the actual exec env:So engine.env partially propagates (
XDG_*survives) but the identity triple is clobbered to the AWF agent container's pre-capsh values. Copilot then can't write~/.copilot/stateand dies.Workaround: A one-line shim at
/usr/local/bin/copiloton the daemon:exec env HOME=/tmp/gh-aw/home USER=runner LOGNAME=runner /usr/local/bin/copilot.real "$@".Expected:
engine.envshould be the authoritative source forHOME/USER/LOGNAMEof the agent exec — applied AFTER capsh's user-switch, not before.Gap 3 — AWF chroot needs a unix-socket DOCKER_HOST that lives on the daemon's own filesystem
Symptom (with
engine.env.DOCKER_HOSTleft at the runner-pod defaulttcp://localhost:2375): AWF can't locate the daemon's filesystem to bind-mount as/host, falls back to the awf-agent container's own (Alpine) rootfs, and the chroot probe fails as in Gap 1.Workaround:
--host=unix:///dind-sock/docker.sockdind-sockemptyDir at/dind-sockon both the runner container and the DinD sidecar.engine.env.DOCKER_HOST=unix:///dind-sock/docker.sockExpected: Either AWF should be able to chroot correctly with a TCP DOCKER_HOST (since the daemon container ID is discoverable from the Docker API), or the chroot setup should accept a configuration knob like
awf.chroot.daemon_filesystem_pathinstead of inferring it from DOCKER_HOST's URL scheme.Gap 4 — Copilot CLI is installed on the runner pod, but the chroot only sees the daemon pod's
/usr/local/binThe
Install GitHub Copilot CLIstep in the gh-aw lock writescopilotto/home/runner/.npm-global/bin/copiloton the runner. The AWF chroot mounts the daemon's/usrread-only as/host/usr— so the runner-installed copilot binary is not visible inside chroot.Workaround: A pre-agent composite action that copies the runner-installed copilot binary into the DinD daemon's
/usr/local/bin/copilot.realvia a helper container with-v /usr/local/bin:/daemon-usr-local-bin:rw -v /tmp:/host-tmp:ro.Expected: The
Install GitHub Copilot CLIstep should be aware of ARC/DinD and install into the daemon's filesystem when chroot mode is active. Or the agent image bundle should ship copilot already.Gap 5 —
mount_mcp_as_cli.cjshardcodesgithubas INTERNAL_SERVERS, so--disable-builtin-mcpshides github_* toolsSymptom: When the agent harness invokes copilot with
--disable-builtin-mcps(the AWF default for chroot mode), the github MCP server doesn't get mounted as a CLI shim because it's in the hardcodedINTERNAL_SERVERS = new Set(["github"])set in${RUNNER_TEMP}/gh-aw/actions/mount_mcp_as_cli.cjs. The agent can't list issues, read PRs, etc.Workaround: A pre-agent
sed -ipatch:Expected: This is a straight bug — either
INTERNAL_SERVERSshould be empty when--disable-builtin-mcpsis set, orgithubshould be made an opt-in entry rather than hardcoded.Gap 6 —
/etc/passwd,/etc/group,/etc/hostsoverrides still need workflow-level mountsThe daemon's DinD image rootfs has no UID 1001 in
/etc/passwd(runnerdoesn't exist), norunnergroup, and nohost.docker.internalin/etc/hosts. capsh's user switch + HOME resolution + MCP gateway resolution all break.Workaround:
sandbox.agent.mountswith daemon-staged files:with a pre-agent step synthesizing those files via a helper container into a daemon-visible path.
Expected: AWF should synthesize a minimal
/etc/passwd//etc/groupcontaining theawfuser/runnerUID it's about to switch to, and an/etc/hostsentry forhost.docker.internal(the host-gateway IP), without requiring the workflow to ship them.Gap 7 —
safe-outputs.threat-detectionjob runs without the workflow's pre-agent-steps and re-hits Gap 4Symptom: Even with all of the above workarounds applied, the auto-generated
detectionjob fails inside AWF chroot:The detection job:
Install GitHub Copilot CLIstep, which installs to/home/runner/.npm-global/bin/copiloton the runner pod.capsh,nodein image, DOCKER_HOST unix socket) all carry over because they're cluster/helm-level./usr/local/bin/copilot— never staged there for the detection job →ENOENT.The job is silently marked successful because
GH_AW_DETECTION_CONTINUE_ON_ERROR !== 'false'(gh-aw's default), which means the overall workflow goes green with no threat detection having actually run. That's a security regression: a workflow author who has correctly configuredsafe-outputs.threat-detectionwill believe their outputs were screened, when in fact the detector no-op'd.Per
docs/src/content/docs/reference/steps-jobs.md, gh-aw exposespre-agent-steps,post-steps, andjobs.<id>.pre-steps, but NOpre-detection-stepsorsafe-outputs.threat-detection.pre-stepshook. The detection job is an auto-generated job with no public injection point, so the same workaround we applied for the agent job cannot be applied here from the workflow level.Expected: The fix for Gap 4 (install copilot CLI into the chroot-runtime overlay when chroot mode is active) is sufficient if it covers both the agent and detection jobs. As a contingency, expose a
safe-outputs.threat-detection.pre-stepshook so users can apply ARC-specific staging until the runtime overlay ships.Files to touch:
pkg/workflow/compiler_threat_detection.go(or wherever the detection job YAML is emitted) — inject the same staging logic used by the agent job, OR add a frontmatterpre-stepsfield to the threat-detection block.pkg/parser/schemas/frontmatter.json— schema entry for the newpre-stepsfield if option B is taken.pkg/workflow/threat_detection_test.go— new test asserting that on ARC/DinD the detection job successfully invokes copilot.Default
GH_AW_DETECTION_CONTINUE_ON_ERRORshould also be reconsidered: the current behavior masks setup failures as successful no-op detections.Root cause summary
The v0.75.0 fix (referenced in #30840's closing comments) addressed the bind-mount-source split-filesystem problem and the squid log-dir ownership issue. It did not address:
blocks the same use case)
pre-agent-stepsand silently no-ops on chroot setup failures (Gap 7)Proposed implementation plan
1. Bundle an AWF chroot runtime tarball in the agent image
In
gh-aw-firewall, build a small "chroot-runtime" tarball at image build time containing:capsh(static or with vendored libcap)/bin/sh,/bin/bash, busybox applets used by AWF entrypoint (mkdir,chmod,cat,head,tee)libutil.so.1(already mentioned in [ARC-DinD] GAW should provide first-class ARC runner support for AWF-backed workflows #30840)/awf/engine/bin) that the agent harness exec's throughStage this tarball from the AWF agent image into the daemon's filesystem at startup via a Docker API helper-container pattern (daemon-visible path → extract into
/awf/runtime→ chroot reads from there). This removes the requirement that users provide a glibc daemon image, and removes Gap 1.Files to touch:
gh-aw-firewall/containers/agent/Dockerfile— assemble tarballgh-aw-firewall/containers/agent/entrypoint.sh— extract onstartup, chroot into
/awf/runtimeoverlaygh-aw-firewall/src/services/agent-volumes.ts— add thestaging mount
2. Preserve engine.env identity vars across capsh
gh-aw-firewall/containers/agent/entrypoint.shperforms the capsh user switch. After the switch, environment is rebuilt from/etc/passwd(HOME), and the AWF agent container's pre-switch USER/ LOGNAME survive —engine.env's overrides are lost. ApplyHOME/USER/LOGNAMEfrom engine.env (passed asAWF_ENGINE_ENV_*docker env vars by gh-aw) after the capsh exec, before exec'ing the engine binary.Files to touch:
gh-aw-firewall/containers/agent/entrypoint.shpkg/workflow/engine_env.go(gh-aw side: ensure engine.env is forwarded asAWF_ENGINE_ENV_*)3. Daemon-filesystem discovery without DOCKER_HOST coupling
Today AWF infers the daemon's bind-mountable filesystem from the DOCKER_HOST URL scheme (
unix://path's parent dir → daemon-visible mount root). Instead, use the Docker API to inspect the daemon container itself and discover itsMergedDir/UpperDir(overlay2) or equivalent. This makes chroot work with TCP DOCKER_HOST and removes the need for the extra unix-socket listener + shared emptyDir (Gap 3).Files to touch:
gh-aw-firewall/src/host-env.ts— addresolveDaemonFilesystem()gh-aw-firewall/src/services/agent-service.ts— use the resolved path instead of inferring from DOCKER_HOST4. Install Copilot CLI into the chroot-runtime overlay
The gh-aw
Install GitHub Copilot CLIstep currentlynpm i -gs on the runner. Whenruns-onselects an ARC runner where AWF chroot mode will be enabled, install instead into the chroot-runtime overlay (Gap 1's bundle), via a Docker helper container that writes into the daemon's filesystem. The detection signal is identical to what--docker-host-path-prefixalready detects.Files to touch:
pkg/workflow/agent_install.go(or equivalent — the step is generated by gh-aw compile, lookup needed)gh-aw-firewall/src/services/agent-volumes.ts— expose the chroot-runtime mountpoint to the install step5. Fix INTERNAL_SERVERS hardcoding
In
pkg/workflow/mcp_internal.go(or wherever theINTERNAL_SERVERS = new Set(["github"])literal is generated intomount_mcp_as_cli.cjs), remove the hardcoded entry. With--disable-builtin-mcps, all MCP servers should be eligible for CLI mounting, including github.Files to touch:
pkg/workflow/mcp_internal.go(or the template that emitsmount_mcp_as_cli.cjs)pkg/workflow/mcp_internal_test.go6. Synthesize minimal
/etc/passwd,/etc/group,/etc/hostsin chrootgh-aw-firewall/containers/agent/entrypoint.shalready writes/host/etc/resolv.conf— extend the same machinery to write minimal identity files containing only the AWF user (UIDAWF_USER_UID,groupname matchingawfuser) and an/etc/hostscontaininghost.docker.internal(resolved from theextra_hostshost-gateway).Files to touch:
gh-aw-firewall/containers/agent/entrypoint.shgh-aw-firewall/src/services/agent-service.ts(drop the requirement that workflows mount these themselves)7. Make threat-detection ARC-aware and fail loud on setup errors
Two independent changes:
a. The auto-generated
detectionjob must apply the same chroot prerequisites as theagentjob. With the chroot-runtime overlay (Step 1) and chroot-routed Copilot install (Step 4), the detection job inherits the same setup for free — confirm with an integration test.b. Change
GH_AW_DETECTION_CONTINUE_ON_ERRORsemantics so that AWF chroot setup failures (spawn ENOENT, capsh missing, etc.) propagate to the job conclusion. A model-output-format mismatch is one thing; a copilot binary not being spawnable is another. Differentiate "model produced unparseable output" from "the detection engine never started" and only continue-on-error for the former.Files to touch:
pkg/workflow/compiler_threat_detection.go— emit the staging prerequisites (or rely on Step 4's overlay).pkg/workflow/threat_detection_parser.go(or similar) —distinguishparse-failure vs spawn-failure.pkg/workflow/threat_detection_test.go.Test plan
Following the patterns in
gh-aw-firewall/tests/integration/chroot-*.test.ts:docker:dindas the test daemon, point AWF at it, and confirm the agent step runs to completion (chroot-runtime overlay supplies capsh / sh / etc.).HOME=/foo,USER=u,LOGNAME=ureach the engine exec inside chroot (verified via a probe binary that printsid,$HOME,$USER,$LOGNAME).DOCKER_HOST=tcp://docker-daemon:2375and no shared filesystem mount — agent step must still succeed.
Install GitHub Copilot CLIstep writes to the chroot-runtime overlay and/usr/local/bin/copilotis discoverable from inside chroot.--disable-builtin-mcps,mount_mcp_as_cliemits a github shim.id runner→ UID matchingAWF_USER_UID, andgetent hosts host.docker.internalreturns the host-gateway IP, without any user-suppliedsandbox.agent.mounts.safe-outputs.threat-detectionenabled andruns-on: arc-gaw, the detection job successfully spawns copilot, produces a parseableTHREAT_DETECTION_RESULT, and a deliberate spawn failure (e.g. unstaged copilot) causes the detection job to fail rather than silently no-op.make agent-finishmust pass.Acceptance criteria
A consumer can write the workflow below and have it run on an ARC RunnerScaleSet with a stock
docker:dindsidecar, with nosandbox.agent.mounts, noengine.envoverrides, nopre-agent-steps, and no custom DinD image:Both the agent job AND the threat-detection job must run end-to-end.
A workflow that successfully runs the agent but silently no-ops
threat-detection (today's behavior) does not satisfy this criterion.
Labels
area:awf,area:arc-dind,area:engine-copilot,type:bug,scope:cross-cutting.Related
The current issue is scoped to remaining gaps after the v0.75.0 fixes, with
tcp://DOCKER_HOST + DinD-sidecar topology.