feat: Docker-in-Docker job execution via host socket by wdconinc · Pull Request #18 · eic/panda-compose

wdconinc · 2026-04-23T19:11:07Z

Summary

Adds Docker-in-Docker (DinD) support so Harvester executes real container jobs on a local Alpine image rather than relying on a local submitter that doesn't exist in the image.

Changes

New files

config/harvester/plugins/docker_submitter.py — Harvester submitter plugin that launches detached Alpine containers via the host Docker socket, using transformation + jobPars from the PanDA job spec as the container command
config/harvester/plugins/docker_monitor.py — polls container status and maps to ST_running/ST_finished/ST_failed

Modified files

docker-compose.yml — harvester service now starts the Harvester daemon (master.py --foreground); installs the Python docker SDK at startup; mounts /var/run/docker.sock; adds PID-based healthcheck
config/harvester/panda_queues.cfg — switches from non-existent LocalSubmitter/LocalMonitor to DockerSubmitter/DockerMonitor (alpine:latest); switches messenger to BaseMessenger so job completion is driven by worker status alone without requiring a jobReport.json file
scripts/pandajob-submit — adds --transformation/--params flags for self-contained inline commands alongside the existing --script mode
.github/workflows/ci.yml — submits an Alpine sh -c 'echo hello; uname -a' job to PANDA_COMPOSE_LOCAL, waits for Harvester to become healthy, polls up to 10 min for jobStatus == finished

Depends on

PR #1 (pip-install-user-panda-client) — panda-client installation fix

Copilot

Pull request overview

Adds Docker-based job execution for the local PanDA/Harvester compose stack by introducing Harvester plugins that submit and monitor jobs as Docker containers via the host Docker socket.

Changes:

Added DockerSubmitter/DockerMonitor Harvester plugins to run and poll worker containers through /var/run/docker.sock.
Updated docker-compose.yml to run Harvester in foreground with a PID healthcheck, mount the Docker socket, and load the custom plugins via PYTHONPATH.
Updated the CI workflow and pandajob-submit script to submit an inline Alpine sh -c ... job and poll until it finishes.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`scripts/pandajob-submit`	Adds inline `--transformation/--params` submission mode (and makes `--script` optional).
`docker-compose.yml`	Starts Harvester daemon in foreground, installs docker SDK at startup, mounts docker.sock, adds healthcheck, mounts plugin dir.
`config/harvester/plugins/docker_submitter.py`	New submitter plugin that launches detached containers for workers.
`config/harvester/plugins/docker_monitor.py`	New monitor plugin that maps container status/exit code to Harvester worker states.
`config/harvester/panda_queues.cfg`	Switches queue to `DockerSubmitter`/`DockerMonitor` and uses `BaseMessenger`.
`.github/workflows/ci.yml`	Expands smoke test to submit an Alpine job and poll for terminal status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T19:15:02Z

+        mkdir -p /var/run/panda /var/log/panda /harvester/workers /harvester/wdir
+        # Allow the unprivileged atlpan process to reach the Docker socket
+        chmod 666 /var/run/docker.sock || true


chmod 666 /var/run/docker.sock makes the host Docker socket world-writable inside the container, which effectively grants root-equivalent access to the host. Prefer granting access via group permissions (e.g., run the container with the host docker group GID via group_add, or run as root and keep the socket at 660), and avoid widening permissions on the socket.

Copilot · 2026-04-23T19:15:02Z

+                container_name = f"harvester-worker-{workSpec.workerID}"
+                wLog.debug(f"running container image={self.containerImage} command={command}")
+
+                container = client.containers.run(
+                    self.containerImage,
+                    command=command,
+                    name=container_name,
+                    detach=True,
+                    remove=False,
+                )


container_name is deterministic per workerID and remove=False leaves containers behind. If Harvester retries submission or reuses a workerID after restart, containers.run(..., name=...) will fail with a name conflict, and old containers will accumulate. Consider enabling auto-removal (remove=True), generating a unique name (e.g., include a UUID/batchID), and/or adding labels and a sweeper to clean up containers for finished/failed workers.

Copilot · 2026-04-23T19:15:03Z

+                elif c_status == "exited":
+                    if exit_code == 0:
+                        new_status = WorkSpec.ST_finished
+                        workSpec.nativeExitCode = 0
+                        workSpec.nativeStatus = "exited(0)"
+                    else:
+                        new_status = WorkSpec.ST_failed
+                        workSpec.nativeExitCode = exit_code
+                        workSpec.nativeStatus = f"exited({exit_code})"
+                else:
+                    new_status = WorkSpec.ST_failed
+                    workSpec.nativeStatus = c_status
+
+                retList.append((new_status, ""))


For non-zero exit codes and other failure states you set nativeStatus/nativeExitCode, but you always append (new_status, ""), which drops useful diagnostics. Returning a non-empty message (e.g., include c_status/exit_code) will make failures debuggable upstream.

Copilot · 2026-04-23T19:15:03Z

+                if c_status in ("created", "running"):
+                    new_status = WorkSpec.ST_running
+                elif c_status == "exited":
+                    if exit_code == 0:
+                        new_status = WorkSpec.ST_finished
+                        workSpec.nativeExitCode = 0
+                        workSpec.nativeStatus = "exited(0)"
+                    else:
+                        new_status = WorkSpec.ST_failed
+                        workSpec.nativeExitCode = exit_code
+                        workSpec.nativeStatus = f"exited({exit_code})"
+                else:
+                    new_status = WorkSpec.ST_failed
+                    workSpec.nativeStatus = c_status


Docker can report container status as restarting (and some daemon versions report removing). Those currently fall into the else branch and mark the worker failed immediately. Consider treating transient statuses like restarting as ST_running (or at least not failing hard until it becomes terminal).

Copilot · 2026-04-23T19:15:03Z

+    if args.script is not None:
+        script_path = os.path.abspath(args.script)
+        if not os.path.isfile(script_path):
+            print(f"ERROR: script not found: {script_path}", file=sys.stderr)
+            sys.exit(1)
+        transformation = "/bin/bash"
+        job_parameters = script_path
+    elif args.transformation is not None or args.params is not None:
+        transformation = args.transformation or "sh"
+        job_parameters = args.params or ""


In --script mode you submit an absolute path from the machine running pandajob-submit as job.jobParameters. With the new DockerSubmitter executing inside a Docker container, that path almost certainly doesn't exist inside the job container, and transformation=/bin/bash also won't exist on the default alpine image. Either clarify in the CLI help that --script requires the script to be present in the worker/container filesystem (and use /bin/sh for alpine), or implement staging/mounting so the script is actually available to the worker.

Copilot · 2026-04-23T19:15:03Z

+        # Install Python docker SDK into the harvester virtualenv
+        /opt/harvester/bin/pip install -q docker
+        set +e


The container startup script uses set -e and then runs /opt/harvester/bin/pip install -q docker before set +e. If the pip install fails (network/pypi outage), the whole container will exit immediately despite the later set +e. Either keep set -e and remove the misleading set +e, or explicitly handle the install failure (e.g., retry / fail with a clearer message / make it non-fatal if appropriate). Also consider pinning the docker SDK version to avoid CI breakage from upstream releases.

Suggested change

# Install Python docker SDK into the harvester virtualenv

/opt/harvester/bin/pip install -q docker

set +e

# Install Python docker SDK into the harvester virtualenv.

# Retry transient network/PyPI failures and pin the version for reproducible startup.

for attempt in 1 2 3; do

if /opt/harvester/bin/pip install -q 'docker==7.1.0'; then

break

fi

if [ "$attempt" -eq 3 ]; then

echo "ERROR: Failed to install Python package 'docker==7.1.0' after ${attempt} attempts" >&2

exit 1

fi

echo "WARNING: pip install of 'docker==7.1.0' failed on attempt ${attempt}; retrying..." >&2

sleep 5

done

Replace SubprocessSubmitter/SubprocessMonitor with custom DockerSubmitter and DockerMonitor plugins that run PanDA jobs as Docker containers on the host daemon via a bind-mounted socket (/var/run/docker.sock). Changes: - config/harvester/plugins/docker_submitter.py: new plugin — starts a detached Alpine container per worker; reads transformation/jobPars from job spec and builds the container command via shlex.split - config/harvester/plugins/docker_monitor.py: new plugin — polls container state; maps running→ST_running, exited(0)→ST_finished, else→ST_failed - docker-compose.yml (harvester): install docker SDK at startup, chmod socket, write PID file for healthcheck; add socket and plugin volume mounts - config/harvester/panda_queues.cfg: switch to DockerSubmitter/DockerMonitor/ BaseMessenger (job completion driven by worker status, no jobReport needed) - .github/workflows/ci.yml: wait for Harvester healthy before submitting; run alpine sh -c 'echo hello from alpine; uname -a' as integration test; raise job/CI timeout to 30 min Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add --container IMAGE to pandajob-submit and wire it through to DockerSubmitter, mirroring PanDA's production container_name job spec field. In production PanDA, jobs carry a container_name attribute that submitter plugins use to select the container image per-job (e.g. via prun --containerImage). Harvester surfaces this as job.jobParams["container_name"]. This commit replicates that pattern in the panda-compose dev stack. Changes: - scripts/pandajob-submit: add --container IMAGE optional arg; when provided it sets job.container_name on the submitted JobSpec - config/harvester/plugins/docker_submitter.py: extract _resolve_image() helper that reads job.jobParams.get("container_name"), strips any "docker://" prefix for Docker SDK compatibility, and falls back to the queue-level containerImage default - .github/workflows/ci.yml: use --container python:3.12-alpine in the integration test to demonstrate per-job container selection (different from the queue default alpine:latest) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 23, 2026 19:11

wdconinc force-pushed the pip-install-user-panda-client branch from 912a09d to c91f885 Compare April 23, 2026 19:11

Copilot started reviewing on behalf of wdconinc April 23, 2026 19:11 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

wdconinc changed the base branch from pip-install-user-panda-client to main April 23, 2026 20:08

wdconinc force-pushed the feat/docker-in-docker branch from 0b9a976 to 2f2e541 Compare April 23, 2026 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Docker-in-Docker job execution via host socket#18

feat: Docker-in-Docker job execution via host socket#18
wdconinc wants to merge 2 commits intomainfrom
feat/docker-in-docker

wdconinc commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        # Install Python docker SDK into the harvester virtualenv
-        /opt/harvester/bin/pip install -q docker
-        set +e
+        # Install Python docker SDK into the harvester virtualenv.
+        # Retry transient network/PyPI failures and pin the version for reproducible startup.
+        for attempt in 1 2 3; do
+          if /opt/harvester/bin/pip install -q 'docker==7.1.0'; then
+            break
+          fi
+          if [ "$attempt" -eq 3 ]; then
+            echo "ERROR: Failed to install Python package 'docker==7.1.0' after ${attempt} attempts" >&2
+            exit 1
+          fi
+          echo "WARNING: pip install of 'docker==7.1.0' failed on attempt ${attempt}; retrying..." >&2
+          sleep 5
+        done

Conversation

wdconinc commented Apr 23, 2026

Summary

Changes

New files

Modified files

Depends on

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants