Skip to content

feat: Docker-in-Docker job execution via host socket#18

Open
wdconinc wants to merge 2 commits intomainfrom
feat/docker-in-docker
Open

feat: Docker-in-Docker job execution via host socket#18
wdconinc wants to merge 2 commits intomainfrom
feat/docker-in-docker

Conversation

@wdconinc
Copy link
Copy Markdown
Collaborator

Summary

Adds Docker-in-Docker (DinD) support so Harvester executes real container jobs on a local Alpine image rather than relying on a local submitter that doesn't exist in the image.

Changes

New files

  • config/harvester/plugins/docker_submitter.py — Harvester submitter plugin that launches detached Alpine containers via the host Docker socket, using transformation + jobPars from the PanDA job spec as the container command
  • config/harvester/plugins/docker_monitor.py — polls container status and maps to ST_running/ST_finished/ST_failed

Modified files

  • docker-compose.yml — harvester service now starts the Harvester daemon (master.py --foreground); installs the Python docker SDK at startup; mounts /var/run/docker.sock; adds PID-based healthcheck
  • config/harvester/panda_queues.cfg — switches from non-existent LocalSubmitter/LocalMonitor to DockerSubmitter/DockerMonitor (alpine:latest); switches messenger to BaseMessenger so job completion is driven by worker status alone without requiring a jobReport.json file
  • scripts/pandajob-submit — adds --transformation/--params flags for self-contained inline commands alongside the existing --script mode
  • .github/workflows/ci.yml — submits an Alpine sh -c 'echo hello; uname -a' job to PANDA_COMPOSE_LOCAL, waits for Harvester to become healthy, polls up to 10 min for jobStatus == finished

Depends on

PR #1 (pip-install-user-panda-client) — panda-client installation fix

Copilot AI review requested due to automatic review settings April 23, 2026 19:11
@wdconinc wdconinc force-pushed the pip-install-user-panda-client branch from 912a09d to c91f885 Compare April 23, 2026 19:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Docker-based job execution for the local PanDA/Harvester compose stack by introducing Harvester plugins that submit and monitor jobs as Docker containers via the host Docker socket.

Changes:

  • Added DockerSubmitter/DockerMonitor Harvester plugins to run and poll worker containers through /var/run/docker.sock.
  • Updated docker-compose.yml to run Harvester in foreground with a PID healthcheck, mount the Docker socket, and load the custom plugins via PYTHONPATH.
  • Updated the CI workflow and pandajob-submit script to submit an inline Alpine sh -c ... job and poll until it finishes.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
scripts/pandajob-submit Adds inline --transformation/--params submission mode (and makes --script optional).
docker-compose.yml Starts Harvester daemon in foreground, installs docker SDK at startup, mounts docker.sock, adds healthcheck, mounts plugin dir.
config/harvester/plugins/docker_submitter.py New submitter plugin that launches detached containers for workers.
config/harvester/plugins/docker_monitor.py New monitor plugin that maps container status/exit code to Harvester worker states.
config/harvester/panda_queues.cfg Switches queue to DockerSubmitter/DockerMonitor and uses BaseMessenger.
.github/workflows/ci.yml Expands smoke test to submit an Alpine job and poll for terminal status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docker-compose.yml Outdated
Comment on lines +233 to +235
mkdir -p /var/run/panda /var/log/panda /harvester/workers /harvester/wdir
# Allow the unprivileged atlpan process to reach the Docker socket
chmod 666 /var/run/docker.sock || true
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chmod 666 /var/run/docker.sock makes the host Docker socket world-writable inside the container, which effectively grants root-equivalent access to the host. Prefer granting access via group permissions (e.g., run the container with the host docker group GID via group_add, or run as root and keep the socket at 660), and avoid widening permissions on the socket.

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +66
container_name = f"harvester-worker-{workSpec.workerID}"
wLog.debug(f"running container image={self.containerImage} command={command}")

container = client.containers.run(
self.containerImage,
command=command,
name=container_name,
detach=True,
remove=False,
)
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container_name is deterministic per workerID and remove=False leaves containers behind. If Harvester retries submission or reuses a workerID after restart, containers.run(..., name=...) will fail with a name conflict, and old containers will accumulate. Consider enabling auto-removal (remove=True), generating a unique name (e.g., include a UUID/batchID), and/or adding labels and a sweeper to clean up containers for finished/failed workers.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +68
elif c_status == "exited":
if exit_code == 0:
new_status = WorkSpec.ST_finished
workSpec.nativeExitCode = 0
workSpec.nativeStatus = "exited(0)"
else:
new_status = WorkSpec.ST_failed
workSpec.nativeExitCode = exit_code
workSpec.nativeStatus = f"exited({exit_code})"
else:
new_status = WorkSpec.ST_failed
workSpec.nativeStatus = c_status

retList.append((new_status, ""))
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-zero exit codes and other failure states you set nativeStatus/nativeExitCode, but you always append (new_status, ""), which drops useful diagnostics. Returning a non-empty message (e.g., include c_status/exit_code) will make failures debuggable upstream.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +66
if c_status in ("created", "running"):
new_status = WorkSpec.ST_running
elif c_status == "exited":
if exit_code == 0:
new_status = WorkSpec.ST_finished
workSpec.nativeExitCode = 0
workSpec.nativeStatus = "exited(0)"
else:
new_status = WorkSpec.ST_failed
workSpec.nativeExitCode = exit_code
workSpec.nativeStatus = f"exited({exit_code})"
else:
new_status = WorkSpec.ST_failed
workSpec.nativeStatus = c_status
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker can report container status as restarting (and some daemon versions report removing). Those currently fall into the else branch and mark the worker failed immediately. Consider treating transient statuses like restarting as ST_running (or at least not failing hard until it becomes terminal).

Copilot uses AI. Check for mistakes.
Comment thread scripts/pandajob-submit Outdated
Comment on lines +50 to +59
if args.script is not None:
script_path = os.path.abspath(args.script)
if not os.path.isfile(script_path):
print(f"ERROR: script not found: {script_path}", file=sys.stderr)
sys.exit(1)
transformation = "/bin/bash"
job_parameters = script_path
elif args.transformation is not None or args.params is not None:
transformation = args.transformation or "sh"
job_parameters = args.params or ""
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In --script mode you submit an absolute path from the machine running pandajob-submit as job.jobParameters. With the new DockerSubmitter executing inside a Docker container, that path almost certainly doesn't exist inside the job container, and transformation=/bin/bash also won't exist on the default alpine image. Either clarify in the CLI help that --script requires the script to be present in the worker/container filesystem (and use /bin/sh for alpine), or implement staging/mounting so the script is actually available to the worker.

Copilot uses AI. Check for mistakes.
Comment thread docker-compose.yml Outdated
Comment on lines +229 to +231
# Install Python docker SDK into the harvester virtualenv
/opt/harvester/bin/pip install -q docker
set +e
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container startup script uses set -e and then runs /opt/harvester/bin/pip install -q docker before set +e. If the pip install fails (network/pypi outage), the whole container will exit immediately despite the later set +e. Either keep set -e and remove the misleading set +e, or explicitly handle the install failure (e.g., retry / fail with a clearer message / make it non-fatal if appropriate). Also consider pinning the docker SDK version to avoid CI breakage from upstream releases.

Suggested change
# Install Python docker SDK into the harvester virtualenv
/opt/harvester/bin/pip install -q docker
set +e
# Install Python docker SDK into the harvester virtualenv.
# Retry transient network/PyPI failures and pin the version for reproducible startup.
for attempt in 1 2 3; do
if /opt/harvester/bin/pip install -q 'docker==7.1.0'; then
break
fi
if [ "$attempt" -eq 3 ]; then
echo "ERROR: Failed to install Python package 'docker==7.1.0' after ${attempt} attempts" >&2
exit 1
fi
echo "WARNING: pip install of 'docker==7.1.0' failed on attempt ${attempt}; retrying..." >&2
sleep 5
done

Copilot uses AI. Check for mistakes.
@wdconinc wdconinc changed the base branch from pip-install-user-panda-client to main April 23, 2026 20:08
Replace SubprocessSubmitter/SubprocessMonitor with custom DockerSubmitter
and DockerMonitor plugins that run PanDA jobs as Docker containers on the
host daemon via a bind-mounted socket (/var/run/docker.sock).

Changes:
- config/harvester/plugins/docker_submitter.py: new plugin — starts a
  detached Alpine container per worker; reads transformation/jobPars from
  job spec and builds the container command via shlex.split
- config/harvester/plugins/docker_monitor.py: new plugin — polls container
  state; maps running→ST_running, exited(0)→ST_finished, else→ST_failed
- docker-compose.yml (harvester): install docker SDK at startup, chmod
  socket, write PID file for healthcheck; add socket and plugin volume mounts
- config/harvester/panda_queues.cfg: switch to DockerSubmitter/DockerMonitor/
  BaseMessenger (job completion driven by worker status, no jobReport needed)
- .github/workflows/ci.yml: wait for Harvester healthy before submitting;
  run alpine sh -c 'echo hello from alpine; uname -a' as integration test;
  raise job/CI timeout to 30 min

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wdconinc wdconinc force-pushed the feat/docker-in-docker branch from 0b9a976 to 2f2e541 Compare April 23, 2026 20:18
Add --container IMAGE to pandajob-submit and wire it through to DockerSubmitter,
mirroring PanDA's production container_name job spec field.

In production PanDA, jobs carry a container_name attribute that submitter plugins
use to select the container image per-job (e.g. via prun --containerImage). Harvester
surfaces this as job.jobParams["container_name"]. This commit replicates that pattern
in the panda-compose dev stack.

Changes:
- scripts/pandajob-submit: add --container IMAGE optional arg; when provided it sets
  job.container_name on the submitted JobSpec
- config/harvester/plugins/docker_submitter.py: extract _resolve_image() helper that
  reads job.jobParams.get("container_name"), strips any "docker://" prefix for Docker
  SDK compatibility, and falls back to the queue-level containerImage default
- .github/workflows/ci.yml: use --container python:3.12-alpine in the integration test
  to demonstrate per-job container selection (different from the queue default alpine:latest)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants