Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine client: fix env propagation to docker subcommand #7006

Merged
merged 1 commit into from
Apr 4, 2024

Conversation

sipsma
Copy link
Contributor

@sipsma sipsma commented Apr 3, 2024

Fixes #6989, confirmed with a (particularly gnarly) integ test the integ test didn't work only in the test workflow on our EKS engine so I had to remove it but still confident this fixes it, see comments below

Signed-off-by: Erik Sipsma <erik@dagger.io>
@sipsma sipsma added this to the v0.11.0 milestone Apr 3, 2024
@sipsma sipsma requested review from jedevc and vito April 4, 2024 00:28
@sipsma
Copy link
Contributor Author

sipsma commented Apr 4, 2024

The integ test is passing locally but failing in CI due to docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: failed to write "a *:* rwm": write /sys/fs/cgroup/devices/docker/b175032dc2eda803243528897adb744ba9957ed50bca77896bd78a68030cf52a/devices.allow: operation not permitted: unknown 🤮

It's probably fixable with the right magic configuration to disable cgroups somewhere in the dockerd service. Will push a fix and merge after.

@sipsma sipsma force-pushed the fix-docker-host-with-cloud-token branch 3 times, most recently from f64527b to bfd45f9 Compare April 4, 2024 04:20
@sipsma
Copy link
Contributor Author

sipsma commented Apr 4, 2024

Uh okay, the cgroup thing is a real nightmare. Unfortunately I'm just going to fall back to rm'ing the test for now. I know the issue is fixed and there's at least a comment on the modified code to reduce the chances of the fix getting lost someday.

The failure here only happens in the test workflow (not locally and not in the testdev workflow). My best guess is that:

  • Something in the oci runtime config created for our engine in K8s (the one that gets hit in the test workflow) is not granting full permissions to all devices in its cgroup
    • could also be a kernel config or a seccomp/apparmor/etc. thing
  • That results in dockerd's attempt to configure rwm for all devices on the privileged dagger engine container to fail because you can't increase permissions beyond your parent cgroup

I really went down a path of true insanity, to the point of configuring the dockerd service with a custom oci runtime implemented as a shell script that rewrote the oci runtime config of containers to remove that device cgroup permission, but even that didn't work for some reason?

So this isn't worth blocking on. Unfortunately I don't really see any paths to unit testing this that wouldn't just devolve into their own forms of insanity, so I think merging without the test is the least worst option atm.

@sipsma sipsma force-pushed the fix-docker-host-with-cloud-token branch from bfd45f9 to 0fb69e7 Compare April 4, 2024 04:38
@sipsma sipsma merged commit bbe70d5 into dagger:main Apr 4, 2024
80 of 82 checks passed
vikram-dagger pushed a commit to vikram-dagger/dagger that referenced this pull request May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐞 DOCKER_HOST is not respected when DAGGER_CLOUD_TOKEN is set
2 participants