Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: support automatic installation of custom CA certs. #7067

Merged
merged 9 commits into from
Apr 19, 2024

Conversation

sipsma
Copy link
Contributor

@sipsma sipsma commented Apr 9, 2024

Still some cleanup left, but good enough for feedback. TODOs:

  • finish description
  • dedupe some of the distro ca installation code
  • make ca installation/uninstallation non-fatal and cleaned up in case of failure partial way through
  • make final call on whether to only selectively enable or just apply to all containers when custom CAs are installed in the engine (at /usr/local/share/ca-certificates)
    • I'm personally good with this now. These codepaths only get hit when users have a custom engine w/ CA certs installed and I made the install best-effort with cleanup in case of errors. Still open to feedback from others on this though.
  • expand test coverage a bit more (more debian/rhel derivatives, cover cleanup in case of install failure)
  • test ubuntu gpu image (manually for now since we don't have any automated tests for that yet)
  • double check umask clearing mentioned here: core: support automatic installation of custom CA certs. #7067 (comment)
  • add more comments to extra gnarly/subtle parts of code

This adds support for automatic installation of custom CA certs to containers. Originally this was gonna just be the first step outlined here but as I started implementing the CA certs part it required enough complication that it seemed best to just implement the full support for automatic CA installation for all containers.

Helpful docs/prior art

CA Certs

I learned a lot about the various nuances surrounding CAs as part of this (and I'm sure I've only scratched the surface), braindump below.

CAs are not standardized in any way; the way they are configured depends on a few (sometimes orthogonal) factors:

  1. Consuming software/TLS library
    • "Generic" - includes openssl and many/most language stdlibs
      • The closest thing to a generic default is the use of a "bundle" file, which is a single file containing each CA cert that should be trusted along with a dir containing individual files for each of those CAs and symlinks of the hash of those files to the file (created with the c_rehash command)
      • Typically there will also be a command that can be used to automatically update the bundle file and CA dir when custom CAs are installed in another directory.
      • The location of the bundle file and other dirs and the name of the command to update CAs depends on the distro. Typically the distro will compile packages like curl with options that point to the right bundle file and/or CA dir to use.
    • PKCS #11 and p11-kit
      • Have not gone too deep into this yet, but my limited understanding is PKCS 11 is an attempt at a standard around trust/key-management and includes CA certs. Not all distros/libraries use the standard, but those that do can load CA certs that are configured via the trust or update-ca-trust command.
      • RHEL-based distros typically use this for example
      • The trust command has an extract subcommand that creates a layout compatible with the above "generic" setup, for better compatibility with libs that don't use libp11-kit
    • Other TLS implementations like GnuTLS and LibreSSL do their own thing
    • The JVM has it's own CA cert store
    • Browsers typically also have their own CA store
    • Different versions of a given TLS lib may also have their own CA trust store (i.e. the alpine base image has both /etc/ssl/certs and /etc/ssl1.1/certs
  2. Distro
    • As mentioned above, all the paths used for each TLS implementation (bundle file, CA dir, etc.) depend on compile-time options set by each distro when building packages
    • CA functionality is typically implemented as optional packages (rather than as pseudo-required packages like libc) and whether said package is installed in base images varies from distro to distro:
      • the rhel based base images seem to usually come with ca-certificates fully installed
      • debian comes without it installed
      • alpine does not have it installed, but nonetheless does have a base bundle file installed (though no separate cert files)

Implementation Notes

  • Automatic install
    • If the user puts certs at /usr/local/share/ca-certificates in their custom engine image, the engine will attempt to install that in all containers (but only if it can identify the distro of the container, fallback is a nop right now)
    • would like to leave this in, but requires that any failure during install/uninstall can be made non-fatal and cleanup-able. Otherwise will change this to be configurable somehow
  • Uninstall + cleanup
    • It's important that we follow the "leave no trace" rule here; any CAs that we autoinstall in a container should be auto-uninstalled so they don't pollute the final container result unexpectedly
    • This becomes extra complicated in cases where the user's exec actually modifies the state of bundle files/CA dirs, in which case we also want to be very careful to not uninstall anything that the exec actually purposefully modified.
      • This isn't an obscure case; e.g. if you apk install curl off the base alpine image the ca-certificates package will be installed as a dep and modify the bundle file and CA dir.
  • Execution location
    • At first, I implemented all this in the shim process that we currently run as the init of each container. However, this didn't end up working/was not ideal:
      1. If the container user was not root, we would not have permissions to modify certain files/dirs and couldn't run update command like update-ca-certificates
      2. We have a long-term goal of removing the need to inject our own init to each container (Remove need to run shim inside Exec container #4989), but this approach buries us even deeper into that dependency
    • Instead, we had the option to either A) run it exclusively in the part of the shim that executes outside the container or B) implement our own custom buildkit Executor (which runs in the dagger-engine process). I ended up going with the Executor approach for a few reasons:
      1. In the long-term, we can migrate all of the current shim functionality to this Executor. This should result in a much simpler overall architecture since we will be able to avoid all the messy hacks around passing special env vars to the shim process and trying to sneak uncached data into the container via ftp_proxy
      2. As mentioned below, ca cert install/uninstall may need to run actual commands as root inside the container, which is much simpler to implement from the Executor relative to the shim (which would need to have new logic for creating a separate bundle for each command, whereas the Executor already has all that logic)
    • The biggest annoyance here is that we need to read/write certain files/dirs from the container before/after the container has actually run, which meant either A) creating all the container mounts temporarily (and unmounting them before the container starts to avoid various corner cases around multiple mounts of overlay) or B) a "vfs" approach that maps read/write operations on files/dirs to the correct underlying source mount even though those haven't been mounted yet.
      • I went with B) because A) is a bit too frought with weird cases and felt like more performance overhead (mount/unmount can be expensive in some cases). This is implement in cmd/engine/cacerts/fs.go as containerFS. The main complication comes from needing to very carefully handle symlinks so that any symlinks in parent dirs of a given path are always resolved correctly to the final mount where they actually point to.
  • Commands
    • Have to support running command like update-ca-certificates and similar by actually starting separate container that reuses the same rootfs+mounts as the "actual" user exec container.
    • This is done from the Executor level so there's no extra layers and overhead is relatively minimal, but still does impose some latency unfortunately. This currently only happens when the engine actually has any custom certs installed, so it's relatively safe and won't slow down all users immediately.
    • I think we can optimize this in time to skip running these commands in cases where we are sure we can manually implement the same functionality via file/dir operations directly. Went with this approach for now to be extra safe to start.
  • Distro detection
    • We need to positively identify which distro is in use because some don't come with a base image containing any bundle file/ca dir/ etc. (e.g. debian), but other packages may still try to check those (we need to manually set those up in these cases).
    • This requires checking various /etc/ files. Used https://github.com/dekobon/distro-detect/blob/master/linux/distrochecks.go as a reference
    • Currently only implemented support for alpine/debian-derivatives/rhel-derivatives

@sipsma sipsma requested review from vito and jedevc April 9, 2024 18:13
cmd/shim/main.go Outdated Show resolved Hide resolved
@sipsma sipsma added this to the v0.11.1 milestone Apr 9, 2024
@sipsma sipsma marked this pull request as ready for review April 10, 2024 00:09
@vito
Copy link
Contributor

vito commented Apr 10, 2024

Uninstall + cleanup

[...]

Oof, yeah that sounds rough. Is there any way we can isolate "our" changes to a separate layer and then rebase it out somehow? (Haven't looked how it works yet, maybe it's not so bad, just thinking out loud)

Commands

[...]

Is the update-ca-certificates operation cached? i.e. if I repeatedly use the same base image, will it only have to have run once?

Or does it install-then-uninstall on each withExec or something like that? (Guessing that's the case based on the description?)

Would it be too weird to install certs as part of .from and then somehow "drop" the layer at export/publish time? (Building on the previous question)

@sipsma
Copy link
Contributor Author

sipsma commented Apr 10, 2024

Uninstall + cleanup
[...]

Oof, yeah that sounds rough. Is there any way we can isolate "our" changes to a separate layer and then rebase it out somehow? (Haven't looked how it works yet, maybe it's not so bad, just thinking out loud)

Not that I am aware of. At the beginning I was trying to use overlay mounts to do that but quickly realized that:

  1. overlay is not always available
  2. we'd need to do some annoying stuff with making sure that we always use the right set of overlay options (which can vary host to host and on the setup, doable but complicated)
  3. it doesn't work for bundle files since those can be re-written with new contents by the user's exec but overlay's granularity is on the whole-file level (whereas we need to look at the "diff" in terms of contents in this case)

Commands
[...]

Is the update-ca-certificates operation cached? i.e. if I repeatedly use the same base image, will it only have to have run once?

Or does it install-then-uninstall on each withExec or something like that? (Guessing that's the case based on the description?)

The update commands have to re-run on each withExec, yeah; but:

  1. they will be cached in the same way the withExec is cached since buildkit's solver doesn't know about any of this
  2. changing the CA certs won't invalidate the cache (mainly matters for remote cache cases)
  3. there's only one layer for the exec overall; we aren't adding more layers each time these update commands run
  4. this will all only get triggered if the user has a custom engine with custom CAs installed, so won't leak out to every user yet

Would it be too weird to install certs as part of .from and then somehow "drop" the layer at export/publish time? (Building on the previous question)

I considered that too but ruled it out since I want to avoid leaving those files around even in the intermediate layers. Leaving them in will still leak them to the layers of the published image (even if we append an extra layer on top cleaning them up).

  • There's also not really any ways to hack around that since the snapshotter correctly prevents you from modifying layers after they are committed.

I think the best we can do is optimize in follow-ups to not always run these update commands and instead selectively just directly fixup files/dirs in the same way the update command does. We can obviously only do this when we are really really sure we know what the update command is (i.e. probably have to match on distro versions too), so still need to support running these commands in each withExec for other cases.

bklog.G(ctx).Debugf("> creating %s %v", id, process.Meta.Args)
defer bklog.G(ctx).Debugf("> container done %s %v", id, process.Meta.Args)

if installCACerts {
Copy link
Contributor Author

@sipsma sipsma Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for reviewers, the code in this file is mostly the same as upstream's runc executor except for this section that handles CA cert installation and some other minor re-arrangement/cleanup of irrelevant codepaths.

Copy link
Contributor

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified tests fail if I comment out an important line (the one that installs the CA into the engine. Obviously there's a lot going on here but it's fully justified afaict. :laughcry:

cmd/shim/main.go Outdated Show resolved Hide resolved
WithMountedFile("/bin/dagger", daggerCliFile(t, c)).
WithEnvVariable("_EXPERIMENTAL_DAGGER_CLI_BIN", "/bin/dagger").
WithEnvVariable("_EXPERIMENTAL_DAGGER_RUNNER_HOST", "tcp://engine:1234").
WithEnvVariable(executeTestEnvName, "ya").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😁

cmd/engine/executor.go Show resolved Hide resolved
@sipsma
Copy link
Contributor Author

sipsma commented Apr 10, 2024

Just FYI that I was adding more tests last night and did find some cases where we were incorrectly leaving around some parent dirs. This is pretty minor, so should be fine to address in follow up.

I am getting close to finishing the fix there but am mentioning this since apparently my planes wifi is gonna cut off once we are too far into the pacific ocean... so we'll see if I have enough time to fix it before my vacation actually starts. Probably should work out.

@jedevc
Copy link
Member

jedevc commented Apr 15, 2024

I'm gonna take this out of the milestone - given @sipsma is now on holiday, I'd rather we wait to merge this.

@jedevc jedevc removed this from the v0.11.1 milestone Apr 15, 2024
sipsma and others added 8 commits April 18, 2024 14:18
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
I had temporarily added a unique exit code per shim error because I was
hitting errors there during development but could not actually see the
error message, only the exit code (likely due to the integ test needing
to run against a dev engine service, thus increasing nesting).

However, that wasn't really maintainable. This changes the shim to
always try to write errors, including panics, to the best available
error writer at the time, which is usually enough to actually see the
error.

Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma
Copy link
Contributor Author

sipsma commented Apr 19, 2024

Fixed the aforementioned issue of leaving parent dirs and not resetting mtime on files (which results in the diff being non-empty when it should be empty) and added test coverage for all that. Will merge this once green. There's going to be a few more immediate follow ups to add GOPROXY and http proxy support too.

…o more funcs is worth it

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma
Copy link
Contributor Author

sipsma commented Apr 19, 2024

The test failure is #6111 which is being debugged separately here: #7128 So gonna merge as is rather than keep rerolling forever

@sipsma sipsma merged commit 5e933ab into dagger:main Apr 19, 2024
45 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants