Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

install to-disk with LUKS + TPM broken #421

Closed
jmpolom opened this issue Mar 22, 2024 · 17 comments
Closed

install to-disk with LUKS + TPM broken #421

jmpolom opened this issue Mar 22, 2024 · 17 comments
Labels
area/install Issues related to `bootc install` area/osintegration Relates to an external OS/distro base image triaged This looks like a valid issue

Comments

@jmpolom
Copy link

jmpolom commented Mar 22, 2024

Does bootc install to-disk --block-setup tpm2-luks /dev/diskX actually work? I tried this in a qemu virtual machine with emulated TPM (via swtpm) and while it installed successfully, upon rebooting the VM into the freshly installed OS the systemd-cryptsetup units failed to decrypt the LUKS volume. Has this actually been tested or otherwise known to work? I will try on real hardware but this has me concerned this feature is not really in a functional state.

I tested with vanilla Fedora 39 Server to try and rule out this being related to the use of a virtual machine with emulated TPM. After installing tpm2-tools, adding the tpm2-tss dracut modules, and running systemd-cryptenroll for the LUKS volume I had an installation that repeatedly would unlock automatically via the TPM at boot (no password and no failures). Also tried with Fedora 39 Silverblue (added modules to initrd and enabled custom initramfs generation with rpm-ostree) -- same results. In both cases the LUKS volume was enrolled after the installed OS was provisioned and booted for the first time although I really doubt that has any effect on anything. I do not believe the test setup (IE: emulated TPM) is the problem here though.

Eventually dracut times out and drops into a rescue shell in the initrd. The cryptsetup unit faied with a Current policy digest does not match stored policy digest, cancelling TPM2 authentication attempt. error. Further, an error of No passphrase or recovery key registered is also printed. I don't think this is a PCR issue.

Some observations:

  • The latter of these issues (lack of failover way to unlock) is most certainly a bug in this install path. If the means for unlocking the LUKS volume will be the TPM, a recovery key must be set to allow the system to boot in the event PCRs change. Alternatively, allow the user to provide a normal password. One or both of these failover methods needs to be supported at install time (not after).
  • It is not clear from the documentation or CLI interface what PCRs the volume gets bound to in the TPM. The defaults should be documented and also user configurable. Right now based on reading the source, it looks like the LUKS volume binds to no PCR so it will always unlock as long as the TPM is present PCR 7, the systemd-cryptenroll default.
@jmpolom jmpolom changed the title install to-disk with LUKS + TPM install to-disk with LUKS + TPM broken Mar 24, 2024
@jmpolom
Copy link
Author

jmpolom commented Mar 24, 2024

I am able to unlock the LUKS volume via TPM on a live system (FCOS) booted on the same VM I used to perform the bootc install to-disk --block-setup tpm2-luks from. It looks like the TPM2 binding on the volume is valid.

I think my proposal is if it remains possible to install to a block device and into a LUKS volume, a user supplied password ought to be an option (it appears this would be a straightforward modification). The LUKS password can be changed or removed later on once the system is installed. If a user opts in to foregoing a password by not specifying one, then you would get the current behavior with no means of recovery. This would necessitate dropping the headless karg which should not be default either as it could severely hinder developmental debugging. Perhaps a second improvement would be to support a recovery key method but I see that as completely secondary to supporting a means to setup the volume with a plain ol password.

@cgwalters cgwalters added area/install Issues related to `bootc install` triaged This looks like a valid issue labels Mar 25, 2024
@jmpolom
Copy link
Author

jmpolom commented Mar 25, 2024

Looks like this was observed with bootc 0.1.7 on an image created from the treefile configuration at this point that resulted in this container image

@cgwalters
Copy link
Collaborator

Thanks for filing this, indeed we have a CI gap on this here.

(The integration of bootc install needs to be at least partly owned by the particular OS you're using; we will try to be generic where we can of course).

@jmpolom
Copy link
Author

jmpolom commented Mar 25, 2024

There needs to be a failover way to unlock any LUKS volumes bound to tokens at install time. Right now bootc actually creates a temp password and then discards it. At a very basic level this could just be output to the console as a very crude method to provide a way in if the TPM method isn't working. This is a basic issue with the to-disk install path. There really should be some user configurable options though.

Unless the thought is bootc will only handle the bottom half of the logic (installing to a pre-made filesystem) and to-disk install path gets removed??

The integration of bootc install needs to be at least partly owned by the particular OS you're using

This was tested with Fedora 39 which has plenty of ostree based releases so I don't think that is necessarily the issue. Are there specific additional integrations needed (link to docs please)?

Based on my detailed and exhaustive review of the existing state of the art (IE: everything here), I'm really coming up short for why this specific configuration isn't working. Looking at how the LUKS configuration is passed to the initrd via kargs and my double checking that the necessary libs are present in the initrd, the failure to unlock really doesn't make sense.

Are any modifications beyond adding a few dracut modules needed to support TPM2 based unlocking with systemd-cryptsetup in the initrd? This is about what I'd expect to need to do to preconfigure an initrd with the proper libs. I've done this before on Debian and prior versions of Fedora successfully.

Hazard a guess at what else should be checked here?

@cgwalters
Copy link
Collaborator

There needs to be a failover way to unlock any LUKS volumes bound to tokens at install time.

For some use cases, all data on disk is effectively a cache, and having a failover adds risk and management overhead - questions around how is the failover secret rotated, etc.

Unless the thought is bootc will only handle the bottom half of the logic (installing to a pre-made filesystem) and to-disk install path gets removed??

There is some strong inherent tension here but the overall thought on the design is that to-disk is designed for the simple cases (obviously, "plain filesystem" is very simple), and the current tpm2-luks is arguably on the bounds of "simple" - and you can use to-filesystem to set things up however you want for all the other cases.

@jmpolom
Copy link
Author

jmpolom commented Mar 26, 2024

There needs to be a failover way to unlock any LUKS volumes bound to tokens at install time.

For some use cases, all data on disk is effectively a cache, and having a failover adds risk and management overhead - questions around how is the failover secret rotated, etc.

Valid concern for that specific case but that has got to be one hell of an edge case. Even if that is a main use case it creates a situation that is so unnecessarily difficult to debug that there needs to be an alternative to enable basic development. Such behavior that discards/skips/doesn't support a manual failover mechanism needs to be explicitly opted into by the user. Or the silly default must have an opt out. Right now it is forced opt in which is a problem.

and the current tpm2-luks is arguably on the bounds of "simple"

I'd encourage you to review all the logic in the supporting libraries and systemd itself for this to work. It is anything but simple and it increases the dependency footprint substantially (proper libs need to end up in the initrd, for instance). Far more complex than a password.

@jmpolom
Copy link
Author

jmpolom commented Mar 26, 2024

Tried to bootc install to-disk --block-setup tpm2-luks with the latest centos-bootc fedora-eln image (sha256: 174fb00e242e7aaa2d9c5f34056caea7fd726433949c0dedd12158aa5e6b1d0f) and it fails to unlock the root volume upon reboot. Failed systemd-cryptsetup unit trying to unlock via TPM2.

I tried the build from last week previously, and it successfully unlocked the LUKS root volume on reboot (sha256:1c5e91ab395665ca11e2c1a17df18beec39d63fd2948f15add9fd95e45c0c85b). Clearly there have been some package changes in the past few days that broke this??

While waiting for this update, I also tried both F39 and F40/next based builds with the latest bootc from the copr. Same story with those -- failed systemd-cryptsetup units trying to unlock via TPM.

@cgwalters
Copy link
Collaborator

cgwalters commented Mar 26, 2024

Thanks Jon, this is very valid feedback and thanks for looking at this.

Valid concern for that specific case but that has got to be one hell of an edge case.

For example in cloud environments I may want to enable LUKS to be very sure my data is encrypted, and binding to the virtualized TPM2 is a generic baseline for that that at least helps ensure that if e.g. someone gets access somehow to an underlying block store they can't read the data. I am sure some people want a fallback password even in cloud, but it's not very "cloud native" to log in interactively on a console in an IaaS.

Even if that is a main use case it creates a situation that is so unnecessarily difficult to debug that there needs to be an alternative to enable basic development. Such behavior that discards/skips/doesn't support a manual failover mechanism needs to be explicitly opted into by the user.

Yes, fair enough.

OK so I think what my inclination here is to make tpm2-luks require a flag in the install config in the container image. This way, the OS container image creator must opt-in to saying they support it (e.g. they have the required components in the initramfs, etc.)

Then in parallel of course, we should:

  • Make this work
  • Expand on doing things like this via bootc install to-filesystem to be clear that arbitrarily complex storage setups can be owned by codebases other than bootc

@jmpolom
Copy link
Author

jmpolom commented Mar 26, 2024

For example in cloud environments I may want to enable LUKS to be very sure my data is encrypted, and binding to the virtualized TPM2 is a generic baseline for that that at least helps ensure that if e.g. someone gets access somehow to an underlying block store they can't read the data. I am sure some people want a fallback password even in cloud, but it's not very "cloud native" to log in interactively on a console in an IaaS.

A great use for the recovery key option. If bootc could cleanly output the recovery key a secondary process could store it in a secrets vault. You really cannot have a TPM2 only binding. Even systems like Windows will provide recovery keys for when the PCRs change (as they are designed to do). Some deployments will want to bind to a lot more than PCR 7 and some of those PCRs may change even on OS update.

OK so I think what my inclination here is to make tpm2-luks require a flag in the install config in the container image. This way, the OS container image creator must opt-in to saying they support it (e.g. they have the required components in the initramfs, etc.)

There would also need to be a flag that could either have the system create and provide at install time a temp password or enroll a recovery key. Ideally both should be supported (recovery keys are lengthy and would be particularly obnoxious to deal with for repetitive early stage testing) at the user or image builders discretion.

Specifically as to the initramfs components, the requirements need to be documented somewhere. Just having other works (like the centos-bootc) to reference isn't a great experience. It isn't entirely straightforward exactly what dracut modules should be added in. It's a bit different configuring the initramfs here because the image is being composed off board from the system it will run on, so any auto detection of things isn't going to work.

Expand on doing things like this via bootc install to-filesystem to be clear that arbitrarily complex storage setups can be owned by codebases other than bootc

Personally I would perhaps consider just removing the to-disk workflow. I'm not sure what value it brings if it can only handle trivially simple deployments. Is the juice from the added complexity actually worth it? I think it might be hard to avoid feature/complexity creep on what looks like a bare metal installer feature.

Might be easier to just document a bit better how to use to-filesystem with an external workflow to prepare the disks and have a reference shell script that can be included in an image build to do this. Something to think about.

cgwalters added a commit to cgwalters/bootc that referenced this issue Mar 26, 2024
This allows the container image builder more control over
`bootc install to-disk` in the installation config.  Per discussion in
containers#421
this one definitely requires integration by the base image,
and not all of them will want it.

(Or if the do want LUKS, they may want more control over it)

The default value is `block: ["direct"]` which only enables
the simple filesystem install.

This change allows two different things:

`block: []`

With this, `bootc install to-disk` will just error out.  It's
a way to effectively disable it for those that want to use
an external installer always.

Another possibility is:

`block: ["direct", "tpm2-luks"]`

To explicitly re-enable the builtin tpm2-luks flow.

Or, one could do just `block: ["tpm2-luks"]` to enforce encrypted installs.

Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters
Copy link
Collaborator

#445 will effectively turn this off by default for now.

Personally I would perhaps consider just removing the to-disk workflow.

I find it extremely useful as it provides a generic baseline, allowing a container image to self-install onto a block device without any other externally versioned infrastructure. (It also tries hard to force configuration to come from the container image by default).

Now, I did also file #440 which would make it much easier for containers to configure things in arbitrary ways.

@jmpolom
Copy link
Author

jmpolom commented Mar 27, 2024

Looks like between tags eln-1710868505 and eln-1711401621 the fedora-bootc image began exhibiting the same failures with the systemd-cryptsetup units on boot after install. Any ideas what changes may have caused this? I saw the same issue with a pretty plain F39 based image.

@cgwalters cgwalters added the area/osintegration Relates to an external OS/distro base image label Mar 29, 2024
cgwalters added a commit to cgwalters/bootc that referenced this issue Apr 2, 2024
This allows the container image builder more control over
`bootc install to-disk` in the installation config.  Per discussion in
containers#421
this one definitely requires integration by the base image,
and not all of them will want it.

(Or if the do want LUKS, they may want more control over it)

The default value is `block: ["direct"]` which only enables
the simple filesystem install.

This change allows two different things:

`block: []`

With this, `bootc install to-disk` will just error out.  It's
a way to effectively disable it for those that want to use
an external installer always.

Another possibility is:

`block: ["direct", "tpm2-luks"]`

To explicitly re-enable the builtin tpm2-luks flow.

Or, one could do just `block: ["tpm2-luks"]` to enforce encrypted installs.

Signed-off-by: Colin Walters <walters@verbum.org>
@jmpolom
Copy link
Author

jmpolom commented Apr 16, 2024

@cgwalters any ideas on this one? What might be causing systemd-cryptsetup to fail at unlocking the LUKS volume bound to the TPM? I do not notice this issue on non-bootc Fedora ostree systems when binding the root LUKS volume to the TPM with systemd-cryptenroll.

My personal opinion is adding an option to "opt into" supporting LUKS volumes is a bandaid/completely wrong response to the issue described here. I do not view this as a functional improvement. TPM2 bound LUKS volumes with systemd-cryptenroll work in other/normal Fedora distros. It should also work here. If there are specific additional steps needed for it to work, those need to be documented. A failover means of unlocking via a recovery key or plain pre-set random password also must be included here by default.

@jmpolom
Copy link
Author

jmpolom commented Apr 18, 2024

Update: The culprit appears to be a shim-x64 package update. Downgrading shim-x64 to 15.6-2 resolves this failure to unlock the LUKS root volume. This was on a system (vm) that does not support secure boot which makes no sense to me. The issue was also observed on metal/hardware with a TPM.

Ultimately the failure to unlock was caused by disagreeing PCR 7 hashes thus valid (ref to systemd issue). I ran the bootc install from a live version of Fedora CoreOS that had shim-x64-15.6-2 while the images I was deploying all had shim-x64-15.8-3. Is shim version agreement necessary between the installation environment and booted system in order to ensure TPM PCRs do not change? Testing from a more recent Fedora CoreOS ISO with shim-x64-15.8-3 strongly suggests shim version agreement is necessary between the installation environment and the installed system to prevent a change to PCR 7 hashes.

I am not a system firmware and TPM expert so I do not really know what is normal behavior here. It seems odd that a shim update would cause a TPM PCR to roll particularly on a system that does not support secure boot. If version agreement is required though, bootc needs to check for that between the install environment and deployed image.

@cgwalters
Copy link
Collaborator

I think shim was recently resigned in Fedora, and PCR 7 contains all the certificates involved. So skew between the host and target is definitely the cause. While arguably podman run ... bootc install is going to make it easy to hit things like this, note that today Fedora does not respin installer ISOs apart from major releases, and would be equally prone to this - but Anaconda doesn't seem to use/expose systemd-cryptenroll.

Onto the next bit: I would agree with your implication that if one is not using Secure Boot it doesn't make sense to bind to PCR 7 at all... maybe that could be changed in systemd. bootc is just exposing the systemd-cryptenroll default here (although of course we could also detect this and pass --tpm2-pcrs= ourselves)

My personal opinion is adding an option to "opt into" supporting LUKS volumes is a bandaid/completely wrong response to the issue described here. I do not view this as a functional improvement. TPM2 bound LUKS volumes with systemd-cryptenroll work in other/normal Fedora distros.

Note that again bootc install to-filesystem supports whatever you want, this is just about the much more opinionated to-disk. (Also, it is expected that distributions will continue to maintain more sophisticated installers that hopefully use to-filesystem at least)

@jmpolom
Copy link
Author

jmpolom commented Apr 18, 2024

While arguably podman run ... bootc install is going to make it easy to hit things like this, note that today Fedora does not respin installer ISOs apart from major releases, and would be equally prone to this - but Anaconda doesn't seem to use/expose systemd-cryptenroll.

Guaranteed to hit it, as I have demonstrated. Normal Fedora gets away with ignoring this detail because by and large systemd-cryptenroll is not supported out of the box. However bootc appears to make a bold claim that it is supported given the available installation options.

Onto the next bit: I would agree with your implication that if one is not using Secure Boot it doesn't make sense to bind to PCR 7 at all... maybe that could be changed in systemd. bootc is just exposing the systemd-cryptenroll default here (although of course we could also detect this and pass --tpm2-pcrs= ourselves)

I don't think systemd is going to change their defaults. That is the wrong venue to deal with this. I also don't think they intend for their default PCR selection to be a "production ready" "universal" configuration everyone should run. It's a default because they needed one and it is relatively unoffensive, but as I have shown here, not always workable. Only binding a LUKS volume to TPM 7 affords little in the way of actual security since a lot can change before that PCR hash changes.

The correct answer is that bootc needs to make the LUKS encryption and TPM binding aspects of the installation process more configurable by downstream users. A lot of users may wish to bind to additional TPM PCRs for added security and others will desire backup passwords. The current implementation, though usable mostly, is simply too naive. If the intent is to keep this feature around it needs to be more configurable.

Note that again bootc install to-filesystem supports whatever you want, this is just about the much more opinionated to-disk. (Also, it is expected that distributions will continue to maintain more sophisticated installers that hopefully use to-filesystem at least)

Message received however I think there are a few assumptions there that are not fair nor reasonable nor substantiated. Yes you probably don't want to build a sophisticated partitioning interface into the to-disk workflow but I am merely highlighting that currently the to-disk workflow lacks basic features and configurability it ought to have.

There is scant documentation on how one might employ installing to filesystem with a LUKS root volume. Given the difficulty I encountered with what should be the "easy mode" install, I am really hesitant to sink more time into an even less documented installation path. "DIY your install procedure" is not a resolution to this issue until at a minimum there is documentation showing the permutations of such usage.

@cgwalters
Copy link
Collaborator

I also don't think they intend for their default PCR selection to be a "production ready" "universal" configuration everyone should run. It's a default because they needed one and it is relatively unoffensive,

"relatively unoffensive" is definitely a hinge point here.

but as I have shown here, not always workable. Only binding a LUKS volume to TPM 7 affords little in the way of actual security since a lot can change before that PCR hash changes.

Yes. I had thought PCR7 wouldn't be a problem but basically it doesn't provide much value, and only causes problems in practice.

There is scant documentation on how one might employ installing to filesystem with a LUKS root volume.

Yes, we will work to improve the to-filesystem examples. However, part of the bigger picture here is that bootc is also intended to be a lower level backend for higher level existing distribution installers. For example, Anaconda today can install bootc-compatible containers via the ostreecontainer verb, and has a rich array of partitioning support, including LUKS (but as you know, not via systemd-cryptenroll). Anaconda doesn't technically use bootc install to-filesystem though, yet.

@jmpolom
Copy link
Author

jmpolom commented Apr 18, 2024

I had thought PCR7 wouldn't be a problem but basically it doesn't provide much value, and only causes problems in practice.

There is systemd-pcrlock coming that appears to address some of the inherent issues when binding to plain PCRs. Ultimately any type of policy driven thing would need to be user configurable as well.

I've created issues for the enhancements install to-disk needs in #476 and #477.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install Issues related to `bootc install` area/osintegration Relates to an external OS/distro base image triaged This looks like a valid issue
Projects
None yet
Development

No branches or pull requests

2 participants