Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle nested whiteouts #172

Closed
cgwalters opened this issue Jul 21, 2023 · 12 comments · Fixed by #175
Closed

handle nested whiteouts #172

cgwalters opened this issue Jul 21, 2023 · 12 comments · Fixed by #175
Labels
enhancement New feature or request

Comments

@cgwalters
Copy link
Contributor

See ostreedev/ostree#2712

Basically...we're now shipping support in ostree for embedding whiteouts. This allows c/image (podman etc.) to be directly pointed at this alternative root in a read-only fashion. Container images shipped this way are "lifecycle bound" with the host (and gain benefits of dedup actually and the efficient ostree on-the-wire deltas).

But this only works because ostree itself doesn't use overlayfs (ostree actually predates overlayfs).

In a composefs future, because overlayfs doesn't nest, we're going to need to figure out how to handle this fun special case.

In a unified storage world things are inherently better here, but hard requring that would actually be an "API break".

I guess another way to say this is that everyone turning on the ostree composefs support is just going to break if they have nested containers today.

@cgwalters
Copy link
Contributor Author

So...half baked thoughts. We can clearly detect this situation in ostree when we're generating the composefs image. I think probably what we need to do here is fall back to continuing to create that sub-portion of the filesystem tree as hardlinks outside of a composefs mount (which we already do today actually). Then, we set up a dynamic bind mount from there to that hardlinked subtree.

I think that'd work...

@cgwalters cgwalters added the enhancement New feature or request label Jul 22, 2023
@alexlarsson
Copy link
Collaborator

I don't think this is the only issue we're going to have, is it? I mean, you also can't store overlayfs xattrs in the composefs image.

@amir73il Is there a way to "recurse" overlayfs dirs for this? I.e. have an overlayfs mount that looks like it contains files such as whiteouts and regular files with overlayfs xattrs set on them?

@alexlarsson
Copy link
Collaborator

I wonder actually if we should have libcomposefs error out if the source data specifies such files. Currently they will produce pretty broken resulting images.

@amir73il
Copy link

amir73il commented Aug 7, 2023

Generally, overlayfs mount doesn't accept an overlayfs as the upper/work dir.
There are scattered assumptions about upper/work fs not being "wierd" and I don't know for sure what all these assumptions are. Need to consult with Miklos if this is important enough.
Note that the requirement that upper/work fs are not "remote" have already been relaxed to allow to support fuse as upper fs.

@alexlarsson
Copy link
Collaborator

@amir73il These would be lower dirs though.

Basically, suppose /usr is on overlayfs (via composefs), and you're shipping a copy of some /var/lib/containers/storage/overlay instance in /usr/share/additional-images. You then want podman to combine the lower dirs in /usr/share/additional-images with upper and work dirs in /var/lib/containers/storage/overlay.

In this case, if a file in the container storage has e.g. a metacopy xattr set, we want the container overlayfs mount to "interpret" that, not the composefs overlayfs mount.

Basically, we would want something like mkcomposefs converting the overlay.metacopy xattr to overlay.escape.metacopy and then have overlayfs expose xattrs called overlay.escape.foo as overlay.foo in the mount.

@alexlarsson
Copy link
Collaborator

And same with whiteouts, we would like a way to store a file in a lower that overlayfs doesn't consider a whiteout, but looks like a whiteout in the resulting overlayfs mountpoint.

@amir73il
Copy link

amir73il commented Aug 7, 2023

Since ovl already has a mount option userxattr to replace trusted.overlay prefix with user.overlay and all the code used helpers to build the actual xattr names, easiest would would be to support option xattr_prefix="user.cfs" or something like that (could also be trusted.cfs).
Regarding whiteout, simplest would be to support xattr overlay.whiteout. no need to support this for creating whiteouts, only for lookup in lower layers, so should be pretty simple. (I think )

@alexlarsson
Copy link
Collaborator

Since ovl already has a mount option userxattr to replace trusted.overlay prefix with user.overlay and all the code used helpers to build the actual xattr names, easiest would would be to support option xattr_prefix="user.cfs" or something like that (could also be trusted.cfs).

That sounds like its easy to implement, but it would mean we would have to use the new prefix for all the files in the lower, and that would add a hard requirement on the new kernel with this support to be able to mount such a file. I'll give it some thought.

Regarding whiteout, simplest would be to support xattr overlay.whiteout. no need to support this for creating whiteouts, only for lookup in lower layers, so should be pretty simple. (I think )

Ideally I would like to avoid checking for a new xattr on all files, as that would slow down everything. I'm thinking we can use an xattr for regular whiteout files, which if sets means that we don't actually treat them as whiteouts.

@amir73il
Copy link

amir73il commented Aug 8, 2023 via email

@cgwalters
Copy link
Contributor Author

This issue also relates a bit to containers/storage#1608 in that ideally we support fully general transformations between what is interpreted by the Linux kernel (e.g. the security.selinux xattr as read by the in-kernel SELinux LSM) versus what happens when userspace calls getxattr() and setxattr()).

Regarding whiteout, simplest would be to support xattr overlay.whiteout

This is short for trusted.overlay.whiteout right? But yes reducing this problem domain to just xattr nesting makes sense to me.

Maybe. Not sure how you plan to do this escaping in a nested way.

I'm uncertain whether it makes sense to leap from not supporting this to a fully general nesting. I think this problem is a bit like nested KVM - the driving use cases for one level of nesting became big enough that it really made sense to productize. But having three levels (or more) is far more obscure, enough that I don't know if it's used for anything real.

That said, wouldn't general nesting be implementable by having overlayfs detect the case where it's reading from an overlayfs lower and process "one level" of trusted.overlay.escape?

Anyways all this aside...hmm...so I think my original half baked thought of "un-nesting" would likely work except that we'd need to synthesize a local cfs image. In theory we could pre-compute the digests of these nested cfs image and include a signature covering them in the main cfs image. Or we can just rely on chain-of-trust from the main rootfs to code synthesizing the nested cfs images.

@alexlarsson
Copy link
Collaborator

Some initial work on nesting are here: https://github.com/alexlarsson/linux/tree/ovl-nesting

@alexlarsson
Copy link
Collaborator

This issue also relates a bit to containers/storage#1608 in that ideally we support fully general transformations between what is interpreted by the Linux kernel (e.g. the security.selinux xattr as read by the in-kernel SELinux LSM) versus what happens when userspace calls getxattr() and setxattr()).

I'm not sure exactly how this applies to composefs. There are two files, the file on the erofs image (A), and the backing file in the CAS (B).

A can have whatever security.selinux xattr we want (we just write it to the image) as overlayfs will expose it to the user as is. The user must pass security checks against this context to open the file

B needs to have a pretty widely accessible context (at least for reading), so we can share these files between images. It will be checked against the context of the user that mounts the composefs filesystem.

Maybe. Not sure how you plan to do this escaping in a nested way

Nesting escapes is easy. The first layer will convert overlay.escape.escape.metacopy to overlay.escape.metacopy, the second with convert overlay.escape.metacopy to overlay.metacopy, and the thrird will interpret overlay.metacopy. However, this is really limited by the current stacking limit of 2 levels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants