Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootless overlay whiteouts throw error when stored #1709

Open
computator opened this issue Jul 7, 2019 · 15 comments

Comments

@computator
Copy link

commented Jul 7, 2019

Description

Deleting existing files in lower layers causes errors when running buildah/podman in rootless mode and using the overlay storage driver. Deleting the files works fine, but it seems to have issues when it later tries to use the whiteout files and fails with operation not permitted. It also has the same issue when pulling a container where a file from a lower layer is deleted. All of these have no issues when using the vfs driver instead.

Steps to reproduce the issue:

$ ctr=$(buildah from alpine)
$ buildah run $ctr rm /etc/alpine-release
$ buildah commit --rm $ctr
Getting image source signatures
Copying blob 256a7af3acb1 skipped: already exists
Copying blob d722205207d3 done
Copying config eb5739d8d7 done
Writing manifest to image destination
Storing signatures
error committing container "alpine-working-container" to "": error copying layers and metadata for container "5c77e22cdb9772b46833bf727d0c6f0139a5e4fad137286ac432fb6f72a1a520": Error committing the finished image: error adding layer with blob "sha256:d722205207d33a3bdaee8e1ab92a2bb66d31ac78dc06fb836b9d46e490dd8faa": Error processing tar file(exit status 1): operation not permitted
ERRO[0002] exit status 1                                
$ 

You can see some relevant strace output here from pulling/importing another container with the same issue. You can see it seems to have issues when it tries to create the whiteout node:

read(0, "var/cache/apt/.wh.pkgcache.bin\0\0"..., 512) = 512
lstat("/var/cache/apt", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/var/cache/apt/.wh.pkgcache.bin", 0xc420bf52e8) = -1 ENOENT (No such file or directory)
mknodat(AT_FDCWD, "/var/cache/apt/pkgcache.bin", S_IFCHR|000, makedev(0, 0)) = -1 EPERM (Operation not permitted)
write(2, "operation not permitted", 23) = 23

buildah info:

$ buildah info
{
    "host": {
        "Distribution": {
            "distribution": "ubuntu",
            "version": "18.04"
        },
        "MemTotal": 16775352320,
        "MenFree": 6280101888,
        "SwapFree": 2147479552,
        "SwapTotal": 2147479552,
        "arch": "amd64",
        "cpus": 8,
        "hostname": "computator",
        "kernel": "4.15.0-54-generic",
        "os": "linux",
        "rootless": true,
        "uptime": "4h 40m 4.82s (Approximately 0.17 days)"
    },
    "store": {
        "ContainerStore": {
            "number": 0
        },
        "GraphDriverName": "overlay",
        "GraphOptions": null,
        "GraphRoot": "/home/computator/.local/share/containers/storage",
        "GraphStatus": {
            "Backing Filesystem": "extfs",
            "Native Overlay Diff": "false",
            "Supports d_type": "true",
            "Using metacopy": "false"
        },
        "ImageStore": {
            "number": 1
        },
        "RunRoot": "/run/user/1000"
    }
}
@rhatdan

This comment has been minimized.

Copy link
Member

commented Jul 7, 2019

@giuseppe Is this a problem with fuse-overlayfs?

@giuseppe

This comment has been minimized.

Copy link
Member

commented Jul 7, 2019

It is trying to use overlay without fuse-overlays. It works on Ubuntu kernels but something seems to be failing. @rlifshay could you try using fuse-overlays as we do by default on Fedora?

@computator

This comment has been minimized.

Copy link
Author

commented Jul 8, 2019

If I use fuse-overlayfs it works without issue. I am hoping that this can be fixed so native overlay works. That way it theoretically has better performance and I also don't have to manually download and install fuse-overlayfs on Ubuntu.

I was looking through some code a day or two ago in an attempt to debug this, and this looks like it might be somewhat related:
https://github.com/containers/storage/blob/95718d665e695ab1b233f12308c7e2846871ac90/drivers/overlay/overlay.go#L1046

@computator

This comment has been minimized.

Copy link
Author

commented Jul 8, 2019

I couldn't find a related issue before when I posted this, but I just now ran across this issue: containers/libpod#2998

@computator

This comment has been minimized.

Copy link
Author

commented Jul 8, 2019

Sorry for the repeated posts. It appears as if the kernel overlayfs itself is the only one that can do mknod or set xattrs in a layer for whiteouts or opaque directories when running in rootless mode. Currently we try and directly add the whiteouts with mknod, which only works with root permissions. Possibly we could solve this by detecting this case and using a different strategy. Rather than making the nodes ourselves, we could just delete the corresponding files or directories in the correct layer to cause overlayfs to create the whiteout nodes itself, thus bypassing the permissions issue.

@giuseppe

This comment has been minimized.

Copy link
Member

commented Jul 8, 2019

We need to create the whiteouts in the same format the kernel (or the FUSE program) expects them to be, so we cannot tweak it to be in a different format. If we use native overlay we need to create the whiteout files either with mknod or mounting an overlay file system and operating on it.
The latter is not trivial as it sounds: to create a whiteout we will first need to create the file in the lower layer and then remove it from the mount point; also this operation cannot be done while the file system is mounted so we will need to do two passes for creating whiteouts.

I don't think the extra complexity is worth to to enable a custom Ubuntu kernel patch.

If you still want to use it, could you open a PR? It will help to fully understand what are the costs and the complexity for supporting it

@computator

This comment has been minimized.

Copy link
Author

commented Jul 13, 2019

I would like to find a feasible way to solve this if possible, because even if it's only usable in Ubuntu (or possibly everywhere, see third option below), it enables a significant performance increase. I saw a nearly 20% improvement in build time for a large container when using native overlay vs fuse-overlayfs. I have several ideas below and would try implementing one of them myself, but I am not familiar enough with go or the internals of these projects to be able to do so.

  • Unless I am missing something, I think creating whiteouts via overlayfs might be a little simpler than you described. The way I understood it, you were saying that to create whiteouts, we would need to create files corresponding to the whiteouts, mount that into an overlayfs, and then delete the files. However, whiteouts are only used to mark existing files in the lower layers as removed. As such, all that would need to be done is mount the lower layer(s) as normal. Since the lower layers already have the files in them, we wouldn't need to create the files. All that would need to be done is delete the existing files in the mounted overlayfs, thus creating whiteouts in the upperdir. This could be done at the same time as, or just before extracting files to a layer (as I believe the current mknod based code does) since extracting the files to the mounted overlayfs also puts them into the upperdir. I believe that this procedure (or a variation of it) would avoid needing the two passes you mentioned.

  • If that is still too complicated, what would be the feasibility of using a small external setuid root helper to call mknod to create the necessary whiteouts? Possibly this could be something optional like with the mount_program storage setting.

  • Another option that might justify the complexity of the first overlayfs/mounting method is to use a setuid root helper to mount overlayfs on non ubuntu distributions, thus allowing all distributions to benefit from the increased performance of native overlayfs.

If none of these are feasible, buildah (and podman, etc) should probably display a warning when using native overlayfs in rootless mode. In addition it would be nice to have fuse-overlayfs included in the buildah/podman ubuntu packages and enabled by default, rather than having to manually install it.

@giuseppe

This comment has been minimized.

Copy link
Member

commented Jul 13, 2019

I would like to find a feasible way to solve this if possible, because even if it's only usable in Ubuntu (or possibly everywhere, see third option below), it enables a significant performance increase. I saw a nearly 20% improvement in build time for a large container when using native overlay vs fuse-overlayfs. I have several ideas below and would try implementing one of them myself, but I am not familiar enough with go or the internals of these projects to be able to do so.

what container are you trying to build? There should not be such big difference, unless you are trying to do many parallel readdirs or copyups.

  • If that is still too complicated, what would be the feasibility of using a small external setuid root helper to call mknod to create the necessary whiteouts? Possibly this could be something optional like with the mount_program storage setting.

  • Another option that might justify the complexity of the first overlayfs/mounting method is to use a setuid root helper to mount overlayfs on non ubuntu distributions, thus allowing all distributions to benefit from the increased performance of native overlayfs.

I don't think we should use setuid programs to circumvent kernel restrictions.

@computator

This comment has been minimized.

Copy link
Author

commented Jul 13, 2019

what container are you trying to build? There should not be such big difference, unless you are trying to do many parallel readdirs or copyups.

It's a big container that installs WINE and some other things (https://github.com/rlifshay/dng-converter-ctr), although I would guess that even the smaller performance difference with a more typical container could add up in a CI environment or something.

I don't think we should use setuid programs to circumvent kernel restrictions.

I can definitely understand why someone might be hesitant to do that, especially if it was the default. However personally I think it would be alright if it was left up to the administrator to explicitly choose to enable and specify such a program. Also, what I was envisioning wouldn't be the same as giving unlimited access to mknod, a definite security risk (and the reason it is restricted to root). Instead, it would be a minimal program that would only make nodes usable as whiteouts and that's it, thus alleviating the security risk. It could even go beyond that, and verify the caller and/or paths to make sure that it was being used as expected.

@rhatdan

This comment has been minimized.

Copy link
Member

commented Jul 13, 2019

I would prefer the distributions and upstream kernel to make those decisions. If Ubuntu is kernel patch is blocking the creation of the whiteout device nodes, then you need to work with them to get it in. We don't want to add setuid programs to Podman to get around security restrictions, especially when dealing with a patch that even the upstream User Namespace team has not allowed into the upstream kernel.

Please open an issue with Ubuntu to allow the creation of the device nodes.

As far as performance issues, I would prefer to fix fuse-overlayfs if possible to make it handle you workload better.

@giuseppe

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

it seems most of the cost in running apt is coming from fsync. In general I think it is safer to disable it for containers, especially if all you do is to build a new image that you'll push immediately, or use a single sync call after the container exited.
I've also started to work on multithreading support, the lookups are still done in single threading mode, but read/write/setattr/sync happen already on different threads and don't block other FUSE requests.

I am working on it here: containers/fuse-overlayfs#88

@rlifshay, the PR seems to improve significantly your test script. Would it be possible for you to try it out and let me know?

@computator

This comment has been minimized.

Copy link
Author

commented Jul 17, 2019

Yes it seems to help a lot, at least when I also enable the mount options you mention in the PR.

norm - 5 iters: 240.72s avg
new - 4 iters: 259.23s avg
new-opt - 4 iters: 204.28s avg

It seems as if after our discussions this has really turned into several separate issues:

  • fuse-overlayfs performance
  • warning of potential issues when using native overlayfs without mknod/xattr permissions
  • better error messages when failing due to mknod/xattr permission denied
  • packaging fuse-overlayfs for ubuntu/debian by default
@giuseppe

This comment has been minimized.

Copy link
Member

commented Jul 17, 2019

@rlifshay thanks a lot for the tests. This is very helpful feedback.

Do you also have a measure for native overlay (as root) on the same host?

* packaging fuse-overlayfs for ubuntu/debian by default

@lsm5 could we add fuse-overlayfs as a dependency to the Ubuntu package?

@computator

This comment has been minimized.

Copy link
Author

commented Jul 18, 2019

@giuseppe I ran it with native overlayfs in rootless mode. I was going to do native as root too, but I kept having internet issues. I will probably try again tomorrow and will update you if it's much different.

native-rootless - 4 iter: 214.23s avg

(I just changed my username from rlifshay if anyone is confused)

@computator

This comment has been minimized.

Copy link
Author

commented Jul 19, 2019

I did some more testing with more stable conditions (mostly internet) and a lot less variance:

native-root - 5 iter: 192.95s avg
native-rootless - 5 iter: 181.96s avg
fuse - 5 iter: 206.82s avg
fuse-new - 5 iter: 183.28s avg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.