Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unprivileged Rootless Buildah on Kubernetes fails due to newgidmap/newuidmap: Operation not permitted error #4049

Closed
ckehoe opened this issue Jun 9, 2022 · 21 comments

Comments

@ckehoe
Copy link

ckehoe commented Jun 9, 2022

Description
I have an unprivileged rootless Buildah container running on kubernetes/CRI-O on a Centos 7.9 host using VFS storage. When running any buildah command I receive the following output:

WARN[0000] error running newgidmap: exit status 1: newgidmap: write to gid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               
WARN[0000] error running newuidmap: exit status 1: newuidmap: write to uid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping      

The buildah bud command eventually fails due to the following error:

ERRO[0001] Error while applying layer: ApplyLayer exit status 1 stdout:  stderr: potentially insufficient UIDs or GIDs available in user namespace (requested 192:192 for /run/systemd/netif): Check /etc/subuid and /etc/subgid: lchown /run/systemd/netif: invalid argument 
error creating build container: writing blob: adding layer with blob "sha256:b297047bc4b568dedc27cbeda009a435b0714fd2f681870601890a4b1c7531e8": ApplyLayer exit status 1 stdout:  stderr: potentially insufficient UIDs or GIDs available in user namespace (requested 192:192 for /run/systemd/netif): Check /etc/subuid and /etc/subgid: lchown /run/systemd/netif: invalid argument

This issue is not present when running as user build with privileged: true enabled on the pod or when running as the root user without privileged mode enabled.

Steps to reproduce the issue:

  1. Deploy buildah to kubernetes running as user build
  2. Run buildah bud

Describe the results you received:
Buildah is unable to create new gid or uid maps:

[build@runner-le6vnazt-project-140-concurrent-026xwb /]$ buildah info
WARN[0000] error running newgidmap: exit status 1: newgidmap: write to gid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               
WARN[0000] error running newuidmap: exit status 1: newuidmap: write to uid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               

Describe the results you expected:
Expected buildah to be able to create mappings

Output of rpm -q buildah or apt list buildah:

buildah-1.23.3-2.fc35.x86_64

Output of buildah version:

buildah version 1.23.3 (image-spec 1.0.1-dev, runtime-spec 1.0.2-dev)

Output of cat /etc/*release:

Fedora release 35 (Thirty Five)
NAME="Fedora Linux"
VERSION="35 (Container Image)"
ID=fedora
VERSION_ID=35
VERSION_CODENAME=""
PLATFORM_ID="platform:f35"
PRETTY_NAME="Fedora Linux 35 (Container Image)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:35"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f35/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=35
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=35
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Container Image"
VARIANT_ID=container
Fedora release 35 (Thirty Five)
Fedora release 35 (Thirty Five)

Output of uname -a:

Linux runner-le6vnazt-project-140-concurrent-026xwb 3.10.0-1160.25.1.el7.x86_64 #1 SMP Wed Apr 28 21:49:45 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Output of cat /etc/containers/storage.conf:

# This file is is the configuration file for all tools
# that use the containers/storage library. The storage.conf file
# overrides all other storage.conf files. Container engines using the
# container/storage library do not inherit fields from other storage.conf
# files.
#
#  Note: The storage.conf file overrides other storage.conf files based on this precedence:
#      /usr/containers/storage.conf
#      /etc/containers/storage.conf
#      $HOME/.config/containers/storage.conf
#      $XDG_CONFIG_HOME/containers/storage.conf (If XDG_CONFIG_HOME is set)
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]

# Default Storage Driver, Must be set for proper operation.
driver = "vfs"

# Temporary storage location
runroot = "/run/containers/storage"

# Primary Read/Write location of container storage
# When changing the graphroot location on an SELINUX system, you must
# ensure  the labeling matches the default locations labels with the
# following commands:
# semanage fcontext -a -e /var/lib/containers/storage /NEWSTORAGEPATH
# restorecon -R -v /NEWSTORAGEPATH
graphroot = "/var/lib/containers/storage"


# Storage path for rootless users
#
# rootless_storage_path = "$HOME/.local/share/containers/storage"

[storage.options]
# Storage options to be passed to underlying storage drivers

# AdditionalImageStores is used to pass paths to additional Read/Only image stores
# Must be comma separated list.
additionalimagestores = [
"/var/lib/shared",
]

# Remap-UIDs/GIDs is the mapping from UIDs/GIDs as they should appear inside of
# a container, to the UIDs/GIDs as they should appear outside of the container,
# and the length of the range of UIDs/GIDs.  Additional mapped sets can be
# listed and will be heeded by libraries, but there are limits to the number of
# mappings which the kernel will allow when you later attempt to run a
# container.
#
#remap-uids = 0:1668442479:65536
#remap-gids = 0:1668442479:65536

# Remap-User/Group is a user name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid or /etc/subgid file.  Mappings are set up starting
# with an in-container ID of 0 and then a host-level ID taken from the lowest
# range that matches the specified name, and using the length of that range.
# Additional ranges are then assigned, using the ranges which specify the
# lowest host-level IDs first, to the lowest not-yet-mapped in-container ID,
# until all of the entries have been used for maps.
#
# remap-user = "containers"
# remap-group = "containers"

# Root-auto-userns-user is a user name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid and /etc/subgid file.  These ranges will be partitioned
# to containers configured to create automatically a user namespace.  Containers
# configured to automatically create a user namespace can still overlap with containers
# having an explicit mapping set.
# This setting is ignored when running as rootless.
# root-auto-userns-user = "storage"
#
# Auto-userns-min-size is the minimum size for a user namespace created automatically.
# auto-userns-min-size=1024
#
# Auto-userns-max-size is the minimum size for a user namespace created automatically.
# auto-userns-max-size=65536

[storage.options.overlay]
# ignore_chown_errors can be set to allow a non privileged user running with
# a single UID within a user namespace to run containers. The user can pull
# and use any image even those with multiple uids.  Note multiple UIDs will be
# squashed down to the default uid in the container.  These images will have no
# separation between the users in the container. Only supported for the overlay
# and vfs drivers.
#ignore_chown_errors = "false"

# Inodes is used to set a maximum inodes of the container image.
# inodes = ""

# Path to an helper program to use for mounting the file system instead of mounting it
# directly.
mount_program = "/usr/bin/fuse-overlayfs"

# mountopt specifies comma separated list of extra mount options
mountopt = "nodev,fsync=0"

# Set to skip a PRIVATE bind mount on the storage home directory.
# skip_mount_home = "false"

# Size is used to set a maximum size of the container image.
# size = ""

# ForceMask specifies the permissions mask that is used for new files and
# directories.
#
# The values "shared" and "private" are accepted.
# Octal permission masks are also accepted.
#
#  "": No value specified.
#     All files/directories, get set with the permissions identified within the
#     image.
#  "private": it is equivalent to 0700.
#     All files/directories get set with 0700 permissions.  The owner has rwx
#     access to the files. No other users on the system can access the files.
#     This setting could be used with networked based homedirs.
#  "shared": it is equivalent to 0755.
#     The owner has rwx access to the files and everyone else can read, access
#     and execute them. This setting is useful for sharing containers storage
#     with other users.  For instance have a storage owned by root but shared
#     to rootless users as an additional store.
#     NOTE:  All files within the image are made readable and executable by any
#     user on the system. Even /etc/shadow within your image is now readable by
#     any user.
#
#   OCTAL: Users can experiment with other OCTAL Permissions.
#
#  Note: The force_mask Flag is an experimental feature, it could change in the
#  future.  When "force_mask" is set the original permission mask is stored in
#  the "user.containers.override_stat" xattr and the "mount_program" option must
#  be specified. Mount programs like "/usr/bin/fuse-overlayfs" present the
#  extended attribute permissions to processes within containers rather then the
#  "force_mask"  permissions.
#
# force_mask = ""

[storage.options.thinpool]
# Storage Options for thinpool

# autoextend_percent determines the amount by which pool needs to be
# grown. This is specified in terms of % of pool size. So a value of 20 means
# that when threshold is hit, pool will be grown by 20% of existing
# pool size.
# autoextend_percent = "20"

# autoextend_threshold determines the pool extension threshold in terms
# of percentage of pool size. For example, if threshold is 60, that means when
# pool is 60% full, threshold has been hit.
# autoextend_threshold = "80"

# basesize specifies the size to use when creating the base device, which
# limits the size of images and containers.
# basesize = "10G"

# blocksize specifies a custom blocksize to use for the thin pool.
# blocksize="64k"

# directlvm_device specifies a custom block storage device to use for the
# thin pool. Required if you setup devicemapper.
# directlvm_device = ""

# directlvm_device_force wipes device even if device already has a filesystem.
# directlvm_device_force = "True"

# fs specifies the filesystem type to use for the base device.
# fs="xfs"

# log_level sets the log level of devicemapper.
# 0: LogLevelSuppress 0 (Default)
# 2: LogLevelFatal
# 3: LogLevelErr
# 4: LogLevelWarn
# 5: LogLevelNotice
# 6: LogLevelInfo
# 7: LogLevelDebug
# log_level = "7"

# min_free_space specifies the min free space percent in a thin pool require for
# new device creation to succeed. Valid values are from 0% - 99%.
# Value 0% disables
# min_free_space = "10%"

# mkfsarg specifies extra mkfs arguments to be used when creating the base
# device.
# mkfsarg = ""

# metadata_size is used to set the `pvcreate --metadatasize` options when
# creating thin devices. Default is 128k
# metadata_size = ""

# Size is used to set a maximum size of the container image.
# size = ""

# use_deferred_removal marks devicemapper block device for deferred removal.
# If the thinpool is in use when the driver attempts to remove it, the driver
# tells the kernel to remove it as soon as possible. Note this does not free
# up the disk space, use deferred deletion to fully remove the thinpool.
# use_deferred_removal = "True"

# use_deferred_deletion marks thinpool device for deferred deletion.
# If the device is busy when the driver attempts to delete it, the driver
# will attempt to delete device every 30 seconds until successful.
# If the program using the driver exits, the driver will continue attempting
# to cleanup the next time the driver is used. Deferred deletion permanently
# deletes the device and all data stored in device will be lost.
# use_deferred_deletion = "True"

# xfs_nospace_max_retries specifies the maximum number of retries XFS should
# attempt to complete IO when ENOSPC (no space) error is returned by
# underlying storage device.
# xfs_nospace_max_retries = "0"

Output of buildah info:

WARN[0000] error running newgidmap: exit status 1: newgidmap: write to gid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               
WARN[0000] error running newuidmap: exit status 1: newuidmap: write to uid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               
{
    "host": {
        "CgroupVersion": "v1",
        "Distribution": {
            "distribution": "fedora",
            "version": "35"
        },
        "MemFree": 468131840,
        "MemTotal": 16259915776,
        "OCIRuntime": "crun",
        "SwapFree": 8232628224,
        "SwapTotal": 8254386176,
        "arch": "amd64",
        "cpus": 8,
        "hostname": "runner-le6vnazt-project-140-concurrent-026xwb",
        "kernel": "3.10.0-1160.25.1.el7.x86_64",
        "os": "linux",
        "rootless": true,
        "uptime": "310h 24m 16.22s (Approximately 12.92 days)"
    },
    "store": {
        "ContainerStore": {
            "number": 0
        },
        "GraphDriverName": "vfs",
        "GraphOptions": null,
        "GraphRoot": "/home/build/.local/share/containers/storage",
        "GraphStatus": {},
        "ImageStore": {
            "number": 0
        },
        "RunRoot": "/var/tmp/containers-user-1000/containers"
    }
}

Output of capsh --print:

Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=i
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Current IAB: cap_chown,cap_dac_override,!cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,!cap_linux_immutable,cap_net_bind_service,!cap_net_broadcast,!cap_net_admin,!cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,cap_sys_chroot,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,cap_mknod,!cap_lease,cap_audit_write,!cap_audit_control,cap_setfcap,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend
Securebits: 00/0x0/1'b0 (no-new-privs=1)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=1000(build) euid=1000(build)
gid=1000(build)
groups=1000(build)
Guessed mode: UNCERTAIN (0)

Output of /etc/subgid and /etc/subuid:

build:2000:50000

Output of rpm -qa | grep shadow:

shadow-utils-4.9-9.fc35.x86_64

Output of getcap /usr/bin/newuidmap /usr/bin/newgidmap:

/usr/bin/newuidmap cap_setuid=ep
/usr/bin/newgidmap cap_setgid=ep

Output of cat /proc/sys/users/max_user_namespace on Centos 7.9 host::

15000
@flouthoc
Copy link
Collaborator

Hi @ckehoe , Thanks for creating the issue. Could you share how are you using buildah inside k8s ? Are you using buildah image from quay ? Could you share the image being used by pod to run buildah. ? I suspect something is wrong with the configured environment. Try using official buildah image from upstream i.e quay.io/buildah/stable:latest or quay.io/buildah/upstream:latest

@rhatdan
Copy link
Member

rhatdan commented Jun 13, 2022

Also using buildah build --isolation=chroot ... Might be helpful.

@ColonelBundy
Copy link

We are running into the same issue, tried both upstream and the latest tag.

We are however running in openshift:

time="2022-06-14T09:13:45Z" level=warning msg="error reading allowed ID mappings: error reading subuid mappings for user \"1000750000\" and subgid mappings for group \"1000750000\": No subuid ranges found for user \"1000750000\" in /etc/subuid" time="2022-06-14T09:13:45Z" level=warning msg="Found no UID ranges set aside for user \"1000750000\" in /etc/subuid." time="2022-06-14T09:13:45Z" level=warning msg="Found no GID ranges set aside for user \"1000750000\" in /etc/subgid." time="2022-06-14T09:13:45Z" level=warning msg="error running newgidmap: fork/exec /usr/bin/newgidmap: operation not permitted: " time="2022-06-14T09:13:45Z" level=warning msg="falling back to single mapping" time="2022-06-14T09:13:45Z" level=warning msg="error running newuidmap: fork/exec /usr/bin/newuidmap: operation not permitted: " time="2022-06-14T09:13:45Z" level=warning msg="falling back to single mapping" time="2022-06-14T09:13:45Z" level=warning msg="Error loading container config when searching for local runtime: no such file or directory" time="2022-06-14T09:13:45Z" level=error msg="failed to setup From and Build flags: failed to get container config: no such file or directory" time="2022-06-14T09:13:[45](https://gitlab.infra.stralfors.com/ocp/roles/-/jobs/28914#L45)Z" level=error msg="exit status 1"

Any ideas?

@rhatdan
Copy link
Member

rhatdan commented Jun 14, 2022

newuidmap and newgidmap are probably failing because you are running in a confined environment without SETUID and SETGID capabilities. Or newuidmap and newgidmap are not setfcap.

@ckehoe
Copy link
Author

ckehoe commented Jun 20, 2022

Could you share how are you using buildah inside k8s ?

The buildah pod is spun up as a Gitlab CI Runner to run our container build pipelines

Are you using buildah image from quay ? Could you share the image being used by pod to run buildah. ? I suspect something is wrong with the configured environment. Try using official buildah image from upstream i.e quay.io/buildah/stable:latest or quay.io/buildah/upstream:latest

Yes, the official buildah image from quay. Pulled down about 3 days before I posted the issue.

I suspect something is wrong with the configured environment

Anything specific to look for? Do you think it's the Centos 7.9 host or the container itself?

@ckehoe
Copy link
Author

ckehoe commented Jun 20, 2022

buildah build --isolation=chroot ... Might be helpful

No change when adding --isolation=chroot

@flouthoc
Copy link
Collaborator

No subuid ranges found for us

@ColonelBundy In your case I think as warning tells that there is no entry for your user in /etc/subuid and /etc/subgid.

@flouthoc
Copy link
Collaborator

@ckehoe Could you share output of /etc/subuid and /etc/subgid and also getcap /usr/bin/newuidmap /usr/bin/newgidmap

@ColonelBundy
Copy link

No subuid ranges found for us

@ColonelBundy In your case I think as warning tells that there is no entry for your user in /etc/subuid and /etc/subgid.

Solved it for now running as user 1000 and the nonroot scc in openshift.

@flouthoc
Copy link
Collaborator

No subuid ranges found for us

@ColonelBundy In your case I think as warning tells that there is no entry for your user in /etc/subuid and /etc/subgid.

Solved it for now running as user 1000 and the nonroot scc in openshift.

Thanks for confirming.

@ckehoe
Copy link
Author

ckehoe commented Jun 22, 2022

Could you share output of /etc/subuid and /etc/subgid and also getcap /usr/bin/newuidmap /usr/bin/newgidmap

@flouthoc

Output of /etc/subuid and /etc/subgid:

build:2000:50000

Output of getcap /usr/bin/newuidmap /usr/bin/newgidmap:

/usr/bin/newuidmap cap_setuid=ep
/usr/bin/newgidmap cap_setgid=ep

@flouthoc
Copy link
Collaborator

@ckehoe Could you also please share buildah unshare cat /proc/self/uid_map looks like.

@flouthoc
Copy link
Collaborator

@ckehoe Since you mentioned gitlab runner are you running in a double nested setup ? i.e kubernetes -> container (rootless) -> buildah ( that is inside first rootless container ) ?

@ckehoe
Copy link
Author

ckehoe commented Jun 29, 2022

@flouthoc

Could you also please share buildah unshare cat /proc/self/uid_map looks like.

[build@runner-qamcr1da-project-140-concurrent-0nh684 /]$ buildah unshare cat /proc/self/uid_map
WARN[0000] Error running newgidmap: exit status 1: newgidmap: write to gid_map failed: Operation not permitted 
WARN[0000] Falling back to single mapping               
WARN[0000] Error running newuidmap: exit status 1: newuidmap: write to uid_map failed: Operation not permitted 
WARN[0000] Falling back to single mapping               
         0       1000          1

@ckehoe
Copy link
Author

ckehoe commented Jun 29, 2022

@flouthoc

Since you mentioned gitlab runner are you running in a double nested setup ? i.e kubernetes -> container (rootless) -> buildah ( that is inside first rootless container ) ?

I don't believe so - it's a kubernetes runner so the buildah container runs on top of CRI-0 (built in to RKE2)

@ckehoe
Copy link
Author

ckehoe commented Jul 11, 2022

@flouthoc Any thoughts on how to get this working?

@flouthoc
Copy link
Collaborator

flouthoc commented Jul 12, 2022

@ckehoe Upon checking this again with teammates. I think this is missing CAP_SETGID / CAP_SETUID in outer container so this states true #4049 (comment)

Could you try with additional caps on outer container i.e ( outermost buildah container )

     securityContext:
       capabilities:
         add:
           - CAP_SETGID
           - CAP_SETUID

cc @giuseppe do you think above is worth trying ?

@kittydoor
Copy link

From my understanding CAP_SETUID is equivalent to running as root, as a process with CAP_SETUID can set their UID to anything including 0 i.e. root. Are there any ways to build containers inside of a container (such as in kubernetes) without this capability?

@giuseppe
Copy link
Member

From my understanding CAP_SETUID is equivalent to running as root, as a process with CAP_SETUID can set their UID to anything including 0 i.e. root. Are there any ways to build containers inside of a container (such as in kubernetes) without this capability?

no, you'd need to use user namespaces. CRI-O has support for them and support for user namespaces is being added to upstream Kubernetes as well

@rhatdan
Copy link
Member

rhatdan commented Jul 20, 2022

CAP_SETUID is very powerful correct but the container is still prevented via SELinux, SECCOMP, Other missing CAPS, Namespaces ... CAP_SETUID is given to all containers by default in Podman, Docker, Containerd, if you trust the container then the processes running within the build will not be running with CAP_SETUID on the outer namespaces.

@giuseppe
Copy link
Member

I am closing the issue since this is not something that can be addressed in buildah

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants