Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running privileged systemd container in namespaced OpenShift pod fails #21008

Open
adelton opened this issue Dec 13, 2023 · 32 comments
Open

Running privileged systemd container in namespaced OpenShift pod fails #21008

adelton opened this issue Dec 13, 2023 · 32 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@adelton
Copy link
Contributor

adelton commented Dec 13, 2023

Issue Description

I try to get Kind (with podman) run in OpenShift rootless pods: https://github.com/adelton/kind-in-pod

I have minimized the problem to running a privileged podman container with --cgroupns=private, run in a privileged OpenShift Pod. It passes when run as uid 0 but fails when run user namespaced.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Have an OpenShift cluster with a regular account (user) and an admin account (admin) with the cluster-admin role.
  2. As the regular user, oc new-project test-1
  3. As the admin, make it possible to run Pods in that namespace test-1 as privileged: oc adm policy add-scc-to-user privileged -z default -n test-1
  4. Have YAML file test-podman.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-podman
# annotations:
#   io.openshift.builder: "true"
#   io.kubernetes.cri-o.userns-mode: auto
spec:
  restartPolicy: Never
# securityContext:
#   runAsUser: 300000
  containers:
  - name: container
    image: quay.io/podman/stable
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    command:
    - bash
    - -c
    - 'set -x ; id ; cat /proc/self/uid_map ; mount | grep cgroup ; podman run --privileged --cgroupns=private --rm -ti quay.io/podman/stable sh -c "id ; cat /proc/self/uid_map ; mount | grep cgroup ; exec /usr/sbin/init --show-status"'
  1. As the reqular user, create the Pod from this YAML: oc apply -f test-podman.yaml
  2. After small while, check that oc logs -f pod/test-podman shows systemd is running in that Pod started by the podman in the OpenShift Pod:
+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         0          0 4294967295
+ mount
+ grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ podman run --privileged --cgroupns=private --rm -ti quay.io/podman/stable sh -c 'id ; cat /proc/self/uid_map ; mount | grep cgroup ; exec /usr/sbin/init --show-status'
time="2023-12-13T19:21:10Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
Trying to pull quay.io/podman/stable:latest...
Getting image source signatures
Copying blob sha256:dca987133639956bbd8d83e9b09d881728d87e139191425d1023025611e50f1b
Copying blob sha256:4084d87f5b2bb0b45e49b9bc40f508bcfef81ec79e2178c7167df536c9ae1bb6
Copying blob sha256:e3ed84217cde78aacc15a69b14acb2c046e57a17c2bac6168e88cf14e1149297
Copying blob sha256:718a00fe32127ad01ddab9fc4b7c968ab2679c92c6385ac6865ae6e2523275e4
Copying blob sha256:12305653be8076c8dd92c10ca264b988347d5d5bb8324f3fef50931ec7a98dec
Copying blob sha256:cf3714e02e6303da92bef7ebdad64221c4d74fe5d8fb7a699486d9dece20600e
Copying blob sha256:4d25ca59348522bfa3a23819387d88f6d72cc6f444647c545a28861a2464b52c
Copying blob sha256:abe0a3f653ea06c6fd277c2b63728084f6a55968fc051a68e7fe1e723f3e1527
Copying blob sha256:5c52e6a6dbdd507d19e2df35836e4221f0da040595cb3653ff059ee98ed88481
Copying config sha256:fcf4ae1744dedf14bd09d45d5b8af39b936c50eb288e82ad69c100e87388f3b1
Writing manifest to image destination
time="2023-12-13T19:21:15Z" level=warning msg="Path \"/run/secrets/etc-pki-entitlement\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping"
uid=0(root) gid=0(root) groups=0(root)
         0          0 4294967295
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
systemd 254.7-1.fc39 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Fedora Linux 39 (Container Image)!

Couldn't move remaining userspace processes, ignoring: Input/output error
bpf-lsm: BPF LSM hook not enabled in the kernel, BPF LSM not supported
Queued start job for default target graphical.target.
[  OK  ] Created slice system-getty.slice - Slice /system/getty.
[  OK  ] Created slice system-modprobe.slice - Slice /system/modprobe.
[  OK  ] Created slice user.slice - User and Session Slice.
[...]
         Starting systemd-update-utmp-runle…- Record Runlevel Change in UTMP...
[  OK  ] Finished systemd-update-utmp-runle…e - Record Runlevel Change in UTMP.

Fedora Linux 39 (Container Image)
Kernel 5.14.0-284.43.1.el9_2.x86_64 on an x86_64 (console)

  1. Delete the Pod: oc delete -f test-podman.yaml
  2. Edit the YAML file and uncomment the annotations and securityContext:
  annotations:
    io.openshift.builder: "true"
    io.kubernetes.cri-o.userns-mode: auto
spec:
  restartPolicy: Never
  securityContext:
    runAsUser: 300000
  1. As the reqular user, create the Pod from this YAML: oc apply -f test-podman.yaml
  2. After small while, check what oc logs -f pod/test-podman reports about the Pod.

Describe the results you received

The output shows that the OpenShift Pod now runs user namespaced (uid 0 in the Pod is uid 300000 on the worker host) ... but then systemd fails even if /sys/fs/cgroup is shown mounted read-write both in the OpenShift Pod and in the podman Pod:

+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         1     200000      65535
         0     300000          1
+ mount
+ grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ podman run --privileged --cgroupns=private --rm -ti quay.io/podman/stable sh -c 'id ; cat /proc/self/uid_map ; mount | grep cgroup ; exec /usr/sbin/init --show-status'
time="2023-12-13T19:22:31Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
Trying to pull quay.io/podman/stable:latest...
Getting image source signatures
Copying blob sha256:dca987133639956bbd8d83e9b09d881728d87e139191425d1023025611e50f1b
Copying blob sha256:cf3714e02e6303da92bef7ebdad64221c4d74fe5d8fb7a699486d9dece20600e
Copying blob sha256:718a00fe32127ad01ddab9fc4b7c968ab2679c92c6385ac6865ae6e2523275e4
Copying blob sha256:4084d87f5b2bb0b45e49b9bc40f508bcfef81ec79e2178c7167df536c9ae1bb6
Copying blob sha256:12305653be8076c8dd92c10ca264b988347d5d5bb8324f3fef50931ec7a98dec
Copying blob sha256:e3ed84217cde78aacc15a69b14acb2c046e57a17c2bac6168e88cf14e1149297
Copying blob sha256:4d25ca59348522bfa3a23819387d88f6d72cc6f444647c545a28861a2464b52c
Copying blob sha256:abe0a3f653ea06c6fd277c2b63728084f6a55968fc051a68e7fe1e723f3e1527
Copying blob sha256:5c52e6a6dbdd507d19e2df35836e4221f0da040595cb3653ff059ee98ed88481
Copying config sha256:fcf4ae1744dedf14bd09d45d5b8af39b936c50eb288e82ad69c100e87388f3b1
Writing manifest to image destination
time="2023-12-13T19:22:37Z" level=warning msg="Path \"/run/secrets/etc-pki-entitlement\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping"
uid=0(root) gid=0(root) groups=0(root)
         1     200000      65535
         0     300000          1
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
systemd 254.7-1.fc39 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Fedora Linux 39 (Container Image)!

Failed to open /dev/tty0: Permission denied
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

Describe the results you expected

No error, systemd running in that user namespace (created by OpenShift's CRI-O).

podman info output

host:
  arch: amd64
  buildahVersion: 1.33.2
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.8-2.fc39.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: '
  cpuUtilization:
    idlePercent: 98.05
    systemPercent: 0.56
    userPercent: 1.39
  cpus: 8
  databaseBackend: sqlite
  distribution:
    distribution: fedora
    variant: container
    version: "39"
  eventLogger: file
  freeLocks: 2048
  hostname: test-podman
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.14.0-284.43.1.el9_2.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 18642235392
  memTotal: 33100300288
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.9.0-1.fc39.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.9.0
    package: netavark-1.9.0-1.fc39.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.9.0
  ociRuntime:
    name: crun
    package: crun-1.12-1.fc39.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.12
      commit: ce429cb2e277d001c2179df1ac66a470f00802ae
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20231119.g4f1709d-1.fc39.x86_64
    version: |
      pasta 0^20231119.g4f1709d-1.fc39.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.2-1.fc39.x86_64
    version: |-
      slirp4netns version 1.2.2
      commit: 0ee2d87523e906518d34a6b423271e4826f71faf
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 11h 50m 23.00s (Approximately 0.46 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.imagestore: /var/lib/shared
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.12-2.fc39.x86_64
      Version: |-
        fusermount3 version: 3.16.1
        fuse-overlayfs: version 1.12
        FUSE library version 3.16.1
        using FUSE kernel interface version 7.38
    overlay.mountopt: nodev,fsync=0
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 321517498368
  graphRootUsed: 16494317568
  graphStatus:
    Backing Filesystem: overlayfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Supports shifting: "true"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.8.1
  Built: 1701777650
  BuiltTime: Tue Dec  5 12:00:50 2023
  GitCommit: ""
  GoVersion: go1.21.4
  Os: linux
  OsArch: linux/amd64
  Version: 4.8.1

Podman in a container

Yes

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Note that the "Privileged Or Rootless" selection that the Bug report form makes me make does not really make sense -- I try to run it privileged but rootless at the same time, just the user namespacing is not done by podman but by CRI-O.

Additional information

Deterministic.

@adelton adelton added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2023
@adelton
Copy link
Contributor Author

adelton commented Dec 13, 2023

I get the same result even if I amend the podman run parameters to include --systemd=always.

@giuseppe
Copy link
Member

you've processes running in the cgroup root in the container, that prevents (on cgroupv2) to create sub-cgroups and move processes there. One thing you could try in the container is:

# mkdir /sys/fs/cgroup/init
# echo 1 > /sys/fs/cgroup/init/cgroup.procs

and make sure no other processes are running in the root cgroup (i.e. /sys/fs/cgroup/cgroup.procs must be empty)

@adelton
Copy link
Contributor Author

adelton commented Dec 18, 2023

Thanks @giuseppe for those hints.

When I added

cat /sys/fs/cgroup/cgroup.procs ; mkdir /sys/fs/cgroup/init ; echo 1 > /sys/fs/cgroup/init/cgroup.procs

to the shell command, I got

+ mount
+ grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ cat /sys/fs/cgroup/cgroup.procs
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
+ mkdir /sys/fs/cgroup/init
mkdir: cannot create directory '/sys/fs/cgroup/init': Permission denied
+ echo 1
bash: line 1: /sys/fs/cgroup/init/cgroup.procs: No such file or directory

So the /sys/fs/cgroup/cgroup.procs is not empty (but it's full of zeroes) and the process running in the user namespace cannot create that /sys/fs/cgroup/init.

@giuseppe
Copy link
Member

thanks, the 0s means that there are already processes in the cgroup but they are not part of the current PID namespace, so you cannot access/reference them but they are still there :/

That looks like a Kubernetes/CRI-O error. There should not be any process in your cgroup except the ones from your container. Not sure how the privileged flag is affecting this.

What is the output of cat /proc/self/cgroup in the container? If you get something different than '/', then you might need to do something like:

# mkdir /sys/fs/cgroup/$CGROUP_YOU_GOT/init
#  echo 1 > /sys/fs/cgroup/$CGROUP_YOU_GOT/init/cgroup.procs

@adelton
Copy link
Contributor Author

adelton commented Dec 18, 2023

What is the output of cat /proc/self/cgroup in the container?

0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod67711498_0b76_435e_8929_2ec772bd35ac.slice/crio-f3e20663b3226133b465d1cf61bda915272154af56a51a7939531b87ea0b07bd.scope

Is that expected?

Shouldn't --cgroupns=private have made it a bit more private?

If you get something different than '/', then you might need to do something like:

I've used

cat /proc/self/cgroup ; CGROUP_YOU_GOT=$( sed "s%^0::/%%" /proc/self/cgroup ) ; mkdir /sys/fs/cgroup/$CGROUP_YOU_GOT/init ; echo 1 > /sys/fs/cgroup/$CGROUP_YOU_GOT/init/cgroup.procs

and got

+ cat /proc/self/cgroup
0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode1f87c0a_28a9_4c27_bcaf_1eb30c6488fc.slice/crio-05f1510a7c9549b477a7689f964588b6db2de9666a28edefe1ce06421d83e5f7.scope
++ sed 's%^0::/%%' /proc/self/cgroup
+ CGROUP_YOU_GOT=kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode1f87c0a_28a9_4c27_bcaf_1eb30c6488fc.slice/crio-05f1510a7c9549b477a7689f964588b6db2de9666a28edefe1ce06421d83e5f7.scope
+ mkdir /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode1f87c0a_28a9_4c27_bcaf_1eb30c6488fc.slice/crio-05f1510a7c9549b477a7689f964588b6db2de9666a28edefe1ce06421d83e5f7.scope/init
mkdir: cannot create directory '/sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode1f87c0a_28a9_4c27_bcaf_1eb30c6488fc.slice/crio-05f1510a7c9549b477a7689f964588b6db2de9666a28edefe1ce06421d83e5f7.scope/init': Permission denied
+ echo 1
bash: line 1: /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode1f87c0a_28a9_4c27_bcaf_1eb30c6488fc.slice/crio-05f1510a7c9549b477a7689f964588b6db2de9666a28edefe1ce06421d83e5f7.scope/init/cgroup.procs: No such file or directory

@giuseppe
Copy link
Member

weird that a privileged container has access to the host cgroup but in read-only mode :/

Then you may need to setup a volume from the host /sys/fs/cgroup, or you can try to mount a fresh cgroup hierarchy inside the container:

# umount /sys/fs/cgroup
# mount -t cgroup2 cgroup2 /sys/fs/cgroup

@adelton
Copy link
Contributor Author

adelton commented Dec 18, 2023

That sadly yields

+ grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ umount /sys/fs/cgroup
+ mount -t cgroup2 cgroup2 /sys/fs/cgroup
mount: /sys/fs/cgroup: permission denied.
       dmesg(1) may have more information after failed mount system call.

and there is nothing cgroup-related in dmesg -- it ends with

[  317.358613] IPv6: ADDRCONF(NETDEV_CHANGE): 1aad4ad83d1bf21: link becomes ready
[  317.375708] device 1aad4ad83d1bf21 entered promiscuous mode
[  329.910163] device 1aad4ad83d1bf21 left promiscuous mode
[  391.214847] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  391.214889] IPv6: ADDRCONF(NETDEV_CHANGE): 337e4fe67588f99: link becomes ready
[  391.231305] device 337e4fe67588f99 entered promiscuous mode
[  403.111496] device 337e4fe67588f99 left promiscuous mode
[  451.763258] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  451.763311] IPv6: ADDRCONF(NETDEV_CHANGE): 1dc9efc6fc0e63d: link becomes ready
[  451.777866] device 1dc9efc6fc0e63d entered promiscuous mode

@adelton
Copy link
Contributor Author

adelton commented Dec 18, 2023

I realized I might have misunderstood where you wanted me to do these changes. I did them in the shell script in the "outer" container, created by OpenShift / CRI-O, before running that podman run.

When I put them to that podman run shell script, to the internal container, I get

+ grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ cat /proc/self/cgroup
0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2cf3c47b_be28_4452_bc2d_9649a1888e5e.slice/crio-2eba818d8cf2b3783458b31a0b00dbe22643bc4fcc2a76501c7196b05a3f43a2.scope
+ podman run --privileged --cgroupns=private --rm -ti quay.io/podman/stable sh -c 'set -x ; id ; cat /proc/self/uid_map ; mount | grep cgroup ; cat /proc/self/cgroup ; cat /sys/fs/cgroup/cgroup.procs ; umount /sys/fs/cgroup ; mount -t cgroup2 cgroup2 /sys/fs/cgroup ; cat /proc/self/cgroup ; cat /sys/fs/cgroup/cgroup.procs ; mkdir /sys/fs/cgroup/init ; echo 1 > /sys/fs/cgroup/init/cgroup.procs ; exec /usr/sbin/init --show-status'
time="2023-12-18T20:33:28Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
Trying to pull quay.io/podman/stable:latest...
Getting image source signatures
Copying blob sha256:d00615445cd619571cdbf3ecb3539b942a172bebde42997a34f245537d1b02e4
Copying blob sha256:88fb5b06f151d51781938a021b646166999bf4bda415997a70f783aae6786f4d
Copying blob sha256:718a00fe32127ad01ddab9fc4b7c968ab2679c92c6385ac6865ae6e2523275e4
Copying blob sha256:9c8aa943ada9627bc58c1f3148c38a1b0dd77be4b57b11361fcbd3d973f83d5c
Copying blob sha256:0cbdea6f14c1949859bb7abf787a973cc5d8ab4b931ad69fcdcf37bc937a5a7f
Copying blob sha256:79f17d1ec90f889477df5d97fb8d5f304ea0ecaf3ad180be5d302854cde8c568
Copying blob sha256:17ebbd7273cbba618cd50fa6ac3084e54faf10a507fd084e89fb75ed5b99d071
Copying blob sha256:3bb12921afd4ab04074c5a20fcf66c22a54cc36760a0735970e99acaa3462e99
Copying blob sha256:7d5557dc6408c937f05b6aee5a581aa57648316fab0601746709a08becee2ea8
Copying config sha256:aa5f91fceffc4dd87e8ebe90e291dfe4e639eda2831ba9a942f20dbb1597f6e3
Writing manifest to image destination
time="2023-12-18T20:33:33Z" level=warning msg="Path \"/run/secrets/etc-pki-entitlement\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping"
+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         1     200000      65535
         0     300000          1
+ mount
+ grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ cat /proc/self/cgroup
0::/
+ cat /sys/fs/cgroup/cgroup.procs
0
0
0
1
7
+ umount /sys/fs/cgroup
+ mount -t cgroup2 cgroup2 /sys/fs/cgroup
+ cat /proc/self/cgroup
0::/
+ cat /sys/fs/cgroup/cgroup.procs
0
0
0
1
11
+ mkdir /sys/fs/cgroup/init
mkdir: cannot create directory '/sys/fs/cgroup/init': Permission denied
+ echo 1
sh: line 1: /sys/fs/cgroup/init/cgroup.procs: No such file or directory
+ exec /usr/sbin/init --show-status
systemd 254.7-1.fc39 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Fedora Linux 39 (Container Image)!

Failed to open /dev/tty0: Permission denied
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

So in the container created by podman run I see 0::/ in /proc/self/cgroup but there are still some processes in /sys/fs/cgroup/cgroup.procs. I can umount and mount the cgroup2 filesystem but the content of /sys/fs/cgroup/cgroup.procs stays mostly the same, and then mkdir /sys/fs/cgroup/init still fails.

@adelton
Copy link
Contributor Author

adelton commented Dec 18, 2023

I'm probably also looking for some guidance if the podman run's --cgroups should be used in some way, and possibly also the io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw annotation. Ideally -- have you got a working example?

@giuseppe
Copy link
Member

I've not a working example as I've never tried this combination yet.

I was not aware of the io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw annotation, but that sounds like something you need, so you can mount the /sys/fs/cgroup hierarchy writeable and do not need to create a new mount

@adelton
Copy link
Contributor Author

adelton commented Dec 20, 2023

The problem is, I don't see any changes to behaviour when I use that (so perhaps it is the default already?). The mount output shows that the mountpoint is rw already.

@giuseppe If I got an access to a OpenShift cluster set up for you, would you be willing to investigate the behaviour directly?

@giuseppe
Copy link
Member

@giuseppe If I got an access to a OpenShift cluster set up for you, would you be willing to investigate the behaviour directly?

yes that will probably help so I can investigate what is going on. Does tomorrow morning (Europe time) work well with you?

@adelton
Copy link
Contributor Author

adelton commented Dec 20, 2023

Yes. I'll ping you. Thank you!

@giuseppe
Copy link
Member

after the investigation, it turned out the root reason for these failures is that CRI-O doesn't delegate the cgroup to any user in the user namespace. So even if the cgroup file system is mounted as writeable, no user in the user namespace can write to it.

I've opened an issue with CRI-O to request for this feature: cri-o/cri-o#7623

@adelton
Copy link
Contributor Author

adelton commented Jan 5, 2024

With @rata 's help in cri-o/cri-o#7623, I was able to make a small progress with the investigation on OpenShift.

It turns out that the io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw: "true" annotation is crucial, at least on OpenShift based on Kubernetes 1.27. (On the other hand, hostUsers: false did not have any effect.)

To use that io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw, it needs to get added to the OpenShift worker nodes to /etc/crio/crio.conf.d/00-default as

[crio.runtime.workloads.openshift-builder]
activation_annotation = "io.openshift.builder"
allowed_annotations = [
  "io.kubernetes.cri-o.userns-mode",
  "io.kubernetes.cri-o.Devices",
  "io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw"
]

and the crio systemd service restarted.

The

    securityContext:
      privileged: true

then needs to be replaced with

    securityContext:
      capabilities:
        add: ["NET_ADMIN", "SYS_ADMIN"]

Keeping that privileged: true there actually prevents the /sys/fs/cgroup from getting correct ownership, so from inside of the container it looks owned by nobody (which means it's owned by host's root, the real uid 0).

To help debugging, I added ls -la /sys/fs/cgroup to that list of bash commands, to better see what changes have what effect.

With these changes, I'm able to run the podman in OpenShift user namespaced, and the /sys/fs/cgroup seems writable by the root uid in container`.

However, since we now run the Pod's container not privileged, running podman now fails with

+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         1     200000      65535
         0     300000          1
+ mount
+ grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ ls -la /sys/fs/cgroup
total 0
drwxr-xr-x. 2 root   nobody 0 Jan  5 17:44 .
drwxr-xr-x. 9 nobody nobody 0 Jan  5 16:32 ..
-r--r--r--. 1 nobody nobody 0 Jan  5 17:44 cgroup.controllers
-r--r--r--. 1 nobody nobody 0 Jan  5 17:44 cgroup.events
-rw-r--r--. 1 nobody nobody 0 Jan  5 17:44 cgroup.freeze
[...]
+ podman run --privileged --cgroupns=private --rm -ti quay.io/podman/stable sh -c 'id ; cat /proc/self/uid_map ; mount | grep cgroup ; ls -la /sys/fs/cgroup ; exec /usr/sbin/init --show-status'
time="2024-01-05T17:44:42Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
Trying to pull quay.io/podman/stable:latest...
Getting image source signatures
Copying blob sha256:718a00fe32127ad01ddab9fc4b7c968ab2679c92c6385ac6865ae6e2523275e4
Copying blob sha256:5271bbb74c5f8de7d10a701c855f8008addb156e89ad8cf5bb52c70fcdeceec3
Copying blob sha256:df0aedc32d90516001a0e7f610ae9d7ae0331b5250a5e2817bde179ed2b2f1ce
Copying blob sha256:12bb734364b9210805dd5da002937cd6228cc01b5cb23fdb17c8ef8d404a606c
Copying blob sha256:c6931ba07be02717b061da4f778d9163ac27467d7f4fe0ad129c64bc40ac1715
Copying blob sha256:77b5e14eecc646ac7ac04ad60f29271f68eb72db489788a986fd6a155372b826
Copying blob sha256:f6c436956692e15fe91c4131634445bffcf5d1ec90705bfa6374af60eef412fc
Copying blob sha256:7042cf026512fef7c6225582354b4944102f76c5b5b4ade65284c5cf1692b4ac
Copying blob sha256:ecf5af26e43f79bbcdf65232a15764bebea53feeb431db329017df0db50130cf
Copying config sha256:2c35043cb2d70b012627d97f78a9c2297003649374bbe003e7acf185f3755b50
Writing manifest to image destination
time="2024-01-05T17:44:47Z" level=error msg="Unmounting /var/lib/containers/storage/overlay/8c63941c301041ea5fea6c41e4d259678d8d6459d83ad28c16fbcc89ead21b5a/merged: invalid argument"
Error: mounting storage for container 3fb13570f56e776e0e22068e072ef3e68108d576de61e09be07545fc02f3fbf4: creating overlay mount to /var/lib/containers/storage/overlay/8c63941c301041ea5fea6c41e4d259678d8d6459d83ad28c16fbcc89ead21b5a/merged, mount_data="lowerdir=/var/lib/containers/storage/overlay/l/VILPOB75NLSTKKKORRBIQQ23FM:/var/lib/containers/storage/overlay/l/K2MSVP5TUAEEXW4VDNVOMDELV6:/var/lib/containers/storage/overlay/l/TYMYUYBQ7EPZKAPMDWGEZGJNU6:/var/lib/containers/storage/overlay/l/J7W4PHIISTOASU6Q5PFL6UUUQ6:/var/lib/containers/storage/overlay/l/WOOABCOSK3JDMYHXV3XH6EDW5X:/var/lib/containers/storage/overlay/l/FLOVW5VUPQ36I7PBZCDBFTZPR4:/var/lib/containers/storage/overlay/l/MDBJFUMGL6OGQAYTM4IW52Q6YS:/var/lib/containers/storage/overlay/l/626R4BH7L4RFJFC53FRUP22UHH:/var/lib/containers/storage/overlay/l/5Q4ZFODUGVBB2BWGFXPOCU422P,upperdir=/var/lib/containers/storage/overlay/8c63941c301041ea5fea6c41e4d259678d8d6459d83ad28c16fbcc89ead21b5a/diff,workdir=/var/lib/containers/storage/overlay/8c63941c301041ea5fea6c41e4d259678d8d6459d83ad28c16fbcc89ead21b5a/work,fsync=0,volatile": using mount program /usr/bin/fuse-overlayfs: unknown argument ignored: lazytime
fuse: device not found, try 'modprobe fuse' first
fuse-overlayfs: cannot mount: No such file or directory

@rata
Copy link

rata commented Jan 8, 2024

@adelton in the not privileged case, does it work if you mount an emptyDir (tmpfs) on the pod in /var/lib/containers/storage? Overlayfs inside overlayfs is problematic, that way it should avoid it. But not sure if that would be enough.

@rata
Copy link

rata commented Jan 8, 2024

cc @dgl, in case you have some time and already hit&fixed this issue here

@dgl
Copy link

dgl commented Jan 9, 2024

Hi, yes, we experienced similar in our environment (not using OpenShift but using cri-o as the runtime).

To confirm the overlayfs issue you can check dmesg as the kernel will log something like:

overlayfs: filesystem on '...' not supported as upperdir

There's two options to fix, as @rata says you can use an emptyDir (or other volume) mounted on the container storage location, or you can make fuse-overlayfs work (add /dev/fuse to the cri-o devices annotation, but using native is obviously better).

@adelton
Copy link
Contributor Author

adelton commented Jan 9, 2024

Adding

    volumeMounts:
    - mountPath: /var/lib/containers
      name: var-lib-containers
  volumes:
  - name: var-lib-containers
    emptyDir: {}

does not seem to change the outcome -- I still get the fuse-overlayfs: cannot mount: No such file or directory error.

Adding mount | grep /var/lib/containers shows that the filesystem mounted in the Pod container is

/dev/nvme0n1p4 on /var/lib/containers type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota)

and ls -la /var/lib/containers
shows

total 4
drwxrwxrwx. 2 nobody nobody    6 Jan  9 19:42 .
drwxr-xr-x. 1 root   root   4096 Jan  9 15:22 ..

@adelton
Copy link
Contributor Author

adelton commented Jan 9, 2024

Defining a PVC podman-pvc and changing that emptyDir: {} to

    persistentVolumeClaim:
      claimName: podman-pvc

changes the mount output to

/dev/nvme2n1 on /var/lib/containers type ext4 (rw,relatime,seclabel)

(ext4 instead of xfs), the ls output to

total 24
drwxr-xr-x. 3 nobody nobody  4096 Jan  9 19:42 .
drwxr-xr-x. 1 root   root    4096 Jan  9 15:22 ..
drwx------. 2 nobody nobody 16384 Jan  9 19:42 lost+found

(the permissions are now 0755 instead of 0777 in case of the emptyDir), and then the error changes to

+ podman run --privileged --cgroupns=private --rm -ti quay.io/podman/stable sh -c 'id ; cat /proc/self/uid_map ; mount | grep cgroup ; ls -la /sys/fs/cgroup ; exec /usr/sbin/init --show-status'
Error: creating runtime static files directory "/var/lib/containers/storage/libpod": mkdir /var/lib/containers/storage: permission denied

So the question is, with a volume, what is the recommended way to get CRI-O (?) to chown the volume to match the user-namespaced uid 0 in the Pod's container, in my case

+ cat /proc/self/uid_map
         1     331071      65535
         0     300000          1

?

@dgl
Copy link

dgl commented Jan 9, 2024

(Aside to the main issue: we do successfully use overlayfs backed with xfs, per the Docker docs you need to format with ftype=1.)

Setting securityContext.fsGroup to one of the mapped GIDs may work here; although depending if you configure a reduced set of capabilities, you may also need to the add CAP_DAC_OVERRIDE (this is adding it to the root user within the user namespace, not host root, so it's not a hugely scary permission).

(This is because CRI-O's annotation based userns-mode doesn't have any explicit support for volumes, whereas the hostUsers based support does do idmapping for volumes on recent Kubernetes so that the ownership will be correct.)

@rata
Copy link

rata commented Jan 10, 2024

@dgl +1

@adelton so that means that it works using an emptyDir for that path, then?

@adelton
Copy link
Contributor Author

adelton commented Jan 10, 2024

@rata No, emptyDir: {} yields the fuse-overlayfs: cannot mount: No such file or directory error.

Adding for example

    fsGroup: 200000
    fsGroupChangePolicy: OnRootMismatch

changes the ownership on the PVC-backed volume to

+ ls -la /var/lib/containers
total 24
drwxrwsr-x. 3 nobody root  4096 Jan 10 11:42 .
drwxr-xr-x. 1 root   root  4096 Jan  9 15:22 ..
drwxrws---. 2 nobody root 16384 Jan 10 11:42 lost+found

and that in turn brings the behaviour en par with the emptyDir: {} approach, back to the

time="2024-01-10T11:42:54Z" level=error msg="Unmounting /var/lib/containers/storage/overlay/e74864eaeca32b53ecc9c12edef624cf404fb485ddd634705efd9612e0c16370/merged: invalid argument"
Error: mounting storage for container 896fbd87605c5fc5304b6353d569231f439803c899c6708a2b16f0aed8cb5415: creating overlay mount to /var/lib/containers/storage/overlay/e74864eaeca32b53ecc9c12edef624cf404fb485ddd634705efd9612e0c16370/merged, mount_data="lowerdir=/var/lib/containers/storage/overlay/l/EOOTMFG4XGGKSO6YTDXO7DM4D7:/var/lib/containers/storage/overlay/l/F2WVVGW77NWBSVT3EVN3NAKUTE:/var/lib/containers/storage/overlay/l/RO2R4VRMMI3QYUSUINIYDJ2KWV:/var/lib/containers/storage/overlay/l/QAUEJNGUU6CJ7RXF5ZF4E5VCNR:/var/lib/containers/storage/overlay/l/BHSBMOUZJZR2HMFUC2FZFH55CY:/var/lib/containers/storage/overlay/l/OF2FUHOHO7VWPSV723PDXEHGQC:/var/lib/containers/storage/overlay/l/45R5Z2QNV2EFBBOT6KCMNELZLY:/var/lib/containers/storage/overlay/l/JUMGH5IWJMQ7XS5VHARUD4A6RI:/var/lib/containers/storage/overlay/l/IQI2TD7HAGLIUCPOMSQKLJNG5E,upperdir=/var/lib/containers/storage/overlay/e74864eaeca32b53ecc9c12edef624cf404fb485ddd634705efd9612e0c16370/diff,workdir=/var/lib/containers/storage/overlay/e74864eaeca32b53ecc9c12edef624cf404fb485ddd634705efd9612e0c16370/work,fsync=0,volatile": using mount program /usr/bin/fuse-overlayfs: unknown argument ignored: lazytime
fuse: device not found, try 'modprobe fuse' first
fuse-overlayfs: cannot mount: No such file or directory
: exit status 1

error.

@adelton
Copy link
Contributor Author

adelton commented Jan 10, 2024

With that

    fsGroup: 200000
    fsGroupChangePolicy: OnRootMismatch

and adding

                  mkdir /var/lib/containers/storage

to the command list before running that podman run --privileged --cgroupns=private quay.io/podman/stable shows that the directory can get created correctly:

+ mkdir /var/lib/containers/storage
+ ls -la /var/lib/containers
total 28
drwxrwsr-x. 4 nobody root  4096 Jan 10 11:53 .
drwxr-xr-x. 1 root   root  4096 Jan  9 15:22 ..
drwxrws---. 2 nobody root 16384 Jan 10 11:53 lost+found
drwxr-sr-x. 2 root   root  4096 Jan 10 11:53 storage

So I guess the question now is -- if I seem to have a root in a user namespace, with /var/lib/containers/storage writeable for that root, how should I run podman in that user namespace so that it does not insist on using fuse-overlayfs? What options are there?

(For the record, adding CAP_DAC_OVERRIDE to the spec.containers[*].securityContext.capabilities.add list does not seem to change the behaviour in any way.)

@adelton
Copy link
Contributor Author

adelton commented Jan 10, 2024

Removing /etc/containers/storage.conf before running podman in Pod's container seems to help, with both the PVC and emptyDir setups. I then get

time="2024-01-10T12:16:28Z" level=error msg="Streaming contents of container 3f18321b837de201335d600ea6ed8956c892e2dfc2e958f33895f17996845baf directory for volume copy-up: error in copier subprocess: chrooting to directory \"/var/lib/containers/storage/overlay/5a61a1de97d209edfb20ca5899a8e2a331eb04126ee785ae07b572388955482d/merged/var/lib/containers\": operation not permitted"

which gets easily fixed by adding "CAP_SYS_CHROOT" to the capabilities list.

The next hurdle is then

time="2024-01-10T12:18:50Z" level=warning msg="Path \"/run/secrets/etc-pki-entitlement\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping"
Error: crun: mount `sysfs` to `sys`: Permission denied: OCI permission denied

@dgl
Copy link

dgl commented Jan 10, 2024

Error: crun: mount sysfs to sys: Permission denied: OCI permission denied

This is likely due to having a masked /proc and /sys. ProcMountType Unmasked (set on the pod and the alpha feature gate) will allow the mount.

Although as you're CAP_SYS_ADMIN you can also just umount the masking paths.

@adelton
Copy link
Contributor Author

adelton commented Jan 10, 2024

Thanks for the hints. My understanding is alpha features are not available on OpenShift.

When I added

     mount | grep sysfs ; umount /sys

to the list of commands, I got

+ mount
+ grep sysfs
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime,seclabel)
+ umount /sys
umount: /sys: block devices are not permitted on filesystem.

Is that ro,nosuid,nodev,noexec,relatime,seclabel the masking? Did you mean umount via umount, or something else?

@rata
Copy link

rata commented Jan 10, 2024

@adelton are you using crun or runc? I think the sysfs error will go away with crun, as it fallback to do bind-mount of /sys when it can't mount a fresh sysfs.

@adelton
Copy link
Contributor Author

adelton commented Jan 10, 2024

The message says

Error: crun: mount `sysfs` to `sys`: Permission denied: OCI permission denied

so I would assume crun is used. Adding podman info to the list of commands confirms that:

  ociRuntime:
    name: crun
    package: crun-1.12-1.fc39.x86_64
    path: /usr/bin/crun

@adelton
Copy link
Contributor Author

adelton commented Jan 10, 2024

For the record, when I add runc to the image and try with runc, the error message is

Error: runc: runc create failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/8), flags: 0xe: permission denied: OCI permission denied

@adelton
Copy link
Contributor Author

adelton commented Jan 10, 2024

Breakthrough: with

      seLinuxOptions:
        type: spc_t

and adding --tmpfs /sys to the podman run, not only podman runs but even the systemd in that podman runs fine (status output is there with all green OKs).

I need to go back and re-reproduce to make sure that the setup is really as confined as I wated it to be.

@adelton
Copy link
Contributor Author

adelton commented Jan 10, 2024

So on a fresh OpenShift cluster

Client Version: 4.14.0-202310201027.p0.g0c63f9d.assembly.stream-0c63f9d
Kustomize Version: v5.0.1
Kubernetes Version: v1.27.8+4fab27b

I've verified that the following steps work:

  1. Have an OpenShift cluster with a regular account (user) and an admin account (admin) with the cluster-admin role.
  2. As the regular user, oc new-project test-1
  3. As the admin, make it possible to run Pods in that namespace test-1 with extra privileges (albeit not privileged: true): oc adm policy add-scc-to-user privileged -z default -n test-1
  4. As the admin, configure CRI-O on worker nodes to allow the io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw annotation -- oc apply -f - file
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cri-o-cgroup2-mount-hierarchy-rw
spec:
  selector:
    matchLabels:
      name: cri-o-cgroup2-mount-hierarchy-rw
  template:
    metadata:
      labels:
        name: cri-o-cgroup2-mount-hierarchy-rw
    spec:
      hostPID: true
      containers:
      - name: cri-o-reconfigure
        image: registry.fedoraproject.org/fedora
        securityContext:
          privileged: true
        command:
        - /bin/bash
        - -c
        - set -x ; date ; grep io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw /host/etc/crio/crio.conf.d/00-default && exit 0 ; sed -i '/allowed_annotations/a "io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw",' /host/etc/crio/crio.conf.d/00-default ; chroot /host systemctl restart crio ; sleep infinity
        volumeMounts:
        - mountPath: /host
          name: host
      volumes:
      - hostPath:
          path: /
          type: ''
        name: host
  1. As a regular user, create a Pod with podman -- oc apply -f - file
apiVersion: v1
kind: Pod
metadata:
  name: test-podman
  annotations:
    io.openshift.builder: "true"
    io.kubernetes.cri-o.userns-mode: auto
    io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw: "true"
spec:
  restartPolicy: Never
  securityContext:
    runAsUser: 300000
    fsGroup: 200000
    fsGroupChangePolicy: OnRootMismatch
  containers:
  - name: container
    image: quay.io/podman/stable
    imagePullPolicy: IfNotPresent
    securityContext:
      capabilities:
        add: ["SYS_ADMIN", "CAP_SYS_CHROOT"]
      seLinuxOptions:
        type: spc_t
    command:
    - bash
    - -c
    - set -x ; id ; cat /proc/self/uid_map ; mount | grep cgroup ; rm /etc/containers/storage.conf ; podman run --privileged --tmpfs /sys --rm -ti quay.io/podman/stable sh -c "id ; cat /proc/self/uid_map ; mount | grep cgroup ; exec /usr/sbin/init --show-status"
    volumeMounts:
    - mountPath: /var/lib/containers
      name: var-lib-containers
  volumes:
  - name: var-lib-containers
    emptyDir: {}

Run oc logs -f pod/test-podman and observe that

+ cat /proc/self/uid_map
         1     200000      65535
         0     300000          1

shows that we run in a user namespace and

systemd 254.7-1.fc39 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Fedora Linux 39 (Container Image)!

Couldn't move remaining userspace processes, ignoring: Input/output error
bpf-lsm: BPF LSM hook not enabled in the kernel, BPF LSM not supported
Queued start job for default target graphical.target.
[  OK  ] Created slice system-getty.slice - Slice /system/getty.
[  OK  ] Created slice system-modprobe.slice - Slice /system/modprobe.
[  OK  ] Created slice user.slice - User and Session Slice.
[  OK  ] Started systemd-ask-password-conso…equests to Console Directory Watch.
[  OK  ] Started systemd-ask-password-wall.…d Requests to Wall Directory Watch.
[  OK  ] Reached target network-online.target - Network is Online.
[  OK  ] Reached target paths.target - Path Units.
[...]
[  OK  ] Reached target multi-user.target - Multi-User System.
[  OK  ] Reached target graphical.target - Graphical Interface.
         Starting systemd-update-utmp-runle…- Record Runlevel Change in UTMP...
[  OK  ] Finished systemd-update-utmp-runle…e - Record Runlevel Change in UTMP.

Fedora Linux 39 (Container Image)
Kernel 5.14.0-284.43.1.el9_2.x86_64 on an x86_64 (console)

shows the systemd in podman in the container in the Pod works.

To sum up, the changes needed in the Pod were

  1. add the io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw: "true" annotation to get /sys/fs/cgroup rw and writeable by the root in the container;
  2. emptyDir: {} mounted at /var/lib/containers to enable the use of something else than FUSE overlay;
  3. rm /etc/containers/storage.conf to get rid of the configuration which explicitly forces FUSE overlay;
  4. fsGroup: 200000 and fsGroupChangePolicy: OnRootMismatch to make the /var/lib/containers writeable by the processes in the container;
  5. capabilities ["SYS_ADMIN", "CAP_SYS_CHROOT"] instead of privileged: true;
  6. podman run argument --tmpfs /sys to fix/workaround the Error: crun: mount sysfstosys: Operation not permitted: OCI permission denied;
  7. seLinuxOptions with type: spc_t to fix/workaround the same for proc, dev/pts, ...

Based on recommendation in cri-o/cri-o#7623 I tried to avoid using privileged: true because that seems to be using completely different logic for cgroups and possible other bits.

On the other hand, the type: spc_t would be nice to be able to get rid of as (IIUIC) that is one of the important aspects of privileged containers. Should we be able to get things running without that spc_t?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants