Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--userns=keep-id storage-chown-by-maps kills machine with large images #16541

Open
Hi-Angel opened this issue Nov 17, 2022 · 84 comments
Open

--userns=keep-id storage-chown-by-maps kills machine with large images #16541

Hi-Angel opened this issue Nov 17, 2022 · 84 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Hi-Angel
Copy link

Hi-Angel commented Nov 17, 2022

/kind bug

Description

While migrating a CI from Docker to Podman, I'm occasionally stumbling upon freezes of Podman commands. They may take dozens (!!!) of minutes, with Podman not doing anything at all.

The hangs aren't specific to any commands. E.g. right as I'm writing this text, I see two jobs, one with podman run … and another with podman inspect both frozen. So I connected to the server with ssh and trying running a time podman inspect foobar (literally a request for non-existing foobar image), and it hanged as well. podman ps hangs, and podman version even hangs!!

Basically, to be able to create this report I had to kill a podman process. I had 2 podman run processes and 2 podman inspects. I killed one of podman inspect processes, and a little later CI finally proceeded and podman commands started working.

Steps to reproduce the issue:

I'm afraid I couldn't find any. It seems to be happening when multiple podman processes are run, but my attempts simulating that in different ways didn't succeed. It just happens from time to time as part of CI, in which case CI basically breaks completely.

Steps to reproduce were found as part of this duplicate issue and are copied below:

  1. This is the "fairly large" image:

    podman pull ghcr.io/martinpitt/swaypod:latest
    time podman create --userns=keep-id ghcr.io/martinpitt/swaypod:latest
    
  2. This is the image that adds TeXlive (which makes it a few hundred MB larger):

    podman pull ghcr.io/martinpitt/swaypod:allpkgs
    time podman create --userns=keep-id ghcr.io/martinpitt/swaypod:allpkgs
    

Describe the results you received:

Step 1 takes 4 s on a Fedora 37 cloud VM (2 CPUs, 4 GiB RAM) with the default btrfs. On a standard RHEL 9.2 VM with XFS and on my laptop's Fedora 37 VM with /home being on ext4, it takes about 20 seconds. In top I see a process called "exe" which is taking 100% CPU:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
972 admin     20   0 1351936  65172  28028 S  96.0   1.7   0:12.33 exe

That is really this:

admin       1972 95.0  1.3 1351680 49344 pts/0   Sl+  04:04   0:01 storage-chown-by-maps /home/admin/.local/share/containers/storage/overlay/3cc2d72c07248c18a9185b6a5bba0e7932b0ce5c26dbc763e476eb50c2a7ea94/merged

With the larger image in step 2, the Fedora 37 btrfs VM takes merely 6s. However, both on the RHEL 9.2 XFS VM as well as my ext4 real-iron Fedora 37 laptop, the storage-chown-by-maps process never ends. After maybe half a minute it kills the VM (ssh dead, cannot log into the virsh console either), and my laptop becomes really sluggish, I cannot even start top any more. Trying to kill -9 or even sudo kill -9 (!) that storage-chown-by-maps does not work either, it's just unkillable.

Describe the results you expected:

The storage-chown-by-maps process should finish eventually, but ideally reasonably fast. This is more or less a glorified chown -R, no? that shouldn't take more than a few seconds.

Output of podman version:

Client:       Podman Engine
Version:      4.3.1
API Version:  4.3.1
Go Version:   go1.18.1
Built:        Thu Jan  1 00:00:00 1970
OS/Arch:      linux/amd64
Output of podman info:
host:
  arch: amd64
  buildahVersion: 1.28.0
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon_2:2.1.5-0ubuntu22.04+obs14.3_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.5, commit: '
  cpuUtilization:
    idlePercent: 96.22
    systemPercent: 0.76
    userPercent: 3.03
  cpus: 4
  distribution:
    codename: jammy
    distribution: ubuntu
    version: "22.04"
  eventLogger: file
  hostname: node29
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 998
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 998
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
  kernel: 5.15.0-52-generic
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 1288302592
  memTotal: 67404197888
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun_1.7-0ubuntu22.04+obs47.1_amd64
    path: /usr/bin/crun
    version: |-
      crun version 1.7
      commit: 40d996ea8a827981895ce22886a9bac367f87264
      rundir: /run/user/998/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/user/998/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_1.2.0-0ubuntu22.04+obs10.15_amd64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 8575172608
  swapTotal: 8589930496
  uptime: 223h 19m 11.00s (Approximately 9.29 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/gitlab-runner/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/gitlab-runner/.local/share/containers/storage
  graphRootAllocated: 983350071296
  graphRootUsed: 645746360320
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 202
  runRoot: /tmp/podman-run-998/containers
  volumePath: /home/gitlab-runner/.local/share/containers/storage/volumes
version:
  APIVersion: 4.3.1
  Built: 0
  BuiltTime: Thu Jan  1 00:00:00 1970
  GitCommit: ""
  GoVersion: go1.18.1
  Os: linux
  OsArch: linux/amd64
  Version: 4.3.1

Package info:

$ apt list podman
Listing... Done
podman/unknown,now 4:4.3.1-0ubuntu22.04+obs64.3 amd64 [installed]
podman/unknown 4:4.3.1-0ubuntu22.04+obs64.3 arm64
podman/unknown 4:4.3.1-0ubuntu22.04+obs64.3 armhf
podman/unknown 4:4.3.1-0ubuntu22.04+obs64.3 s390x

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2022
@mheon
Copy link
Member

mheon commented Nov 17, 2022

More details on the commands being run (particularly the podman run commands) would help. This sounds like a potential deadlock.

@Hi-Angel
Copy link
Author

More details on the commands being run (particularly the podman run commands) would help. This sounds like a potential deadlock.

Looking at scrollback when I had the hang, the 4 podman commands that were running are below (I edited out some paths and IDs). I am not sure why are there identical pairs… Looking at paths, both podman run are being run from a single job, and I don't have anything that makes two identical podman runs to run as part of one job 🤷‍♂️ Either way, here they are.

gitlab-+  288652  0.0  0.0 1409668 34596 ?       Sl   14:27   0:00 podman run --rm --userns=keep-id -v /home/user1/.ccache:/home/my_builder/.ccache -v /home/user1/.web_modules/v4/node_modules/:/home/my_builder/.web_modules/v4/node_modules/ -v /home/user1/builds/path_to_project/project/:/home/my_builder/project --user my_builder -t my-builder-ba7f7b1bf09c880df52be235960424ff8e5987a5f2bba341d0947a5004e41e6b:1 /bin/sh -c cd /home/my_builder/project && ./full_build.sh
gitlab-+  288662  0.1  0.0 1410180 39944 ?       Sl   14:27   0:03 podman run --rm --userns=keep-id -v /home/user1/.ccache:/home/my_builder/.ccache -v /home/user1/.web_modules/v4/node_modules/:/home/my_builder/.web_modules/v4/node_modules/ -v /home/user1/builds/path_to_project/project/:/home/my_builder/project --user my_builder -t my-builder-ba7f7b1bf09c880df52be235960424ff8e5987a5f2bba341d0947a5004e41e6b:1 /bin/sh -c cd /home/my_builder/project && ./full_build.sh
gitlab-+  288995  0.0  0.0 1409668 34084 ?       Sl   14:27   0:00 podman inspect base-iso-5cf95cc968e717578e97eb82b422534a6539a7c77718f7079d5344235978bcb8:1
gitlab-+  289004  0.0  0.0 1409584 33944 ?       Sl   14:27   0:00 podman inspect base-iso-5cf95cc968e717578e97eb82b422534a6539a7c77718f7079d5344235978bcb8:1

@Hi-Angel
Copy link
Author

Hi-Angel commented Nov 17, 2022

I am not sure why are there identical pairs…

Oh, apparently that's just how it works. Running a mere podman run --rm -it test creates two identical processes as well.

@Hi-Angel
Copy link
Author

FTR: running the same podman run and podman inspect commands from the same directory manually doesn't hang. So, as I said, it's sporadical.

@Surfrdan
Copy link

Surfrdan commented Nov 18, 2022

I'm experiencing the same hanging behaviour when I run a tomcat container. I know the container is doing a LOT of processing on startup (indexing 14TB of JSON files to a postgres DB) and I expect it to un-hang in the future (possibly days from now) as I have run the same container with smaller datasets and it releases in time (maybe minutes or hours). While this is happening I can get nothing from podman ps, podman top, podman logs and probably others. They all hang.

podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z  -it fcrepo_tomcat

iotop on the host OS does show disk IO for the container in question so It's doing something

 337148 be/4 root      127.79 K/s    0.00 B/s podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z -it fcrepo_tomcat
 337150 be/4 root       40.36 K/s    0.00 B/s podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z -it fcrepo_tomcat
 337151 be/4 root       26.90 K/s    0.00 B/s podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z -it fcrepo_tomcat
 337158 be/4 root       13.45 K/s    0.00 B/s podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z -it fcrepo_tomcat
 337159 be/4 root       40.36 K/s    0.00 B/s podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z -it fcrepo_tomcat
 337161 be/4 root       13.45 K/s    0.00 B/s podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z -it fcrepo_tomcat
 337175 be/4 root       13.45 K/s    0.00 B/s podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z -it fcrepo_tomcat
 337176 be/4 root       40.36 K/s    0.00 B/s podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z -it fcrepo_tomcat

@mheon
Copy link
Member

mheon commented Nov 18, 2022

Podman should not hang because a container is processing IO. A Podman hang is almost certainly because something is holding a lock; we don't hold locks during steady-state, only during certain setup operations.

@mheon
Copy link
Member

mheon commented Nov 18, 2022

Are the containers running (as in, do you see Tomcat in ps? Or just podman?)

@Surfrdan
Copy link

On closer inspection @mheon I can not actually see a tomcat process running yet.

Is there anything else I can do to help debug the situation?

From the host:

$ ps aux | grep fcrepo_tomcat_migration
root      337148 15.7  0.1 6582404 76264 pts/0   Sl+  Nov17 227:00 podman run -d --name fcrepo_tomcat_migration --pod fcrepo_migration -v /fedora-storage/delivery/migrated/data/ocfl-root:/usr/local/fedora/data/ocfl-root:z -it fcrepo_tomcat

Obviously I can't run podman top as its still hanging from hours ago:

$ ps aux | grep 'podman top'
root      347816  0.0  0.0 413772 10608 pts/3    S+   09:06   0:00 sudo podman top fcrepo_tomcat_migration
root      347817  0.2  0.0 2819320 64512 pts/3   Sl+  09:06   1:19 podman top fcrepo_tomcat_migration

@mheon
Copy link
Member

mheon commented Nov 18, 2022

How big is the volume (/fedora-storage/delivery/migrated/data/ocfl-root)? Does it have a lot of small files? I'm wondering if we aren't trying to chattr millions of files in there because of the :z on the volume mount... Omitting the :z and using --security-opt label=disable instead would be interesting, to see if the container starts faster.

@Hi-Angel
Copy link
Author

The hang I got right now is ridiculous. There is only one command running:

gitlab-+ 1197025  0.0  0.0 1409660 34788 pts/0   Sl+  06:44   0:00 podman run --rm --userns=keep-id -v /home/user1/.ccache:/home/my_builder/.ccache -v /home/user1/.web_modules/v4/node_modules/:/home/my_builder/.web_modules/v4/node_modules/ -v /home/user1/builds/J1a__jgb/0/developers/project/:/home/my_builder/project --user my_builder -t project-builder-db12a84caddb49eb7219b2256a64a8a5eb07feef7fb1c5c9c16641d547139b66:1 /bin/echo hello
gitlab-+ 1197035  0.3  0.0 1336184 43964 pts/0   Sl+  06:44   0:00 podman run --rm --userns=keep-id -v /home/user1/.ccache:/home/my_builder/.ccache -v /home/user1/.web_modules/v4/node_modules/:/home/my_builder/.web_modules/v4/node_modules/ -v /home/user1/builds/J1a__jgb/0/developers/project/:/home/my_builder/project --user my_builder -t project-builder-db12a84caddb49eb7219b2256a64a8a5eb07feef7fb1c5c9c16641d547139b66:1 /bin/echo hello

(as we found out earlier, podman commands duplicate itself in process list, so this is just one command).

And somehow managed to hang 😄 🤷‍♂️ Running podman version with these won't return, for example.

@Hi-Angel
Copy link
Author

Oh, I don't know if that points to something, but after I Ctrl+ced the command above (I had ran it manually), some seconds later I got this message:

Error: creating container storage: creating an ID-mapped copy of layer "b42cb22c5974757a983a1d4e635f7ff1eba06eb691fd1a4d1a2842a551827958": signal: interrupt

@Hi-Angel
Copy link
Author

It hanged again as part of CI, and sending a kill so other processes would continue does nothing, the process won't quit.

@Hi-Angel
Copy link
Author

Got stacktrace from the process that hanged first and now doesn't react to kill:

#0  runtime.futex () at /usr/lib/go-1.18/src/runtime/sys_linux_amd64.s:553
#1  0x0000556e9d7419d6 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=93933576630435) at /usr/lib/go-1.18/src/runtime/os_linux.go:66
#2  0x0000556e9d71b3c7 in runtime.notesleep (n=0xfffffffffffffe00) at /usr/lib/go-1.18/src/runtime/lock_futex.go:159
#3  0x0000556e9d74aac5 in runtime.mPark () at /usr/lib/go-1.18/src/runtime/proc.go:1449
#4  0x0000556e9d74c645 in runtime.stoplockedm () at /usr/lib/go-1.18/src/runtime/proc.go:2422
#5  0x0000556e9d74e11d in runtime.schedule () at /usr/lib/go-1.18/src/runtime/proc.go:3119
#6  0x0000556e9d74e86d in runtime.park_m (gp=0xc000015a00) at /usr/lib/go-1.18/src/runtime/proc.go:3336
#7  0x0000556e9d776f05 in runtime.mcall () at /usr/lib/go-1.18/src/runtime/asm_amd64.s:425
#8  0x0000556e9d77b8a7 in runtime.newproc (fn=0x11) at <autogenerated>:1
#9  0x0000556e9f834d40 in ?? ()
#10 0x0000000000000011 in ?? ()
#11 0x0000556e9d776e85 in runtime.mstart () at /usr/lib/go-1.18/src/runtime/asm_amd64.s:367
#12 0x0000556e9d776e13 in runtime.rt0_go () at /usr/lib/go-1.18/src/runtime/asm_amd64.s:331
#13 0x0000000000000011 in ?? ()
#14 0x00007ffd9e018b78 in ?? ()
#15 0x0000000000000000 in ?? ()

@Surfrdan
Copy link

How big is the volume (/fedora-storage/delivery/migrated/data/ocfl-root)? Does it have a lot of small files? I'm wondering if we aren't trying to chattr millions of files in there because of the :z on the volume mount... Omitting the :z and using --security-opt label=disable instead would be interesting, to see if the container starts faster.

It's a 13TB volume with at least 50 million small json and text files and maybe 10million larger bninary files. I let the process run all weekend but it was still hung this morning. I CTRL+Cd the podman run command the other hanging podman ps podman top etc commands completed instandly as soon as I'd killed the podman run.

I followed your advice and replaced the :z with the --security-opt label=disable args and the container came up in an instant! Thanks for identifying the cause of the hang!

@mheon
Copy link
Member

mheon commented Nov 21, 2022

I'm going to go ahead and close this then. @Hi-Angel can open a fresh issue for their hang (which seems like a deadlock and not :z being really, really slow).

I'll also look into :z - it should probably log what it's doing if it's taking forever, and there's no reason the container has to be locked when we do it (unlocking the container would let commands like podman ps work at the same time).

@mheon mheon closed this as completed Nov 21, 2022
@Hi-Angel
Copy link
Author

@mheon err. What? 😄 Why? I don't get it, why do you want me to create a duplicate ? Do you want me to copy comments over there as well? Err, I am so confused.

@mheon
Copy link
Member

mheon commented Nov 21, 2022

Because it won't be a duplicate. A new bug requires that you provide the full bug template (including full podman info), which we find essential for debugging efforts. Further, your issue is unrelated to this one: effects may be similar, but causes are not, given you don't have a single relabel in your command.

@Hi-Angel
Copy link
Author

A new bug requires that you provide the full bug template (including full podman info), which we find essential for debugging efforts.

I did that, didn't I? I provided all that info when I created this report.

Further, your issue is unrelated to this one: effects may be similar, but causes are not, given you don't have a single relabel in your command.

Right. So… why do you close an unrelated bugreport?? 😂😂😂 I mean, don't get me wrong, I'm glad that @Surfrdan 's issues are getting resolved, but they never even created a bugreport. This one is about something similar to their problem, but as you noted not really the same one.

Because it won't be a duplicate.

So, based on my replies above, why wouldn't it?

@Hi-Angel
Copy link
Author

Perhaps, did you confuse the author? @Surfrdan wasn't the author here, I was

@mheon
Copy link
Member

mheon commented Nov 21, 2022

Yep, sorry for the runaround. Need more coffee. Reopening.

@Surfrdan Mind opening an RFE issue for additional logging around :z so this doesn't confuse other folks?

@mheon mheon reopened this Nov 21, 2022
@Hi-Angel
Copy link
Author

Yep, sorry for the runaround. Need more coffee. Reopening.

No problem, happens with the best of us ;)

@Surfrdan
Copy link

Sorry for the confusion guys. I genuinely thought we had the same issue at the outset. I'll take mine to a new ticket as requested.

@lukasmrtvy
Copy link

lukasmrtvy commented Nov 22, 2022

Have a similar problem #16062 ( podman 4.3.1, kernel 6.0.8, fcos 37.20221121.20.0 )

@dkelsey
Copy link

dkelsey commented Nov 23, 2022

I'm having the same issue.
I've restarted the podman machine a few times.
Containers run for a bit, don't show up in Podman Desktop, even after a refresh, however I can connect to them on localhost.
I've enabled completions.
Basic commands hang. podman ps. podman start [tab] -- the container name don't show up.
The terminal window become unresponsive; Ctrl-z, Ctrl-c, Ctrl-d, Ctrl-u have no affect. I have to close the window and start again.

These work:

  • podman machine stop
  • podman machine rm
  • podman machine init ...

I've recreated the machine a few times. it works for a few minutes though the behaviour is strange as described above.
Recreating the machine works for about a minute.

I'm on macOS Monterey 12.6

@mheon
Copy link
Member

mheon commented Nov 23, 2022

Can you SSH into the machine with podman machine ssh and see if there are any Podman processes running there that seem to be stalled?

@dkelsey
Copy link

dkelsey commented Nov 23, 2022

I did attempt to ssh into the machine at one point, when podman was in the "hanging" state. I was able to ssh in however no shell prompt appeared.
The only way to exit the machine was by typing ~.

@rhatdan
Copy link
Member

rhatdan commented Feb 19, 2023

In order to make this fast for a rootless user it would take getting idmap support for rootless user.

Another option would be to pull the image in --userns=keep-id mode.
@giuseppe are we capable of doing this? If the user sets userns=keep-id in the containers.conf, we should pull in the correct user namespace, and not have to chown.

@giuseppe
Copy link
Member

we had that in the past and it was causing troubles, I don't recall exactly which ones but I can probably find it out in the commit logs.

I'd not worry too much about this problem for now, there is already a workaround with fuse-overlayfs and the last time I talked to Christian Brauner, he said to be interested in getting idmapped mounts working from a user namespace too, and that it was something he was going to work in the near future.

Similarly, the super slowness is also caused by unprivileged overlay not supporting metacopy from a user namespace, so we end up copying the whole files (even though I'd expect reflinks to be used on btrfs), and that is also something that should be fixed in the kernel. Overlay developers are aware of the issue as it came up recently from the composefs lkml thread.

@Hi-Angel
Copy link
Author

Another option would be to pull the image in --userns=keep-id mode.
@giuseppe are we capable of doing this? If the user sets userns=keep-id in the containers.conf, we should pull in the correct user namespace, and not have to chown.

we had that in the past and it was causing troubles, I don't recall exactly which ones but I can probably find it out in the commit logs.

I can comment on that. I tried making that default and described my experience in that comment some time ago. The quote below summarizes experience:

Adding the userns = "keep-id" option globally breaks containers that doesn't even try to access mount-binds: merely a ls /root/ for a unmodified ubuntu:20.04 results in Permission denied.

@lukasmrtvy
Copy link

Have similar problems with --userns=auto

@bendem
Copy link

bendem commented Apr 5, 2023

Just came across this, keep-id was introduced because it's a valid use case (#2898), it's problematic that it has become a footgun now.

@oseiberts11
Copy link

oseiberts11 commented May 16, 2023

Here is a a similar report, I first reported it at #16062 (comment) but maybe it belongs here (too).

In #16062 (comment) this issue is considered a locking bug inside podman. We seem to have problems regarding a deadlock in the xfs file system, triggered by podman. I am wondering if these can be explained by the same explanation or not.

The issue occurs when starting a container for RabbitMQ with --user rabbitmq --userns=keep-id, on newer kernels when the native overlayfs is used and not fuse-overlayfs.

One thing that is sub-optimal is that RabbitMQ needs its home directory mounted in the container (/var/lib/rabbitmq) but this is also where Podman stores all the container files; so effectively the container files are mounted into the container.

The kernel warns about processes being stuck; here is one of them:

2023-05-10 15:37:39 in4 kernel[-] warning: [   69.953197] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/compat278498705/lower1' does not support file handles, falling back to xino=off.
2023-05-10 15:37:40 in4 kernel[-] warning: [   69.975395] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/metacopy-check184881131/l1' does not support file handles, falling back to xino=off.
2023-05-10 15:37:40 in4 kernel[-] warning: [   69.980772] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/opaque-bug-check964911957/l2' does not support file handles, falling back to xino=off.
2023-05-10 15:37:44 in4 kernel[-] warning: [   74.073926] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/l/PJRXH6F3G7CISWKDG472AW6PUO' does not support file handles, falling back to xino=off.
2023-05-10 15:40:35 in4 kernel[-] err: [  245.356952] INFO: task kworker/9:0:73 blocked for more than 120 seconds.
2023-05-10 15:40:35 in4 kernel[-] err: [  245.363676]       Not tainted 5.15.0-71-generic #78~20.04.1-Ubuntu
2023-05-10 15:40:35 in4 kernel[-] err: [  245.369880] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377744] task:kworker/9:0     state:D stack:    0 pid:   73 ppid:     2 flags:0x00004000
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377748] Workqueue: xfs-conv/dm-1 xfs_end_io [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377841] Call Trace:
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377844]  <TASK>
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377846]  __schedule+0x2cd/0x890
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377850]  ? xfs_buf_find.isra.0+0x331/0x780 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377915]  schedule+0x69/0x110
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377917]  schedule_timeout+0x206/0x2d0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377919]  ? xfs_buf_get_map+0x52/0x420 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377983]  __down+0x84/0xf0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377985]  down+0x53/0x70
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377987]  xfs_buf_lock+0x32/0xc0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378051]  xfs_buf_find.isra.0+0x34a/0x780 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378116]  xfs_buf_get_map+0x52/0x420 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378179]  xfs_buf_read_map+0x53/0x350 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378243]  ? xfs_read_agf+0xa6/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378295]  xfs_trans_read_buf_map+0x130/0x2a0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378368]  ? xfs_read_agf+0xa6/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378421]  xfs_read_agf+0xa6/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378473]  xfs_alloc_read_agf+0x39/0x1b0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378523]  ? xfs_perag_get+0x45/0xa0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378574]  xfs_refcount_finish_one+0xee/0x350 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378636]  xfs_refcount_update_finish_item+0x4a/0xb0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378706]  ? xfs_trans_get_cud+0x62/0x70 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378777]  xfs_defer_finish_noroll+0x1f4/0x600 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378835]  __xfs_trans_commit+0x17d/0x3b0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378905]  xfs_trans_commit+0x10/0x20 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378974]  xfs_reflink_end_cow_extent+0x230/0x2a0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379043]  xfs_reflink_end_cow+0x7c/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379112]  xfs_end_ioend+0xfc/0x190 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379185]  xfs_end_io+0xb0/0xe0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379244]  process_one_work+0x228/0x3d0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379246]  worker_thread+0x4d/0x3f0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379248]  ? process_one_work+0x3d0/0x3d0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379250]  kthread+0x127/0x150
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379252]  ? set_kthread_struct+0x50/0x50
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379254]  ret_from_fork+0x1f/0x30
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379258]  </TASK>

After this, more and more processes get blocked as they try to access the file system.

We are currently working around it by forcing the use of the fuse-overlayfs instead of the native one.

The version of Podman we have is fairly old (3.4.2) but Ubuntu doesn't seem to have packaged a newer version for Ubuntu 20.04. Therefore we could not try if this fix [updating to the latest 4.x version] also works for us.

@Hi-Angel
Copy link
Author

The version of Podman we have is fairly old (3.4.2) but Ubuntu doesn't seem to have packaged a newer version for Ubuntu 20.04. Therefore we could not try if this fix [updating to the latest 4.x version] also works for us.

Ubuntu always have outdated packages, even when using the latest release (excluding perhaps browsers). The only way to get latest version of whatever software you're interested in is either using a PPA or packaging it yourself.

For Ubuntu there's a PPA mentioned in Podman installation instructions

@0xF4CED
Copy link
Contributor

0xF4CED commented May 20, 2023

I encountered issues with this on low-end hardware in the past, but now it has manifested in a different and more severe manner, that showcases the potential of the problem.

TLDR; I managed to bork my SSD with this issue.

I had a rootless Podman container running through a systemd service within a pod using the "--userns nomap" flag. About seven days ago, the "podman-auto-update.service" updated the container. However, something went awry during this process, causing the Podman/container service to enter an endless loop of storage-chown-by-maps operations (see log). Consequently, my SSD suffered a massive data write of ~23TB over the past week.

Repeatedly restarting the service only resulted in the same problematic state. However, when I manually stopped the service and ran the container directly, it successfully completed the storage-chown-by-maps operation, ultimately resolving the issue.

container-in_app.service.log (unfortunately only the last few hours)

systemd units
# pod-invoiceninja.service
# autogenerated by Podman 4.3.1
# Sat Dec 10 16:10:57 CET 2022

[Unit]
Description=Podman pod-invoiceninja.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=/run/user/1000/containers
Wants=container-in_app.service container-in_db.service container-in_web.service
Before=container-in_app.service container-in_db.service container-in_web.service

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm \
	-f %t/pod-invoiceninja.pid %t/pod-invoiceninja.pod-id
ExecStartPre=/usr/bin/podman pod create \
	--infra-conmon-pidfile %t/pod-invoiceninja.pid \
	--pod-id-file %t/pod-invoiceninja.pod-id \
	--exit-policy=stop \
	--replace \
	--name invoiceninja \
	--infra-name in-infra \
	--network proxy,invoiceninja \
	--userns nomap
ExecStart=/usr/bin/podman pod start \
	--pod-id-file %t/pod-invoiceninja.pod-id
ExecStop=/usr/bin/podman pod stop \
	--ignore \
	--pod-id-file %t/pod-invoiceninja.pod-id  \
	-t 10
ExecStopPost=/usr/bin/podman pod rm \
	--ignore \
	-f \
	--pod-id-file %t/pod-invoiceninja.pod-id
PIDFile=%t/pod-invoiceninja.pid
Type=forking

[Install]
WantedBy=default.target

# container-in_app.service
# autogenerated by Podman 4.3.1
# Sat Dec 10 16:10:57 CET 2022

[Unit]
Description=Podman container-in_app.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers
BindsTo=pod-invoiceninja.service
After=pod-invoiceninja.service

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=always
TimeoutStopSec=70
ExecStartPre=/bin/rm \
	-f %t/%n.ctr-id
ExecStart=/usr/bin/podman run \
	--cidfile=%t/%n.ctr-id \
	--cgroups=no-conmon \
	--rm \
	--pod-id-file %t/pod-invoiceninja.pod-id \
	--sdnotify=conmon \
	-d \
	--replace \
	-t \
	--name=in_app \
	--env-file /var/home/puser/podman/invoiceninja/env \
	-v /var/home/puser/podman/invoiceninja/docker/app/public:/var/www/app/public:z,rw \
	-v /var/home/puser/podman/invoiceninja/docker/app/storage:/var/www/app/storage:Z,rw \
	--label io.containers.autoupdate=image docker.io/invoiceninja/invoiceninja:5
ExecStop=/usr/bin/podman stop \
	--ignore -t 10 \
	--cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm \
	-f \
	--ignore -t 10 \
	--cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target
# container-in_db.service
# autogenerated by Podman 4.3.1
# Sat Dec 10 16:10:57 CET 2022

[Unit]
Description=Podman container-in_db.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers
BindsTo=pod-invoiceninja.service
After=pod-invoiceninja.service

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm \
	-f %t/%n.ctr-id
ExecStart=/usr/bin/podman run \
	--cidfile=%t/%n.ctr-id \
	--cgroups=no-conmon \
	--rm \
	--pod-id-file %t/pod-invoiceninja.pod-id \
	--sdnotify=conmon \
	-d \
	--replace \
	-t \
	--name=in_db \
	-v /var/home/puser/podman/invoiceninja/docker/mysql/data:/var/lib/mysql:Z,rw \
	--label io.containers.autoupdate=image docker.io/mariadb:10.4
ExecStop=/usr/bin/podman stop \
	--ignore -t 10 \
	--cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm \
	-f \
	--ignore -t 10 \
	--cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target
# container-in_web.service
# autogenerated by Podman 4.3.1
# Sat Dec 10 16:10:57 CET 2022

[Unit]
Description=Podman container-in_web.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers
BindsTo=pod-invoiceninja.service
After=pod-invoiceninja.service

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm \
	-f %t/%n.ctr-id
ExecStart=/usr/bin/podman run \
	--cidfile=%t/%n.ctr-id \
	--cgroups=no-conmon \
	--rm \
	--pod-id-file %t/pod-invoiceninja.pod-id \
	--sdnotify=conmon \
	-d \
	--replace \
	-t \
	--name=in_web \
	-v /var/home/puser/podman/invoiceninja/docker/app/public:/var/www/app/public:z,ro \
	--label io.containers.autoupdate=image docker.io/nginx:alpine
ExecStop=/usr/bin/podman stop \
	--ignore -t 10 \
	--cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm \
	-f \
	--ignore -t 10 \
	--cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target
podman info
host:
  arch: amd64
  buildahVersion: 1.30.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 97.47
    systemPercent: 1.89
    userPercent: 0.64
  cpus: 4
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: iot
    version: "38"
  eventLogger: journald
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.2.15-300.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 679092224
  memTotal: 8059166720
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.4-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.4
      commit: 5a8fa99a5e41facba2eda4af12fa26313918805b
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8049127424
  swapTotal: 8058302464
  uptime: 3h 16m 14.00s (Approximately 0.12 days)
  plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
  registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
  store:
  configFile: /var/home/puser/.config/containers/storage.conf
  containerStore:
    number: 17
    paused: 0
    running: 8
    stopped: 9
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/puser/.local/share/containers/storage
  graphRootAllocated: 170259980288
  graphRootUsed: 67244826624
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 73
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/puser/.local/share/containers/storage/volumes
  version:
  APIVersion: 4.5.0
  Built: 1681486942
  BuiltTime: Fri Apr 14 17:42:22 2023
  GitCommit: ""
  GoVersion: go1.20.2
  Os: linux
  OsArch: linux/amd64
  Version: 4.5.0

It might be beneficial to warn users in the documentation, that these (userns/idmap) options can be problematic when using the (default) native overlay storage driver.

@arroyoj
Copy link

arroyoj commented Jul 20, 2023

I came across this issue when I started testing out podman and wanted to use the --userns=keep-id feature. Even with a modest sized image (ubi9), it took several minutes for the initial run. However, while testing out other command line options, I think I may have discovered a workaround, but I'm not familiar enough with podman to know if this will really work the same. I tried replacing --userns=keep-id with what seems to be the equivalent --uidmap/--gidmap/--user options, and then my container started right away. I think this is related to @rhatdan's comment about pulling the image directly into the desired user namespace, except that I couldn't get that to work with userns=keep-id in containers.conf. Anyway, here are some comparisons of using podman run with --userns=keep-id vs with --uidmap/--gidmap/--user. From my very limited testing, the resulting containers seem to work the same way, but the second option is much faster and seems to avoid the chown copying of the image. All of these tests were run using podman version 4.4.1 and using the native overlay driver.

Here was my initial result of running a fresh ubi9 container with --userns=keep-id, and outputting the container's uid/gid maps and user information:

[user123@testing ~]$ podman system reset
WARNING! This will remove:
        - all containers
        - all pods
        - all images
        - all networks
        - all build cache
        - all machines
        - all volumes
Are you sure you want to continue? [y/N] y
[user123@testing ~]$ time podman run --userns keep-id ubi9 /bin/bash -c "cat /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
Resolved "ubi9" as an alias (/etc/containers/registries.conf.d/001-rhel-shortnames.conf)
Trying to pull registry.access.redhat.com/ubi9:latest...
Getting image source signatures
Checking if image destination supports signatures
Copying blob 7b3dd25bf011 done  
Copying config 663a35613b done  
Writing manifest to image destination
Storing signatures
         0          1       1000
      1000          0          1
      1001       1001      64536
         0          1       1000
      1000          0          1
      1001       1001      64536
uid=1000(user123) gid=1000(user123) groups=1000(user123)
systemd-oom:x:995:995:systemd Userspace OOM Killer:/:/usr/sbin/nologin
user123:*:1000:1000:User:/:/bin/sh

real	3m35.660s
user	0m3.961s
sys	0m5.205s

Running with --userns=keep-id, the first run took ~3.5 minutes, and it also created two full copies of the image: one owned by the user who ran the command and a second copy owned by the container's root user (plus an additional overlay):

[user123@testing ~]$ ls -l ~/.local/share/containers/storage/overlay
total 0
drwx------. 6 user123 user123  69 Jul 19 22:13 027bc8fdccbd38573cd9926b34078f040106011813eed2dbde06ca238275475e
drwx------. 4  100000  100000  55 Jul 19 22:16 2ea334e9348196693c8e257e7e3ca99170a7f912fc341d41b4a73f45c7744832
drwx------. 5  100000  100000  68 Jul 19 22:16 b5c6227bb9c10116dd1a2755cb68b6b9ab3ca5d333b3b33fecb4061d13344fde
drwxr-xr-x. 2 user123 user123 108 Jul 19 22:16 l

For comparison, here is the result of using a combination of --uidmap, --gidmap, and --user to create the same final user namespace and container user:

[user123@testing ~]$ podman system reset
WARNING! This will remove:
        - all containers
        - all pods
        - all images
        - all networks
        - all build cache
        - all machines
        - all volumes
Are you sure you want to continue? [y/N] y
[user123@testing ~]$ time podman run --uidmap 0:1:1000 --uidmap 1000:0:1 --uidmap 1001:1001:64536 --gidmap 0:1:1000 --gidmap 1000:0:1 --gidmap 1001:1001:64536 --user 1000:1000 ubi9 /bin/bash -c "cat /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
Resolved "ubi9" as an alias (/etc/containers/registries.conf.d/001-rhel-shortnames.conf)
Trying to pull registry.access.redhat.com/ubi9:latest...
Getting image source signatures
Checking if image destination supports signatures
Copying blob 7b3dd25bf011 done  
Copying config 663a35613b done  
Writing manifest to image destination
Storing signatures
         0          1       1000
      1000          0          1
      1001       1001      64536
         0          1       1000
      1000          0          1
      1001       1001      64536
uid=1000(user123) gid=1000(1000) groups=1000(1000)
systemd-oom:x:995:995:systemd Userspace OOM Killer:/:/usr/sbin/nologin
user123:*:1000:1000:User:/:/bin/sh

real	0m14.086s
user	0m3.704s
sys	0m0.953s

Now, the first run only took 14 seconds, and the container appears to be the same from the user namespace perspective. One difference is that this approach does not copy in the host user's group name, but that seems to be doable with the --group-entry option that is present in a more recent version of podman. Also, looking at the overlay directory, it appears the ubi9 image (027...) was pulled in directly into the desired user namespace, without having a second copy and a long chown process:

[user123@testing ~]$ ls -l ~/.local/share/containers/storage/overlay
total 0
drwx------. 6  100000  100000 68 Jul 19 22:22 027bc8fdccbd38573cd9926b34078f040106011813eed2dbde06ca238275475e
drwx------. 4  100000  100000 55 Jul 19 22:22 c27dfa4742ab61b7830f4887ab41dd35c52ceebfa57e0f8d2a576356997e4e16
drwxr-xr-x. 2 user123 user123 74 Jul 19 22:22 l

After having run the image once with the "long" user namespace setup, I could now use the --userns=keep-id option on a subsequent run and get a fast start up time with no additional image copying (this is continued from above with no system reset):

[user123@testing ~]$ time podman run --userns keep-id ubi9 /bin/bash -c "cat /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
         0          1       1000
      1000          0          1
      1001       1001      64536
         0          1       1000
      1000          0          1
      1001       1001      64536
uid=1000(user123) gid=1000(user123) groups=1000(user123)
systemd-oom:x:995:995:systemd Userspace OOM Killer:/:/usr/sbin/nologin
user123:*:1000:1000:User:/:/bin/sh

real	0m0.755s
user	0m0.150s
sys	0m0.059s

Now the startup time was under 1 second and only one additional overlay was created, and still no overlay image owned by the host user:

[user123@testing ~]$ ls -l ~/.local/share/containers/storage/overlay
total 0
drwx------. 6  100000  100000  68 Jul 19 22:22 027bc8fdccbd38573cd9926b34078f040106011813eed2dbde06ca238275475e
drwx------. 4  100000  100000  55 Jul 19 22:22 65f95377e4d29859428b68680d3383731fcc4b7b987ab6624a9c400cc492fc41
drwx------. 4  100000  100000  55 Jul 19 22:22 c27dfa4742ab61b7830f4887ab41dd35c52ceebfa57e0f8d2a576356997e4e16
drwxr-xr-x. 2 user123 user123 108 Jul 19 22:22 l

However, the slowdown does reappear if creating a new container with the default user namespace, which leads to the long chown copying, so there is still an issue if you need the same image with two different user namespaces:

[user123@testing ~]$ time podman run ubi9 /bin/bash -c "cat /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
         0       1000          1
         1     100000      65536
         0       1000          1
         1     100000      65536
uid=0(root) gid=0(root) groups=0(root)
tss:x:59:59:Account used for TPM access:/dev/null:/sbin/nologin
systemd-oom:x:995:995:systemd Userspace OOM Killer:/:/usr/sbin/nologin

real	2m43.235s
user	0m0.836s
sys	0m4.506s

I would be interested to know whether this is a viable workaround for --userns=keep-id. I haven't done much testing, but the two approaches seem to produce the same containers (except for the group name missing for now). More importantly, if this does work as desired, would it be possible to implement --userns=keep-id to use the same steps that occur with the --uidmap/--gidmap/--user version, so that the initial pull is done into the desired namespace and the extra copy/chown isn't needed?

@martinpitt
Copy link
Contributor

Linux 5.12 introduced id mapped mounts. If that is available, that could also avoid the costly chown? The latter would have to be kept as a fallback for a few years for older kernels.

@rhatdan
Copy link
Member

rhatdan commented Jul 22, 2023

id mapped file systems is not allowed in rootless mode yet. Podman supports it in rootful mode.

@mhoran
Copy link

mhoran commented Oct 24, 2023

I ran into this issue when trying to set up a ZFS ZVOL specifically for Podman to get better performance than with fuse-overlayfs. Since I can easily swap filesystems and test mount options, I was able to try out a few different configurations. Unfortunately the version of ZFS I am using at the moment cannot be used with overlayfs, but as of ZFS 2.2 which hopefully will be backported to my Linux distribution soon, overlayfs is supported -- so I'll re-run my tests then.

With an ext4 volume for Podman storage, I saw the most abysmal performance with default options. My tests were with the ghcr.io/home-assistant/home-assistant:stable image, a 1.93 GB image.

With default options, a pre-pulled home-assistant image took an additional 314 seconds to start up. During this time, all podman commands I ran on my machine hung. The load of an 4 core, 8 thread system rose up to above 22. The system was mostly responsive for the duration of the copy, but Podman was not.

I tested disabling the ext4 journal as well as other options. The only option I found that made any meaningful difference was adding nobarrier to the mount options. In my case, this reduced initial container start time from 314 seconds to 66 seconds. Note that this option is probably not safe in most circumstances, though if you have a dedicated filesystem for container image storage, it's probably okay.

Then I moved on to testing btrfs. btrfs performed the best. Starting the Home Assistant container with a pre-pulled image on btrfs took 11 seconds (and 13 seconds on a subsequent test.) btrfs also had the best deduplication performance. df showed no difference in disk usage from pre-pulling the image to after the container was started.

Finally, I tested XFS. XFS performance was in between ext4 and btrfs. I did not see any deadlock like reported by other users. But I'm not entirely sure XFS was using reflinks with Podman. My first test with XFS started a pre-pulled Home Assistant container in 18 seconds, but a subsequent test took nearly twice as long. Also, the df output did not seem to indicate any deduplication. With a test file, creating a 1 GB file, cp --reflink=always to another name, and a chown of the new file did not show any increase in disk usage -- so I'm not sure what happened here. Perhaps an issue with overlayfs or Podman in my configuration.

I'm hopeful that ZFS 2.2 will perform similarly to btrfs. But it's possible that I'll see similar issues to that of XFS, given the interaction with overlayfs.

I understand that this is a complicated problem, and that fuse-overlayfs can be used as a workaround. But the lockup of podman is a bit of an issue, and it'd be nice if there were some way to speed up the process if it is known ahead of time that an chown of the image will be needed. I've found fuse-overlayfs performance to be pretty poor, especially on lager images (like Home Assistant.)

It sounds like rootless idmapped file systems is also a potential solution -- is there somewhere to follow along on that progress? Would that be a kernel change, or a Podman change?

@giuseppe
Copy link
Member

@mhoran I guess you are running as rootless? In this case, idmapped mounts won't help, since the kernel doesn't support it. At the moment, with rootless, the only way to have a dynamic IDs mapping at runtime is to use fuse-overlayfs for the storage. It is a FUSE file system, so it is inherently slower than native overlay, but it works quite well if you use the image only for reading files and use volumes for I/O intensive operations.

Idmapped mounts should work fine for root, please pay attention that it is file system dependent, but BTFS, XFS and EXT4 should work fine with it, I am not sure about ZFS.

@mvmn
Copy link

mvmn commented Jan 31, 2024

How do I even pass something like --userns=keep-id to podman compose? It doesn't recognize this flag at all.

@rhatdan
Copy link
Member

rhatdan commented Jan 31, 2024

podman compose is just going to execute docker-compose which will communicate with the podman service, which would require the change.

If you want keep-id as the default then you can change your containers.conf to set that as default.

@RyanGreenup
Copy link

RyanGreenup commented Feb 8, 2024

I just ran into this problem as well. I can confirm that that Native Overlay causes HUGE impacts to container build time when paired with --userns=keep-id on a ZFS system, this doesn't occur on btrfs (although it's still a bit slower at 10 vs 1 s). I can't comment on xfs or ext4.

Here's what we tried and found:

Click Me for more information

Problem

We had two Gentoo systems on ZFS, same OS (from the same tarball) but one ran into this issue (A) and the other didn't (B). We determined that (A) was using native overlay but (B) was using fuse-overlay, B had pre-existing containers from 2024-09-23. We compared with a third machine on Suse Aeon (btrfs).

If I'm not mistaken, fuse overlay and native overlay correspond to the Native Overlay Diff which can be checked with:

podman info --debug | grep Na
  • fuse overlay

    graphDriverName: overlay
      Native Overlay Diff: "false"
    
  • Native overlay

    graphDriverName: overlay
      Native Overlay Diff: "true"
    

What we Found

We experimented with the following command:

podman pull debian
podman image save -o /tmp/debian.tar debian

podman system reset
podman load -i /tmp/debian.tar
time podman run --rm --userns keep-id -v /home/:/home/ debian date

We found the time to create that debian container varied a lot:

ID Driver Time (s) FS OS
A Native 17 ZFS Gentoo
B Fuse 0.2 ZFS Gentoo
B Native 30 ZFS Gentoo
C Native 0.8 btrfs Suse Aeon

When I ran podman system reset on Machine B, the Native Overlay Diff
switched to true. After that (B) had the exact same problem (~30 s). I
reverted back, which fixed it, like so:

conf=$HOME/.config/containers/storage.conf
mv  $conf $conf.bak
cat <<EOF >> $conf
[storage]
driver = "overlay"

[storage.options]
mount_program = "/usr/bin/fuse-overlayfs"
EOF

We repeated with some larger images and roughly observed:

ID Image Size Time (min) FS Overlay OS
A 3 GB 6 ZFS Native Gentoo
A 7 GB 25 ZFS Native Gentoo
B 3 GB 0.7 ZFS Fuse Gentoo
C 7 GB 10 btrfs Fuse Suse

The time taken to load the images wasn't impacted by the FS or by native
overlay.

Alternative User Mapping

I should note, the advice given above by @arroyoj 1 didn't work
on my end, it still took 17 s to load Debian.

time podman run --uidmap 0:1:1000 --uidmap 1000:1000:64536 --gidmap 0:1:1000 --gidmap 1000:1000:64536 --user 1000:1000 debian /bin/bash -c "cat /proc/self/uid_map /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
time podman run --uidmap 0:1:1000 --uidmap 1000:1000:64536 --gidmap 0:1:1000 --gidmap 1000:1000:64536 --user 1000:1000 debian /bin/bash -c "cat /proc/self/uid_map /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
Error: creating container storage: the specified UID and/or GID mapping 1000:1000:64536 conflicts with other mappings: invalid mappings specified

________________________________________________________
Executed in   16.10 secs    fish           external
   usr time    0.11 secs  215.00 micros    0.11 secs
   sys time    1.54 secs  187.00 micros    1.54 secs

Had it worked we could have created containers like that and cloned them
into distrobox.

We finally tried without any id maps at all, this, as expected, worked:

time podman run debian /bin/bash -c "cat /proc/self/uid_map /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
shai@gentoo ~/A/C/c/base (master)> time podman run debian /bin/bash -c "cat /proc/self/uid_map /proc/self/uid_map /proc/self/gid_map; id; tail -n 2 /etc/passwd"
         0       1000          1
         1     165536      65536
     65537 1001000000    1000000
         0       1000          1
         1     165536      65536
     65537 1001000000    1000000
         0       1000          1
         1     165536      65536
     65537 1001000000    1000000
uid=0(root) gid=0(root) groups=0(root)
_apt:x:42:65534::/nonexistent:/usr/sbin/nologin
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin

________________________________________________________
Executed in  111.87 millis    fish           external
   usr time   32.39 millis  247.00 micros   32.15 millis
   sys time   17.06 millis  146.00 micros   16.91 millis

Conclusion

Using --userns=keep-id with Native Overlay does not work on ZFS but
it's fine on btrfs. So, if you want to use distrobox with Native
Overlay, you're might as well use btrfs, which is what we did.

See Also

Footnotes

  1. https://github.com/containers/podman/issues/16541#issuecomment-1643045592

@Hi-Angel
Copy link
Author

Hi-Angel commented Feb 8, 2024

@RyanGreenup is machine C (btrfs) different in hardware from machine B? I'm asking because if it is, it's hard to judge how much difference is there on native overlay compared to FUSE when BTRFS is involved.

Using --userns=keep-id with Native Overlay does not work on ZFS but it's fine on btrfs.

Barring my previous comment, I'd presume the reason for seeing difference in performance may be that BTRFS supports reflinks, whereas ZFS as of today to my knowledge still doesn't.

@RyanGreenup
Copy link

RyanGreenup commented Feb 8, 2024

Yes, B and C are different hardware, so these numbers aren’t going to be sharp enough to split hairs.

The lack of reflink in zfs is unfortunate indeed, as a result, I suspect it will be a while before we can use native overlay with userns on zfs.


Edit: I should clarify though,

  • Machine B (Laptop)

    • i5 11th gen
    • 970 EVO m.2 nvme
      • ZFS
  • Machine C (Desktop)

    • Ryzen 5600x
    • 970 EVO m.2 nvme
      • btrfs

I would have expected Machine C to have about equal performance in terms of hardware, the difference between 1 and 10 s is arguably significant.

@Rudd-O
Copy link

Rudd-O commented Jun 18, 2024

Same issue. A chown of a single file takes seconds (verified using strace). Yes there is no reflink support in this ZFS version, probably causing enormous amounts of data copying when chowning a file.

EDIT to add command line that causes the issue:

[ollama@roxanne ~]$ podman run --replace -it --userns=keep-id -p 3000:8080 --add-host=host.docker.internal:127.0.0.1 -v /tmp/shit:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

@Rudd-O
Copy link

Rudd-O commented Jun 18, 2024

I enabled block cloning on ZFS. It solved the issue it seems.

@rhatdan
Copy link
Member

rhatdan commented Jun 24, 2024

Can we close this issue?

@Ramblurr
Copy link

Ramblurr commented Jun 24, 2024

Long time watcher (and victim) of this issue here chiming in.. this issue always bites me when running podman on low-end hardware like the raspberry pi 3 and 4 (with ext4 fs fwiw).

id mapped file systems is not allowed in rootless mode yet. Podman supports it in rootful mode.
(comment link)

I subscribed to this issue when @rhatdan made that comment last year. If I understand the various threads, supporting id mapped file systems in rootless mode would solve/mitigate this issue, right?

Is there an issue I can subscribe to that will indicate when that feature is available? If so, I'm all for closing this issue because it's grown quite gnarly with lots of input and no clear direction.

Edit: also this comment succinctly summarized the issue from my POV #16541 (comment)

@Hi-Angel
Copy link
Author

Can we close this issue?

Have there been changes that should fix it?

@rhatdan
Copy link
Member

rhatdan commented Jun 24, 2024

No idea, I saw someone mentioned a workaround potential fix.

@Hi-Angel
Copy link
Author

No idea, I saw someone mentioned a workaround potential fix.

I presume you meant the comment about block cloning on ZFS? "Block cloning" I presume are reflinks, and in my older steps-to-reproduce I used XFS which supports that. So it's likely the comment author just have fast SSD/small image, but the problem is still there.

@Romain-Geissler-1A
Copy link
Contributor

Romain-Geissler-1A commented Jun 24, 2024

@rhatdan Last week in our meeting when speaking specifically about idmap in rootless mode, you told me it will never be implemented as it would allow a user to write file on disk as root and obviously this would be a huge security issue. Thinking a bit more since then, we have no way in the kernel to make rootless idmap mounts restricted to only mapping which "make sense" from a security point of view ? I mean each time I start a container, it creates a uid_map already, as a rootless user. Playing with the --userns flag I can create a different user id mapping. Typically if I can "show" to the kernel that my user has a running process with a given uid_map, why wouldn't be ok to create another idmapped mount using that specific uid_map ? Maybe I don't use the right words, sorry for that, my knowledge in that area is not huge.

@rhatdan
Copy link
Member

rhatdan commented Jun 24, 2024

@giuseppe what do you think of the chances of the kernel allowing idmapped file mappings only for UIDs defined within the user namespace?

@giuseppe
Copy link
Member

idmapped inside a user namespace are restricted to the file systems whose superblock is owned by the user namespace itself (like a tmpfs, or FUSE that were created inside the user namespace). My impression is that it is unlikely that the restriction will be lifted to allow idmapped mounts for other file systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests