Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gitlab container volume frequently blocks io (hangs) until a mounted volume is accessed from within the container. #5498

Closed
NHRedAnt opened this issue Mar 14, 2020 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless stale-issue

Comments

@NHRedAnt
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

During create (or run) of a complex container like "gitlab" multiple hang (IO block?) events occur when running on CentOS 8.1.1911. It does not unblock when left untouched for 12 hours. However it will immediately unblock when "ls /opt/gitlab" command is run from inside the container.

This behavior exists in the official podman 1.6.4 available on CentOS 8.1.
Same behavior occurs under CentOS 8.1 when updated the the most recent podman 1.8.1.

However we experience no such behavior when testing on Fedora 30 or Fedora 31 using their default podman 1.8.0.

CentOS platforms had SE Linux enabled and enforcing.
Fedora 30 system had selinux disabled. The Fedora 31 system had selinux enabled.

Assuming this might be related to major kernel differences, and not the unlikelyhood of a fix before 1.8.0 and reversion at 1.8.1.

CentOS 8.1 kernel: 4.18.0-147.5.1.el8_1.x86_64
Fedora30 kernel: 5.5.7-100.fc30.x86_64

Steps to reproduce the issue:

DAT=$HOME/vols/gitlab/data

mkdir -p $DAT/config $DAT/data $DAT/logs

IMG=gitlab/gitlab-ce:12.8.6-ce.0

podman run \
       --name gitlab \
       --hostname gitlab.example.com \
       --restart always \
       --detach \
       --publish 8443:443 \
       --publish 8480:80 \
       --volume $DAT/config:/etc/gitlab:Z  \
       --volume $DAT/logs:/var/log/gitlab:Z \
       --volume $DAT/data:/var/opt/gitlab:Z \
       --ulimit nofile=10240:10240 \
       $IMG

Note that the --ulimit isn't needed to show the behavior, but is to actually complete the build. Also note that it does require the host system limit is set sufficiently before execution.

Describe the results you received:

Container stops building after a dozen seconds of logged activity. The following command will allow further building, but will hang again within a few minutes. The same "ls /opt/gitlab" always releases the block.

podman exec gitlab ls /opt/gitlab

The build completes normally if given sufficient volume access. Like using:

while true; do sleep 2; podman exec gitlab ls /opt/gitlab; done

After correctly built "podman start gitlab" also experiences the same blocking behavior (and solution). But not as frequently as during the build. Aside from the io block the container appears to perform normally.

Describe the results you expected:

Normal processing for a few minutes until container instance provides a working gitlab.

Additional information you deem important (e.g. issue happens only occasionally):

Shell exploring or running commands that access the mounted volume appear to also release the block.

Fails similarly for all gitlab versions tested:
gitlab/gitlab-ce:12.8.6-ce.0
gitlab/gitlab-ce:12.8.5-ce.0
gitlab/gitlab-ce:12.6.8-ce.0

Output of podman version:

[rea@broken ~]$ podman --version
podman version 1.8.1

[rea@ipod ~]$ podman --version
podman version 1.6.4

[rea@q rcc_podman]$ podman --version
podman version 1.8.0

Output of podman info --debug:

[rea@broken ~]$ podman info --debug
debug:
  compiler: gc
  git commit: ""
  go version: go1.12.12
  podman version: 1.8.1
host:
  BuildahVersion: 1.14.2
  CgroupVersion: v1
  Conmon:
    package: conmon-2.0.11-1.1.el8.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.11, commit: 196be609e5496079d86682c737f9e0af764d4df8'
  Distribution:
    distribution: '"centos"'
    version: "8"
  IDMappings:
    gidmap:
    - container_id: 0
      host_id: 10
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 3489
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  MemFree: 2843439104
  MemTotal: 4131905536
  OCIRuntime:
    name: runc
    package: runc-1.0.0-15.1.el8.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.0.0-rc10
      commit: 2e01c9e4dfdab4b4c993a3698059fb824cd3286a
      spec: 1.0.1-dev
  SwapFree: 4269797376
  SwapTotal: 4269797376
  arch: amd64
  cpus: 4
  eventlogger: journald
  hostname: broken.sr.unh.edu
  kernel: 4.18.0-147.5.1.el8_1.x86_64
  os: linux
  rootless: true
  slirp4netns:
    Executable: /usr/bin/slirp4netns
    Package: slirp4netns-0.4.3-23.1.el8.x86_64
    Version: |-
      slirp4netns version 0.4.3-beta.1
      commit: b04291ba84ca35ccc60bd009372a28f9ea7ef841
  uptime: 9h 30m 52.71s (Approximately 0.38 days)
registries:
  search:
  - registry.fedoraproject.org
  - quay.io
  - docker.io
store:
  ConfigFile: /home/rea/.config/containers/storage.conf
  ContainerStore:
    number: 0
  GraphDriverName: overlay
  GraphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-0.7.6-6.2.el8.x86_64
      Version: |-
        fusermount3 version: 3.2.1
        fuse-overlayfs: version 0.7.6
        FUSE library version 3.2.1
        using FUSE kernel interface version 7.26
  GraphRoot: /home/rea/.local/share/containers/storage
  GraphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  ImageStore:
    number: 0
  RunRoot: /run/user/3489/containers
  VolumePath: /home/rea/.local/share/containers/storage/volumes

Package info (e.g. output of rpm -q podman or apt list podman):

[rea@broken ~]$ rpm -q podman
podman-1.8.1-2.1.el8.x86_64

Additional environment details (AWS, VirtualBox, physical, etc.):

The CentOS 8.1 podman 1.8.1 information given for "broken" is a VirtualBox on a Mac, as is the Fedora 31 system that worked using podman 1.8.0.
Two separate CentOS 8.1 podman 1.6.4 physical Dell rack servers fail similarly.
The working Fedora 30 podman 1.8.0 is a physical Dell desktop system.

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 14, 2020
@mheon
Copy link
Member

mheon commented Mar 14, 2020

Are you running as root, or rootless? If rootless, fuse-overlayfs sounds like a likely culprit.

I somewhat doubt this is related to volumes. /opt/gitlab is not a volume in the command you gave; you're just poking the container's root filesystem. Exec also does many other things (modifies cgroups for the joining process, for example) so it could be affecting system state in other ways.

@NHRedAnt
Copy link
Author

All deployments listed above are run as normal non-root users. Our main reason to use podman is for better security.
I will run a new test as root and report back.

@NHRedAnt
Copy link
Author

Root podman does work on CentOS 8.1 podman 1.6.4. So the issue does appear to affect only "rootless" instances.

A full podman exec is not necessary. The unblocking can also come from a single "podman exec -ti bash" and then use "ls /opt/gitlab" when needed.

Also noted that other direct volume "ls" commands or other directories appear to unblock. Not even a "ls -ld /opt/gitlab", or "ls -ld /opt/gitlab/bin". Any subdir listing has no affect, but a "ls -ld /opt/gitlab/*" does unblock.

This hang may always be related to the starting of a new service by the "runsvdir" process (noted below).
One other way we have found does unblock is to run a second initialization daemon like this:
"/opt/gitlab/embedded/bin/runsvdir-start &"

@mheon
Copy link
Member

mheon commented Mar 16, 2020

My suspicion is definitely fuse-overlayfs now. It sounds like it's blocking on something, and then unblocking when another operation comes in.

@giuseppe PTAL

@mheon mheon added the rootless label Mar 16, 2020
@NHRedAnt
Copy link
Author

Well here's another useful update.
Appears to work correctly on CentOS 8.1 podman 1.6.4 with updated kernel 4.18.0-177.el8.x86_64.

dnf install centos-release-stream
dnf --refresh update
reboot

Works as expected with the new kernel.

broken> cat /proc/version
Linux version 4.18.0-177.el8.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 8.2.1 20180905 (Red Hat 8.2.1-3) (GCC)) #1 SMP Wed Feb 12 19:59:38 UTC 2020

This seems like the nicest solution for us at the moment.

@mheon
Copy link
Member

mheon commented Mar 16, 2020 via email

@NHRedAnt
Copy link
Author

This is from one of the CentOS 8.1 rack servers. Unfortunately this preview system upgrade to centos-release-stream needed to be reversed (due to requiring kmod-mptsas for it's older SAS controller). This is the information from that system after reverting.

Once we solve the SAS issue on this server, or I'm back at work with access to the Mac VM I can check to see if this information is the same after the "centos-release-stream".

[rea@ipod ~]$ podman info  
host:
  BuildahVersion: 1.12.0-dev
  CgroupVersion: v1
  Conmon:
    package: conmon-2.0.6-1.module_el8.1.0+272+3e64ee36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.6, commit: 7a4f0dd7b20a3d4bf9ef3e5cbfac05606b08eac0'
  Distribution:
    distribution: '"centos"'
    version: "8"
  IDMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 231072
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 3489
      size: 1
    - container_id: 1
      host_id: 231072
      size: 65536
  MemFree: 5714972672
  MemTotal: 12571811840
  OCIRuntime:
    name: runc
    package: runc-1.0.0-64.rc9.module_el8.1.0+272+3e64ee36.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.1-dev'
  SwapFree: 8585736192
  SwapTotal: 8585736192
  arch: amd64
  cpus: 16
  eventlogger: journald
  hostname: ipod.sr.unh.edu
  kernel: 4.18.0-147.5.1.el8_1.x86_64
  os: linux
  rootless: true
  slirp4netns:
    Executable: /usr/bin/slirp4netns
    Package: slirp4netns-0.4.2-2.git21fdece.module_el8.1.0+272+3e64ee36.x86_64
    Version: |-
      slirp4netns version 0.4.2+dev
      commit: 21fdece2737dc24ffa3f01a341b8a6854f8b13b4
  uptime: 3h 8m 16.14s (Approximately 0.12 days)
registries:
  blocked: null
  insecure: null
  search:
  - registry.access.redhat.com
  - registry.fedoraproject.org
  - registry.centos.org
  - docker.io
store:
  ConfigFile: /home/rea/.config/containers/storage.conf
  ContainerStore:
    number: 1
  GraphDriverName: overlay
  GraphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-0.7.2-1.module_el8.1.0+272+3e64ee36.x86_64
      Version: |-
        fuse-overlayfs: version 0.7.2
        FUSE library version 3.2.1
        using FUSE kernel interface version 7.26
  GraphRoot: /home/rea/.local/share/containers/storage
  GraphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  ImageStore:
    number: 1
  RunRoot: /run/user/3489
  VolumePath: /home/rea/.local/share/containers/storage/volumes

@mheon
Copy link
Member

mheon commented Mar 16, 2020

The first podman info has fuse-overlayfs 0.7.6, the new one has fuse-overlayfs 0.7.2.

I'd be interested in a test of the new, 1.8.0 Podman with the old 0.7.2 fuse-overlayfs - I suspect that combination will work fine

@giuseppe
Copy link
Member

I've tried locally on Fedora with different versions of podman and fuse-overlayfs and I am not able to reproduce the hang. It could be the kernel.

Can you strace the container process when the hang happens?

@NHRedAnt
Copy link
Author

I've tried locally on Fedora with different versions of podman and fuse-overlayfs and I am not able to reproduce the hang. It could be the kernel.

Can you strace the container process when the hang happens?

We saw no problems in Fedora 30 or 31 with and without selinux.

The problem occurs only in CentOS 8.1 podman 1.6.4/1.8.0/1.8.1 rootless gitlab image (most recent 3 upto gitlab/gitlab-ce:12.8.6-ce.0) with the default 8.1 kernel 4.18.0-147.5.1.el8_1.x86_64.

Newer kernel (4.18.0-177.el8.x86_64) or root podman works.

@NHRedAnt
Copy link
Author

Here is the requested "podman info" from a working kernel upgraded (via centos-release-stream) CentOS 8.1 podman 1.6.4

[rea@ipod ~]$ podman info
host:
  BuildahVersion: 1.12.0-dev
  CgroupVersion: v1
  Conmon:
    package: conmon-2.0.6-1.module_el8.1.0+272+3e64ee36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.6, commit: 7a4f0dd7b20a3d4bf9ef3e5cbfac05606b08eac0'
  Distribution:
    distribution: '"centos"'
    version: "8"
  IDMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 231072
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 3489
      size: 1
    - container_id: 1
      host_id: 231072
      size: 65536
  MemFree: 3985772544
  MemTotal: 12569260032
  OCIRuntime:
    name: runc
    package: runc-1.0.0-64.rc9.module_el8.1.0+272+3e64ee36.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.1-dev'
  SwapFree: 8585736192
  SwapTotal: 8585736192
  arch: amd64
  cpus: 16
  eventlogger: journald
  hostname: ipod.sr.unh.edu
  kernel: 4.18.0-177.el8.x86_64
  os: linux
  rootless: true
  slirp4netns:
    Executable: /usr/bin/slirp4netns
    Package: slirp4netns-0.4.2-2.git21fdece.module_el8.1.0+272+3e64ee36.x86_64
    Version: |-
      slirp4netns version 0.4.2+dev
      commit: 21fdece2737dc24ffa3f01a341b8a6854f8b13b4
  uptime: 2h 26m 8.21s (Approximately 0.08 days)
registries:
  blocked: null
  insecure: null
  search:
  - registry.access.redhat.com
  - registry.fedoraproject.org
  - registry.centos.org
  - docker.io
store:
  ConfigFile: /home/rea/.config/containers/storage.conf
  ContainerStore:
    number: 1
  GraphDriverName: overlay
  GraphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-0.7.2-1.module_el8.1.0+272+3e64ee36.x86_64
      Version: |-
        fuse-overlayfs: version 0.7.2
        FUSE library version 3.2.1
        using FUSE kernel interface version 7.26
  GraphRoot: /home/rea/.local/share/containers/storage
  GraphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  ImageStore:
    number: 1
  RunRoot: /run/user/3489
  VolumePath: /home/rea/.local/share/containers/storage/volumes

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Apr 17, 2020

Since this works on newer kernel, I take it we can close this Bug? Reopen if I am mistaken

@rhatdan rhatdan closed this as completed Apr 17, 2020
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless stale-issue
Projects
None yet
Development

No branches or pull requests

5 participants