Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman socket performance issues #14941

Closed
jdoss opened this issue Jul 14, 2022 · 10 comments
Closed

Podman socket performance issues #14941

jdoss opened this issue Jul 14, 2022 · 10 comments
Labels
kind/performance locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@jdoss
Copy link
Contributor

jdoss commented Jul 14, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

This is kind of a cross post issue to see if there is anything that can be done to improve the performance of the Podman socket under high concurrency.

I opened this issue hashicorp/nomad-driver-podman#175 on the Nomad Podman driver project to see if we can track down why Podman on my nomad client nodes becomes overwhelmed and unresponsive under high concurrency. This seems to be a common issue for other users of the Nomad Podman driver.

Is there anything that can be done to help improve the performance of the Podman socket? Are there any tips from the Podman team on how to better debug this issue to get more information?

Steps to reproduce the issue:

  1. Launch hundreds of containers per client node with Nomad

  2. Watch the podman socket become unavailable and my Nomad job allocations start failing

Additional information you deem important (e.g. issue happens only occasionally):

Podman is being run as root on these client nodes on Fedora CoreOS 36.20220618.3.1 on Google Compute VMs.

Output of podman version:

# podman version
Client:       Podman Engine
Version:      4.1.0
API Version:  4.1.0
Go Version:   go1.18.2
Built:        Mon May 30 16:03:28 2022
OS/Arch:      linux/amd64

Output of podman info --debug:

# podman info --debug
host:
  arch: amd64
  buildahVersion: 1.26.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.0-2.fc36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: '
  cpuUtilization:
    idlePercent: 99.77
    systemPercent: 0.1
    userPercent: 0.13
  cpus: 4
  distribution:
    distribution: fedora
    variant: coreos
    version: "36"
  eventLogger: journald
  hostname: nomad-ephemeral-production-0.internal.step.plumbing
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.18.5-200.fc36.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 32708116480
  memTotal: 33506603008
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.4.5-1.fc36.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.4.5
      commit: c381048530aa750495cf502ddb7181f2ded5b400
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64
    version: |-
      slirp4netns version 1.2.0-beta.0
      commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 4294963200
  swapTotal: 4294963200
  uptime: 53m 28.18s
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 106825756672
  graphRootUsed: 4567363584
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.1.0
  Built: 1653926608
  BuiltTime: Mon May 30 16:03:28 2022
  GitCommit: ""
  GoVersion: go1.18.2
  Os: linux
  OsArch: linux/amd64
  Version: 4.1.0

Package info (e.g. output of rpm -q podman or apt list podman):

# rpm -q podman
podman-4.1.0-8.fc36.x86_64
@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 14, 2022
@baude
Copy link
Member

baude commented Jul 14, 2022

Have you tried removing the nomad client and flood the socket without it?

@baude
Copy link
Member

baude commented Jul 14, 2022

@jwhonce any thoughts?

@baude baude added kind/performance and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 14, 2022
@jdoss
Copy link
Contributor Author

jdoss commented Jul 14, 2022

Hey @baude! Thanks for taking a look.

Have you tried removing the nomad client and flood the socket without it?

No, I haven't tried that. I am trying to think about how I would go about getting the same conditions without the nomad client running. The driver uses the socket to stream logs for each container so I think there are a lot of things going on that build up to the socket getting overloaded.

@baude
Copy link
Member

baude commented Jul 14, 2022

is it possible to exactly reproduce what you are doing? otherwise, this is a lot to ask

@Luap99
Copy link
Member

Luap99 commented Jul 14, 2022

If it just the log endpoint it is tracked here: #14879

@jdoss
Copy link
Contributor Author

jdoss commented Jul 14, 2022

is it possible to exactly reproduce what you are doing? otherwise, this is a lot to ask

@baude Not without launching your own Nomad cluster and loading up each client node 200+ containers each. I understand it's a lot to ask and I am willing to do whatever I can on my end to provide more information.

If it just the log endpoint it is tracked here: #14879

@Luap99 Yeah the Driver does track the log endpoint. Here is where I believe it is doing that:

https://github.com/hashicorp/nomad-driver-podman/blob/main/api/container_logs.go#L16

@jdoss
Copy link
Contributor Author

jdoss commented Jul 14, 2022

It looks like I can disable log collection in the Nomad Podman driver.

plugin "nomad-driver-podman" {
          config {
            socket_path = "unix://var/run/podman/podman.sock"
            disable_log_collection = false
            volumes {
              enabled      = true
              selinuxlabel = "z"
            }
          }
        }

I am going to test that out on my client nodes and see if I have better performance when deploying a lot of containers at once.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 15, 2022

Since we have heard nothing back in a month. I am guessing that the issue is resolved. Reopen if I am mistaken.

@rhatdan rhatdan closed this as completed Aug 15, 2022
@jdoss
Copy link
Contributor Author

jdoss commented Aug 15, 2022

I am still seeing issues but I haven't been able to dig into it more. I will respond back once I have more info.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 19, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/performance locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

4 participants