Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network issues when using pasta instead of slirp4netns in podman 5.0.2 #22593

Closed
jadonclegg opened this issue May 3, 2024 · 8 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature pasta pasta(1) bugs or features

Comments

@jadonclegg
Copy link

Issue Description

Somewhat of a complex scenario, but I have a java spring web app using flyway for db migration. I have my db (postgres:14) running in podman using the official docker image. When running a larger migration (400k lines) that uses COPY ... FROM STDIN; (can't give a lot of details, enterprise app) it consistently freezes and never completes the migration. Switching podman to --network host or --network slirp4netns fixes the issue and the migration runs basically instantly.

I'm not sure where to find any useful logs for more information, running podman logs -f postgres doesn't give any useful information, and there's no useful information coming from the tomcat logs either. Not sure where to get logs for the inner workings of podman.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Run a massive migration using COPY ... FROM STDIN in flyway on a db running postgres:14 in podman 5.0.2 using pasta as the network backend (rootless)

Describe the results you received

database migration hangs. I'm assuming the network just hangs or something similar, but not sure how to get more information on this. Waited 10+ minutes with no change.

Describe the results you expected

network shouldn't hang

podman info output

host:
  arch: amd64
  buildahVersion: 1.35.3
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.10-1.fc40.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: '
  cpuUtilization:
    idlePercent: 94.13
    systemPercent: 0.94
    userPercent: 4.94
  cpus: 20
  databaseBackend: sqlite
  distribution:
    distribution: fedora
    variant: workstation
    version: "40"
  eventLogger: journald
  freeLocks: 2045
  hostname: fedora
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
  kernel: 6.8.8-300.fc40.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 8644038656
  memTotal: 33302413312
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.10.0-1.fc40.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.10.0
    package: netavark-1.10.3-3.fc40.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.10.3
  ociRuntime:
    name: crun
    package: crun-1.14.4-1.fc40.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.14.4
      commit: a220ca661ce078f2c37b38c92e66cf66c012d9c1
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20240426.gd03c4e2-1.fc40.x86_64
    version: |
      pasta 0^20240426.gd03c4e2-1.fc40.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.2-2.fc40.x86_64
    version: |-
      slirp4netns version 1.2.2
      commit: 0ee2d87523e906518d34a6b423271e4826f71faf
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 1h 9m 35.00s (Approximately 0.04 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/jadon/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 2
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/jadon/.local/share/containers/storage
  graphRootAllocated: 298802216960
  graphRootUsed: 38957015040
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 2
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/jadon/.local/share/containers/storage/volumes
version:
  APIVersion: 5.0.2
  Built: 1713312000
  BuiltTime: Tue Apr 16 18:00:00 2024
  GitCommit: ""
  GoVersion: go1.22.1
  Os: linux
  OsArch: linux/amd64
  Version: 5.0.2

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

Fedora 40, all packages up to date as of May 3

Additional information

No response

@jadonclegg jadonclegg added the kind/bug Categorizes issue or PR as related to a bug. label May 3, 2024
@sbrivio-rh sbrivio-rh added network Networking related issue or feature pasta pasta(1) bugs or features labels May 3, 2024
@sbrivio-rh
Copy link
Collaborator

A packet capture would be nice to have, you can ask pasta to save one with e.g. --network=pasta:--pcap,/tmp/postgres.pcap.

You can drop packets you don't want to share using pcapedit or even Wireshark, or have a look and share what seems to be relevant with a tshark output.

@jadonclegg
Copy link
Author

jadonclegg commented May 6, 2024

Running a packet capture seems to only pick up a few multicast listener report message v2 and router solicitation packets, without picking up any of the postgres traffic. Are there options I'm missing? After letting the container run for several minutes and running the java app that runs the db migrations, I can see the app contacting the db and doing various things, but I only get about 12 total packets in the pcap.

podman run --rm -it --name postgres --network pasta:--pcap,/tmp/postgres.pcap -v pg:/var/lib/postgresql/data -p 5432:5432 -e POSTGRES_PASSWORD=postgres docker.io/postgres:14

pcap.zip

@sbrivio-rh
Copy link
Collaborator

Running a packet capture seems to only pick up a few multicast listener report message v2 and router solicitation packets, without picking up any of the postgres traffic. Are there options I'm missing?

Ah, no, my bad: pasta implements a tap bypass path for local connections that splices TCP sockets between container and host directly, to improve throughput by avoiding Layer-2 / Layer-4 translations where not needed, see the appendage in this diagram. Packets on that path are not captured (because they're not really packets -- we just move the Layer-4 payload around).

See also #22575 (comment) (that issue might be related, but we're not sure there's an issue in pasta yet) and the following comment.

To capture traffic in this case, tcpdump -i lo -w lo.pcap in the container should do the trick.

@jadonclegg
Copy link
Author

Here's the tshark output of a pcap ran for a few minutes on the container using tcpdump -i lo -w lo.pcap while the migration was running. Can't share the full pcap because of the amount of unencrypted sql / data going through, I'd rather not have to go through the list and check every packet for sensitive data and edit it out.

You can see around 2.7s in (near the end of the file) all tcp traffic stops until around 121s in. The migration never finishes, and I stop things manually, although I haven't let it go for more than about 15 min. But usually the hundreds of migrations (when using slirp4netns) take less than 2 min total, whereas this one migration gets stuck for over 10 min using pasta with seemingly nothing actually happening.

tshark.txt

@sbrivio-rh
Copy link
Collaborator

You can see around 2.7s in (near the end of the file) all tcp traffic stops until around 121s in.

Thanks a lot for the capture, this seems to be compatible with the current findings from #22575 -- I'm trying to reproduce that at the moment.

@sbrivio-rh
Copy link
Collaborator

@jadonclegg, assuming that #22575 is in fact the same issue, I just posted a patch to fix this at https://archives.passt.top/passt-dev/20240508090338.2735208-1-sbrivio@redhat.com/ (see also #22575 (comment)).

It would be great if you could test it by building from source (something on the lines of git clone git://passt.top/passt; cd passt; curl https://archives.passt.top/passt-dev/20240508090338.2735208-1-sbrivio@redhat.com/raw | git am; prefix=/usr sudo make install).

@jadonclegg
Copy link
Author

@sbrivio-rh your patch appears to have fixed the issue. After compiling and installing it, I ran the migration three times in a row and it worked every time. I then reverted to the commit before yours, installed again, and the migration got stuck at the same place it was happening before installing your patched version.

@sbrivio-rh
Copy link
Collaborator

This is now fixed in passt version 2024_05_10.7288448 and the corresponding Fedora 40 update.

hswong3i pushed a commit to alvistack/passt-top-passt that referenced this issue May 11, 2024
In tcp_splice_sock_handler(), if we get EAGAIN on the second splice(),
from pipe to receiving socket, that doesn't necessarily mean that the
pipe is empty: the receiver buffer might be full instead.

Hence, we can't use the 'never_read' flag to decide that there's
nothing to wait for: even if we didn't read anything from the sending
side in a given iteration, we might still have data to send in the
pipe. Use read/written counters, instead.

This fixes an issue where large bulk transfers would occasionally
hang. From a corresponding strace:

     0.000061 epoll_wait(4, [{events=EPOLLOUT, data={u32=29442, u64=12884931330}}], 8, 1000) = 1
     0.005003 epoll_ctl(4, EPOLL_CTL_MOD, 211, {events=EPOLLIN|EPOLLRDHUP, data={u32=54018, u64=8589988610}}) = 0
     0.000089 epoll_ctl(4, EPOLL_CTL_MOD, 115, {events=EPOLLIN|EPOLLRDHUP, data={u32=29442, u64=12884931330}}) = 0
     0.000081 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000073 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 1048576
     0.000087 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000045 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = 520415
     0.000060 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000044 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
     0.000044 epoll_wait(4, [], 8, 1000) = 0

we're reading from socket 211 into to the pipe end numbered 151,
which connects to pipe end 150, and from there we're writing into
socket 115.

We initially drop EPOLLOUT from the set of monitored flags for socket
115, because it already signaled it's ready for output. Then we read
nothing from socket 211 (the sender had nothing to send), and we keep
emptying the pipe into socket 115 (first 1048576 bytes, then 520415
bytes).

This call of tcp_splice_sock_handler() ends with EAGAIN on the writing
side, and we just exit this function without setting the OUT_WAIT_1
flag (and, in turn, EPOLLOUT for socket 115). However, it turns out,
the pipe wasn't actually emptied, and while socket 211 had nothing
more to send, we should have waited on socket 115 to be ready for
output again.

As a further step, we could consider not clearing EPOLLOUT at all,
unless the read/written counters match, but I'm first trying to fix
this ugly issue with a minimal patch.

Link: containers/podman#22575
Link: containers/podman#22593
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
@rhatdan rhatdan closed this as completed May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature pasta pasta(1) bugs or features
Projects
None yet
Development

No branches or pull requests

3 participants