Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.0.0] Rootless networking with custom network is broken #22146

Closed
maxi0604 opened this issue Mar 23, 2024 · 20 comments
Closed

[5.0.0] Rootless networking with custom network is broken #22146

maxi0604 opened this issue Mar 23, 2024 · 20 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature pasta pasta(1) bugs or features

Comments

@maxi0604
Copy link

maxi0604 commented Mar 23, 2024

Issue Description

When using rootless podman and a network created with podman network create foo, the container doesn't have internet access. The issue is not specific to IPv4-only networks and also occurs with podman network create --ipv6 bar.

Steps to reproduce the issue

Steps to reproduce the issue

  1. podman network create foo
  2. podman run -it --rm --network=foo alpine wget google.com

Describe the results you received

The IP resolves, but the command hangs. ping (and ping6) work as expected

Describe the results you expected

The command goes through.

podman info output

host:
  arch: amd64
  buildahVersion: 1.35.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: /usr/bin/conmon is owned by conmon 1:2.1.10-1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: 2dcd736e46ded79a53339462bc251694b150f870'
  cpuUtilization:
    idlePercent: 98.38
    systemPercent: 0.56
    userPercent: 1.06
  cpus: 12
  databaseBackend: sqlite
  distribution:
    distribution: arch
    version: unknown
  eventLogger: journald
  freeLocks: 2047
  hostname: hermes
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.8.1-arch1-1
  linkmode: dynamic
  logDriver: journald
  memFree: 8785002496
  memTotal: 16076029952
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: /usr/lib/podman/aardvark-dns is owned by aardvark-dns 1.10.0-1
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.10.0
    package: /usr/lib/podman/netavark is owned by netavark 1.10.3-1
    path: /usr/lib/podman/netavark
    version: netavark 1.10.3
  ociRuntime:
    name: crun
    package: /usr/bin/crun is owned by crun 1.14.4-1
    path: /usr/bin/crun
    version: |-
      crun version 1.14.4
      commit: a220ca661ce078f2c37b38c92e66cf66c012d9c1
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: /usr/bin/pasta is owned by passt 2024_03_20.71dd405-1
    version: |
      pasta unknown version
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 4294963200
  swapTotal: 4294963200
  uptime: 0h 19m 48.00s
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/maxi/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/maxi/.local/share/containers/storage
  graphRootAllocated: 511554093056
  graphRootUsed: 71551737856
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 1
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/maxi/.local/share/containers/storage/volumes
version:
  APIVersion: 5.0.0
  Built: 1711060217
  BuiltTime: Thu Mar 21 23:30:17 2024
  GitCommit: e71ec6f1d94d2d97fb3afe08aae0d8adaf8bddf0-dirty
  GoVersion: go1.22.1
  Os: linux
  OsArch: linux/amd64
  Version: 5.0.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

No response

@maxi0604 maxi0604 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 23, 2024
@maxi0604 maxi0604 changed the title Rootless networking with custom network is broken [5.0.0] Rootless networking with custom network is broken Mar 23, 2024
@sbrivio-rh sbrivio-rh added the pasta pasta(1) bugs or features label Mar 24, 2024
@sbrivio-rh
Copy link
Collaborator

Faking a --pcap option in pasta, for wget google.com I see a RST from the container just after the request and a TCP window update from pasta, see frame 19 below:

$ tshark -r /tmp/hack.pcap
    1   0.000000           :: → ff02::16     ICMPv6 110 Multicast Listener Report Message v2
    2   0.520126           :: → ff02::1:ff00:2 ICMPv6 86 Neighbor Solicitation for 2a01:4f8:222:904::2
    3   0.584131           :: → ff02::16     ICMPv6 110 Multicast Listener Report Message v2
    4   0.616134           :: → ff02::1:ffdb:35eb ICMPv6 86 Neighbor Solicitation for fe80::6417:aaff:fedb:35eb
    5   1.640101 fe80::6417:aaff:fedb:35eb → ff02::16     ICMPv6 110 Multicast Listener Report Message v2
    6   1.640122 fe80::6417:aaff:fedb:35eb → ff02::2      ICMPv6 70 Router Solicitation from 66:17:aa:db:35:eb
    7   2.600041 fe80::6417:aaff:fedb:35eb → ff02::16     ICMPv6 110 Multicast Listener Report Message v2
    8   4.353807 66:17:aa:db:35:eb → Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
    9   4.353843 ASRockIn_8e:d7:b6 → 66:17:aa:db:35:eb ARP 42 88.198.0.161 is at a8:a1:59:8e:d7:b6
   10   4.353851 88.198.0.164 → 185.12.64.1  DNS 70 Standard query 0x495e A google.com
   11   4.353855 88.198.0.164 → 185.12.64.1  DNS 70 Standard query 0x30d9 AAAA google.com
   12   4.354172  185.12.64.1 → 88.198.0.164 DNS 98 Standard query response 0x30d9 AAAA google.com AAAA 2a00:1450:4001:829::200e
   13   4.354183  185.12.64.1 → 88.198.0.164 DNS 86 Standard query response 0x495e A google.com A 142.250.184.206
   14   4.354416 88.198.0.164 → 142.250.184.206 TCP 74 44824 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=262450325 TSecr=0 WS=4096
   15   4.359743 142.250.184.206 → 88.198.0.164 TCP 62 80 → 44824 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=61440 WS=256
   16   4.359749 88.198.0.164 → 142.250.184.206 TCP 54 44824 → 80 [ACK] Seq=1 Ack=1 Win=65536 Len=0
   17   4.359788 88.198.0.164 → 142.250.184.206 HTTP 127 GET / HTTP/1.1 
   18   4.359813 142.250.184.206 → 88.198.0.164 TCP 54 [TCP Window Update] 80 → 44824 [<None>] Seq=1 Win=65280 Len=0
   19   4.359817 88.198.0.164 → 142.250.184.206 TCP 54 44824 → 80 [RST, ACK] Seq=2622007341 Ack=1 Win=0 Len=0
   20   4.568208 88.198.0.164 → 142.250.184.206 TCP 127 [TCP Retransmission] 44824 → 80 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=73
   21   5.000035 88.198.0.164 → 142.250.184.206 TCP 127 [TCP Retransmission] 44824 → 80 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=73
   22   5.736172 fe80::6417:aaff:fedb:35eb → ff02::2      ICMPv6 70 Router Solicitation from 66:17:aa:db:35:eb
   23   5.864084 88.198.0.164 → 142.250.184.206 TCP 127 [TCP Retransmission] 44824 → 80 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=73
   24   6.980008 88.198.0.164 → 142.250.184.206 TCP 54 44824 → 80 [FIN, ACK] Seq=74 Ack=1 Win=65536 Len=0
   25   7.560166 88.198.0.164 → 142.250.184.206 TCP 127 [TCP Retransmission] 44824 → 80 [FIN, PSH, ACK] Seq=1 Ack=1 Win=65536 Len=73
   26  11.112077 88.198.0.164 → 142.250.184.206 TCP 127 [TCP Retransmission] 44824 → 80 [FIN, PSH, ACK] Seq=1 Ack=1 Win=65536 Len=73
   27  13.928009 fe80::6417:aaff:fedb:35eb → ff02::2      ICMPv6 70 Router Solicitation from 66:17:aa:db:35:eb

...if I enter the target namespace and capture traffic from there, I don't see that segment, though:

10:47:34.893690 IP (tos 0x0, ttl 64, id 46395, offset 0, flags [DF], proto TCP (6), length 60)
    10.89.0.9.39992 > 142.250.185.110.80: Flags [S], cksum 0x52f9 (incorrect -> 0xde27), seq 3706848507, win 64240, options [mss 1460,sackOK,TS val 1860972213 ecr 0,nop,wscale 12], length 0
10:47:34.898732 IP (tos 0x0, ttl 254, id 0, offset 0, flags [none], proto TCP (6), length 48)
    142.250.185.110.80 > 10.89.0.9.39992: Flags [S.], cksum 0x02cf (correct), seq 873828756, ack 3706848508, win 65535, options [mss 61440,nop,wscale 8], length 0
10:47:34.898756 IP (tos 0x0, ttl 64, id 46396, offset 0, flags [DF], proto TCP (6), length 40)
    10.89.0.9.39992 > 142.250.185.110.80: Flags [.], cksum 0x52e5 (incorrect -> 0x18d8), ack 1, win 16, length 0
10:47:34.898783 IP (tos 0x0, ttl 64, id 46397, offset 0, flags [DF], proto TCP (6), length 113)
    10.89.0.9.39992 > 142.250.185.110.80: Flags [P.], cksum 0x532e (incorrect -> 0x1f9b), seq 1:74, ack 1, win 16, length 73: HTTP, length: 73
	GET / HTTP/1.1
	Host: google.com
	User-Agent: Wget
	Connection: close

Is it from netfilter? It doesn't look like netavark is configuring anything that might lead to that:

# nft list ruleset
table ip nat {
	chain NETAVARK-F7FBBA6E0636F {
		ip daddr 10.89.0.0/24 counter packets 0 bytes 0 accept
		ip daddr != 224.0.0.0/4 counter packets 1 bytes 60 # xt_MASQUERADE
	}

	chain POSTROUTING {
		type nat hook postrouting priority srcnat; policy accept;
		counter packets 8 bytes 512 jump NETAVARK-HOSTPORT-MASQ
		ip saddr 10.89.0.0/24 counter packets 2 bytes 100 jump NETAVARK-F7FBBA6E0636F
	}

	chain NETAVARK-HOSTPORT-SETMARK {
		counter packets 0 bytes 0 # xt_MARK
	}

	chain NETAVARK-HOSTPORT-MASQ {
		# xt_comment meta mark & 0x00002000 == 0x00002000 counter packets 0 bytes 0 # xt_MASQUERADE
	}

	chain NETAVARK-HOSTPORT-DNAT {
	}

	chain PREROUTING {
		type nat hook prerouting priority dstnat; policy accept;
		# xt_addrtype counter packets 1 bytes 56 jump NETAVARK-HOSTPORT-DNAT
	}

	chain OUTPUT {
		type nat hook output priority -100; policy accept;
		# xt_addrtype counter packets 0 bytes 0 jump NETAVARK-HOSTPORT-DNAT
	}
}
table ip filter {
	chain NETAVARK_FORWARD {
		ip daddr 10.89.0.0/24 # xt_conntrack counter packets 1 bytes 48 accept
		ip saddr 10.89.0.0/24 counter packets 11 bytes 1117 accept
	}

	chain FORWARD {
		type filter hook forward priority filter; policy accept;
		# xt_comment counter packets 12 bytes 1165 jump NETAVARK_FORWARD
	}
}

@Luap99
Copy link
Member

Luap99 commented Mar 24, 2024

@maxi0604 Is this ipv4 or ipv6 traffic that is not working? I only have access to ipv4 systems so I cannot test v6.
Does it work with with --network pasta?

For the cutsom rootless network case the setup is more complicated with involves both pasta and netavark so it is not easy to tell where things go wrong. You can enter our rootless netns with podman unshare --rootless-netns and then there if you have the container running you should see both the pasta interface (i.e. should have the same name as your external interface) and the podman/netavark bridge interface (called podmanX).
So in order to do a full package dump you run something like this podman unshare --rootless-netns tcpdump -nn -i any in another terminal and then try to run your reproducer again, then we should see where the packages are getting lost.

@Luap99 Luap99 added the network Networking related issue or feature label Mar 24, 2024
@maxi0604
Copy link
Author

maxi0604 commented Mar 24, 2024

@Luap99

Is this ipv4 or ipv6 traffic that is not working? I only have access to ipv4 systems so I cannot test v6.

I've explicitly tested both and both show the same hang. If the network was not created with --ipv6, then trying an ipv6-only connection does fail fast as expected.

Does it work with with --network pasta?

Yes, that seems to work with v4 and v6.

For the cutsom rootless network case the setup is more complicated with involves both pasta and netavark so it is not easy to tell where things go wrong. You can enter our rootless netns with podman unshare --rootless-netns and then there if you have the container running you should see both the pasta interface (i.e. should have the same name as your external interface) and the podman/netavark bridge interface (called podmanX). So in order to do a full package dump you run something like this podman unshare --rootless-netns tcpdump -nn -i any in another terminal and then try to run your reproducer again, then we should see where the packages are getting lost.

$ podman unshare --rootless-netns tcpdump -nn -i any                                                                                                                                                            
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
16:26:26.973408 veth0 B   ARP, Request who-has 10.89.2.2 tell 10.89.2.2, length 28
16:26:26.978505 podman3 Out IP 10.89.2.1 > 224.0.0.22: igmp v3 report, 1 group record(s)
16:26:26.978516 veth0 Out IP 10.89.2.1 > 224.0.0.22: igmp v3 report, 1 group record(s)
16:26:26.981891 veth0 Out IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 2 group record(s), length 48
16:26:26.981920 podman3 Out IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 3 group record(s), length 68
16:26:26.981941 veth0 Out IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 3 group record(s), length 68
16:26:26.981956 veth0 M   IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:26.981967 podman3 M   IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:26.991893 veth0 Out IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 2 group record(s), length 48
16:26:27.048404 podman3 Out IP 10.89.2.1 > 224.0.0.22: igmp v3 report, 1 group record(s)
16:26:27.048422 veth0 Out IP 10.89.2.1 > 224.0.0.22: igmp v3 report, 1 group record(s)
16:26:27.121090 veth0 B   ARP, Request who-has 10.89.2.1 tell 10.89.2.2, length 28
16:26:27.121098 podman3 B   ARP, Request who-has 10.89.2.1 tell 10.89.2.2, length 28
16:26:27.121123 podman3 Out ARP, Reply 10.89.2.1 is-at ea:97:e8:10:7f:5f, length 28
16:26:27.121128 veth0 Out ARP, Reply 10.89.2.1 is-at ea:97:e8:10:7f:5f, length 28
16:26:27.121138 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [S], seq 507767311, win 32120, options [mss 1460,sackOK,TS val 307039792 ecr 0,nop,wscale 7], length 0
16:26:27.121140 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [S], seq 507767311, win 32120, options [mss 1460,sackOK,TS val 307039792 ecr 0,nop,wscale 7], length 0
16:26:27.121190 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [S], seq 507767311, win 32120, options [mss 1460,sackOK,TS val 307039792 ecr 0,nop,wscale 7], length 0
16:26:27.129336 wlan0 In  IP 142.250.74.206.80 > 172.17.61.166.56748: Flags [S.], seq 2288725518, ack 507767312, win 65535, options [mss 61440,nop,wscale 8], length 0
16:26:27.129373 podman3 Out IP 142.250.74.206.80 > 10.89.2.2.56748: Flags [S.], seq 2288725518, ack 507767312, win 65535, options [mss 61440,nop,wscale 8], length 0
16:26:27.129378 veth0 Out IP 142.250.74.206.80 > 10.89.2.2.56748: Flags [S.], seq 2288725518, ack 507767312, win 65535, options [mss 61440,nop,wscale 8], length 0
16:26:27.129408 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [.], ack 1, win 251, length 0
16:26:27.129410 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [.], ack 1, win 251, length 0
16:26:27.129421 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [.], ack 1, win 251, length 0
16:26:27.129499 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:27.129502 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:27.129566 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:27.129707 wlan0 In  IP 142.250.74.206.80 > 172.17.61.166.56748: Flags [none], win 255, length 0
16:26:27.129738 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [R.], seq 3787199985, ack 1, win 0, length 0
16:26:27.261774 veth0 M   IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:27.261789 podman3 M   IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:27.341741 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:27.341750 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:27.341802 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:27.368433 veth0 Out IP6 :: > ff02::1:ff4f:949a: ICMP6, neighbor solicitation, who has fe80::5cce:1cff:fe4f:949a, length 32
16:26:27.528188 podman3 Out IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 3 group record(s), length 68
16:26:27.528206 veth0 Out IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 3 group record(s), length 68
16:26:27.638446 veth0 M   IP6 :: > ff02::1:ff3a:267b: ICMP6, neighbor solicitation, who has fe80::bc6e:c9ff:fe3a:267b, length 32
16:26:27.638456 podman3 M   IP6 :: > ff02::1:ff3a:267b: ICMP6, neighbor solicitation, who has fe80::bc6e:c9ff:fe3a:267b, length 32
16:26:27.718555 podman3 Out IP6 :: > ff02::1:ff10:7f5f: ICMP6, neighbor solicitation, who has fe80::e897:e8ff:fe10:7f5f, length 32
16:26:27.718577 veth0 Out IP6 :: > ff02::1:ff10:7f5f: ICMP6, neighbor solicitation, who has fe80::e897:e8ff:fe10:7f5f, length 32
16:26:27.768217 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:27.768228 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:27.768270 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:28.381582 veth0 Out IP6 fe80::5cce:1cff:fe4f:949a > ff02::16: HBH ICMP6, multicast listener report v2, 3 group record(s), length 68
16:26:28.391525 veth0 Out IP6 fe80::5cce:1cff:fe4f:949a > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:28.528441 veth0 Out IP6 fe80::5cce:1cff:fe4f:949a > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:28.621747 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:28.621755 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:28.621801 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:28.648248 veth0 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:28.648274 podman3 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:28.648309 veth0 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::2: ICMP6, router solicitation, length 16
16:26:28.648311 podman3 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::2: ICMP6, router solicitation, length 16
16:26:28.728483 veth0 Out IP6 fe80::5cce:1cff:fe4f:949a > ff02::16: HBH ICMP6, multicast listener report v2, 3 group record(s), length 68
16:26:28.728515 podman3 Out IP6 fe80::e897:e8ff:fe10:7f5f > ff02::16: HBH ICMP6, multicast listener report v2, 4 group record(s), length 88
16:26:28.728540 veth0 Out IP6 fe80::e897:e8ff:fe10:7f5f > ff02::16: HBH ICMP6, multicast listener report v2, 4 group record(s), length 88
16:26:28.738407 podman3 Out IP6 fe80::e897:e8ff:fe10:7f5f > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:28.738413 veth0 Out IP6 fe80::e897:e8ff:fe10:7f5f > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:29.581568 podman3 Out IP6 fe80::e897:e8ff:fe10:7f5f > ff02::16: HBH ICMP6, multicast listener report v2, 4 group record(s), length 88
16:26:29.581589 veth0 Out IP6 fe80::e897:e8ff:fe10:7f5f > ff02::16: HBH ICMP6, multicast listener report v2, 4 group record(s), length 88
16:26:29.635135 podman3 Out IP6 fe80::e897:e8ff:fe10:7f5f > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:29.635160 veth0 Out IP6 fe80::e897:e8ff:fe10:7f5f > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:29.661510 veth0 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:29.661522 podman3 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
16:26:30.328262 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:30.328269 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:30.328309 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:32.194847 wlan0 Out ARP, Request who-has 172.17.60.1 tell 172.17.61.166, length 28
16:26:32.194855 podman3 Out ARP, Request who-has 10.89.2.2 tell 10.89.2.1, length 28
16:26:32.194865 veth0 Out ARP, Request who-has 10.89.2.2 tell 10.89.2.1, length 28
16:26:32.194930 veth0 P   ARP, Reply 10.89.2.2 is-at be:6e:c9:3a:26:7b, length 28
16:26:32.194931 wlan0 In  ARP, Reply 172.17.60.1 is-at 40:1a:58:6c:2f:87, length 28
16:26:32.194939 podman3 In  ARP, Reply 10.89.2.2 is-at be:6e:c9:3a:26:7b, length 28
16:26:32.835226 veth0 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::2: ICMP6, router solicitation, length 16
16:26:32.835236 podman3 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::2: ICMP6, router solicitation, length 16
16:26:33.901550 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:33.901560 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:33.901616 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:40.728234 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:40.728244 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:40.728288 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [P.], seq 1:78, ack 1, win 251, length 77: HTTP: GET / HTTP/1.1
16:26:41.158185 veth0 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::2: ICMP6, router solicitation, length 16
16:26:41.158194 podman3 M   IP6 fe80::bc6e:c9ff:fe3a:267b > ff02::2: ICMP6, router solicitation, length 16
16:26:42.336506 veth0 P   IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [F.], seq 78, ack 1, win 251, length 0
16:26:42.336513 podman3 In  IP 10.89.2.2.56748 > 142.250.74.206.80: Flags [F.], seq 78, ack 1, win 251, length 0
16:26:42.336543 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [F.], seq 78, ack 1, win 251, length 0

142.250.74.206:80 is google.com.

I think the IPv6 output is unrelated, the network was created without --ipv6.

@sbrivio-rh
Copy link
Collaborator

I don't think there's any packet getting lost, @maxi0604's output is consistent with mine, here is the RST segment:

16:26:27.129738 wlan0 Out IP 172.17.61.166.56748 > 142.250.74.206.80: Flags [R.], seq 3787199985, ack 1, win 0, length 0

I strace'd pasta, and it close()s the "host" socket as it gets this, as expected.

@sbrivio-rh
Copy link
Collaborator

sbrivio-rh commented Mar 24, 2024

[Distractedly thinking about this, sorry for the rain of comments] On a second thought, we can't exclude that the window update frame (18 in my first capture, #22146 (comment)) is seen as somewhat strange by the kernel and that warrants a reset.

The acknowledgement sequence is increased by one compared to the SYN, ACK segment, but the ACK flag is not set (because we want to update the window) -- that should be legitimate but somewhat unusual.

@sbrivio-rh
Copy link
Collaborator

Tagging @dgibson in case that rings a bell.

@sbrivio-rh
Copy link
Collaborator

Confirmed, the kernel doesn't seem to like (anymore?) a segment that just updates the window, without any flag set, and with the acknowledgement sequence matching the previous one. If I force the ACK flag in pasta, here:

diff --git a/tcp.c b/tcp.c
index a1860d1..7785ab3 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1679,7 +1679,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
        } else {
                th->ack = !!(flags & (ACK | DUP_ACK)) ||
                          conn->seq_ack_to_tap != prev_ack_to_tap ||
-                         !prev_wnd_to_tap;
+                         !prev_wnd_to_tap || 1;
        }
 
        th->doff = (sizeof(*th) + optlen) / 4;

then we don't get a reset and wget completes.

@sbrivio-rh
Copy link
Collaborator

This smells like a kernel issue to me and we should look into that. Probably reasonable workaround meanwhile: if we just completed the three-way handshake, with a connection started from the tap side (container), reset our own value of the window we sent to the container, in order to force an ACK flag on the next segment (including a possible window update, as it happens here):

diff --git a/tcp.c b/tcp.c
index a1860d1..1135c71 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2629,6 +2629,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
                        goto reset;
 
                conn_event(c, conn, ESTABLISHED);
+               conn->wnd_to_tap = 0;
 
                if (th->fin) {
                        conn->seq_from_tap++;

lightly tested, this seems to work as well.

@dgibson
Copy link
Collaborator

dgibson commented Mar 25, 2024

I don't think it is a kernel issue. @sbrivio-rh pointed out this kernel commit. It states that RFC 793 requires that packets without an ACK be dropped, and my reading of RFC 793 its successors concurs. See for example here.

I think we should be setting ACK on all non-SYN, non-RST packets. What we do for RST packets is a bit more complicated.

Currently trying to figure out how to correct this without excessive churn. I've also filed an upstream pasta bug to track it.

@sbrivio-rh
Copy link
Collaborator

While we fix this in pasta and make updated packages available, I tested this nftables-based workaround:

nft 'add chain ip filter input { type filter hook input priority 0; }'
nft add rule filter input 'tcp flags & (syn | rst | ack) == 0 counter drop'

from the target network namespace (for pasta itself).

For some reason podman unshare --rootless-netns didn't bring me there, so I entered it with nsenter -U -n -t $(pidof aardvark-dns).

The idea is to drop any TCP segment that has none of the SYN, RST, and ACK flags set, before some kernel component (we haven't figured that out yet) resets the connection. @dgibson also points out that RFC 9293 says those segments should be discarded, but not that they should cause a reset. This part looks like a kernel issue to me.

@KirilMihaylov
Copy link

I can confirm that on 5.0 it is broken with the default bridged network adapter when running on WSL. Unless a custom DNS server is added, e.g. Cloudflare's 1.1.1.1, DNS requests fail.

@maxi0604
Copy link
Author

I can confirm that on 5.0 it is broken with the default bridged network adapter when running on WSL. Unless a custom DNS server is added, e.g. Cloudflare's 1.1.1.1, DNS requests fail.

This seems different. In my case, DNS and ping work but the actual TCP transfer fails.

@KirilMihaylov
Copy link

I'm sorry then. I must have misunderstood the reported issue. My apologies!

@dgibson
Copy link
Collaborator

dgibson commented Mar 25, 2024

@KirilMihaylov , which pasta version do you have installed? There was a DNS related issue fixed recently, which you might be seeing.

@dgibson dgibson self-assigned this Mar 26, 2024
@dgibson
Copy link
Collaborator

dgibson commented Mar 26, 2024

I have something I hope is a fix, essentially a polished version of Stefano's suggestion. Unfortunately I haven't been able to test it against the specific problem here, because I wasn't able to reproduce. I don't know quite what's different about my setup, but the wget from an alpine container is working fine for me with podman 5.0.0 and existing pasta binaries.

@dgibson
Copy link
Collaborator

dgibson commented Mar 26, 2024

Ok, tree with the draft fix is here. I believe @sbrivio-rh will be able to make a release, and we can test from there.

@sbrivio-rh
Copy link
Collaborator

Unfortunately I haven't been able to test it against the specific problem here, because I wasn't able to reproduce.

I'm able to reproduce the issue reliably, and your series fixes it for me. Testing and releasing now.

I don't know quite what's different about my setup, but the wget from an alpine container is working fine for me with podman 5.0.0 and existing pasta binaries.

I think it's pretty much a combination of two factors, which might be unlikely or impossible to reproduce on some setups: first off we get a slightly different window value from the socket (65280 instead of 65536 bytes in my case) between three-way handshake and just after it, and we reflect it to the container, hence the problematic packet.

Second, we write the HTTP request to the socket, but we don't see it being acknowledged right away (hence no increase of acknowledged sequence and no ACK flag in the problematic packet).

@sbrivio-rh
Copy link
Collaborator

This should now be fixed in the new version 2024_03_26.4988e2b.

As the Arch Linux maintainer just happened to merge a change two hours ago, I guess you'll get an updated package for Arch rather soon.

@maxi0604
Copy link
Author

maxi0604 commented Mar 26, 2024

This should now be fixed in the new version 2024_03_26.4988e2b.

As the Arch Linux maintainer just happened to merge a change two hours ago, I guess you'll get an updated package for Arch rather soon.

I've flagged the package in the Arch repository, thanks for the quick fix everyone

@maxi0604
Copy link
Author

The update has been released and works, closing this

hswong3i pushed a commit to alvistack/passt-top-passt that referenced this issue Mar 27, 2024
Currently we set ACK on flags packets only when the acknowledged byte
pointer has advanced, or we hadn't previously set a window.  This means
in particular that we can send a window update with no ACK flag, which
doesn't appear to be correct.  RFC 9293 requires a receiver to ignore such
a packet [0], and indeed it appears that every non-SYN, non-RST packet
should have the ACK flag.

The reason for the existing logic, rather than always forcing an ACK seems
to be to avoid having the packet mistaken as a duplicate ACK which might
trigger a fast retransmit.  However, earlier tests in the function mean we
won't reach here if we don't have either an advance in the ack pointer -
which will already set the ACK flag, or a window update - which shouldn't
trigger a fast retransmit.

[0] https://www.ietf.org/rfc/rfc9293.html#section-3.10.7.4-2.5.2.1

Link: containers/podman#22146
Link: https://bugs.passt.top/show_bug.cgi?id=84
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature pasta pasta(1) bugs or features
Projects
None yet
Development

No branches or pull requests

5 participants