Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods cannot talk to cluster IPs on Ubuntu 2204 #2103

Closed
Tracked by #14140
olemarkus opened this issue Oct 5, 2022 · 42 comments
Closed
Tracked by #14140

Pods cannot talk to cluster IPs on Ubuntu 2204 #2103

olemarkus opened this issue Oct 5, 2022 · 42 comments
Labels

Comments

@olemarkus
Copy link

What happened:

After upgrading clusters to use Ubuntu 22.04 by default, the kOps e2e tests started failing for this CNI: https://testgrid.k8s.io/kops-network-plugins#kops-aws-cni-amazon-vpc

What seems to happen is that Pods do receive IPs, but they fail to talk across nodes. Calling e.g a ClusterIP service from the host works, but not from a Pod. Kube-proxy therefore should be working just fine.

I cannot see anything wrong in any logs. But what I do see is that there are AWS-related rules in the legacy iptables, while kube-proxy uses nftables. So my guess is that this is the cause of this behavior. nft and legacy iptables must not be mixed anyway.

Attach logs
Example logs here: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/e2e-kops-aws-cni-amazon-vpc/1577618499142946816/artifacts/i-0d90e121da8bff687/

How to reproduce it (as minimally and precisely as possible):

kops create cluster --name test.kops-dev.srsandbox.io --cloud aws --networking=amazonvpc --zones=eu-central-1a,eu-central-1b,eu-central-1c --channel=alpha --master-count=3 --yes --kubernetes-version 1.25.0 --discovery-store=$KOPS_STATE_STORE/discovery --image=099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20220921.1
@jayanthvn
Copy link
Contributor

This looks similar to this issue - #1847 (comment).

This workaround - #1847 (comment) has helped.

As suggested by @achevuru - Amazon Linux 2 images use iptables-legacy by default as well. We will check and update if there is something we can do to address this scenario.

@olemarkus
Copy link
Author

Let me try that workaround. If it works, I think it would be helpfull if an iptables-nft image could be published. I imagine it wouldn't be too much work to do that.

@jayanthvn
Copy link
Contributor

Thanks, please let us know if it works.

@olemarkus
Copy link
Author

olemarkus commented Oct 5, 2022

Unfortunately, no luck. The workaround does remove the rules from iptables-legacy and I do see them now in nftables. Pods still cannot talk to cluster IPs.

Can also confirm I still see nothing interesting in the logs, and Pods do get their IPs.

@olemarkus
Copy link
Author

Any idea how to progress on this? Anywhere we should look for potential issues?

@jayanthvn
Copy link
Contributor

@olemarkus - Sorry for the delay. Since you mentioned pod to pod communication is broken. Wondering if you already verified this, if not can you please run tcp dump on the sender pod host side veth, sender node, receiving node and receiving node host side veth? This should provide context on where the traffic is getting dropped.

@olemarkus
Copy link
Author

As far as I can tell, pod to pod comms works when they are interacting directly. It's pod -> clusterIP that does not work except when the Pod is running in hostNetworking mode.

@achevuru
Copy link
Contributor

achevuru commented Oct 18, 2022

I expect pod->clusterIP to work from within the pod if it is working fine from the node (or) hostNetworking pod, because the DNAT rules that replace clusterIP with one of the backend endpoints are installed in the root(host) network namespace (by kube-proxy) and not inside the pod network namespace.

Were you able to track the packet via tcpdump?

@olemarkus
Copy link
Author

Right.
To make this simple, I created one pod (A) without host networking and one pod with (B).
Pod A tries to reach pod B behind a clusterIP service. This should minimize the levels of indirection in the networking to reproduce this.

Running tcp dump against A's veth, I see the packets going from pod IP to the service IP. However, tcpdumping any host interfaces, shows no packets coming in from A's IP.

This is with a custom build of aws-node using NFT iptables.

The DNAT rules seems to be working fine since connecting from the host towards the cluster IP works.

@achevuru
Copy link
Contributor

So, if I understood it right - Connection to a Pod via ClusterIP from a pod without hostNetworking fails but the same works the other way around?

When you say tcpdump against Pod's veth - Are you referring to the veth interface inside the pod's network namespace (or) the veth interface on the host network namespace? Would you be able to share iptables o/p (both NAT and filter) tables with us @ k8s-awscni-triage@amazon.com (from both host and pod network namespaces)..

@olemarkus
Copy link
Author

I have not tested from B to A.
I referred to the host side veth interface. Since I saw traffic on the host side, I assumed it had already existed the pod namespace.
I'll send you the output.

Also worth mentioning that this is very easy to reproduce with latest kOps using e.g --image=099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20220921.1

@achevuru
Copy link
Contributor

Interesting, if you see the packet on host side veth then we know it landed on the host network end and now the behavior should be similar to a connection that we initiate from the node. Are there any active network policies on the node? We will check the logs/iptables output once we receive them and will update here.

We will see if we can reproduce with the above image as well.

@olemarkus
Copy link
Author

Sent the iptables output.

Also tried disabling rp filtering on the veth interface, but didn't seem to have much effect.

There are no network policies or similar on the node other than what ubuntu 2204 may be doing by default.

@veshij
Copy link
Contributor

veshij commented Oct 22, 2022

troubleshooting similar issue on our cluster. Configuration which works on u20 doesn't work after an upgrade to U22.
So far I found that outbound packets from pods are dropped in kernel, likely here.

@veshij
Copy link
Contributor

veshij commented Oct 22, 2022

I think I found the issue.
Looks like cni plugin adds incorrect static arp entry (or veth mac changes later on) inside pod network namespace for 169.254.1.1.

@achevuru
Copy link
Contributor

@veshij VPC CNI does add a static arp entry for default GW (169.254.1.1 - pointing to host side veth) inside the pod network namespace. So, it is essentially for the host side veth ..

https://github.com/aws/amazon-vpc-cni-k8s/blob/314625892e91ce36fba87211694537071b590c92/docs/cni-proposal.md#inside-a-pod

// Add a connected route to a dummy next hop (169.254.1.1 or fe80::1)

// add static ARP entry for default gateway

Are you saying the packet is dropped at host veth because of L2 header discrepancy (i.e.,) mismatch with host veth's MAC? As you can see, we derive the hostVeth MAC and are using it...So, the veth MAC must be changing. We can see the veth MAC inside the pod network namespace and compare it against the current value..

neigh := &netlink.Neigh{
		LinkIndex:    contVeth.Attrs().Index,
		State:        netlink.NUD_PERMANENT,
		IP:           gwNet.IP,
		HardwareAddr: hostVeth.Attrs().HardwareAddr,
	}

@veshij
Copy link
Contributor

veshij commented Oct 22, 2022

Yes, that's exactly what happens in my system. I can confirm that on u22 (running newer kernel) the mac address of host's veth doesn't match static arp record inside the pod. And more to say - this mac address is not used on any other interface. Exactly the same cni binary running on u20 (and older kernel) has no issues.

I'm troubleshooting it a bit further. I don't think it's a bug in CNI code, currently I'm suspecting either an issue with netlink implementation/kernel netlink interface or mac address changes over time on veth interface (smth similar to ipv6's privacy extensions).

root@bwi1a-i-0fee9155633983366:~# ip netns exec cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1 ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
...
3: eth0@if127: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 4a:84:9e:7e:1a:40 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.13.51/32 scope global eth0
    ...

root@bwi1a-i-0fee9155633983366:~# ping -c 1 10.244.13.51
PING 10.244.13.51 (10.244.13.51) 56(84) bytes of data.

--- 10.244.13.51 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

root@bwi1a-i-0fee9155633983366:~# ip netns exec cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1 arp -na
? (169.254.1.1) at 9e:67:e4:1f:87:74 [ether] PERM on eth0
? (10.244.9.159) at 32:10:f0:bf:bd:44 [ether] on eth0
root@bwi1a-i-0fee9155633983366:~# ip link | grep cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1
    link/ether 32:10:f0:bf:bd:44 brd ff:ff:ff:ff:ff:ff link-netns cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1
root@bwi1a-i-0fee9155633983366:~# ip netns exec cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1 arp -d 169.254.1.1
root@bwi1a-i-0fee9155633983366:~# ip netns exec cni-ba6e84fd-1673-afbb-fc11-fb995dc2c5b1 arp -s 169.254.1.1 32:10:f0:bf:bd:44
root@bwi1a-i-0fee9155633983366:~# ping -c 1 10.244.13.51
PING 10.244.13.51 (10.244.13.51) 56(84) bytes of data.
64 bytes from 10.244.13.51: icmp_seq=1 ttl=64 time=0.023 ms

--- 10.244.13.51 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms

@veshij
Copy link
Contributor

veshij commented Oct 22, 2022

Testcase triggers the issue both in AWS and on-prem, kernel 5.15.
Doesn't trigger the issue on 5.13.

package main

import (
	"flag"
	"fmt"
	"github.com/containernetworking/plugins/pkg/ns"
	"github.com/vishvananda/netlink"
	"net"
	"path/filepath"
	"time"
)

const (
	vethName = "veth0"
)

func main() {
	namespace := flag.String("ns", "", "")
	flag.Parse()

	err := ns.WithNetNSPath(filepath.Join("/var/run/netns", *namespace), func(hostNs ns.NetNS) error {
		veth := &netlink.Veth{
			LinkAttrs: netlink.LinkAttrs{
				Name:  "eth0",
				Flags: net.FlagUp,
				MTU:   1234,
			},
			PeerName: "veth0",
		}
		if err := netlink.LinkAdd(veth); err != nil {
			panic(err)
		}

		hostLink, err := netlink.LinkByName(vethName)
		if err != nil {
			panic(err)
		}
		fmt.Printf("Host link mac address inside netns: %+v\n", hostLink.Attrs().HardwareAddr)

		// Move it to root namespace.
		if err = netlink.LinkSetNsFd(hostLink, int(hostNs.Fd())); err != nil {
			panic(err)
		}

		return nil
	})
	if err != nil {
		panic(err)
	}

	hostLink, err := netlink.LinkByName(vethName)
	if err != nil {
		panic(err)
	}
	fmt.Printf("Host link (id=%d) mac address in root ns: %+v\n", hostLink.Attrs().Index, hostLink.Attrs().HardwareAddr)
	time.Sleep(time.Second)

	hostLink, err = netlink.LinkByName("veth0")
	if err != nil {
		panic(err)
	}
	fmt.Printf("Host link (id=%d) mac address in root ns after sleep: %+v\n", hostLink.Attrs().Index, hostLink.Attrs().HardwareAddr)
}
# ip netns del test >/dev/null;  ip netns add test; sleep 1; ./arp -ns test
root@iad8a-rk36-17a:~# ip netns del test >/dev/null;  ip netns add test; sleep 1; ./arp -ns test
I1022 06:03:04.589008 96147 main.go:34] Host link mac address inside netns: 8a:65:fc:48:7b:b2
I1022 06:03:04.633549 96147 main.go:71] Host link mac address in root ns: 8a:65:fc:48:7b:b2
I1022 06:03:05.633731 96147 main.go:78] Host link mac address in root ns after sleep: 7a:0f:ba:38:59:59

@veshij
Copy link
Contributor

veshij commented Oct 22, 2022

Looks like it's udev.

root@iad8a-rk36-17a:~# cat /usr/lib/systemd/network/99-default.link | grep -v '^#'

[Match]
OriginalName=*

[Link]
NamePolicy=keep kernel database onboard slot path
AlternativeNamesPolicy=database onboard slot path
MACAddressPolicy=persistent

https://www.freedesktop.org/software/systemd/man/systemd.link.html

MACAddressPolicy=persistent

This feature depends on ID_NET_NAME_* properties to exist for the link. On hardware where these properties are not set, the generation of a persistent MAC address will fail.

u20:

root@sjc8d-rl10-7a:~# udevadm test /devices/virtual/net/veth0 |& grep ID_NET_NAME
root@sjc8d-rl10-7a:~#

u22:

root@iad8a-rk36-17a:/usr/lib/systemd/network# udevadm test /devices/virtual/net/veth0 |& grep ID_NET_NAME
ID_NET_NAME=veth0

with udevadm control --stop-exec-queue mac address remains constant.

We likely want to fix implementation on cni side, I suppose changing order to creatre veth pair in root namespace first and moving device to netns should be a reasonable workaround.

@veshij
Copy link
Contributor

veshij commented Oct 24, 2022

@jayanthvn @achevuru what do you think?

More conventional approach:

  • create veth pair in host namespace
  • move pod's veth to pod namespace
  • sleep a bit to make sure udev is done
  • configure pod's namespace with a correct mac address

Another option is to leave almost everything as is:

  • create veth pair in namespace
  • move host veth to host namespace
  • return to host namespace
  • sleep a bit to make sure udev is done
  • return back to pod namespace and configure arp entry/route

Unfortunately I'm not sure how to make it work without some sleep with magic duration (it takes a 100-200ms on my system, but it can be worse if host is heavily loaded).

@veshij
Copy link
Contributor

veshij commented Oct 24, 2022

Actually I can't even repro with the first approach, will probably implement that. I can repro it.

@veshij
Copy link
Contributor

veshij commented Oct 25, 2022

>More conventional approach:
>Another option is to leave almost everything as is:
scratch that.
The more correct configuration which does not require to mess with arp entries and udev:

  • configure link-local address 169.254.1.1 on host's veth
ip addr add 169.254.1.1/32 dev eni2b77055a6e0 scope link
  • configure onlink route inside the container:
ip ro add default via 169.254.1.1 dev eth0 onlink

@veshij
Copy link
Contributor

veshij commented Oct 25, 2022

#2118

@kwohlfahrt
Copy link
Contributor

FWIW, I found this workaround during node setup helped resolve the issue:

mkdir -p /etc/systemd/network/99-default.link.d/
cat <<EOF > /etc/systemd/network/99-default.link.d/aws-cni-workaround.conf
[Link]
MACAddressPolicy=none
EOF

@jayanthvn
Copy link
Contributor

Thanks @kwohlfahrt

The proposed PR #2118 changes the order of CNI to creatre veth pair in root namespace and then move device to netns.

@heybronson
Copy link

Was this resolved?

@kishorj
Copy link
Contributor

kishorj commented Jul 21, 2023

/reopen

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Sep 20, 2023
@hakman
Copy link

hakman commented Oct 2, 2023

This still needs a fix

@jdn5126
Copy link
Contributor

jdn5126 commented Oct 2, 2023

@hakman unless #2118 gets revived and made usable, the fix is to set MACAddressPolicy=none on Ubuntu 22.x

@hakman
Copy link

hakman commented Oct 2, 2023

@hakman unless #2118 gets revived and made usable, the fix is to set MACAddressPolicy=none on Ubuntu 22.x

Thanks @jdn5126!

@github-actions github-actions bot removed the stale Issue or PR is stale label Oct 3, 2023
@pmankad96
Copy link

pmankad96 commented Oct 13, 2023

It also prevents new kops cluster with networking=amazonvpc to come up healthy. In my case core-dns-xx and ebs-csi-node pods kept crashing. For core-dns the log read: plugin/error timeout when trying to connect to Amazon Provided DNS server. For ebs-csi-node the error was related to unable to get the Node (was trying on 100.64. - not sure why). The workaround is to use 20.04 image instead. The error messages are so cryptic that it took me a while to figure out.

@btalbot
Copy link

btalbot commented Oct 18, 2023

So running in AWS with spec.networking.amazonvpc and also using awsEBSCSIDriver is broken? Trying to upgrade my test cluster from 1.25 to 1.26 and the ebs-sci-node pod's ebs-plugin container in the new masters keep crash-looping with this log

+ kube-system ebs-csi-node-sjkv7 › ebs-plugin
kube-system ebs-csi-node-sjkv7 ebs-plugin I1018 23:46:11.891261       1 metadata.go:101] kubernetes api is available
kube-system ebs-csi-node-sjkv7 ebs-plugin panic: error getting Node i-04bddcf2fcb369bae: Get "https://100.64.0.1:443/api/v1/nodes/i-04bddcf2fcb369bae": dial tcp 100.64.0.1:443: i/o timeout
kube-system ebs-csi-node-sjkv7 ebs-plugin
kube-system ebs-csi-node-sjkv7 ebs-plugin goroutine 1 [running]:
kube-system ebs-csi-node-sjkv7 ebs-plugin github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0xc00003f540)
kube-system ebs-csi-node-sjkv7 ebs-plugin 	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:94 +0x345
kube-system ebs-csi-node-sjkv7 ebs-plugin github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver({0xc00054df30, 0x8, 0x3684458?})
kube-system ebs-csi-node-sjkv7 ebs-plugin 	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:95 +0x393
kube-system ebs-csi-node-sjkv7 ebs-plugin main.main()
kube-system ebs-csi-node-sjkv7 ebs-plugin 	/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:46 +0x37d
- kube-system ebs-csi-node-sjkv7 › ebs-plugin

Running amazonvpc networking with ebs-csi seems like a pretty common use case to be so broken.

@jdn5126
Copy link
Contributor

jdn5126 commented Oct 19, 2023

@pmankad96 @btalbot I suggest filing a support case for this so that it can be investigated further

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Dec 19, 2023
@btalbot
Copy link

btalbot commented Dec 19, 2023

I haven't seen any comments or commits on this so I presume that Ubuntu 2204 is still broken on AWS running amazonvpc?

@hakman
Copy link

hakman commented Dec 19, 2023

I haven't seen any comments or commits on this so I presume that Ubuntu 2204 is still broken on AWS running amazonvpc?

Not yet 🥲

@jdn5126 jdn5126 removed the stale Issue or PR is stale label Dec 19, 2023
@jdn5126
Copy link
Contributor

jdn5126 commented Dec 19, 2023

@btalbot Ubuntu 22.04 works on EKS, you just have to set MACAddressPolicy=none like the official EKS AMI does: https://github.com/awslabs/amazon-eks-ami/blob/master/scripts/install-worker.sh#L104

@jdn5126
Copy link
Contributor

jdn5126 commented Jan 25, 2024

Closing this as complete, since the troubleshooting doc informs people to set MACAddressPolicy=none. AL2023 also does this automatically in the AMI, so it can be referenced by other distros: https://github.com/awslabs/amazon-eks-ami/blob/master/scripts/install-worker.sh#L104

@jdn5126 jdn5126 closed this as completed Jan 25, 2024
Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.