-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing #21735
Comments
Found this flake also on a couple of 1.12 backport PRs: |
Based on being IPsec + timing, the likely culprit is one of: |
Initial TriageSeems to fail about 1/5 times it runs. Skipped on net-next, failing on 4.9, 5.4, but not on 4.19. Unclear why. Understanding What is Going OnUsing https://jenkins.cilium.io/job/cilium-master-k8s-1.23-kernel-5.4/3366/ and Let's take a shortcut and check the IPsec error counters first. Many issues with IPsec arise that way:
Jackpot! The issue is a missing XFRM policy somehow. From
The Jenkins output has Tracing the PacketsLet's trace the packet from the Linux state we collected. The reply packet is 10.0.0.117 -> 10.0.1.188, leaving from k8s2.
We're using the IPsec key corresponding to SPI 6. (edited) No proxy redirects in place so no L7 policies. We don't need to care about XFRM policies for that. We'll match:
XFRM state matched:
So packet goes out encrypted with the above key and arrives in k8s1 with IPs src 10.0.0.81 dst 10.0.0.74. Something's weird already. We're using ipam=cluster-pool here. The outer destination IP should have the same /24 value as the pods on k8s1; it corresponds to cilium_host IP address. But we see 10.0.0.74 instead of e.g. 10.0.1.74. What's the IP address of k8s1's cilium_host interface? That's not good 😱
The XFRM state on k8s2 is incorrect. It should be src 10.0.0.81 dst 10.0.1.205 Tracing the Agent BugThe Node and CiliumNode objects agree on the cilium_host IP address:
and they agree with k8s2. But that doesn't match the reality:
In the Cilium agent logs, we have:
So the router IP got changed because it didn't match the pod CIDR. Meaning the pod CIDR changed. Using the Cilium logs collected for previous, succeeding tests, we can go back to the test where the pod CIDR changed. That's K8sUpdates. Turns out K8sUpdates delete the CiliumNode objects and cleans the filesystem. Then, Cilium gets the CiliumInternalIP from k8s annotations and Cilium operator assigns a pod CIDR. If the pod CIDR changed, then the CiliumInternalIP doesn't belong in the new pod CIDR and we have a mismatch with the logs just above. |
Test Name
Failure Output
FAIL: Connectivity test between nodes failed
Stack Trace
Standard Output
Standard Error
Resources
The text was updated successfully, but these errors were encountered: