CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing #21735

qmonnet · 2022-10-14T10:21:34Z

Test Name

- K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing
  (first to fail; output in the following section is for that test)

- K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing with bpf_host

Failure Output

FAIL: Connectivity test between nodes failed

Stack Trace

/home/jenkins/workspace/Cilium-PR-K8s-1.16-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:527
Connectivity test between nodes failed
Expected
    <bool>: false
to be true
/home/jenkins/workspace/Cilium-PR-K8s-1.16-kernel-4.9/src/github.com/cilium/cilium/test/k8s/datapath_configuration.go:615

Standard Output

Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
⚠️  Number of "level=warning" in logs: 6
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
Top 4 errors/warnings:
CONFIG_CGROUP_BPF optional kernel parameter is not in kernel (needed for: Host Reachable Services and Sockmap optimization)
CONFIG_LWTUNNEL_BPF optional kernel parameter is not in kernel (needed for: Lightweight Tunnel hook for IP-in-IP encapsulation)
Key allocation attempt failed
Unable to install direct node route {Ifindex: 0 Dst: fd02::100/120 Src: <nil> Gw: <nil> Flags: [] Table: 0 Realm: 0}
Cilium pods: [cilium-85fkb cilium-ckdvx]
Netpols loaded: 
CiliumNetworkPolicies loaded: 202210131819k8sdatapathconfigtransparentencryptiondirectrouting::l3-policy-demo 
Endpoint Policy Enforcement:
Pod                          Ingress   Egress
testclient-tc895             false     false
testds-62dw4                 false     false
testds-qfzkm                 false     false
coredns-8cfc78c54-4pdk9      false     false
test-k8s2-5b756fd6c5-ptfbt   false     false
testclient-2-74r55           false     false
testclient-2-nmb4b           false     false
testclient-mm5mb             false     false
Cilium agent 'cilium-85fkb': Status: Ok  Health: Ok Nodes "" ContainerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 41 Failed 0
Cilium agent 'cilium-ckdvx': Status: Ok  Health: Ok Nodes "" ContainerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 33 Failed 0

Standard Error

18:18:24 STEP: Running BeforeAll block for EntireTestsuite K8sDatapathConfig Transparent encryption DirectRouting
18:18:25 STEP: Deploying ipsec_secret.yaml in namespace kube-system
18:18:25 STEP: Installing Cilium
18:18:26 STEP: Waiting for Cilium to become ready
18:19:09 STEP: Validating if Kubernetes DNS is deployed
18:19:09 STEP: Checking if deployment is ready
18:19:09 STEP: Checking if kube-dns service is plumbed correctly
18:19:09 STEP: Checking if pods have identity
18:19:09 STEP: Checking if DNS can resolve
18:19:25 STEP: Kubernetes DNS is not ready: unable to resolve service name kubernetes.default.svc.cluster.local with DNS server 10.96.0.10 by running 'dig +short kubernetes.default.svc.cluster.local @10.96.0.10' Cilium pod: Exitcode: 9 
Err: exit status 9
Stdout:
 	 ;; connection timed out; no servers could be reached
	 
	 
Stderr:
 	 command terminated with exit code 9
	 

18:19:25 STEP: Restarting Kubernetes DNS (-l k8s-app=kube-dns)
18:19:25 STEP: Waiting for Kubernetes DNS to become operational
18:19:25 STEP: Checking if deployment is ready
18:19:25 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:26 STEP: Checking if deployment is ready
18:19:26 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:27 STEP: Checking if deployment is ready
18:19:27 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:28 STEP: Checking if deployment is ready
18:19:28 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:29 STEP: Checking if deployment is ready
18:19:29 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:30 STEP: Checking if deployment is ready
18:19:30 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:31 STEP: Checking if deployment is ready
18:19:31 STEP: Checking if kube-dns service is plumbed correctly
18:19:31 STEP: Checking if pods have identity
18:19:31 STEP: Checking if DNS can resolve
18:19:31 STEP: Validating Cilium Installation
18:19:31 STEP: Performing Cilium controllers preflight check
18:19:31 STEP: Performing Cilium health check
18:19:31 STEP: Checking whether host EP regenerated
18:19:31 STEP: Performing Cilium status preflight check
18:19:32 STEP: Performing Cilium service preflight check
18:19:32 STEP: Performing K8s service preflight check
18:19:34 STEP: Waiting for cilium-operator to be ready
18:19:34 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator")
18:19:34 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator") => <nil>
18:19:34 STEP: Making sure all endpoints are in ready state
18:19:35 STEP: Creating namespace 202210131819k8sdatapathconfigtransparentencryptiondirectrouting
18:19:35 STEP: Deploying demo_ds.yaml in namespace 202210131819k8sdatapathconfigtransparentencryptiondirectrouting
18:19:36 STEP: Applying policy /home/jenkins/workspace/Cilium-PR-K8s-1.16-kernel-4.9/src/github.com/cilium/cilium/test/k8s/manifests/l3-policy-demo.yaml
18:19:40 STEP: Waiting for 4m0s for 7 pods of deployment demo_ds.yaml to become ready
18:19:40 STEP: WaitforNPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="")
18:19:46 STEP: WaitforNPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="") => <nil>
18:19:46 STEP: Checking pod connectivity between nodes
18:19:46 STEP: WaitforPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="-l zgroup=testDSClient")
18:19:46 STEP: WaitforPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="-l zgroup=testDSClient") => <nil>
18:19:46 STEP: WaitforPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="-l zgroup=testDS")
18:19:46 STEP: WaitforPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="-l zgroup=testDS") => <nil>
FAIL: Connectivity test between nodes failed
Expected
    <bool>: false
to be true
=== Test Finished at 2022-10-13T18:19:55Z====
18:19:55 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathConfig
===================== TEST FAILED =====================
18:19:55 STEP: Running AfterFailed block for EntireTestsuite K8sDatapathConfig
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE                                                         NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE   NOMINATED NODE   READINESS GATES
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   test-k8s2-5b756fd6c5-ptfbt         2/2     Running   0          22s   10.0.1.161      k8s2   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testclient-2-74r55                 1/1     Running   0          22s   10.0.0.235      k8s1   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testclient-2-nmb4b                 1/1     Running   0          22s   10.0.1.49       k8s2   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testclient-mm5mb                   1/1     Running   0          22s   10.0.1.118      k8s2   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testclient-tc895                   1/1     Running   0          22s   10.0.0.32       k8s1   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testds-62dw4                       2/2     Running   0          22s   10.0.0.57       k8s1   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testds-qfzkm                       2/2     Running   0          22s   10.0.1.62       k8s2   <none>           <none>
	 cilium-monitoring                                                 grafana-7fd557d749-89499           0/1     Running   0          30m   10.0.0.145      k8s1   <none>           <none>
	 cilium-monitoring                                                 prometheus-d87f8f984-w9bn7         1/1     Running   0          30m   10.0.0.205      k8s1   <none>           <none>
	 kube-system                                                       cilium-85fkb                       1/1     Running   0          91s   192.168.56.12   k8s2   <none>           <none>
	 kube-system                                                       cilium-ckdvx                       1/1     Running   0          91s   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       cilium-operator-5c955b98b8-45g5f   1/1     Running   0          91s   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       cilium-operator-5c955b98b8-t79r5   1/1     Running   0          91s   192.168.56.12   k8s2   <none>           <none>
	 kube-system                                                       coredns-8cfc78c54-4pdk9            1/1     Running   0          32s   10.0.1.104      k8s2   <none>           <none>
	 kube-system                                                       etcd-k8s1                          1/1     Running   0          33m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       kube-apiserver-k8s1                1/1     Running   0          33m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       kube-controller-manager-k8s1       1/1     Running   3          33m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       kube-proxy-cwltp                   1/1     Running   0          34m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       kube-proxy-g5dd9                   1/1     Running   0          31m   192.168.56.12   k8s2   <none>           <none>
	 kube-system                                                       kube-scheduler-k8s1                1/1     Running   2          34m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       log-gatherer-c8lwj                 1/1     Running   0          30m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       log-gatherer-r2hzb                 1/1     Running   0          30m   192.168.56.12   k8s2   <none>           <none>
	 kube-system                                                       registry-adder-4bjgj               1/1     Running   0          31m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       registry-adder-9x6l2               1/1     Running   0          31m   192.168.56.12   k8s2   <none>           <none>
	 
Stderr:
 	 

Fetching command output from pods [cilium-85fkb cilium-ckdvx]
cmd: kubectl exec -n kube-system cilium-85fkb -c cilium-agent -- cilium status
Exitcode: 0 
Stdout:
 	 KVStore:                 Ok   Disabled
	 Kubernetes:              Ok   1.16 (v1.16.15) [linux/amd64]
	 Kubernetes APIs:         ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
	 KubeProxyReplacement:    Disabled   
	 Host firewall:           Disabled
	 CNI Chaining:            none
	 Cilium:                  Ok   1.12.90 (v1.12.90-fdcb68fb)
	 NodeMonitor:             Listening for events on 3 CPUs with 64x4096 of shared memory
	 Cilium health daemon:    Ok   
	 IPAM:                    IPv4: 7/254 allocated from 10.0.1.0/24, IPv6: 7/254 allocated from fd02::100/120
	 IPv6 BIG TCP:            Disabled
	 BandwidthManager:        Disabled
	 Host Routing:            Legacy
	 Masquerading:            IPTables [IPv4: Enabled, IPv6: Enabled]
	 Controller Status:       41/41 healthy
	 Proxy Status:            OK, ip 10.0.1.197, 0 redirects active on ports 10000-20000
	 Global Identity Range:   min 256, max 65535
	 Hubble:                  Ok   Current/Max Flows: 2065/65535 (3.15%), Flows/s: 25.94   Metrics: Disabled
	 Encryption:              IPsec
	 Cluster health:          1/2 reachable   (2022-10-13T18:19:44Z)
	   Name                   IP              Node      Endpoints
	   k8s2 (localhost)       192.168.56.12   unknown   unreachable
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-85fkb -c cilium-agent -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                                       IPv6        IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                                                             
	 236        Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s2                                                                                                 ready   
	                                                            reserved:host                                                                                                                      
	 329        Disabled           Disabled          56438      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::162   10.0.1.49    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                    
	                                                            k8s:zgroup=testDSClient2                                                                                                           
	 384        Disabled           Disabled          20699      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::122   10.0.1.161   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                    
	                                                            k8s:zgroup=test-k8s2                                                                                                               
	 698        Enabled            Disabled          56483      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::161   10.0.1.62    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                    
	                                                            k8s:zgroup=testDS                                                                                                                  
	 803        Disabled           Disabled          53574      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::185   10.0.1.118   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                    
	                                                            k8s:zgroup=testDSClient                                                                                                            
	 902        Disabled           Disabled          4          reserved:health                                                                                   fd02::13e   10.0.1.168   ready   
	 3578       Disabled           Disabled          15138      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::132   10.0.1.104   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=coredns                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=kube-system                                                                                        
	                                                            k8s:k8s-app=kube-dns                                                                                                               
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-ckdvx -c cilium-agent -- cilium status
Exitcode: 0 
Stdout:
 	 KVStore:                 Ok   Disabled
	 Kubernetes:              Ok   1.16 (v1.16.15) [linux/amd64]
	 Kubernetes APIs:         ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
	 KubeProxyReplacement:    Disabled   
	 Host firewall:           Disabled
	 CNI Chaining:            none
	 Cilium:                  Ok   1.12.90 (v1.12.90-fdcb68fb)
	 NodeMonitor:             Listening for events on 3 CPUs with 64x4096 of shared memory
	 Cilium health daemon:    Ok   
	 IPAM:                    IPv4: 5/254 allocated from 10.0.0.0/24, IPv6: 5/254 allocated from fd02::/120
	 IPv6 BIG TCP:            Disabled
	 BandwidthManager:        Disabled
	 Host Routing:            Legacy
	 Masquerading:            IPTables [IPv4: Enabled, IPv6: Enabled]
	 Controller Status:       33/33 healthy
	 Proxy Status:            OK, ip 10.0.0.211, 0 redirects active on ports 10000-20000
	 Global Identity Range:   min 256, max 65535
	 Hubble:                  Ok   Current/Max Flows: 4678/65535 (7.14%), Flows/s: 57.00   Metrics: Disabled
	 Encryption:              IPsec
	 Cluster health:          0/2 reachable   (2022-10-13T18:19:40Z)
	   Name                   IP              Node      Endpoints
	   k8s1 (localhost)       192.168.56.11   unknown   unreachable
	   k8s2                   192.168.56.12   unknown   unreachable
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-ckdvx -c cilium-agent -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                                       IPv6       IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                                                            
	 501        Disabled           Disabled          53574      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::ae   10.0.0.32    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                   
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                   
	                                                            k8s:zgroup=testDSClient                                                                                                           
	 1083       Disabled           Disabled          56438      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::f9   10.0.0.235   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                   
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                   
	                                                            k8s:zgroup=testDSClient2                                                                                                          
	 1211       Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s1                                                                                                ready   
	                                                            k8s:node-role.kubernetes.io/master                                                                                                
	                                                            reserved:host                                                                                                                     
	 2661       Enabled            Disabled          56483      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::7    10.0.0.57    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                   
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                   
	                                                            k8s:zgroup=testDS                                                                                                                 
	 3367       Disabled           Disabled          4          reserved:health                                                                                   fd02::d    10.0.0.168   ready   
	 
Stderr:
 	 

===================== Exiting AfterFailed =====================
18:20:17 STEP: Running AfterEach for block EntireTestsuite K8sDatapathConfig
18:20:17 STEP: Deleting deployment demo_ds.yaml
18:20:17 STEP: Deleting deployment ipsec_secret.yaml
18:20:18 STEP: Deleting namespace 202210131819k8sdatapathconfigtransparentencryptiondirectrouting
18:20:33 STEP: Running AfterEach for block EntireTestsuite

Resources

jibi · 2022-10-17T21:02:20Z

Found this flake also on a couple of 1.12 backport PRs:

pchaigno · 2022-11-11T14:06:41Z

Based on being IPsec + timing, the likely culprit is one of:

pchaigno · 2022-11-14T16:03:06Z

Initial Triage

Seems to fail about 1/5 times it runs.

Skipped on net-next, failing on 4.9, 5.4, but not on 4.19. Unclear why.

Understanding What is Going On

Using https://jenkins.cilium.io/job/cilium-master-k8s-1.23-kernel-5.4/3366/ and
ipsec-flake.zip

Let's take a shortcut and check the IPsec error counters first. Many issues with IPsec arise that way:

$ grep -v "0$" bugtool-cilium-*/cmd/cat--proc-net-xfrm_stat.md 
bugtool-cilium-665vh/cmd/cat--proc-net-xfrm_stat.md:XfrmInNoPols            	5

Jackpot! The issue is a missing XFRM policy somehow.

From test-output.log:

time="2022-11-11T01:17:11Z" level=error msg="Error executing command 'kubectl exec -n 202211110116k8sdatapathconfigtransparentencryptiondirectrouting testclient-84dnc -- ping -W 5 -c 5 10.0.0.117'" error="exit status 1"
cmd: "kubectl exec -n 202211110116k8sdatapathconfigtransparentencryptiondirectrouting testclient-84dnc -- ping -W 5 -c 5 10.0.0.117" exitCode: 1 duration: 9.194782591s stdout:
PING 10.0.0.117 (10.0.0.117) 56(84) bytes of data.

--- 10.0.0.117 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4091ms


err:
exit status 1
stderr:
command terminated with exit code 1

FAIL: Connectivity test between nodes failed

The Jenkins output has kubectl get pods -o wide --all-namespaces which tells us that the client is on k8s1 with IP 10.0.1.188, whereas the server (IP 10.0.0.117 above) is on k8s2 with name testds-f6dbm.
The non-zero error counter was for bugtool-cilium-665vh which is from k8s1. Thus, the failure is for the reply traffic as it enters k8s1.

Tracing the Packets

Let's trace the packet from the Linux state we collected.

The reply packet is 10.0.0.117 -> 10.0.1.188, leaving from k8s2.
First, check the IPsec key in use according to k8s2's ipcache:

$ git grep 10.0.1.188/32 bugtool-cilium-xnmwz/cmd/cilium-bpf-ipcache-list.md
bugtool-cilium-xnmwz/cmd/cilium-bpf-ipcache-list.md:10.0.1.188/32                             identity=1706 encryptkey=6 tunnelendpoint=192.168.56.11

We're using the IPsec key corresponding to SPI 6. (edited)

No proxy redirects in place so no L7 policies. We don't need to care about XFRM policies for that.

We'll match:

src 10.0.0.0/24 dst 10.0.1.0/24 uid 0
	dir out action allow index 57 priority 0 share any flag  (0x00000000)
	lifetime config:
	  limit: soft (INF)(bytes), hard (INF)(bytes)
	  limit: soft (INF)(packets), hard (INF)(packets)
	  expire add: soft 0(sec), hard 0(sec)
	  expire use: soft 0(sec), hard 0(sec)
	lifetime current:
	  0(bytes), 0(packets)
	  add 2022-11-11 01:16:53 use -
	mark 0x6e00/0xff00 
	tmpl src 10.0.0.81 dst 10.0.0.74
		proto esp spi 0x00000006(6) reqid 1(0x00000001) mode tunnel
		level required share any 
		enc-mask ffffffff auth-mask ffffffff comp-mask ffffffff

XFRM state matched:

$ git grep -A7 "src 10.0.0.81 dst 10.0.0.74" bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md:src 10.0.0.81 dst 10.0.0.74
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   proto esp spi 0x00000006(6) reqid 1(0x00000001) mode tunnel
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   replay-window 0 seq 0x00000000 flag  (0x00000000)
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   mark 0x6e00/0xff00 output-mark 0xe00/0xf00
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   aead rfc4106(gcm(aes)) [hash:fb57b7468d2bb142de56893974c4cef49d018a1d6a9bb6dbc0a6bf9efd243314] (160 bits) 128
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   lifetime config:

So packet goes out encrypted with the above key and arrives in k8s1 with IPs src 10.0.0.81 dst 10.0.0.74.

Something's weird already. We're using ipam=cluster-pool here. The outer destination IP should have the same /24 value as the pods on k8s1; it corresponds to cilium_host IP address. But we see 10.0.0.74 instead of e.g. 10.0.1.74.

What's the IP address of k8s1's cilium_host interface? That's not good 😱

$ git grep "link cilium_host" bugtool-cilium-665vh/cmd/ip-a.md
bugtool-cilium-665vh/cmd/ip-a.md:    inet 10.0.1.205/32 scope link cilium_host

The XFRM state on k8s2 is incorrect. It should be src 10.0.0.81 dst 10.0.1.205

Tracing the Agent Bug

The Node and CiliumNode objects agree on the cilium_host IP address:

$ jq '.items[].metadata.annotations."io.cilium.network.ipv4-cilium-host"' nodes.json 
"10.0.0.74"
"10.0.1.34"
$ cat api-resource-ciliumnodes.txt 
NAME   CILIUMINTERNALIP   INTERNALIP      AGE
k8s1   10.0.0.74          192.168.56.11   6m10s
k8s2   10.0.1.34          192.168.56.12   6m10s

and they agree with k8s2.

But that doesn't match the reality:

$ git grep "link cilium_host"
bugtool-cilium-665vh/cmd/ip-a.md:    inet 10.0.1.205/32 scope link cilium_host
bugtool-cilium-xnmwz/cmd/ip-a.md:    inet 10.0.0.81/32 scope link cilium_host

In the Cilium agent logs, we have:

$ git grep 10.0.0.74 pod-kube-system-cilium-665vh-cilium-agent.log
pod-kube-system-cilium-665vh-cilium-agent.log:2022-11-11T01:16:42.632640174Z level=debug msg="Add NodeCiliumInternalIP: 10.0.0.74" k8sNodeID=15dec14a-0888-43a8-943c-08575728b78f nodeName=k8s1 subsys=k8s
pod-kube-system-cilium-665vh-cilium-agent.log:2022-11-11T01:16:42.639208067Z level=info msg="Restored router IPs from node information" ipv4=10.0.0.74 ipv6="fd02::87" subsys=k8s
pod-kube-system-cilium-665vh-cilium-agent.log:2022-11-11T01:16:42.645891571Z level=info msg="The router IP (10.0.0.74) considered for restoration does not belong in the Pod CIDR of the node. Discarding old router IP." cidrs="[10.0.1.0/24]" subsys=node
pod-kube-system-cilium-665vh-cilium-agent.log:2022-11-11T01:16:42.662387924Z level=debug msg="Add NodeCiliumInternalIP: 10.0.0.74" k8sNodeID=15dec14a-0888-43a8-943c-08575728b78f nodeName=k8s1 subsys=k8s

So the router IP got changed because it didn't match the pod CIDR. Meaning the pod CIDR changed.

Using the Cilium logs collected for previous, succeeding tests, we can go back to the test where the pod CIDR changed. That's K8sUpdates.

Turns out K8sUpdates delete the CiliumNode objects and cleans the filesystem. Then, Cilium gets the CiliumInternalIP from k8s annotations and Cilium operator assigns a pod CIDR. If the pod CIDR changed, then the CiliumInternalIP doesn't belong in the new pod CIDR and we have a mismatch with the logs just above.

qmonnet added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Oct 14, 2022

qmonnet mentioned this issue Oct 14, 2022

Helm: configurable capabilities for cilium-agent #21506

Merged

6 tasks

chancez mentioned this issue Oct 14, 2022

CI: K8sDatapathConfig Host firewall With VXLAN #21740

Closed

pchaigno added the area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. label Oct 18, 2022

joestringer mentioned this issue Oct 18, 2022

treewide: Refactor and simplify ipcache usage #21774

Merged

julianwiedmann mentioned this issue Oct 19, 2022

EndpointManager: fix deadlock when releasing an endpoint #21771

Merged

qmonnet mentioned this issue Oct 19, 2022

bpf: Replace the ghash.h implementation #21794

Merged

joestringer mentioned this issue Oct 20, 2022

Enable operator operation without kubernetes. #21344

Merged

julianwiedmann mentioned this issue Oct 24, 2022

test: net_policies: delete custom IP routes after test completion #21857

Merged

michi-covalent mentioned this issue Oct 24, 2022

dns: Add DataSource field to ProxyRequestContext #21854

Merged

jibi mentioned this issue Oct 26, 2022

hubble: Add "hubble-prefer-ipv6" option #21751

Merged

7 tasks

qmonnet mentioned this issue Oct 26, 2022

v1.12 backports 2022-10-19 #21809

Merged

6 tasks

aanm added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Oct 26, 2022

rgo3 mentioned this issue Oct 31, 2022

probes: refactor bpftool feature macros generation #21451

Merged

joestringer mentioned this issue Nov 3, 2022

daemon: Top-level composition into a hierarchy of cells #21736

Merged

ti-mo mentioned this issue Nov 4, 2022

bpf: bump cilium/ebpf, automatically grow verifier log buffer #21973

Merged

vincentmli mentioned this issue Nov 5, 2022

helm: Document missing k8sService kubeConfigPath bpf.mapDynamicSizeRatio #21817

Merged

7 tasks

bimmlerd mentioned this issue Nov 7, 2022

Series of cleanups to ENI tests #21975

Merged

This was referenced Nov 9, 2022

Fix CEP batching FCFS mode to group CEPs per namespace. #22041

Merged

go.mod, vendor: pin golang.org/x/* packages to tagged versions #22051

Merged

gandro mentioned this issue Nov 9, 2022

eni: fix new node not triggering creation of ENI with fix deadlock #21830

Merged

tklauser mentioned this issue Nov 9, 2022

Update Go to 1.19.3 #22024

Merged

pchaigno self-assigned this Nov 11, 2022

aanm mentioned this issue Nov 12, 2022

pkg/k8s: do not read k8s node annotations if they are not written #22127

Merged

christarazi mentioned this issue Nov 12, 2022

Optimize generateLabelString() #21718

Merged

pchaigno assigned aanm Nov 14, 2022

joestringer mentioned this issue Nov 14, 2022

Disable eBPF host routing in cni chaining mode #22044

Merged

7 tasks

This was referenced Nov 14, 2022

dnsproxy: Improve regex pattern #20246

Merged

daemon/policy: Reduce overhead of policy deletion #22153

Merged

dnsproxy: introduce cache to reference count regexes #21288

Merged

aanm closed this as completed in #22127 Nov 22, 2022

pchaigno mentioned this issue Nov 24, 2022

v1.12 backports 2022-11-22 #22308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing #21735

CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing #21735

qmonnet commented Oct 14, 2022 •

edited

Loading

jibi commented Oct 17, 2022

pchaigno commented Nov 11, 2022

pchaigno commented Nov 14, 2022

CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing #21735

CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing #21735

Comments

qmonnet commented Oct 14, 2022 • edited Loading

Test Name

Failure Output

Stack Trace

Standard Output

Standard Error

Resources

jibi commented Oct 17, 2022

pchaigno commented Nov 11, 2022

pchaigno commented Nov 14, 2022

Initial Triage

Understanding What is Going On

Tracing the Packets

Tracing the Agent Bug

qmonnet commented Oct 14, 2022 •

edited

Loading