Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing #21735

Closed
qmonnet opened this issue Oct 14, 2022 · 3 comments · Fixed by #22127
Assignees
Labels
area/CI Continuous Integration testing issue or flake area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. ci/flake This is a known failure that occurs in the tree. Please investigate me! sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@qmonnet
Copy link
Member

qmonnet commented Oct 14, 2022

Test Name

- K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing
  (first to fail; output in the following section is for that test)

- K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing with bpf_host

Failure Output

FAIL: Connectivity test between nodes failed

Stack Trace

/home/jenkins/workspace/Cilium-PR-K8s-1.16-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:527
Connectivity test between nodes failed
Expected
    <bool>: false
to be true
/home/jenkins/workspace/Cilium-PR-K8s-1.16-kernel-4.9/src/github.com/cilium/cilium/test/k8s/datapath_configuration.go:615

Standard Output

Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
⚠️  Number of "level=warning" in logs: 6
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
Top 4 errors/warnings:
CONFIG_CGROUP_BPF optional kernel parameter is not in kernel (needed for: Host Reachable Services and Sockmap optimization)
CONFIG_LWTUNNEL_BPF optional kernel parameter is not in kernel (needed for: Lightweight Tunnel hook for IP-in-IP encapsulation)
Key allocation attempt failed
Unable to install direct node route {Ifindex: 0 Dst: fd02::100/120 Src: <nil> Gw: <nil> Flags: [] Table: 0 Realm: 0}
Cilium pods: [cilium-85fkb cilium-ckdvx]
Netpols loaded: 
CiliumNetworkPolicies loaded: 202210131819k8sdatapathconfigtransparentencryptiondirectrouting::l3-policy-demo 
Endpoint Policy Enforcement:
Pod                          Ingress   Egress
testclient-tc895             false     false
testds-62dw4                 false     false
testds-qfzkm                 false     false
coredns-8cfc78c54-4pdk9      false     false
test-k8s2-5b756fd6c5-ptfbt   false     false
testclient-2-74r55           false     false
testclient-2-nmb4b           false     false
testclient-mm5mb             false     false
Cilium agent 'cilium-85fkb': Status: Ok  Health: Ok Nodes "" ContainerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 41 Failed 0
Cilium agent 'cilium-ckdvx': Status: Ok  Health: Ok Nodes "" ContainerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 33 Failed 0

Standard Error

18:18:24 STEP: Running BeforeAll block for EntireTestsuite K8sDatapathConfig Transparent encryption DirectRouting
18:18:25 STEP: Deploying ipsec_secret.yaml in namespace kube-system
18:18:25 STEP: Installing Cilium
18:18:26 STEP: Waiting for Cilium to become ready
18:19:09 STEP: Validating if Kubernetes DNS is deployed
18:19:09 STEP: Checking if deployment is ready
18:19:09 STEP: Checking if kube-dns service is plumbed correctly
18:19:09 STEP: Checking if pods have identity
18:19:09 STEP: Checking if DNS can resolve
18:19:25 STEP: Kubernetes DNS is not ready: unable to resolve service name kubernetes.default.svc.cluster.local with DNS server 10.96.0.10 by running 'dig +short kubernetes.default.svc.cluster.local @10.96.0.10' Cilium pod: Exitcode: 9 
Err: exit status 9
Stdout:
 	 ;; connection timed out; no servers could be reached
	 
	 
Stderr:
 	 command terminated with exit code 9
	 

18:19:25 STEP: Restarting Kubernetes DNS (-l k8s-app=kube-dns)
18:19:25 STEP: Waiting for Kubernetes DNS to become operational
18:19:25 STEP: Checking if deployment is ready
18:19:25 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:26 STEP: Checking if deployment is ready
18:19:26 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:27 STEP: Checking if deployment is ready
18:19:27 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:28 STEP: Checking if deployment is ready
18:19:28 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:29 STEP: Checking if deployment is ready
18:19:29 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:30 STEP: Checking if deployment is ready
18:19:30 STEP: Kubernetes DNS is not ready yet: only 0 of 1 replicas are available
18:19:31 STEP: Checking if deployment is ready
18:19:31 STEP: Checking if kube-dns service is plumbed correctly
18:19:31 STEP: Checking if pods have identity
18:19:31 STEP: Checking if DNS can resolve
18:19:31 STEP: Validating Cilium Installation
18:19:31 STEP: Performing Cilium controllers preflight check
18:19:31 STEP: Performing Cilium health check
18:19:31 STEP: Checking whether host EP regenerated
18:19:31 STEP: Performing Cilium status preflight check
18:19:32 STEP: Performing Cilium service preflight check
18:19:32 STEP: Performing K8s service preflight check
18:19:34 STEP: Waiting for cilium-operator to be ready
18:19:34 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator")
18:19:34 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator") => <nil>
18:19:34 STEP: Making sure all endpoints are in ready state
18:19:35 STEP: Creating namespace 202210131819k8sdatapathconfigtransparentencryptiondirectrouting
18:19:35 STEP: Deploying demo_ds.yaml in namespace 202210131819k8sdatapathconfigtransparentencryptiondirectrouting
18:19:36 STEP: Applying policy /home/jenkins/workspace/Cilium-PR-K8s-1.16-kernel-4.9/src/github.com/cilium/cilium/test/k8s/manifests/l3-policy-demo.yaml
18:19:40 STEP: Waiting for 4m0s for 7 pods of deployment demo_ds.yaml to become ready
18:19:40 STEP: WaitforNPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="")
18:19:46 STEP: WaitforNPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="") => <nil>
18:19:46 STEP: Checking pod connectivity between nodes
18:19:46 STEP: WaitforPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="-l zgroup=testDSClient")
18:19:46 STEP: WaitforPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="-l zgroup=testDSClient") => <nil>
18:19:46 STEP: WaitforPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="-l zgroup=testDS")
18:19:46 STEP: WaitforPods(namespace="202210131819k8sdatapathconfigtransparentencryptiondirectrouting", filter="-l zgroup=testDS") => <nil>
FAIL: Connectivity test between nodes failed
Expected
    <bool>: false
to be true
=== Test Finished at 2022-10-13T18:19:55Z====
18:19:55 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathConfig
===================== TEST FAILED =====================
18:19:55 STEP: Running AfterFailed block for EntireTestsuite K8sDatapathConfig
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE                                                         NAME                               READY   STATUS    RESTARTS   AGE   IP              NODE   NOMINATED NODE   READINESS GATES
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   test-k8s2-5b756fd6c5-ptfbt         2/2     Running   0          22s   10.0.1.161      k8s2   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testclient-2-74r55                 1/1     Running   0          22s   10.0.0.235      k8s1   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testclient-2-nmb4b                 1/1     Running   0          22s   10.0.1.49       k8s2   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testclient-mm5mb                   1/1     Running   0          22s   10.0.1.118      k8s2   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testclient-tc895                   1/1     Running   0          22s   10.0.0.32       k8s1   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testds-62dw4                       2/2     Running   0          22s   10.0.0.57       k8s1   <none>           <none>
	 202210131819k8sdatapathconfigtransparentencryptiondirectrouting   testds-qfzkm                       2/2     Running   0          22s   10.0.1.62       k8s2   <none>           <none>
	 cilium-monitoring                                                 grafana-7fd557d749-89499           0/1     Running   0          30m   10.0.0.145      k8s1   <none>           <none>
	 cilium-monitoring                                                 prometheus-d87f8f984-w9bn7         1/1     Running   0          30m   10.0.0.205      k8s1   <none>           <none>
	 kube-system                                                       cilium-85fkb                       1/1     Running   0          91s   192.168.56.12   k8s2   <none>           <none>
	 kube-system                                                       cilium-ckdvx                       1/1     Running   0          91s   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       cilium-operator-5c955b98b8-45g5f   1/1     Running   0          91s   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       cilium-operator-5c955b98b8-t79r5   1/1     Running   0          91s   192.168.56.12   k8s2   <none>           <none>
	 kube-system                                                       coredns-8cfc78c54-4pdk9            1/1     Running   0          32s   10.0.1.104      k8s2   <none>           <none>
	 kube-system                                                       etcd-k8s1                          1/1     Running   0          33m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       kube-apiserver-k8s1                1/1     Running   0          33m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       kube-controller-manager-k8s1       1/1     Running   3          33m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       kube-proxy-cwltp                   1/1     Running   0          34m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       kube-proxy-g5dd9                   1/1     Running   0          31m   192.168.56.12   k8s2   <none>           <none>
	 kube-system                                                       kube-scheduler-k8s1                1/1     Running   2          34m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       log-gatherer-c8lwj                 1/1     Running   0          30m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       log-gatherer-r2hzb                 1/1     Running   0          30m   192.168.56.12   k8s2   <none>           <none>
	 kube-system                                                       registry-adder-4bjgj               1/1     Running   0          31m   192.168.56.11   k8s1   <none>           <none>
	 kube-system                                                       registry-adder-9x6l2               1/1     Running   0          31m   192.168.56.12   k8s2   <none>           <none>
	 
Stderr:
 	 

Fetching command output from pods [cilium-85fkb cilium-ckdvx]
cmd: kubectl exec -n kube-system cilium-85fkb -c cilium-agent -- cilium status
Exitcode: 0 
Stdout:
 	 KVStore:                 Ok   Disabled
	 Kubernetes:              Ok   1.16 (v1.16.15) [linux/amd64]
	 Kubernetes APIs:         ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
	 KubeProxyReplacement:    Disabled   
	 Host firewall:           Disabled
	 CNI Chaining:            none
	 Cilium:                  Ok   1.12.90 (v1.12.90-fdcb68fb)
	 NodeMonitor:             Listening for events on 3 CPUs with 64x4096 of shared memory
	 Cilium health daemon:    Ok   
	 IPAM:                    IPv4: 7/254 allocated from 10.0.1.0/24, IPv6: 7/254 allocated from fd02::100/120
	 IPv6 BIG TCP:            Disabled
	 BandwidthManager:        Disabled
	 Host Routing:            Legacy
	 Masquerading:            IPTables [IPv4: Enabled, IPv6: Enabled]
	 Controller Status:       41/41 healthy
	 Proxy Status:            OK, ip 10.0.1.197, 0 redirects active on ports 10000-20000
	 Global Identity Range:   min 256, max 65535
	 Hubble:                  Ok   Current/Max Flows: 2065/65535 (3.15%), Flows/s: 25.94   Metrics: Disabled
	 Encryption:              IPsec
	 Cluster health:          1/2 reachable   (2022-10-13T18:19:44Z)
	   Name                   IP              Node      Endpoints
	   k8s2 (localhost)       192.168.56.12   unknown   unreachable
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-85fkb -c cilium-agent -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                                       IPv6        IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                                                             
	 236        Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s2                                                                                                 ready   
	                                                            reserved:host                                                                                                                      
	 329        Disabled           Disabled          56438      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::162   10.0.1.49    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                    
	                                                            k8s:zgroup=testDSClient2                                                                                                           
	 384        Disabled           Disabled          20699      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::122   10.0.1.161   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                    
	                                                            k8s:zgroup=test-k8s2                                                                                                               
	 698        Enabled            Disabled          56483      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::161   10.0.1.62    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                    
	                                                            k8s:zgroup=testDS                                                                                                                  
	 803        Disabled           Disabled          53574      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::185   10.0.1.118   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                    
	                                                            k8s:zgroup=testDSClient                                                                                                            
	 902        Disabled           Disabled          4          reserved:health                                                                                   fd02::13e   10.0.1.168   ready   
	 3578       Disabled           Disabled          15138      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::132   10.0.1.104   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=coredns                                                                                    
	                                                            k8s:io.kubernetes.pod.namespace=kube-system                                                                                        
	                                                            k8s:k8s-app=kube-dns                                                                                                               
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-ckdvx -c cilium-agent -- cilium status
Exitcode: 0 
Stdout:
 	 KVStore:                 Ok   Disabled
	 Kubernetes:              Ok   1.16 (v1.16.15) [linux/amd64]
	 Kubernetes APIs:         ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
	 KubeProxyReplacement:    Disabled   
	 Host firewall:           Disabled
	 CNI Chaining:            none
	 Cilium:                  Ok   1.12.90 (v1.12.90-fdcb68fb)
	 NodeMonitor:             Listening for events on 3 CPUs with 64x4096 of shared memory
	 Cilium health daemon:    Ok   
	 IPAM:                    IPv4: 5/254 allocated from 10.0.0.0/24, IPv6: 5/254 allocated from fd02::/120
	 IPv6 BIG TCP:            Disabled
	 BandwidthManager:        Disabled
	 Host Routing:            Legacy
	 Masquerading:            IPTables [IPv4: Enabled, IPv6: Enabled]
	 Controller Status:       33/33 healthy
	 Proxy Status:            OK, ip 10.0.0.211, 0 redirects active on ports 10000-20000
	 Global Identity Range:   min 256, max 65535
	 Hubble:                  Ok   Current/Max Flows: 4678/65535 (7.14%), Flows/s: 57.00   Metrics: Disabled
	 Encryption:              IPsec
	 Cluster health:          0/2 reachable   (2022-10-13T18:19:40Z)
	   Name                   IP              Node      Endpoints
	   k8s1 (localhost)       192.168.56.11   unknown   unreachable
	   k8s2                   192.168.56.12   unknown   unreachable
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-ckdvx -c cilium-agent -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                                       IPv6       IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                                                            
	 501        Disabled           Disabled          53574      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::ae   10.0.0.32    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                   
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                   
	                                                            k8s:zgroup=testDSClient                                                                                                           
	 1083       Disabled           Disabled          56438      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::f9   10.0.0.235   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                   
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                   
	                                                            k8s:zgroup=testDSClient2                                                                                                          
	 1211       Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s1                                                                                                ready   
	                                                            k8s:node-role.kubernetes.io/master                                                                                                
	                                                            reserved:host                                                                                                                     
	 2661       Enabled            Disabled          56483      k8s:io.cilium.k8s.policy.cluster=default                                                          fd02::7    10.0.0.57    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                   
	                                                            k8s:io.kubernetes.pod.namespace=202210131819k8sdatapathconfigtransparentencryptiondirectrouting                                   
	                                                            k8s:zgroup=testDS                                                                                                                 
	 3367       Disabled           Disabled          4          reserved:health                                                                                   fd02::d    10.0.0.168   ready   
	 
Stderr:
 	 

===================== Exiting AfterFailed =====================
18:20:17 STEP: Running AfterEach for block EntireTestsuite K8sDatapathConfig
18:20:17 STEP: Deleting deployment demo_ds.yaml
18:20:17 STEP: Deleting deployment ipsec_secret.yaml
18:20:18 STEP: Deleting namespace 202210131819k8sdatapathconfigtransparentencryptiondirectrouting
18:20:33 STEP: Running AfterEach for block EntireTestsuite

Resources

@qmonnet qmonnet added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Oct 14, 2022
@jibi
Copy link
Member

jibi commented Oct 17, 2022

@pchaigno pchaigno added the area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. label Oct 18, 2022
@aanm aanm added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Oct 26, 2022
@pchaigno pchaigno self-assigned this Nov 11, 2022
@pchaigno
Copy link
Member

@pchaigno
Copy link
Member

Initial Triage

Seems to fail about 1/5 times it runs.

Skipped on net-next, failing on 4.9, 5.4, but not on 4.19. Unclear why.

Understanding What is Going On

Using https://jenkins.cilium.io/job/cilium-master-k8s-1.23-kernel-5.4/3366/ and
ipsec-flake.zip

Let's take a shortcut and check the IPsec error counters first. Many issues with IPsec arise that way:

$ grep -v "0$" bugtool-cilium-*/cmd/cat--proc-net-xfrm_stat.md 
bugtool-cilium-665vh/cmd/cat--proc-net-xfrm_stat.md:XfrmInNoPols            	5

Jackpot! The issue is a missing XFRM policy somehow.

From test-output.log:

time="2022-11-11T01:17:11Z" level=error msg="Error executing command 'kubectl exec -n 202211110116k8sdatapathconfigtransparentencryptiondirectrouting testclient-84dnc -- ping -W 5 -c 5 10.0.0.117'" error="exit status 1"
cmd: "kubectl exec -n 202211110116k8sdatapathconfigtransparentencryptiondirectrouting testclient-84dnc -- ping -W 5 -c 5 10.0.0.117" exitCode: 1 duration: 9.194782591s stdout:
PING 10.0.0.117 (10.0.0.117) 56(84) bytes of data.

--- 10.0.0.117 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4091ms


err:
exit status 1
stderr:
command terminated with exit code 1

FAIL: Connectivity test between nodes failed

The Jenkins output has kubectl get pods -o wide --all-namespaces which tells us that the client is on k8s1 with IP 10.0.1.188, whereas the server (IP 10.0.0.117 above) is on k8s2 with name testds-f6dbm.
The non-zero error counter was for bugtool-cilium-665vh which is from k8s1. Thus, the failure is for the reply traffic as it enters k8s1.

Tracing the Packets

Let's trace the packet from the Linux state we collected.

The reply packet is 10.0.0.117 -> 10.0.1.188, leaving from k8s2.
First, check the IPsec key in use according to k8s2's ipcache:

$ git grep 10.0.1.188/32 bugtool-cilium-xnmwz/cmd/cilium-bpf-ipcache-list.md
bugtool-cilium-xnmwz/cmd/cilium-bpf-ipcache-list.md:10.0.1.188/32                             identity=1706 encryptkey=6 tunnelendpoint=192.168.56.11

We're using the IPsec key corresponding to SPI 6. (edited)

No proxy redirects in place so no L7 policies. We don't need to care about XFRM policies for that.

We'll match:

src 10.0.0.0/24 dst 10.0.1.0/24 uid 0
	dir out action allow index 57 priority 0 share any flag  (0x00000000)
	lifetime config:
	  limit: soft (INF)(bytes), hard (INF)(bytes)
	  limit: soft (INF)(packets), hard (INF)(packets)
	  expire add: soft 0(sec), hard 0(sec)
	  expire use: soft 0(sec), hard 0(sec)
	lifetime current:
	  0(bytes), 0(packets)
	  add 2022-11-11 01:16:53 use -
	mark 0x6e00/0xff00 
	tmpl src 10.0.0.81 dst 10.0.0.74
		proto esp spi 0x00000006(6) reqid 1(0x00000001) mode tunnel
		level required share any 
		enc-mask ffffffff auth-mask ffffffff comp-mask ffffffff

XFRM state matched:

$ git grep -A7 "src 10.0.0.81 dst 10.0.0.74" bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md:src 10.0.0.81 dst 10.0.0.74
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   proto esp spi 0x00000006(6) reqid 1(0x00000001) mode tunnel
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   replay-window 0 seq 0x00000000 flag  (0x00000000)
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   mark 0x6e00/0xff00 output-mark 0xe00/0xf00
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   aead rfc4106(gcm(aes)) [hash:fb57b7468d2bb142de56893974c4cef49d018a1d6a9bb6dbc0a6bf9efd243314] (160 bits) 128
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0
bugtool-cilium-xnmwz/cmd/ip--s-xfrm-state.md-   lifetime config:

So packet goes out encrypted with the above key and arrives in k8s1 with IPs src 10.0.0.81 dst 10.0.0.74.

Something's weird already. We're using ipam=cluster-pool here. The outer destination IP should have the same /24 value as the pods on k8s1; it corresponds to cilium_host IP address. But we see 10.0.0.74 instead of e.g. 10.0.1.74.

What's the IP address of k8s1's cilium_host interface? That's not good 😱

$ git grep "link cilium_host" bugtool-cilium-665vh/cmd/ip-a.md
bugtool-cilium-665vh/cmd/ip-a.md:    inet 10.0.1.205/32 scope link cilium_host

The XFRM state on k8s2 is incorrect. It should be src 10.0.0.81 dst 10.0.1.205

Tracing the Agent Bug

The Node and CiliumNode objects agree on the cilium_host IP address:

$ jq '.items[].metadata.annotations."io.cilium.network.ipv4-cilium-host"' nodes.json 
"10.0.0.74"
"10.0.1.34"
$ cat api-resource-ciliumnodes.txt 
NAME   CILIUMINTERNALIP   INTERNALIP      AGE
k8s1   10.0.0.74          192.168.56.11   6m10s
k8s2   10.0.1.34          192.168.56.12   6m10s

and they agree with k8s2.

But that doesn't match the reality:

$ git grep "link cilium_host"
bugtool-cilium-665vh/cmd/ip-a.md:    inet 10.0.1.205/32 scope link cilium_host
bugtool-cilium-xnmwz/cmd/ip-a.md:    inet 10.0.0.81/32 scope link cilium_host

In the Cilium agent logs, we have:

$ git grep 10.0.0.74 pod-kube-system-cilium-665vh-cilium-agent.log
pod-kube-system-cilium-665vh-cilium-agent.log:2022-11-11T01:16:42.632640174Z level=debug msg="Add NodeCiliumInternalIP: 10.0.0.74" k8sNodeID=15dec14a-0888-43a8-943c-08575728b78f nodeName=k8s1 subsys=k8s
pod-kube-system-cilium-665vh-cilium-agent.log:2022-11-11T01:16:42.639208067Z level=info msg="Restored router IPs from node information" ipv4=10.0.0.74 ipv6="fd02::87" subsys=k8s
pod-kube-system-cilium-665vh-cilium-agent.log:2022-11-11T01:16:42.645891571Z level=info msg="The router IP (10.0.0.74) considered for restoration does not belong in the Pod CIDR of the node. Discarding old router IP." cidrs="[10.0.1.0/24]" subsys=node
pod-kube-system-cilium-665vh-cilium-agent.log:2022-11-11T01:16:42.662387924Z level=debug msg="Add NodeCiliumInternalIP: 10.0.0.74" k8sNodeID=15dec14a-0888-43a8-943c-08575728b78f nodeName=k8s1 subsys=k8s

So the router IP got changed because it didn't match the pod CIDR. Meaning the pod CIDR changed.

Using the Cilium logs collected for previous, succeeding tests, we can go back to the test where the pod CIDR changed. That's K8sUpdates.

Turns out K8sUpdates delete the CiliumNode objects and cleans the filesystem. Then, Cilium gets the CiliumInternalIP from k8s annotations and Cilium operator assigns a pod CIDR. If the pod CIDR changed, then the CiliumInternalIP doesn't belong in the new pod CIDR and we have a mismatch with the logs just above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. ci/flake This is a known failure that occurs in the tree. Please investigate me! sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants