[v1.10] daemon, node: Fix faulty router IP restoration logic #16675

christarazi · 2021-06-28T22:18:29Z

daemon, node: Fix faulty router IP restoration logic #16672 -- daemon, node: Fix faulty router IP restoration logic (@christarazi )

Once this PR is merged, you can update the PR labels via:

$ for pr in 16672; do contrib/backporting/set-labels.py $pr done 1.10; done

christarazi · 2021-06-28T22:35:02Z

test-backport-1.10

this hit #16659

aanm · 2021-06-29T04:19:51Z

test-1.18-4.9 hit command not found https://jenkins.cilium.io/job/Cilium-PR-K8s-1.18-kernel-4.9/1689/testReport/junit/Suite-k8s-1/18/K8sVerifier_Runs_the_kernel_verifier_against_Cilium_s_BPF_datapath/

aditighag · 2021-06-29T15:08:27Z

k8s-1.21-kernel-4.9 failure - #16659

app2-58757b7dd5-465jl" cannot curl fqdn target "vagrant-cache.ci.cilium.io"
Expected command: kubectl exec -n 202106282353k8shubbletesthubbleobservetestl3l4flow app2-58757b7dd5-465jl -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 http://vagrant-cache.ci.cilium.io -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'"
To succeed, but it failed:
Exitcode: 28
Err: exit status 28
Stdout:
time-> DNS: '0.022253()', Connect: '0.000000',Transfer '0.000000', total '5.000267'
Stderr:
command terminated with exit code 28

[ upstream commit ff63b07 ] When running in ENI or Alibaba IPAM mode, or any CRD-backed IPAM mode ("crd") and upon Cilium restart, it was very likely that `cilium_host` was assigned an additional IP. Below is a case where Cilium was restarted 3 times, hence getting 3 additional router IPs: ``` 4: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000 link/ether 66:03:3c:07:8c:47 brd ff:ff:ff:ff:ff:ff inet 192.168.35.9/32 scope link cilium_host valid_lft forever preferred_lft forever inet 192.168.34.37/32 scope link cilium_host valid_lft forever preferred_lft forever inet 192.168.57.107/32 scope link cilium_host valid_lft forever preferred_lft forever inet6 fe80::6403:3cff:fe07:8c47/64 scope link valid_lft forever preferred_lft forever ``` This was because in CRD-backed IPAM modes, we wait until we fully sync with K8s in order to derive the VPC CIDR, which becomes the pod CIDR on the node. Since the router IP restoration logic was using a different pod CIDR during the router IP validation check, it was erroneously discarding it. This was observed with: ``` 2021-06-25T13:59:47.816069937Z level=info msg="The router IP (192.168.135.3) considered for restoration does not belong in the Pod CIDR of the node. Discarding old router IP." cidr=10.8.0.0/16 subsys=node ``` This is problematic because the extraneous router IPs could be also assigned to pods, which would break pod connectivity. The fix is to break up the router IP restoration process into 2 parts. The first is to attempt a restoration of the IP from the filesystem (`node_config.h`). We also fetch the router IPs from Kubernetes resources since they were already retrieved prior inside k8s.WaitForNodeInformation(). Then after the CRD-backed IPAM is initialized and started (*Daemon).startIPAM() is called, we attempt the second part. This includes evaluating which IPs (either from filesystem or from K8s) should be set as the router IPs. The IPs from the filesystem take precedence. In case the node was rebooted, the filesystem will be wiped so then we'd rely on the IPs from the K8s resources. At this point in the daemon initialization, we have the correct CIDR range as the pod CIDR range to validate the chosen IP. Fixes: beb8bde ("k8s, node: Restore router IPs (`cilium_host`) from K8s resource") Signed-off-by: Chris Tarazi <chris@isovalent.com>

christarazi · 2021-06-29T21:10:04Z

CI has passed previously. Force pushed to update the commit reference as the upstream PR was merged.

maintainer-s-little-helper bot added backport/1.10 kind/backports This PR provides functionality previously merged into master. labels Jun 28, 2021

christarazi changed the title ~~daemon, node: Fix faulty router IP restoration logic~~ [v1.10] daemon, node: Fix faulty router IP restoration logic Jun 28, 2021

christarazi force-pushed the pr/christarazi/v1.10-backport-16672 branch from 14de0c9 to 890ffaa Compare June 28, 2021 22:30

christarazi force-pushed the pr/christarazi/v1.10-backport-16672 branch from 890ffaa to 2ee471c Compare June 29, 2021 21:09

christarazi marked this pull request as ready for review June 29, 2021 21:09

christarazi requested a review from a team as a code owner June 29, 2021 21:09

aanm merged commit c349779 into cilium:v1.10 Jun 29, 2021

christarazi deleted the pr/christarazi/v1.10-backport-16672 branch June 29, 2021 22:55

aditighag mentioned this pull request Jun 30, 2021

v1.10 backports 2021-06-29 #16686

Merged

aanm mentioned this pull request Jul 2, 2021

Prepare for release v1.10.2 #16749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.10] daemon, node: Fix faulty router IP restoration logic #16675

[v1.10] daemon, node: Fix faulty router IP restoration logic #16675

christarazi commented Jun 28, 2021 •

edited

christarazi commented Jun 28, 2021 •

edited by aanm

aanm commented Jun 29, 2021 •

edited

aditighag commented Jun 29, 2021

christarazi commented Jun 29, 2021 •

edited

[v1.10] daemon, node: Fix faulty router IP restoration logic #16675

[v1.10] daemon, node: Fix faulty router IP restoration logic #16675

Conversation

christarazi commented Jun 28, 2021 • edited

christarazi commented Jun 28, 2021 • edited by aanm

aanm commented Jun 29, 2021 • edited

aditighag commented Jun 29, 2021

christarazi commented Jun 29, 2021 • edited

christarazi commented Jun 28, 2021 •

edited

christarazi commented Jun 28, 2021 •

edited by aanm

aanm commented Jun 29, 2021 •

edited

christarazi commented Jun 29, 2021 •

edited