Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodediscovery: Fix bug where CiliumInternalIP was flapping #29964

Conversation

gandro
Copy link
Member

@gandro gandro commented Dec 18, 2023

This fixes a bug in UpdateCiliumNodeResource where the CiliumInternalIP (aka cilium_host IP, aka router IP) was flapping in the node manager during restoration (i.e. during cilium-agent restarts).

In particular in cluster-pool mode, UpdateCiliumNodeResource is called before the cilium_host IP has been restored, as there are some circular dependencies: The restored IP can only be fully validated after the IPAM subsystem is ready, but that in turn can only happen if the CiliumNode object has been created. The UpdateCiliumNodeResource function however will only announce the cilium_host IP if it has been restored.

This commit attempts to break that cycle by not overwriting any already existing CiliumInternalIP in the CiliumNode resource.

Overall, this change is rather hacky, in particular it does not address the fact that other less crucial node information (like the health IP) also flaps (see also #28299). But since we want to backport this bugfix to older stable branches too, this change is intentionally kept as minimal as possible.

Example node event (as observed by other nodes) before this change:

2023-12-18T12:58:20.070330814Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"}],..." subsys=nodemanager
2023-12-18T12:58:20.208082226Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],..." subsys=nodemanager

After this change (note the CiliumInternalIP present in both events):

2023-12-18T15:38:23.695653876Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"},{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"}],..." subsys=nodemanager
2023-12-18T15:38:23.838604573Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],...}" subsys=nodemanager

Reported-by: Paul Chaignon paul.chaignon@gmail.com

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related. needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Dec 18, 2023
@gandro gandro requested a review from a team as a code owner December 18, 2023 15:59
@gandro gandro requested a review from jibi December 18, 2023 15:59
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in v1.15.0-rc.1 Dec 18, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.6 Dec 18, 2023
@gandro
Copy link
Member Author

gandro commented Dec 18, 2023

I have reproduced this also on v1.14. It likely requires a manual backport on that branch. I'll also check v1.13 and v1.12 now.

@gandro gandro requested review from giorio94 and removed request for jibi December 18, 2023 16:06
@gandro gandro added the needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch label Dec 18, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.11 Dec 18, 2023
@gandro gandro marked this pull request as draft December 18, 2023 16:21
@gandro
Copy link
Member Author

gandro commented Dec 18, 2023

Converting back to draft, after discussing offline with Marco that this also needs to be fixed in kvstore mode. Also, the bug is present in all supported Cilium versions.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.18 Dec 18, 2023
@gandro gandro marked this pull request as ready for review December 18, 2023 17:10
@gandro
Copy link
Member Author

gandro commented Dec 18, 2023

Apparently kvstore should also be fixed by this commit. Will manually verify tomorrow

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The fix looks reasonable to me, especially considering the requirement for backporting it to all stable versions. Moving forward, we'll likely want to refactor the IPAM CiliumNode creation logic to avoid triggering unnecessary updates before restoring the full local node information (and fix also the bouncing of the other IP addresses), but that looks more invasive as a change.

I've also double-checked the kvstore case in combination with cluster-pool IPAM mode, and the CiliumInternalIP does not flip anymore (it was caused by the CiliumNode to kvstore synchronization process implemented by the Cilium operator):

Full kvstore events received after restart

PUT
cilium/state/nodes/v1/default/kind-worker
{"Name":"kind-worker","Cluster":"default","IPAddresses":[{"Type":"CiliumInternalIP","IP":"10.0.1.199"},{"Type":"CiliumInternalIP","IP":"fd00::147"},{"Type":"InternalIP","IP":"172.19.0.3"},{"Type":"InternalIP","IP":"fc00:c111::3"}],"IPv4AllocCIDR":{"IP":"10.0.1.0","Mask":"////AA=="},"IPv4SecondaryAllocCIDRs":null,"IPv6AllocCIDR":{"IP":"fd00::100","Mask":"////////////////////AA=="},"IPv6SecondaryAllocCIDRs":null,"IPv4HealthIP":"","IPv6HealthIP":"","IPv4IngressIP":"","IPv6IngressIP":"","ClusterID":0,"Source":"custom-resource","EncryptionKey":0,"Labels":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"kind-worker","kubernetes.io/os":"linux"},"Annotations":null,"NodeIdentity":0,"WireguardPubKey":""}
PUT
cilium/state/nodes/v1/default/kind-worker
{"Name":"kind-worker","Cluster":"default","IPAddresses":[{"Type":"InternalIP","IP":"172.19.0.3"},{"Type":"InternalIP","IP":"fc00:c111::3"},{"Type":"CiliumInternalIP","IP":"10.0.1.199"},{"Type":"CiliumInternalIP","IP":"fd00::147"}],"IPv4AllocCIDR":{"IP":"10.0.1.0","Mask":"////AA=="},"IPv4SecondaryAllocCIDRs":null,"IPv6AllocCIDR":{"IP":"fd00::100","Mask":"////////////////////AA=="},"IPv6SecondaryAllocCIDRs":null,"IPv4HealthIP":"10.0.1.63","IPv6HealthIP":"fd00::11a","IPv4IngressIP":"","IPv6IngressIP":"","ClusterID":0,"Source":"local","EncryptionKey":0,"Labels":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"kind-worker","kubernetes.io/os":"linux"},"Annotations":{},"NodeIdentity":1,"WireguardPubKey":""}
PUT
cilium/state/nodes/v1/default/kind-worker
{"Name":"kind-worker","Cluster":"default","IPAddresses":[{"Type":"InternalIP","IP":"172.19.0.3"},{"Type":"InternalIP","IP":"fc00:c111::3"},{"Type":"CiliumInternalIP","IP":"10.0.1.199"},{"Type":"CiliumInternalIP","IP":"fd00::147"}],"IPv4AllocCIDR":{"IP":"10.0.1.0","Mask":"////AA=="},"IPv4SecondaryAllocCIDRs":null,"IPv6AllocCIDR":{"IP":"fd00::100","Mask":"////////////////////AA=="},"IPv6SecondaryAllocCIDRs":null,"IPv4HealthIP":"10.0.1.63","IPv6HealthIP":"fd00::11a","IPv4IngressIP":"","IPv6IngressIP":"","ClusterID":0,"Source":"custom-resource","EncryptionKey":0,"Labels":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"kind-worker","kubernetes.io/os":"linux"},"Annotations":null,"NodeIdentity":0,"WireguardPubKey":""}

pkg/nodediscovery/nodediscovery.go Outdated Show resolved Hide resolved
This commit improves the debug logging of node update events by using
the JSON representation instead of the Go syntax representation of the
node. This makes it easier to parse the log message, as IP addresses are
now printed as strings instead of byte arrays.

Before:

```
level=debug msg="Received node update event from custom-resource: types.Node{Name:\"kind-worker\", Cluster:\"default\", IPAddresses:[]types.Address{types.Address{Type:\"InternalIP\", IP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xac, 0x12, 0x0, 0x3}}, types.Address{Type:\"InternalIP\", IP:net.IP{0xfc, 0x0, 0xc1, 0x11, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x3}}, types.Address{Type:\"CiliumInternalIP\", IP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0x0, 0x0, 0xd2}}}, IPv4AllocCIDR:(*cidr.CIDR)(0xc000613180), IPv4SecondaryAllocCIDRs:[]*cidr.CIDR(nil), IPv6AllocCIDR:(*cidr.CIDR)(nil), IPv6SecondaryAllocCIDRs:[]*cidr.CIDR(nil), IPv4HealthIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0x0, 0x0, 0x30}, IPv6HealthIP:net.IP(nil), IPv4IngressIP:net.IP(nil), IPv6IngressIP:net.IP(nil), ClusterID:0x0, Source:\"custom-resource\", EncryptionKey:0x0, Labels:map[string]string{\"beta.kubernetes.io/arch\":\"amd64\", \"beta.kubernetes.io/os\":\"linux\", \"kubernetes.io/arch\":\"amd64\", \"kubernetes.io/hostname\":\"kind-worker2\", \"kubernetes.io/os\":\"linux\"}, Annotations:map[string]string(nil), NodeIdentity:0x0, WireguardPubKey:\"\"}" subsys=nodemanager
```

After:

```
level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.3\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::3\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],\"IPv4AllocCIDR\":{\"IP\":\"10.0.1.0\",\"Mask\":\"////AA==\"},\"IPv4SecondaryAllocCIDRs\":null,\"IPv6AllocCIDR\":null,\"IPv6SecondaryAllocCIDRs\":null,\"IPv4HealthIP\":\"10.0.1.120\",\"IPv6HealthIP\":\"\",\"IPv4IngressIP\":\"\",\"IPv6IngressIP\":\"\",\"ClusterID\":0,\"Source\":\"custom-resource\",\"EncryptionKey\":0,\"Labels\":{\"beta.kubernetes.io/arch\":\"amd64\",\"beta.kubernetes.io/os\":\"linux\",\"kubernetes.io/arch\":\"amd64\",\"kubernetes.io/hostname\":\"kind-worker\",\"kubernetes.io/os\":\"linux\"},\"Annotations\":null,\"NodeIdentity\":0,\"WireguardPubKey\":\"\"}" subsys=nodemanager
```

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This fixes a bug in `UpdateCiliumNodeResource` where the
`CiliumInternalIP` (aka `cilium_host` IP, aka router IP) was flapping in
the node manager during restoration (i.e. during cilium-agent restarts).

In particular in `cluster-pool` mode, `UpdateCiliumNodeResource` is
called before the `cilium_host` IP has been restored, as there are some
circular dependencies: The restored IP can only be fully validated after
the IPAM subsystem is ready, but that in turn can only happen if the
`CiliumNode` object has been created. The `UpdateCiliumNodeResource`
function however will only announce the `cilium_host` IP if it has been
restored.

This commit attempts to break that cycle by not overwriting any already
existing `CiliumInternalIP` in the CiliumNode resource.

Overall, this change is rather hacky, in particular it does not address
the fact that other less crucial node information (like the health IP)
also flaps. But since we want to backport this bugfix to older stable
branches too, this change is intentionally kept as minimal as possible.

Example node event (as observed by other nodes) before this change:

```
2023-12-18T12:58:20.070330814Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"}],..." subsys=nodemanager
2023-12-18T12:58:20.208082226Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],..." subsys=nodemanager
```

After this change (note the `CiliumInternalIP` present in both events):

```
2023-12-18T15:38:23.695653876Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"},{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"}],..." subsys=nodemanager
2023-12-18T15:38:23.838604573Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],...}" subsys=nodemanager
```

Reported-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/fix-router-ip-being-temporarily-removed branch from c8cc069 to 125ae7a Compare December 18, 2023 17:52
@gandro
Copy link
Member Author

gandro commented Dec 18, 2023

/test

@julianwiedmann julianwiedmann added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 18, 2023
@julianwiedmann julianwiedmann merged commit 263e689 into cilium:main Dec 18, 2023
61 of 62 checks passed
@pchaigno pchaigno added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Dec 18, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.15 in v1.15.0-rc.1 Dec 18, 2023
@pchaigno pchaigno added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Dec 18, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.11 Dec 18, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.12 in 1.12.18 Dec 19, 2023
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Dec 19, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.14 in 1.14.6 Dec 19, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport done to v1.14 in 1.14.6 Dec 19, 2023
@github-actions github-actions bot added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Dec 19, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.13 in 1.13.11 Dec 19, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport done to v1.13 in 1.13.11 Dec 19, 2023
@julianwiedmann julianwiedmann added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Dec 19, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.15 to Backport done to v1.15 in v1.15.0-rc.1 Dec 19, 2023
@github-actions github-actions bot added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Dec 20, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.12 in 1.12.18 Dec 20, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport done to v1.12 in 1.12.18 Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related.
Projects
No open projects
1.12.18
Backport done to v1.12
1.13.11
Backport done to v1.13
1.14.6
Backport done to v1.14
Status: Released
Status: Released
Status: Released
v1.15.0-rc.1
Backport done to v1.15
Development

Successfully merging this pull request may close these issues.

None yet

4 participants