Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.12: nodediscovery: Fix bug where CiliumInternalIP was flapping #29979

Merged
merged 2 commits into from
Dec 20, 2023

Conversation

pchaigno
Copy link
Member

Manual backport of #29964. See commits for conflict notes (same ones as for v1.13).

Once this PR is merged, you can update the PR labels via:

$ for pr in 29964; do contrib/backporting/set-labels.py $pr done 1.12; done

[ upstream commit 7084c17 ]

[ backporter's notes: Trivial conflicts in includes and due to a new
  functions. ]

This commit improves the debug logging of node update events by using
the JSON representation instead of the Go syntax representation of the
node. This makes it easier to parse the log message, as IP addresses are
now printed as strings instead of byte arrays.

Before:

```
level=debug msg="Received node update event from custom-resource: types.Node{Name:\"kind-worker\", Cluster:\"default\", IPAddresses:[]types.Address{types.Address{Type:\"InternalIP\", IP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xac, 0x12, 0x0, 0x3}}, types.Address{Type:\"InternalIP\", IP:net.IP{0xfc, 0x0, 0xc1, 0x11, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x3}}, types.Address{Type:\"CiliumInternalIP\", IP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0x0, 0x0, 0xd2}}}, IPv4AllocCIDR:(*cidr.CIDR)(0xc000613180), IPv4SecondaryAllocCIDRs:[]*cidr.CIDR(nil), IPv6AllocCIDR:(*cidr.CIDR)(nil), IPv6SecondaryAllocCIDRs:[]*cidr.CIDR(nil), IPv4HealthIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0x0, 0x0, 0x30}, IPv6HealthIP:net.IP(nil), IPv4IngressIP:net.IP(nil), IPv6IngressIP:net.IP(nil), ClusterID:0x0, Source:\"custom-resource\", EncryptionKey:0x0, Labels:map[string]string{\"beta.kubernetes.io/arch\":\"amd64\", \"beta.kubernetes.io/os\":\"linux\", \"kubernetes.io/arch\":\"amd64\", \"kubernetes.io/hostname\":\"kind-worker2\", \"kubernetes.io/os\":\"linux\"}, Annotations:map[string]string(nil), NodeIdentity:0x0, WireguardPubKey:\"\"}" subsys=nodemanager
```

After:

```
level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.3\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::3\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],\"IPv4AllocCIDR\":{\"IP\":\"10.0.1.0\",\"Mask\":\"////AA==\"},\"IPv4SecondaryAllocCIDRs\":null,\"IPv6AllocCIDR\":null,\"IPv6SecondaryAllocCIDRs\":null,\"IPv4HealthIP\":\"10.0.1.120\",\"IPv6HealthIP\":\"\",\"IPv4IngressIP\":\"\",\"IPv6IngressIP\":\"\",\"ClusterID\":0,\"Source\":\"custom-resource\",\"EncryptionKey\":0,\"Labels\":{\"beta.kubernetes.io/arch\":\"amd64\",\"beta.kubernetes.io/os\":\"linux\",\"kubernetes.io/arch\":\"amd64\",\"kubernetes.io/hostname\":\"kind-worker\",\"kubernetes.io/os\":\"linux\"},\"Annotations\":null,\"NodeIdentity\":0,\"WireguardPubKey\":\"\"}" subsys=nodemanager
```

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 263e689 ]

[ backporter's notes: Various conflicts. net.IPFamilyOfString(x) doesn't
  exist so it had to be replaced with ParseIPSloppy(x).To4().
  slices.DeleteFunc doesn't exist so it had to be defined in our own
  slices package as done on v1.14. mutateNodeResource's logic was
  significantly refactored so the new code to clean node IP addresses
  had to be moved before all appending of node IP addresses. ]

This fixes a bug in `UpdateCiliumNodeResource` where the
`CiliumInternalIP` (aka `cilium_host` IP, aka router IP) was flapping in
the node manager during restoration (i.e. during cilium-agent restarts).

In particular in `cluster-pool` mode, `UpdateCiliumNodeResource` is
called before the `cilium_host` IP has been restored, as there are some
circular dependencies: The restored IP can only be fully validated after
the IPAM subsystem is ready, but that in turn can only happen if the
`CiliumNode` object has been created. The `UpdateCiliumNodeResource`
function however will only announce the `cilium_host` IP if it has been
restored.

This commit attempts to break that cycle by not overwriting any already
existing `CiliumInternalIP` in the CiliumNode resource.

Overall, this change is rather hacky, in particular it does not address
the fact that other less crucial node information (like the health IP)
also flaps. But since we want to backport this bugfix to older stable
branches too, this change is intentionally kept as minimal as possible.

Example node event (as observed by other nodes) before this change:

```
2023-12-18T12:58:20.070330814Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"}],..." subsys=nodemanager
2023-12-18T12:58:20.208082226Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],..." subsys=nodemanager
```

After this change (note the `CiliumInternalIP` present in both events):

```
2023-12-18T15:38:23.695653876Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"},{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"}],..." subsys=nodemanager
2023-12-18T15:38:23.838604573Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],...}" subsys=nodemanager
```

Reported-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno added kind/backports This PR provides functionality previously merged into master. backport/1.12 This PR represents a backport for Cilium 1.12.x of a PR that was merged to main. labels Dec 19, 2023
@pchaigno
Copy link
Member Author

/test-backport-1.12

@pchaigno pchaigno marked this pull request as ready for review December 19, 2023 10:40
@pchaigno pchaigno requested a review from a team as a code owner December 19, 2023 10:40
@pchaigno pchaigno merged commit f68f463 into v1.12 Dec 20, 2023
112 checks passed
@pchaigno pchaigno deleted the pr/v1.12-backport-2023-12-19 branch December 20, 2023 08:07
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 20, 2023
jaredledvina added a commit to DataDog/cilium that referenced this pull request Jan 23, 2024
…available"

This reverts commit 532590f.
Instead we'll pick upstreams cilium#29979
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.12 This PR represents a backport for Cilium 1.12.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants