Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.14: nodediscovery: Fix bug where CiliumInternalIP was flapping #29972

Merged
merged 2 commits into from Dec 19, 2023

Conversation

pchaigno
Copy link
Member

Manual backport of #29964. See commits for conflict notes.

Once this PR is merged, you can update the PR labels via:

$ for pr in 29964; do contrib/backporting/set-labels.py $pr done 1.14; done

[ upstream commit 7084c17 ]

[ backporter's notes: Trivial conflicts on includes. ]

This commit improves the debug logging of node update events by using
the JSON representation instead of the Go syntax representation of the
node. This makes it easier to parse the log message, as IP addresses are
now printed as strings instead of byte arrays.

Before:

```
level=debug msg="Received node update event from custom-resource: types.Node{Name:\"kind-worker\", Cluster:\"default\", IPAddresses:[]types.Address{types.Address{Type:\"InternalIP\", IP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xac, 0x12, 0x0, 0x3}}, types.Address{Type:\"InternalIP\", IP:net.IP{0xfc, 0x0, 0xc1, 0x11, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x3}}, types.Address{Type:\"CiliumInternalIP\", IP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0x0, 0x0, 0xd2}}}, IPv4AllocCIDR:(*cidr.CIDR)(0xc000613180), IPv4SecondaryAllocCIDRs:[]*cidr.CIDR(nil), IPv6AllocCIDR:(*cidr.CIDR)(nil), IPv6SecondaryAllocCIDRs:[]*cidr.CIDR(nil), IPv4HealthIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0x0, 0x0, 0x30}, IPv6HealthIP:net.IP(nil), IPv4IngressIP:net.IP(nil), IPv6IngressIP:net.IP(nil), ClusterID:0x0, Source:\"custom-resource\", EncryptionKey:0x0, Labels:map[string]string{\"beta.kubernetes.io/arch\":\"amd64\", \"beta.kubernetes.io/os\":\"linux\", \"kubernetes.io/arch\":\"amd64\", \"kubernetes.io/hostname\":\"kind-worker2\", \"kubernetes.io/os\":\"linux\"}, Annotations:map[string]string(nil), NodeIdentity:0x0, WireguardPubKey:\"\"}" subsys=nodemanager
```

After:

```
level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.3\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::3\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],\"IPv4AllocCIDR\":{\"IP\":\"10.0.1.0\",\"Mask\":\"////AA==\"},\"IPv4SecondaryAllocCIDRs\":null,\"IPv6AllocCIDR\":null,\"IPv6SecondaryAllocCIDRs\":null,\"IPv4HealthIP\":\"10.0.1.120\",\"IPv6HealthIP\":\"\",\"IPv4IngressIP\":\"\",\"IPv6IngressIP\":\"\",\"ClusterID\":0,\"Source\":\"custom-resource\",\"EncryptionKey\":0,\"Labels\":{\"beta.kubernetes.io/arch\":\"amd64\",\"beta.kubernetes.io/os\":\"linux\",\"kubernetes.io/arch\":\"amd64\",\"kubernetes.io/hostname\":\"kind-worker\",\"kubernetes.io/os\":\"linux\"},\"Annotations\":null,\"NodeIdentity\":0,\"WireguardPubKey\":\"\"}" subsys=nodemanager
```

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno added the kind/backports This PR provides functionality previously merged into master. label Dec 18, 2023
@pchaigno pchaigno requested a review from gandro December 18, 2023 22:11
@maintainer-s-little-helper maintainer-s-little-helper bot added the backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. label Dec 18, 2023
@pchaigno pchaigno marked this pull request as ready for review December 18, 2023 23:02
@pchaigno pchaigno requested a review from a team as a code owner December 18, 2023 23:02
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in terms of backported logic, thanks for taking care of this. I have not tested this branch yet however, might be worth doing before we merge.

pkg/nodediscovery/nodediscovery.go Show resolved Hide resolved
@gandro

This comment was marked as outdated.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too quick to approve. I noticed in the v1.13 review that wee need to remove the nodeResource.Spec.Addresses = []ciliumv2.NodeAddress{} initialization on line 411. Otherwise the IP addresses we want to preserve are removed.

[ upstream commit 263e689 ]

[ backporter's notes: Trivial conflicts on includes. Non-trivial
  conflicts in the mutateNodeResource function since it was
  significantly refactored. Conflict resolved by moving the
  conflicting code before all appending to the slice of node IP
  addresses. slices.DeleteFunc also doesn't exist in Go 1.20 so this
  commit introduces it in our own slices package. ]

This fixes a bug in `UpdateCiliumNodeResource` where the
`CiliumInternalIP` (aka `cilium_host` IP, aka router IP) was flapping in
the node manager during restoration (i.e. during cilium-agent restarts).

In particular in `cluster-pool` mode, `UpdateCiliumNodeResource` is
called before the `cilium_host` IP has been restored, as there are some
circular dependencies: The restored IP can only be fully validated after
the IPAM subsystem is ready, but that in turn can only happen if the
`CiliumNode` object has been created. The `UpdateCiliumNodeResource`
function however will only announce the `cilium_host` IP if it has been
restored.

This commit attempts to break that cycle by not overwriting any already
existing `CiliumInternalIP` in the CiliumNode resource.

Overall, this change is rather hacky, in particular it does not address
the fact that other less crucial node information (like the health IP)
also flaps. But since we want to backport this bugfix to older stable
branches too, this change is intentionally kept as minimal as possible.

Example node event (as observed by other nodes) before this change:

```
2023-12-18T12:58:20.070330814Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"}],..." subsys=nodemanager
2023-12-18T12:58:20.208082226Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],..." subsys=nodemanager
```

After this change (note the `CiliumInternalIP` present in both events):

```
2023-12-18T15:38:23.695653876Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"},{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"}],..." subsys=nodemanager
2023-12-18T15:38:23.838604573Z level=debug msg="Received node update event from custom-resource" node="{\"Name\":\"kind-worker\",\"Cluster\":\"default\",\"IPAddresses\":[{\"Type\":\"InternalIP\",\"IP\":\"172.18.0.4\"},{\"Type\":\"InternalIP\",\"IP\":\"fc00:c111::4\"},{\"Type\":\"CiliumInternalIP\",\"IP\":\"10.0.1.245\"}],...}" subsys=nodemanager
```

Reported-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno
Copy link
Member Author

/test-backport-1.14

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 19, 2023
@pchaigno pchaigno merged commit e4d877d into v1.14 Dec 19, 2023
209 checks passed
@pchaigno pchaigno deleted the pr/v1.14-backport-2023-12-18 branch December 19, 2023 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants