Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam/multipool: Fix bug where allocator was unable to update CiliumNode #27963

Merged
merged 2 commits into from Sep 7, 2023

Conversation

gandro
Copy link
Member

@gandro gandro commented Sep 6, 2023

This commit fixes an issue where the multi-pool allocator was unable to
update a CiliumNode resource because of concurrent writes. This
manifested in the following error being emitted repeatedly:

level=debug msg="Controller run failed" consecutiveErrors=48
error="failed to update spec: Operation cannot be fulfilled on
ciliumnodes.cilium.io \"kind-worker\": the object has been modified;
please apply your changes to the latest version and try again"
name=ipam-multi-pool-sync-kind-worker subsys=controller
uuid=12ba9a52-d36f-48fe-a7b7-3cf97c2cdb26

This would happen because the operator CiliumNode watcher does not call
the Upsert function if only the metadata (e.g. resource version,
labels, annotations, etc) of a node changes. This meant that the
allocator was working with a stale resourceVersion of the CiliumNode
object, causing any updates to fail until Upsert would be called again
because some non-metadata field changed.

This commit fixes that issue by having the controller fetch the most
recent version of the CiliumNode if the Kubernetes API reports that
there have been concurrent changes. This behavior matches the behavior
of the cluster-pool and ENI/Azure/AlibabaCloud implementation, which
already correctly fetched the resource again upon conflicts.

In addition, this commit also adds a unit test to test this new
behavior.

The IPAM allocator must not allocate new CIDRs before restoration has
finished, i.e. before the K8s CiliumNode cache has synced and we have
observed all nodes.

Before this commit, we already started a controller attempting to
allocate new CIDRs even if the allocator was not yet ready. This led to
the controller being run unnecessarily, as it cannot succeed until
`Resync` is called. Creating a controller before `Resync` is called is
also not needed, because `Resync` itself (re-)creates a controller for
each pending node.

Therefore, this commit changes the logic to not start any controller
before `restoreFinished` is true, as the controller will be created by
`Resync` once everything is ready.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. area/ipam Impacts IP address management functionality. needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Sep 6, 2023
@gandro gandro requested a review from a team as a code owner September 6, 2023 12:02
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.2 Sep 6, 2023
@gandro gandro requested a review from tklauser September 6, 2023 12:03
This commit fixes an issue where the multi-pool allocator was unable to
update a CiliumNode resource because of concurrent writes. This
manifested in the following error being emitted repeatedly:

```
level=debug msg="Controller run failed" consecutiveErrors=48
error="failed to update spec: Operation cannot be fulfilled on
ciliumnodes.cilium.io \"kind-worker\": the object has been modified;
please apply your changes to the latest version and try again"
name=ipam-multi-pool-sync-kind-worker subsys=controller
uuid=12ba9a52-d36f-48fe-a7b7-3cf97c2cdb26
```

This would happen because the operator CiliumNode watcher does not call
the `Upsert` function if only the metadata (e.g. resource version,
labels, annotations, etc) of a node changes. This meant that the
allocator was working with a stale `resourceVersion` of the CiliumNode
object, causing any updates to fail until `Upsert` would be called again
because some non-metadata field changed.

This commit fixes that issue by having the controller fetch the most
recent version of the `CiliumNode` if the Kubernetes API reports that
there have been concurrent changes. This behavior matches the behavior
of the cluster-pool and ENI/Azure/AlibabaCloud implementation, which
already correctly fetched the resource again upon conflicts.

In addition, this commit also adds a unit test to test this new
behavior.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/multi-pool-fix-operator-desync branch from f5ffacc to 0c69aaf Compare September 6, 2023 12:03
@gandro
Copy link
Member Author

gandro commented Sep 6, 2023

/test

Edit:

  • ci-e2e failed in ❌ all-ingress-deny-knp/pod-to-cidr/external-1001-1: cilium-test/client2-646b88fb9b-8j27l (10.244.3.23) -> external-1001 (1.0.0.1:443), restarting
  • ci-ipsec-upgrade failed with Error while dialing: dial tcp: lookup localhost on 1.1.1.1:53: no such host", which I think is also caused by external infra, though I'm not 100% certain

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. Thanks for adding a unit test as well!

Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice fix, changes look good on my end

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 7, 2023
@gandro gandro merged commit 525b1ea into cilium:main Sep 7, 2023
62 checks passed
@michi-covalent michi-covalent added this to Needs backport from main in 1.14.3 Sep 9, 2023
@michi-covalent michi-covalent removed this from Needs backport from main in 1.14.2 Sep 9, 2023
@gandro gandro mentioned this pull request Sep 12, 2023
15 tasks
@gandro gandro added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Sep 12, 2023
@gandro gandro added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Sep 25, 2023
@jrajahalme jrajahalme moved this from Needs backport from main to Backport done to v1.14 in 1.14.3 Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipam Impacts IP address management functionality. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.14.3
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

None yet

3 participants