Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource: Fix race with event retries of deleted and subsequently recreated objects #27340

Merged
merged 3 commits into from Oct 6, 2023

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Aug 8, 2023

See commits for detailed description.

resource: Fix race condition in handling of Kubernetes object delete event retrying. In the very rare case when an object was created, deleted and re-created with the same name and the handling of the first deletion failed, the handling of delete event may have been retried even though the object was re-created. Only affected features using the Resource-library (LB IPAM, Mutual Auth and ClusterMesh).

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 8, 2023
Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a race.

pkg/k8s/resource/resource_test.go Outdated Show resolved Hide resolved
pkg/k8s/resource/resource_test.go Outdated Show resolved Hide resolved
pkg/k8s/resource/resource.go Outdated Show resolved Hide resolved
@odinuge
Copy link
Member

odinuge commented Aug 8, 2023

Is there an issue tracking the issues this is causing, and the (potential) consequences of this @joamaki?

Also, do you know what versions of cilium this affects?

@joamaki
Copy link
Contributor Author

joamaki commented Aug 8, 2023

Is there an issue tracking the issues this is causing, and the (potential) consequences of this @joamaki?

Also, do you know what versions of cilium this affects?

No issue yet as this was just discovered by reviewing the code. The consequences are luckily
limited since this only affects uses of Resource[T] that retry deletions and only cases where
an object is deleted and then recreated with the same key before the failed delete is successfully
retried.

Currently the potentially affected code paths according to "References" for Event[T].Done with non-nil
argument to a Delete event in v1.14 are:

None of these exist in earlier versions of Cilium. So the impact is only to v1.14 and to features ClusterMesh, LBIPAM, Multi-pool IPAM and Service Mesh / L7 proxy.

@joamaki joamaki changed the title pr/joamaki/fix resource delete race resource: Fix race with event retries of deleted and subsequently recreated objects Aug 9, 2023
@joamaki joamaki added the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Aug 25, 2023
@joamaki joamaki force-pushed the pr/joamaki/fix-resource-delete-race branch from 26801fc to 49258fc Compare August 25, 2023 07:36
@joamaki joamaki requested a review from bimmlerd August 28, 2023 11:25
@joamaki joamaki marked this pull request as ready for review August 28, 2023 11:26
@joamaki joamaki requested a review from a team as a code owner August 28, 2023 11:26
@joamaki joamaki requested a review from youngnick August 28, 2023 11:26
@joamaki joamaki added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Aug 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.2 Aug 28, 2023
@joamaki
Copy link
Contributor Author

joamaki commented Aug 28, 2023

/test

Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits, but LGTM!

pkg/k8s/resource/resource.go Outdated Show resolved Hide resolved
pkg/k8s/resource/resource.go Outdated Show resolved Hide resolved
pkg/k8s/resource/resource_test.go Show resolved Hide resolved
pkg/k8s/resource/resource.go Outdated Show resolved Hide resolved
@joamaki joamaki force-pushed the pr/joamaki/fix-resource-delete-race branch 2 times, most recently from 97e1d2e to 3c84e87 Compare August 31, 2023 12:05
@aanm aanm self-requested a review September 1, 2023 08:46
pkg/k8s/resource/resource.go Outdated Show resolved Hide resolved
pkg/k8s/resource/resource.go Outdated Show resolved Hide resolved
@christarazi christarazi added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. kind/bug This is a bug in the Cilium logic. labels Sep 1, 2023
Copy link
Contributor

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with @aanm's changes.

@joamaki joamaki requested a review from aanm September 4, 2023 11:29
@joamaki joamaki force-pushed the pr/joamaki/fix-resource-delete-race branch from 3c84e87 to 8e06adc Compare September 5, 2023 13:05
@joamaki
Copy link
Contributor Author

joamaki commented Sep 5, 2023

/test

@bimmlerd
Copy link
Member

bimmlerd commented Sep 5, 2023

Is there a way #27549 could be triggered by this?

@joamaki joamaki force-pushed the pr/joamaki/fix-resource-delete-race branch from 8e06adc to 5f73cc4 Compare September 6, 2023 08:26
@joamaki
Copy link
Contributor Author

joamaki commented Sep 6, 2023

Is there a way #27549 could be triggered by this?

Don't think so. Probably same as with #23292.

We should perhaps look into doing what pkg/k8s/informer/informer.go does and implement the
Process function ourselves to make sure we mutate the cache.Store and queue work under the same lock to avoid the situation where a new subscriber will see a key via IterKeys(), process it, and then reprocess the same event due to the ResourceEventHandler queueing it.

@joamaki
Copy link
Contributor Author

joamaki commented Sep 12, 2023

Is there a way #27549 could be triggered by this?

Don't think so. Probably same as with #23292.

We should perhaps look into doing what pkg/k8s/informer/informer.go does and implement the Process function ourselves to make sure we mutate the cache.Store and queue work under the same lock to avoid the situation where a new subscriber will see a key via IterKeys(), process it, and then reprocess the same event due to the ResourceEventHandler queueing it.

I ended up doing exactly this: Resource[T] now implements its own Process function. This removes the race where a new subscriber can list the keys in the store while a key is being queued for it which led to the double-upsert event.

I also redid the delete object logic to keep a map of last known state of objects per subscriber and not use the "delete object" that informer sees. This keeps things consistent and the code easier to follow.

@joamaki
Copy link
Contributor Author

joamaki commented Sep 12, 2023

/test

Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice simplification.

pkg/k8s/resource/resource.go Outdated Show resolved Hide resolved
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adapt the pkg/k8s/informer.NewInformerWithStore for the pkg/k8s/resource use cases and use it?

Also, we should keep the mutator cache detector code, it was accidentally removed. I'm going to re-add it.

@joamaki
Copy link
Contributor Author

joamaki commented Sep 15, 2023

Can we adapt the pkg/k8s/informer.NewInformerWithStore for the pkg/k8s/resource use cases and use it?

Also, we should keep the mutator cache detector code, it was accidentally removed. I'm going to re-add it.

How do I do this with it: https://github.com/cilium/cilium/pull/27340/files#diff-ae3d7e3fabf6dbbdcdd0d826fe38147c28883c23eb2f1ae43f5b80f54f830a7bR699 ?

@joamaki
Copy link
Contributor Author

joamaki commented Sep 26, 2023

/test

@joamaki joamaki force-pushed the pr/joamaki/fix-resource-delete-race branch 3 times, most recently from 83cd574 to 11ff2cf Compare September 29, 2023 11:02
Resource[T] does not correctly handle the events:
  Upsert -> Delete with Done(not-nil) -> Upsert (recreate)
    -> Delete (retry)

The retried delete event carries the old initial version of the
object causing the recreated object to be incorrectly deleted.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Fix double upserts that were caused by store being manipulated without
synchronization with the subscriber queues by processing the deltas
under the resource read-lock and doing the initial key listing for
new subscriber with a write-lock. This way we cannot accidentally
see a key in the store and process it just before the key is queued.

As shown by test case in previous commit, the delete events are retried
with an old incorrect version of the object causing a recreated object
to be deleted.

Fix the deletion retrying by always queueing upserts and deletes by key
and keeping the last known state of objects emitted to the subscriber.
Only emit a delete event if the subscriber has seen its creation and
only use a version of the object that the subscriber has observed.

Fixes: 4101e2c ("k8s: Add resource package")
Signed-off-by: Jussi Maki <jussi@isovalent.com>
The semantics around retrying and the Sync event are subtle. Spell out the properties
in a comment for Events().

Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki force-pushed the pr/joamaki/fix-resource-delete-race branch from 11ff2cf to b47a92c Compare October 4, 2023 10:33
@joamaki
Copy link
Contributor Author

joamaki commented Oct 4, 2023

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 4, 2023
@joamaki joamaki merged commit 2d00e56 into cilium:main Oct 6, 2023
59 of 61 checks passed
@joamaki joamaki mentioned this pull request Oct 10, 2023
4 tasks
@joamaki joamaki added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Oct 10, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.3 Oct 10, 2023
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Oct 12, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.14 in 1.14.3 Oct 12, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport done to v1.14 in 1.14.3 Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
No open projects
1.14.3
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

None yet

6 participants