New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource: Fix race with event retries of deleted and subsequently recreated objects #27340
resource: Fix race with event retries of deleted and subsequently recreated objects #27340
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a race.
Is there an issue tracking the issues this is causing, and the (potential) consequences of this @joamaki? Also, do you know what versions of cilium this affects? |
No issue yet as this was just discovered by reviewing the code. The consequences are luckily Currently the potentially affected code paths according to "References" for
None of these exist in earlier versions of Cilium. So the impact is only to v1.14 and to features ClusterMesh, LBIPAM, Multi-pool IPAM and Service Mesh / L7 proxy. |
26801fc
to
49258fc
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of nits, but LGTM!
97e1d2e
to
3c84e87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with @aanm's changes.
3c84e87
to
8e06adc
Compare
/test |
Is there a way #27549 could be triggered by this? |
8e06adc
to
5f73cc4
Compare
Don't think so. Probably same as with #23292. We should perhaps look into doing what |
I ended up doing exactly this: Resource[T] now implements its own I also redid the delete object logic to keep a map of last known state of objects per subscriber and not use the "delete object" that informer sees. This keeps things consistent and the code easier to follow. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, nice simplification.
eb36475
to
316751a
Compare
316751a
to
2737615
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we adapt the pkg/k8s/informer.NewInformerWithStore for the pkg/k8s/resource use cases and use it?
Also, we should keep the mutator cache detector code, it was accidentally removed. I'm going to re-add it.
How do I do this with it: https://github.com/cilium/cilium/pull/27340/files#diff-ae3d7e3fabf6dbbdcdd0d826fe38147c28883c23eb2f1ae43f5b80f54f830a7bR699 ? |
/test |
83cd574
to
11ff2cf
Compare
Resource[T] does not correctly handle the events: Upsert -> Delete with Done(not-nil) -> Upsert (recreate) -> Delete (retry) The retried delete event carries the old initial version of the object causing the recreated object to be incorrectly deleted. Signed-off-by: Jussi Maki <jussi@isovalent.com>
Fix double upserts that were caused by store being manipulated without synchronization with the subscriber queues by processing the deltas under the resource read-lock and doing the initial key listing for new subscriber with a write-lock. This way we cannot accidentally see a key in the store and process it just before the key is queued. As shown by test case in previous commit, the delete events are retried with an old incorrect version of the object causing a recreated object to be deleted. Fix the deletion retrying by always queueing upserts and deletes by key and keeping the last known state of objects emitted to the subscriber. Only emit a delete event if the subscriber has seen its creation and only use a version of the object that the subscriber has observed. Fixes: 4101e2c ("k8s: Add resource package") Signed-off-by: Jussi Maki <jussi@isovalent.com>
The semantics around retrying and the Sync event are subtle. Spell out the properties in a comment for Events(). Signed-off-by: Jussi Maki <jussi@isovalent.com>
11ff2cf
to
b47a92c
Compare
/test |
See commits for detailed description.