New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes out-of-sycn CEP update #17001
Fixes out-of-sycn CEP update #17001
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Can you add a Fixes
pointing out which commit it is fixing?
Thanks for quick reviewing! A bit confused, what do you mean by which commit it fixes? Did you mean which github issue? If so there's no open issue for this, we encountered this internally. |
See step 5. of https://docs.cilium.io/en/v1.10/contributing/development/contributing_guide/#submitting-a-pull-request |
Today endpoint synchronizer code won't update the upstream CEP upon initialization if it finds the CEP to be created exists in the api-server but still updates local cache. This creates issues when CEPs become out-of-sync while agent is down. For example, we see endpoints become out-of-sync because endpoint sychronizer won't update the CEP when a node is preempted and comes back quickly with GKE preemptables. The chain of event is: - Node preempted and comes back quickly - CEPs for old pods present in apiserver - Agent starts to regen endpoints - Endpointsynchronizer does not update CEP upon initilization but local cache *lastMdl* is updated with new CEP - Remote nodes have old CEP with old IP - Traffic from (reinstated) pods with new IP becomes *unmanaged* to Cilium. This fixes the above issue by setting local cache to upstream when initilization fails due to existing CEP. Fixes: 4f958ad Signed-off-by: Weilong Cui <cuiwl@google.com>
Ah wasn't aware of this, thanks for pointing out! Updated description and commit msg. |
Friendly ping, PTAL :) |
test-me-please |
Cilium L4LB XDP was hit by #17002 |
All required CI passed. |
Today endpoint synchronizer code won't update the upstream CEP upon
initialization if it finds the CEP to be created exists in the
api-server but still updates local cache. This creates issues when
CEPs become out-of-sync while agent is down. For example, we see
endpoints become out-of-sync because endpoint sychronizer won't
update the CEP when a node is preempted and comes back quickly with
GKE preemptables. The chain of event is:
cache lastMdl is updated with new CEP
Cilium.
This fixes the above issue by setting local cache to upstream when
initilization fails due to existing CEP.
Fixes: 4f958ad
Signed-off-by: Weilong Cui cuiwl@google.com