Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes out-of-sycn CEP update #17001

Merged
merged 1 commit into from Aug 2, 2021
Merged

Fixes out-of-sycn CEP update #17001

merged 1 commit into from Aug 2, 2021

Conversation

Weil0ng
Copy link
Contributor

@Weil0ng Weil0ng commented Jul 26, 2021

Today endpoint synchronizer code won't update the upstream CEP upon
initialization if it finds the CEP to be created exists in the
api-server but still updates local cache. This creates issues when
CEPs become out-of-sync while agent is down. For example, we see
endpoints become out-of-sync because endpoint sychronizer won't
update the CEP when a node is preempted and comes back quickly with
GKE preemptables. The chain of event is:

  • Node preempted and comes back quickly
  • CEPs for old pods present in apiserver
  • Agent starts to regen endpoints
  • Endpointsynchronizer does not update CEP upon initilization but local
    cache lastMdl is updated with new CEP
  • Remote nodes have old CEP with old IP
  • Traffic from (reinstated) pods with new IP becomes unmanaged to
    Cilium.

This fixes the above issue by setting local cache to upstream when
initilization fails due to existing CEP.

Fixes: 4f958ad

Signed-off-by: Weilong Cui cuiwl@google.com

@Weil0ng Weil0ng requested review from a team and brb July 26, 2021 22:26
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 26, 2021
@Weil0ng Weil0ng requested a review from aanm July 26, 2021 22:26
@Weil0ng Weil0ng added the release-note/misc This PR makes changes that have no direct user impact. label Jul 26, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 26, 2021
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Can you add a Fixes pointing out which commit it is fixing?

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Jul 26, 2021

LGTM overall. Can you add a Fixes pointing out which commit it is fixing?

Thanks for quick reviewing! A bit confused, what do you mean by which commit it fixes? Did you mean which github issue? If so there's no open issue for this, we encountered this internally.

@aanm
Copy link
Member

aanm commented Jul 26, 2021

LGTM overall. Can you add a Fixes pointing out which commit it is fixing?

Thanks for quick reviewing! A bit confused, what do you mean by which commit it fixes? Did you mean which github issue? If so there's no open issue for this, we encountered this internally.

See step 5. of https://docs.cilium.io/en/v1.10/contributing/development/contributing_guide/#submitting-a-pull-request

Today endpoint synchronizer code won't update the upstream CEP upon
initialization if it finds the CEP to be created exists in the
api-server but still updates local cache. This creates issues when
CEPs become out-of-sync while agent is down. For example, we see
endpoints become out-of-sync because endpoint sychronizer won't
update the CEP when a node is preempted and comes back quickly with
GKE preemptables. The chain of event is:
- Node preempted and comes back quickly
- CEPs for old pods present in apiserver
- Agent starts to regen endpoints
- Endpointsynchronizer does not update CEP upon initilization but local
cache *lastMdl* is updated with new CEP
- Remote nodes have old CEP with old IP
- Traffic from (reinstated) pods with new IP becomes *unmanaged* to
Cilium.

This fixes the above issue by setting local cache to upstream when
initilization fails due to existing CEP.

Fixes: 4f958ad

Signed-off-by: Weilong Cui <cuiwl@google.com>
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Jul 26, 2021

LGTM overall. Can you add a Fixes pointing out which commit it is fixing?

Thanks for quick reviewing! A bit confused, what do you mean by which commit it fixes? Did you mean which github issue? If so there's no open issue for this, we encountered this internally.

See step 5. of https://docs.cilium.io/en/v1.10/contributing/development/contributing_guide/#submitting-a-pull-request

Ah wasn't aware of this, thanks for pointing out! Updated description and commit msg.

@Weil0ng Weil0ng changed the title Fixes out-of-sycn CEP update. Fixes out-of-sycn CEP update Jul 26, 2021
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Jul 27, 2021

Friendly ping, PTAL :)

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Jul 27, 2021

test-me-please

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Jul 27, 2021

Cilium L4LB XDP was hit by #17002

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Jul 27, 2021

All required CI passed.

@Weil0ng Weil0ng added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. and removed release-note/misc This PR makes changes that have no direct user impact. labels Jul 29, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2021
@Weil0ng Weil0ng added needs-backport/1.10 ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jul 31, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.4 Jul 31, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 31, 2021
@Weil0ng Weil0ng added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.10.4 Aug 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 2, 2021
@Weil0ng Weil0ng added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 2, 2021
@nathanjsweet nathanjsweet merged commit ed37faf into cilium:master Aug 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants