Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

egressgateway: fix initial reconciliation #18325

Merged
merged 1 commit into from
Jan 5, 2022

Conversation

jibi
Copy link
Member

@jibi jibi commented Dec 22, 2021

When a new egress gateway manager is created, it will wait for the k8s
cache to be fully synced before running the first reconciliation.

Currently the logic is based on the WaitUntilK8sCacheIsSynced method
of the Daemon object, which waits on the k8sCachesSynced channel to be
closed (which indicates that the cache has been indeed synced).

The issue with this approach is that Daemon object is passed to
the NewEgressGatewayManager method before its k8sCachesSynced
channel is properly initialized. This in turn causes the
WaitUntilK8sCacheIsSynced method to never return.

Since NewEgressGatewayManager must be called before that channel is
initialized, we need to switch to a polling approach, where the
k8sCachesSynced is checked periodically.

@jibi jibi added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.10 feature/egress-gateway Impacts the egress IP gateway feature. labels Dec 22, 2021
@jibi jibi requested review from a team and kkourt December 22, 2021 14:14
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.7 Dec 22, 2021
Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
(Not a big fun of using Sleep, but I can't think of a better approach at this moment.)

if manager.k8sCacheSyncedChecker.K8sCacheIsSynced() {
break
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking comment: If there is a bug that does not allow the cache to be synced, we would get stuck in an infinite loop. Should we add an info message here? We can print it after N iterations if we think it would be too verbose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are already logging this info in the watchers package:

go func() {
select {
case <-cachesSynced:
log.Info("All pre-existing resources have been received; continuing")
case <-time.After(option.Config.K8sSyncTimeout):
log.Fatal("Timed out waiting for pre-existing resources to be received; exiting")
}
}()

so probably no need to log it also here

@jibi
Copy link
Member Author

jibi commented Dec 23, 2021

(Not a big fun of using Sleep, but I can't think of a better approach at this moment.)

Agree 👍 the alternative would be to use a time.Ticker (which was my initial approach before switching to WaitUntilK8sCacheIsSynced, see #17813 (comment)) but that would just make the logic more verbose without any benefit, as you need to Stop() it once you are done

@jibi jibi force-pushed the pr/jibi/fix-egressgw-initial-reconciliation branch from f04be2b to 91879e2 Compare December 23, 2021 09:02
When a new egress gateway manager is created, it will wait for the k8s
cache to be fully synced before running the first reconciliation.

Currently the logic is based on the WaitUntilK8sCacheIsSynced method
of the Daemon object, which waits on the k8sCachesSynced channel to be
closed (which indicates that the cache has been indeed synced).

The issue with this approach is that Daemon object is passed to
the NewEgressGatewayManager method _before_ its k8sCachesSynced
channel is properly initialized. This in turn causes the
WaitUntilK8sCacheIsSynced method to never return.

Since NewEgressGatewayManager must be called before that channel is
initialized, we need to switch to a polling approach, where the
k8sCachesSynced is checked periodically.

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@jibi jibi force-pushed the pr/jibi/fix-egressgw-initial-reconciliation branch from 91879e2 to f9140cd Compare January 4, 2022 10:27
@jibi
Copy link
Member Author

jibi commented Jan 4, 2022

/test

@jibi
Copy link
Member Author

jibi commented Jan 5, 2022

l4lb is failing consistently (will be fixed with #18370)

marking as ready to merge

@jibi jibi added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 5, 2022
@pchaigno pchaigno merged commit ab9bfd7 into master Jan 5, 2022
@pchaigno pchaigno deleted the pr/jibi/fix-egressgw-initial-reconciliation branch January 5, 2022 17:33
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.7 Jan 12, 2022
@jibi jibi mentioned this pull request Jan 12, 2022
@joestringer joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. backport-done/1.10 and removed backport-pending/1.11 labels Jan 18, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.7 Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. feature/egress-gateway Impacts the egress IP gateway feature. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.10.7
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

None yet

4 participants