Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s: Provide fallback for EndpointSlices detection if discovery API is not available #11253

Merged
merged 2 commits into from
May 4, 2020

Conversation

tgraf
Copy link
Member

@tgraf tgraf commented Apr 30, 2020

Discovery of API groups requires the API services of the apiserver to be healthy. Such API services can depend on the readiness of regular pods which require Cilium to function correctly. By treating failure to discover API groups as fatal, a critial loop can be entered in which Cilium cannot start because the API groups can't be discovered and the API groups will only become discoverable once Cilium is up.

Warn about the lack of discovery ability and fall back to probing the API directly.

@tgraf tgraf added kind/bug This is a bug in the Cilium logic. pending-review release-note/bug This PR fixes an issue in a previous release of Cilium. labels Apr 30, 2020
@tgraf tgraf requested a review from a team April 30, 2020 10:40
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 Apr 30, 2020
Copy link
Contributor

@errordeveloper errordeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

nit: could you clarify if Update would get called again? The description comment for this function doesn't necessarily suggest what is the case... Perhaps copying the info from commit message into a comment would be quite helpful also, especially as this is a pretty special case.

@coveralls
Copy link

coveralls commented Apr 30, 2020

Coverage Status

Coverage decreased (-0.02%) to 44.428% when pulling 9f48402 on pr/tgraf/server-version-bootstrap into 3ecb413 on master.

@tgraf
Copy link
Member Author

tgraf commented Apr 30, 2020

nit: could you clarify if Update would get called again?

I'll update the comment. It doesn't really make sense to call Update() again as the capabilities are primarily used during bootstrapping. What could be possible is to keep running to stabilize and then restart a couple of minutes later to detect capabilities again.

@tgraf tgraf force-pushed the pr/tgraf/server-version-bootstrap branch from 67f41e6 to 51b78b6 Compare April 30, 2020 13:47
@tgraf tgraf requested review from Weil0ng and aanm April 30, 2020 13:47
@tgraf
Copy link
Member Author

tgraf commented Apr 30, 2020

test-me-please

00:08:31.763  failed to get destination image "sha256:799bab65f89532154bfa2d34fc49bc036f88c52f9fa63864080714f65d9ac37a": No such image: sha256:799bab65f89532154bfa2d34fc49bc036f88c52f9fa63864080714f65d9ac37a

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Thomas offline. This requires a flag that does the discovery on a conditional basis

@tgraf tgraf force-pushed the pr/tgraf/server-version-bootstrap branch from 51b78b6 to eb53fe1 Compare May 1, 2020 09:35
@tgraf tgraf requested a review from a team May 1, 2020 09:35
@tgraf tgraf requested review from a team as code owners May 1, 2020 09:35
@tgraf tgraf changed the title k8s: Don't treat failure discover API groups as fatal k8s: Provide fallback for EndpointSlices detection if discovery API is not available May 1, 2020
@tgraf
Copy link
Member Author

tgraf commented May 1, 2020

test-me-please

@tgraf tgraf force-pushed the pr/tgraf/server-version-bootstrap branch from eb53fe1 to 46403af Compare May 1, 2020 12:46
@tgraf
Copy link
Member Author

tgraf commented May 1, 2020

test-me-please

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair I only noticed the typos because I opened the PR on my IDE.

pkg/k8s/version/version.go Outdated Show resolved Hide resolved
pkg/k8s/version/version.go Outdated Show resolved Hide resolved
pkg/k8s/version/version.go Outdated Show resolved Hide resolved
// K8sAPIDiscoveryEnabled returns true if API discovery of API groups and
// resources is enabled
func (d *DefaultConfiguration) K8sAPIDiscoveryEnabled() bool {
return defaults.K8sEnableAPIDiscovery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? Shouldn't this be the value specified by the user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this represents the default configuration when no user options are available. This also clearly documents in which cases that happens instead of just relying on pkg/option which is uninitialized in several cases.

pkg/k8s/version/version.go Outdated Show resolved Hide resolved
pkg/k8s/version/version.go Outdated Show resolved Hide resolved
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI/docs changes LGTM.

tgraf added 2 commits May 3, 2020 13:38
Discovery of API groups requires the API services of the apiserver to be
healthy. Such API services can depend on the readiness of regular pods
which require Cilium to function correctly. By treating failure to
discover API groups as fatal, a critial loop can be entered in which
Cilium cannot start because the API groups can't be discovered and th
API groups will only become discoverable once Cilium is up.

Warn about the lack of discovery ability and fall back to probing the
API directly.

Signed-off-by: Thomas Graf <thomas@cilium.io>
The discovery request consists of several API calls to the apiserver. It
has already been observed that the k8s client can stsrt throttling
during startup phase.

Disable the use of the discovery API by default and rely on individual
API probing but allow users to benefit from it by opting into it.

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf force-pushed the pr/tgraf/server-version-bootstrap branch from 46403af to 9f48402 Compare May 3, 2020 11:44
@tgraf tgraf requested a review from aanm May 3, 2020 11:44
@tgraf
Copy link
Member Author

tgraf commented May 3, 2020

test-me-please

@tgraf tgraf merged commit 6fcf0cd into master May 4, 2020
1.8.0 automation moved this from In progress to Merged May 4, 2020
@tgraf tgraf deleted the pr/tgraf/server-version-bootstrap branch May 4, 2020 08:16
@joestringer
Copy link
Member

joestringer commented May 5, 2020

@tgraf @aanm did we intend to mark this for backport to v1.7?

Context:
#11233 (comment)

Not sure if this is the right place or if it is too late to ask, but can we backport #11253 too? Without it #11206 might cause cilium to not start correctly on environments like GKE (where discovery API would fail).

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.4 May 6, 2020
@aanm
Copy link
Member

aanm commented May 6, 2020

@joestringer yes, it should be marked for backport.

@vadorovsky
Copy link
Member

I will backport it in #11233

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.4 May 6, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.4 May 13, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.4 May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.7.4
Backport done to v1.7
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

7 participants