Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: switch probes to only startupProbe #4861

Merged
merged 1 commit into from
Oct 23, 2023

Conversation

phisco
Copy link
Contributor

@phisco phisco commented Oct 20, 2023

Description of your changes

The probes introduced in #4748 could be too aggressive and lead to some undesired restarts under heavier operations, e.g. in my case installing platform-ref-aws, if the resources are constrained.

The original issue we were trying to solve was that we were seeing some failures in the e2es at startup due to the webhooks not being yet ready to receive requests. So, a startupProbe could be enough, although it could lead to the pod being restarted after the failureThreshold * periodSeconds time window, I've set it to be 60s (30*2s), but we could be even more permissive and increase the failureThreshold if needed. To compensate, I've configured the probe to be just a tcpProbe, which should be less affected even if for some reason the pod was under heavier usage.

I hit this while testing it locally on the latest commit, by just applying the following manifest:

apiVersion: pkg.crossplane.io/v1
kind: Configuration
metadata:
  name: platform-ref-aws
spec:
  package: xpkg.upbound.io/upbound/platform-ref-aws:v0.6.0

Seeing a few restarts of the Crossplane pod after that.

With this change I couldn't see any restart, and even killing the crossplane pod while it was trying to deploy platform-ref-aws didn't result in any issue at restart. Has to be seen whether it's actually solving the flakiness we were seeing.

I have:

Need help with this checklist? See the cheat sheet.

@phisco phisco requested a review from a team as a code owner October 20, 2023 14:51
@phisco phisco requested a review from bobh66 October 20, 2023 14:51
@phisco
Copy link
Contributor Author

phisco commented Oct 20, 2023

NB: I've tried disabling schema aware validation but nothing changed

@phisco
Copy link
Contributor Author

phisco commented Oct 20, 2023

3 runs all green except a single failure for the realtime-composition job, which is flapping also on master. So looks like is enough to address the issue we were originally trying to resolve.

// Add default health probe
if err := mgr.AddHealthzCheck("ping", healthz.Ping); err != nil {
return errors.Wrap(err, "cannot create ping health check")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave these, even if we don't use them for now? I could see them being handy in future. Or even just handy so that folks can configure their Helm chart to use them if they want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, make sense to me, reverted these changes

Signed-off-by: Philippe Scorsolini <p.scorsolini@gmail.com>
@negz negz merged commit 59f4aa1 into crossplane:master Oct 23, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants