feat: switch probes to only startupProbe #4861
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of your changes
The probes introduced in #4748 could be too aggressive and lead to some undesired restarts under heavier operations, e.g. in my case installing
platform-ref-aws
, if the resources are constrained.The original issue we were trying to solve was that we were seeing some failures in the e2es at startup due to the webhooks not being yet ready to receive requests. So, a startupProbe could be enough, although it could lead to the pod being restarted after the
failureThreshold * periodSeconds
time window, I've set it to be 60s (30*2s), but we could be even more permissive and increase the failureThreshold if needed. To compensate, I've configured the probe to be just atcpProbe
, which should be less affected even if for some reason the pod was under heavier usage.I hit this while testing it locally on the latest commit, by just applying the following manifest:
Seeing a few restarts of the Crossplane pod after that.
With this change I couldn't see any restart, and even killing the crossplane pod while it was trying to deploy
platform-ref-aws
didn't result in any issue at restart. Has to be seen whether it's actually solving the flakiness we were seeing.I have:
make reviewable
to ensure this PR is ready for review.Added or updated unit tests.Added or updated e2e tests.Linked a PR or a docs tracking issue to document this change.Addedbackport release-x.y
labels to auto-backport this PR.Need help with this checklist? See the cheat sheet.