New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: remove controller liveness probe #9557
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9557 +/- ##
=======================================
Coverage 45.77% 45.77%
=======================================
Files 222 222
Lines 26372 26372
=======================================
Hits 12072 12072
Misses 12651 12651
Partials 1649 1649
Continue to review full report at Codecov.
|
livenessProbe: | ||
httpGet: | ||
path: /healthz | ||
port: 8082 | ||
initialDelaySeconds: 5 | ||
periodSeconds: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has setting a failureThreshold been considered? This is set to 1 by default. If set to 5 for example it would need 5 failures to restart the pod. If a probe is successful between failures the failureThreshold is reset. This would give the EU a choice to make an adjustment based on their workloads.
livenessProbe:
httpGet:
path: /healthz
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachaller can you remember if we discussed setting a failure threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we did talk about it, also failure threshold actually defaults to 3 with a minimum of 1. I think if we want to keep the liveness check we have to bump the timeout to something like 3 to 5 seconds. That is mainly where the issue is at but the reasoning for removal was because restarting can make things worse under load conditions when the timeout is hit due to load.
I do think keeping it and bumping timeout can be a first pass but it might eventually still make sense to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachaller You are absolutely right failureThreshold does default to 3. I think it also might be worth trying to set failureThreshol=5 and bumping up the timeoutSeconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning about removing it is related to the nature of the controller watch loop versus how the liveness probe is currently implemented. Currently the /healthz
endpoint is exposed by the metrics server which runs in a separate goroutine. This can cause the controller to be restarted by reasons unrelated to the watch loop. In cases when the watch queues are accumulating lots of resource updates the controller will be busy trying to process them. Killing the controller in this case makes the situation worse as the queue remains the same after the restart. We noticed this behaviour in one of our internal instances which led to this discussion challenging the effectiveness of the current liveness probe implementation in ArgoCD controller.
@34fathombelow do you have a real use case where the current liveness-probe implementation is useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for your detailed explanation. I'm fine with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra background info as well @leoluz I was not aware that the check was also in a seperate go routine which makes the liveness prob as you mentioned even more useless.
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
8600faf
to
79ab2c4
Compare
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Liveness probe use to make sense a long time ago. Controller used to verify k8s connectivity in healthz handler. We updated healthz handler but forgot to remove liveness probe.
Current controller liveness probe is a noop just checking if the metric-server is able to receive requests.
In cases when the controller is overloaded reconciling large queues restarting the Pod is more harmful than letting it running. In discussion with @alexmt we understand that liveness probe should be removed from the controller to prevent it from being restarted.
Signed-off-by: Leonardo Luz Almeida leonardo_almeida@intuit.com
Note on DCO:
If the DCO action in the integration test fails, one or more of your commits are not signed off. Please click on the Details link next to the DCO action for instructions on how to resolve this.
Checklist: