chore: remove controller liveness probe #9557

leoluz · 2022-06-01T17:20:53Z

Current controller liveness probe is a noop just checking if the metric-server is able to receive requests.

In cases when the controller is overloaded reconciling large queues restarting the Pod is more harmful than letting it running. In discussion with @alexmt we understand that liveness probe should be removed from the controller to prevent it from being restarted.

Signed-off-by: Leonardo Luz Almeida leonardo_almeida@intuit.com

Note on DCO:

If the DCO action in the integration test fails, one or more of your commits are not signed off. Please click on the Details link next to the DCO action for instructions on how to resolve this.

Checklist:

codecov · 2022-06-01T17:34:19Z

Codecov Report

Merging #9557 (ac8e591) into master (b2fe209) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #9557   +/-   ##
=======================================
  Coverage   45.77%   45.77%           
=======================================
  Files         222      222           
  Lines       26372    26372           
=======================================
  Hits        12072    12072           
  Misses      12651    12651           
  Partials     1649     1649

Impacted Files	Coverage Δ
util/settings/settings.go	`48.16% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2fe209...ac8e591. Read the comment docs.

34fathombelow · 2022-06-02T12:58:35Z

manifests/base/application-controller/argocd-application-controller-statefulset.yaml

-        livenessProbe:
-          httpGet:
-            path: /healthz
-            port: 8082
-          initialDelaySeconds: 5
-          periodSeconds: 10


Has setting a failureThreshold been considered? This is set to 1 by default. If set to 5 for example it would need 5 failures to restart the pod. If a probe is successful between failures the failureThreshold is reset. This would give the EU a choice to make an adjustment based on their workloads.

livenessProbe: httpGet: path: /healthz port: 8082 initialDelaySeconds: 5 periodSeconds: 10 failureThreshold: 5

@zachaller can you remember if we discussed setting a failure threshold?

So we did talk about it, also failure threshold actually defaults to 3 with a minimum of 1. I think if we want to keep the liveness check we have to bump the timeout to something like 3 to 5 seconds. That is mainly where the issue is at but the reasoning for removal was because restarting can make things worse under load conditions when the timeout is hit due to load.

I do think keeping it and bumping timeout can be a first pass but it might eventually still make sense to remove.

@zachaller You are absolutely right failureThreshold does default to 3. I think it also might be worth trying to set failureThreshol=5 and bumping up the timeoutSeconds.

The reasoning about removing it is related to the nature of the controller watch loop versus how the liveness probe is currently implemented. Currently the /healthz endpoint is exposed by the metrics server which runs in a separate goroutine. This can cause the controller to be restarted by reasons unrelated to the watch loop. In cases when the watch queues are accumulating lots of resource updates the controller will be busy trying to process them. Killing the controller in this case makes the situation worse as the queue remains the same after the restart. We noticed this behaviour in one of our internal instances which led to this discussion challenging the effectiveness of the current liveness probe implementation in ArgoCD controller.

@34fathombelow do you have a real use case where the current liveness-probe implementation is useful?

Got it, thanks for your detailed explanation. I'm fine with this change.

Thanks for the extra background info as well @leoluz I was not aware that the check was also in a seperate go routine which makes the liveness prob as you mentioned even more useless.

Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>

crenshaw-dev

lgtm!

alexmt

LGTM. Liveness probe use to make sense a long time ago. Controller used to verify k8s connectivity in healthz handler. We updated healthz handler but forgot to remove liveness probe.

leoluz requested a review from alexmt June 1, 2022 17:21

leoluz requested a review from crenshaw-dev June 1, 2022 17:34

34fathombelow reviewed Jun 2, 2022

View reviewed changes

leoluz added 3 commits June 2, 2022 10:09

chore: remove controller liveness probe

6e5054c

Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>

Trigger build

986d809

Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>

Update generated manifests

79ab2c4

Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>

leoluz force-pushed the remove-liveness-probe branch from 8600faf to 79ab2c4 Compare June 2, 2022 14:10

trigger build

ac8e591

Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>

crenshaw-dev approved these changes Jun 2, 2022

View reviewed changes

alexmt approved these changes Jun 2, 2022

View reviewed changes

crenshaw-dev enabled auto-merge (squash) June 2, 2022 18:21

crenshaw-dev merged commit 895f9cc into argoproj:master Jun 2, 2022

saumeya mentioned this pull request Jul 27, 2022

fix: remove liveness probe from application controller statefulset argoproj-labs/argocd-operator#742

Merged

2 tasks

mkilchhofer mentioned this pull request Oct 25, 2022

feat(argo-cd): Upgrade Argo CD to 2.5.0 argoproj/argo-helm#1568

Merged

6 tasks

pdrastil mentioned this pull request Oct 26, 2022

chore(argo-cd): Remove liveness probe from application controller argoproj/argo-helm#1581

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: remove controller liveness probe #9557

chore: remove controller liveness probe #9557

leoluz commented Jun 1, 2022 •

edited

codecov bot commented Jun 1, 2022 •

edited

34fathombelow Jun 2, 2022 •

edited

crenshaw-dev Jun 2, 2022

zachaller Jun 2, 2022 •

edited

34fathombelow Jun 2, 2022

leoluz Jun 2, 2022 •

edited

34fathombelow Jun 2, 2022

zachaller Jun 2, 2022

crenshaw-dev left a comment

alexmt left a comment

chore: remove controller liveness probe #9557

chore: remove controller liveness probe #9557

Conversation

leoluz commented Jun 1, 2022 • edited

codecov bot commented Jun 1, 2022 • edited

Codecov Report

34fathombelow Jun 2, 2022 • edited

Choose a reason for hiding this comment

crenshaw-dev Jun 2, 2022

Choose a reason for hiding this comment

zachaller Jun 2, 2022 • edited

Choose a reason for hiding this comment

34fathombelow Jun 2, 2022

Choose a reason for hiding this comment

leoluz Jun 2, 2022 • edited

Choose a reason for hiding this comment

34fathombelow Jun 2, 2022

Choose a reason for hiding this comment

zachaller Jun 2, 2022

Choose a reason for hiding this comment

crenshaw-dev left a comment

Choose a reason for hiding this comment

alexmt left a comment

Choose a reason for hiding this comment

leoluz commented Jun 1, 2022 •

edited

codecov bot commented Jun 1, 2022 •

edited

34fathombelow Jun 2, 2022 •

edited

zachaller Jun 2, 2022 •

edited

leoluz Jun 2, 2022 •

edited