Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: remove controller liveness probe #9557

Merged
merged 4 commits into from Jun 2, 2022

Conversation

leoluz
Copy link
Collaborator

@leoluz leoluz commented Jun 1, 2022

Current controller liveness probe is a noop just checking if the metric-server is able to receive requests.

In cases when the controller is overloaded reconciling large queues restarting the Pod is more harmful than letting it running. In discussion with @alexmt we understand that liveness probe should be removed from the controller to prevent it from being restarted.

Signed-off-by: Leonardo Luz Almeida leonardo_almeida@intuit.com

Note on DCO:

If the DCO action in the integration test fails, one or more of your commits are not signed off. Please click on the Details link next to the DCO action for instructions on how to resolve this.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • Optional. My organization is added to USERS.md.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).

@leoluz leoluz requested a review from alexmt June 1, 2022 17:21
@codecov
Copy link

codecov bot commented Jun 1, 2022

Codecov Report

Merging #9557 (ac8e591) into master (b2fe209) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #9557   +/-   ##
=======================================
  Coverage   45.77%   45.77%           
=======================================
  Files         222      222           
  Lines       26372    26372           
=======================================
  Hits        12072    12072           
  Misses      12651    12651           
  Partials     1649     1649           
Impacted Files Coverage Δ
util/settings/settings.go 48.16% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2fe209...ac8e591. Read the comment docs.

@leoluz leoluz requested a review from crenshaw-dev June 1, 2022 17:34
Comment on lines -139 to -144
livenessProbe:
httpGet:
path: /healthz
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has setting a failureThreshold been considered? This is set to 1 by default. If set to 5 for example it would need 5 failures to restart the pod. If a probe is successful between failures the failureThreshold is reset. This would give the EU a choice to make an adjustment based on their workloads.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8082
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 5

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zachaller can you remember if we discussed setting a failure threshold?

Copy link
Contributor

@zachaller zachaller Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we did talk about it, also failure threshold actually defaults to 3 with a minimum of 1. I think if we want to keep the liveness check we have to bump the timeout to something like 3 to 5 seconds. That is mainly where the issue is at but the reasoning for removal was because restarting can make things worse under load conditions when the timeout is hit due to load.

I do think keeping it and bumping timeout can be a first pass but it might eventually still make sense to remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zachaller You are absolutely right failureThreshold does default to 3. I think it also might be worth trying to set failureThreshol=5 and bumping up the timeoutSeconds.

Copy link
Collaborator Author

@leoluz leoluz Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning about removing it is related to the nature of the controller watch loop versus how the liveness probe is currently implemented. Currently the /healthz endpoint is exposed by the metrics server which runs in a separate goroutine. This can cause the controller to be restarted by reasons unrelated to the watch loop. In cases when the watch queues are accumulating lots of resource updates the controller will be busy trying to process them. Killing the controller in this case makes the situation worse as the queue remains the same after the restart. We noticed this behaviour in one of our internal instances which led to this discussion challenging the effectiveness of the current liveness probe implementation in ArgoCD controller.

@34fathombelow do you have a real use case where the current liveness-probe implementation is useful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for your detailed explanation. I'm fine with this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra background info as well @leoluz I was not aware that the check was also in a seperate go routine which makes the liveness prob as you mentioned even more useless.

Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
Signed-off-by: Leonardo Luz Almeida <leonardo_almeida@intuit.com>
Copy link
Collaborator

@crenshaw-dev crenshaw-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Collaborator

@alexmt alexmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Liveness probe use to make sense a long time ago. Controller used to verify k8s connectivity in healthz handler. We updated healthz handler but forgot to remove liveness probe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants