Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow healthz and livez endpoints cause liveness and readiness probe failures #5137

Closed
mbrancato opened this issue May 17, 2022 · 6 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@mbrancato
Copy link

Describe the bug:

After upgrading from 1.1 to 1.7.2 I have noticed increased webhook timeouts. We use fluxcd and that causes frequent reconciliation using server-side dry-run.

After looking at the pod, I can see frequent probe failures. And even restarts with exit code 0.

...
    State:          Running
      Started:      Tue, 17 May 2022 14:19:20 -0400
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 17 May 2022 13:13:00 -0400
      Finished:     Tue, 17 May 2022 14:19:20 -0400
...
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Warning  Unhealthy  5m49s (x23 over 80m)  kubelet  Liveness probe failed: Get "http://10.28.11.22:6080/livez": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  44s (x60 over 81m)    kubelet  Readiness probe failed: Get "http://10.28.11.22:6080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Exploring the probes, I manually started probing using kubectl port-forward. I did notice that with some requests, the response could be pretty delayed, by a few seconds. The default probe timeout is 1s, so I imagine that is the cause.

Expected behaviour:

Faster probe response times.

Steps to reproduce the bug:

Install cert-manager 1.7.2, trigger a webhook validation dry run every minute.

Anything else we need to know?:

Environment details::

  • Kubernetes version: 1.22
  • Cloud-provider/provisioner: GKE
  • cert-manager version: 1.7.2
  • Install method: helm

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label May 17, 2022
@irbekrm irbekrm added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 1, 2022
@irbekrm
Copy link
Contributor

irbekrm commented Jun 1, 2022

Thanks for opening the issue.

We actually see some isssues with webhook in our e2e tests too https://prow.build-infra.jetstack.net/view/gs/jetstack-logs/logs/ci-cert-manager-e2e-v1-23/1533685236070617088 could potentially be related.

There is also an issue with webhooks reported upstream for newer versions of Kubernetes kubernetes/kubernetes#109022, not sure yet if this might be related.

@irbekrm
Copy link
Contributor

irbekrm commented Jun 1, 2022

/milestone v1.9

@jetstack-bot jetstack-bot added this to the v1.9 milestone Jun 1, 2022
@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2022
@jetstack-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 4, 2022
@jetstack-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

@jetstack-bot
Copy link
Contributor

@jetstack-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants