Slow healthz and livez endpoints cause liveness and readiness probe failures #5137

mbrancato · 2022-05-17T19:02:31Z

Describe the bug:

After upgrading from 1.1 to 1.7.2 I have noticed increased webhook timeouts. We use fluxcd and that causes frequent reconciliation using server-side dry-run.

After looking at the pod, I can see frequent probe failures. And even restarts with exit code 0.

...
    State:          Running
      Started:      Tue, 17 May 2022 14:19:20 -0400
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 17 May 2022 13:13:00 -0400
      Finished:     Tue, 17 May 2022 14:19:20 -0400
...
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Warning  Unhealthy  5m49s (x23 over 80m)  kubelet  Liveness probe failed: Get "http://10.28.11.22:6080/livez": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  44s (x60 over 81m)    kubelet  Readiness probe failed: Get "http://10.28.11.22:6080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Exploring the probes, I manually started probing using kubectl port-forward. I did notice that with some requests, the response could be pretty delayed, by a few seconds. The default probe timeout is 1s, so I imagine that is the cause.

Expected behaviour:

Faster probe response times.

Steps to reproduce the bug:

Install cert-manager 1.7.2, trigger a webhook validation dry run every minute.

Anything else we need to know?:

Environment details::

Kubernetes version: 1.22
Cloud-provider/provisioner: GKE
cert-manager version: 1.7.2
Install method: helm

/kind bug

The text was updated successfully, but these errors were encountered:

irbekrm · 2022-06-01T08:57:52Z

Thanks for opening the issue.

We actually see some isssues with webhook in our e2e tests too https://prow.build-infra.jetstack.net/view/gs/jetstack-logs/logs/ci-cert-manager-e2e-v1-23/1533685236070617088 could potentially be related.

There is also an issue with webhooks reported upstream for newer versions of Kubernetes kubernetes/kubernetes#109022, not sure yet if this might be related.

irbekrm · 2022-06-01T08:58:10Z

/milestone v1.9

jetstack-bot · 2022-09-04T11:05:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

jetstack-bot · 2022-10-04T11:18:53Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

jetstack-bot · 2022-11-03T12:18:03Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

jetstack-bot · 2022-11-03T12:18:06Z

@jetstack-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label May 17, 2022

irbekrm added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 1, 2022

jetstack-bot added this to the v1.9 milestone Jun 1, 2022

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2022

jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 4, 2022

jetstack-bot closed this as completed Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow healthz and livez endpoints cause liveness and readiness probe failures #5137

Slow healthz and livez endpoints cause liveness and readiness probe failures #5137

mbrancato commented May 17, 2022

irbekrm commented Jun 1, 2022 •

edited

Loading

irbekrm commented Jun 1, 2022

jetstack-bot commented Sep 4, 2022

jetstack-bot commented Oct 4, 2022

jetstack-bot commented Nov 3, 2022

jetstack-bot commented Nov 3, 2022

Slow healthz and livez endpoints cause liveness and readiness probe failures #5137

Slow healthz and livez endpoints cause liveness and readiness probe failures #5137

Comments

mbrancato commented May 17, 2022

irbekrm commented Jun 1, 2022 • edited Loading

irbekrm commented Jun 1, 2022

jetstack-bot commented Sep 4, 2022

jetstack-bot commented Oct 4, 2022

jetstack-bot commented Nov 3, 2022

jetstack-bot commented Nov 3, 2022

irbekrm commented Jun 1, 2022 •

edited

Loading