Increase default liveness probe timeout #19003

dstandish · 2021-10-15T05:12:51Z

In practice the liveness probe can regularly take longer than 5 seconds.

10 seems like a better default.

Because it seems to take longer than perhaps initially expected, it seems reasonable to reduce the check frequency to not consume as much CPU. Though perhaps there's a reason for checking more frequently, so whatever you think on that...

To keep the max downtime to 5 minutes after increasing the check period, I reduce number of failed checks to 5.

closes #19001

In practice the liveness probe can regularly take longer than 5 seconds. 10 seems like a better default. Because it takes double the time, we can reduce the check frequency so that we do not waste as many CPU cycles. And to keep the max downtime to 5 minutes, we reduce number of failed checks to 5.

kaxil

cc @jedcunningham

github-actions · 2021-10-15T07:34:08Z

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

dstandish · 2021-10-15T20:52:06Z

So what's slow about the liveness probe is the import of airflow, which takes around 5 seconds it seems.

Is the "right" way to do this to add scheduler-health endpoint? I guess from separation of concerns maybe that's not possible, since API probably runs on webserver, and maybe we don't want scheduler health to depend on webserver health.

Looking at this contrived example, I thought of an alternative. It seems that possibly another way would be to run a sidecar that runs the health check in a loop in a long-running python process, and after success it ensures a scheduler-health file exists; if not success, remove the file (or we could put status in an alway-there file). Then the scheduler liveness probe could be cat /scheduler-health, and no need to import airflow. This would seem to be a lighter-weight solution overall, with more predictable timing, since it wouldn't need to import airflow except at startup and most of the time it would be sleeping. Though it would require more memory.... But more hacky than a scheduler-health endpoint, and it requires a long-running sidecar.

Here's the referenced example:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

From that page:

For the first 30 seconds of the container's life, there is a /tmp/healthy file. So during the first 30 seconds, the command cat /tmp/healthy returns a success code. After 30 seconds, cat /tmp/healthy returns a failure code.

Thoughts?

jedcunningham · 2021-10-17T14:02:43Z

It's pretty similar (and semi-related) to an issue for workers I opened a while back: #17191

I think having an endpoint makes sense, and we toggle whether the log endpoint works or not based on the Executor.

andormarkus · 2021-11-02T14:43:16Z

Hi @dstandish and @jedcunningham

I'm running on Helm chart 1.2 and manually increased limits to the values what where in this PR and still got problem.

    timeoutSeconds: 10
    failureThreshold: 5
    periodSeconds: 60

I have created high thresholds and now my probes are passing

  livenessProbe:
    initialDelaySeconds: 60
    timeoutSeconds: 30
    failureThreshold: 5
    periodSeconds: 60

If you agree, I suggest raising the timeout even more or document the slow probe in the documentation.

Thanks,
Andor

dstandish requested a review from jedcunningham October 15, 2021 05:12

dstandish requested review from ashb, dimberman and kaxil as code owners October 15, 2021 05:12

boring-cyborg bot added the area:helm-chart Airflow Helm Chart label Oct 15, 2021

kaxil approved these changes Oct 15, 2021

View reviewed changes

github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Oct 15, 2021

potiuk approved these changes Oct 15, 2021

View reviewed changes

jedcunningham approved these changes Oct 15, 2021

View reviewed changes

jedcunningham added this to the Airflow Helm Chart 1.3.0 milestone Oct 15, 2021

jedcunningham merged commit 866c764 into apache:main Oct 15, 2021

dstandish deleted the increase-liveness-timeout branch October 15, 2021 20:52

kaxil mentioned this pull request Nov 5, 2021

Add Changelog for Airflow Chart 1.3.0 #19417

Merged

jedcunningham mentioned this pull request Jan 5, 2022

Increase default livenessProbe timeout #20698

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase default liveness probe timeout #19003

Increase default liveness probe timeout #19003

dstandish commented Oct 15, 2021 •

edited

Loading

kaxil left a comment

github-actions bot commented Oct 15, 2021

dstandish commented Oct 15, 2021 •

edited

Loading

jedcunningham commented Oct 17, 2021

andormarkus commented Nov 2, 2021

Increase default liveness probe timeout #19003

Increase default liveness probe timeout #19003

Conversation

dstandish commented Oct 15, 2021 • edited Loading

kaxil left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 15, 2021

dstandish commented Oct 15, 2021 • edited Loading

jedcunningham commented Oct 17, 2021

andormarkus commented Nov 2, 2021

dstandish commented Oct 15, 2021 •

edited

Loading

dstandish commented Oct 15, 2021 •

edited

Loading