Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default liveness probe timeout #19003

Merged
merged 1 commit into from Oct 15, 2021

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Oct 15, 2021

In practice the liveness probe can regularly take longer than 5 seconds.

10 seems like a better default.

Because it seems to take longer than perhaps initially expected, it seems reasonable to reduce the check frequency to not consume as much CPU. Though perhaps there's a reason for checking more frequently, so whatever you think on that...

To keep the max downtime to 5 minutes after increasing the check period, I reduce number of failed checks to 5.

closes #19001

In practice the liveness probe can regularly take longer than 5 seconds.  10 seems like a better default. Because it takes double the time, we can reduce the check frequency so that we do not waste as many CPU cycles.  And to keep the max downtime to 5 minutes, we reduce number of failed checks to 5.
@boring-cyborg boring-cyborg bot added the area:helm-chart Airflow Helm Chart label Oct 15, 2021
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Oct 15, 2021
@github-actions
Copy link

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@jedcunningham jedcunningham merged commit 866c764 into apache:main Oct 15, 2021
@dstandish
Copy link
Contributor Author

dstandish commented Oct 15, 2021

So what's slow about the liveness probe is the import of airflow, which takes around 5 seconds it seems.

Is the "right" way to do this to add scheduler-health endpoint? I guess from separation of concerns maybe that's not possible, since API probably runs on webserver, and maybe we don't want scheduler health to depend on webserver health.

Looking at this contrived example, I thought of an alternative. It seems that possibly another way would be to run a sidecar that runs the health check in a loop in a long-running python process, and after success it ensures a scheduler-health file exists; if not success, remove the file (or we could put status in an alway-there file). Then the scheduler liveness probe could be cat /scheduler-health, and no need to import airflow. This would seem to be a lighter-weight solution overall, with more predictable timing, since it wouldn't need to import airflow except at startup and most of the time it would be sleeping. Though it would require more memory.... But more hacky than a scheduler-health endpoint, and it requires a long-running sidecar.

Here's the referenced example:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

From that page:

For the first 30 seconds of the container's life, there is a /tmp/healthy file. So during the first 30 seconds, the command cat /tmp/healthy returns a success code. After 30 seconds, cat /tmp/healthy returns a failure code.

Thoughts?

@dstandish dstandish deleted the increase-liveness-timeout branch October 15, 2021 20:52
@jedcunningham
Copy link
Member

It's pretty similar (and semi-related) to an issue for workers I opened a while back: #17191

I think having an endpoint makes sense, and we toggle whether the log endpoint works or not based on the Executor.

@andormarkus
Copy link
Contributor

Hi @dstandish and @jedcunningham

I'm running on Helm chart 1.2 and manually increased limits to the values what where in this PR and still got problem.

    timeoutSeconds: 10
    failureThreshold: 5
    periodSeconds: 60

I have created high thresholds and now my probes are passing

  livenessProbe:
    initialDelaySeconds: 60
    timeoutSeconds: 30
    failureThreshold: 5
    periodSeconds: 60

If you agree, I suggest raising the timeout even more or document the slow probe in the documentation.

Thanks,
Andor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:helm-chart Airflow Helm Chart okay to merge It's ok to merge this PR as it does not require more tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow liveness probe causes frequent restarts (scheduler and triggerer)
5 participants