New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase default liveness probe timeout #19003
Increase default liveness probe timeout #19003
Conversation
In practice the liveness probe can regularly take longer than 5 seconds. 10 seems like a better default. Because it takes double the time, we can reduce the check frequency so that we do not waste as many CPU cycles. And to keep the max downtime to 5 minutes, we reduce number of failed checks to 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
So what's slow about the liveness probe is the import of airflow, which takes around 5 seconds it seems. Is the "right" way to do this to add Looking at this contrived example, I thought of an alternative. It seems that possibly another way would be to run a sidecar that runs the health check in a loop in a long-running python process, and after success it ensures a Here's the referenced example: apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5 From that page:
Thoughts? |
It's pretty similar (and semi-related) to an issue for workers I opened a while back: #17191 I think having an endpoint makes sense, and we toggle whether the log endpoint works or not based on the Executor. |
Hi @dstandish and @jedcunningham I'm running on Helm chart 1.2 and manually increased limits to the values what where in this PR and still got problem. timeoutSeconds: 10
failureThreshold: 5
periodSeconds: 60 I have created high thresholds and now my probes are passing livenessProbe:
initialDelaySeconds: 60
timeoutSeconds: 30
failureThreshold: 5
periodSeconds: 60 If you agree, I suggest raising the timeout even more or document the slow probe in the documentation. Thanks, |
In practice the liveness probe can regularly take longer than 5 seconds.
10 seems like a better default.
Because it seems to take longer than perhaps initially expected, it seems reasonable to reduce the check frequency to not consume as much CPU. Though perhaps there's a reason for checking more frequently, so whatever you think on that...
To keep the max downtime to 5 minutes after increasing the check period, I reduce number of failed checks to 5.
closes #19001