New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow liveness probe causes frequent restarts (scheduler and triggerer) #19001
Comments
Thank you so much for this PR, but it seams that the issue is still there.
In my # Airflow scheduler settings
scheduler:
# If the scheduler stops heartbeating for 5 minutes (5*60s) kill the
# scheduler and let Kubernetes restart it
livenessProbe:
initialDelaySeconds: 10
timeoutSeconds: 10
failureThreshold: 5
periodSeconds: 60
# Airflow 2.0 allows users to run multiple schedulers,
# However this feature is only recommended for MySQL 8+ and Postgres
replicas: 1 And I am using Helm chart v1.3.0 |
@V0lantis, you can try bumping your |
Thank you @jedcunningham, it seems to be working for now. I put the value to 30 seconds. Hope it is not too much |
For the triggerer, I have the following config : triggerer:
livenessProbe:
initialDelaySeconds: 10
timeoutSeconds: 30
failureThreshold: 5
periodSeconds: 60 I have :
Exactly the same for the scheduler unfortunately, and it is still restarting. Should I increase the |
@V0lantis, you wouldn't happen to be on 2.2.0 still? There was an issue with the cli command used in the health check in that version. |
Hello @jedcunningham, thank you to reach me back ! Here are my new param for the scheduler and triggerer :
I am using 2 replicas for most of my pods now, hoping that it will avoid any downtime :) |
3 minutes, yeah, definitely shouldn't be taking anywhere near that long! Can you try exec'ing into the pod and timing the command? airflow/chart/templates/scheduler/scheduler-deployment.yaml Lines 160 to 178 in c20ad79
As for what is "too long", well, that's kinda up to you. If you are on slower and/or busy hardware it might need to be higher than 30s, but it doesn't really "do" much either (start python, load modules, query the db). |
There is definitely a mistake from my side, since the following command : cat <<EOF > tmp.py
import os
os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'
from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.db import create_session
from airflow.utils.net import get_hostname
import sys
with create_session() as session:
job = session.query(SchedulerJob).filter_by(hostname=get_hostname()).order_by(
SchedulerJob.latest_heartbeat.desc()).limit(1).first()
sys.exit(0 if job.is_alive() else 1)
EOF
/entrypoint python -Wignore tmp.py return Traceback (most recent call last):
File "/opt/airflow/tmp.py", line 5, in <module>
from airflow.jobs.scheduler_job import SchedulerJob
ModuleNotFoundError: No module named 'airflow' I am using a custom image to import private dependancies. Therefore, my airflow image is defined by : FROM python:3.9-slim AS compile-image
RUN apt-get update
RUN apt-get install -y --no-install-recommends build-essential gcc git ssh libpq-dev python3-dev
RUN python -m venv /opt/venv
# Make sure we use the virtualenv:
ENV PATH="/opt/venv/bin:$PATH"
RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts
COPY requirements.txt .
RUN --mount=type=ssh,id=github_ssh_key pip install \
--no-cache \
-r requirements.txt
FROM apache/airflow:2.2.1-python3.9
COPY --from=compile-image --chown=airflow /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
ENV PYTHONPATH="/opt/venv/lib/python3.9/site-packages/:$PYTHONPATH" I'll try with the normal deployment setup. Update Thank you for your comment. Since it is not normal that I cannot import airflow from the CLI, I looked back at how I defined my image, and I updated it from airflows guidelines. I can now time
I think the root reason was my image. I am going to pass the LivenessProbe to its ancient values. Here is the new definition of my image : FROM apache/airflow:2.2.1-python3.9
USER root
RUN apt-get update
RUN apt-get install -y --no-install-recommends build-essential gcc git ssh libpq-dev python3-dev \
&& apt-get autoremove -yqq --purge \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts
COPY requirements.txt .
RUN --mount=type=ssh,id=github_ssh_key pip install \
--no-cache \
-r requirements.txt
USER airflow Final Update That was definitely the reason. I wonder why it was working correctly in my dags 🤔 |
Sorry for this comment, but actually my last Dockerfile wasn't properly working. I had to write this dirty Dockerfile : FROM apache/airflow:2.2.1-python3.9
USER root
RUN apt-get update
RUN apt-get install -y --no-install-recommends build-essential gcc git ssh libpq-dev python3-dev \
&& apt-get autoremove -yqq --purge \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts
COPY requirements.txt .
RUN --mount=type=ssh,id=github_ssh_key pip install \
--no-cache \
--target /home/airflow/.local/lib/python3.9/site-packages \
-r requirements.txt --no-user
USER airflow I wonder if this is the correct way, but I am sure this is not the place for this subject. I am just commenting it here in case it could be useful to anyone : |
A comment: this approach will produce a much bigger docker image than it should be (because of the build essentials - extending the image like that is not the best idea). You can actually take a look at the https://airflow.apache.org/docs/docker-stack/build.html#customizing-the-image where you can find description and a lot of exmples on how you can customise the image instead to get 30%-50% smaller image as a result (with slight complexity in the building process). |
Thank you @potiuk for your comment. I actually took a look to this documentation as I wrote in my previous comment, but unfortunately, I didn’t find a clean way to install private dependancies from those examples. I am installing a custom python package from a private git repository which forced me to do this trick with the |
Didn't this one work: https://airflow.apache.org/docs/docker-stack/build.html#using-custom-installation-sources ? It is exactly an example of instaling from private repos. |
Thank you so much @potiuk, I finally managed to make it works, but not exactly as the doc suggests. Should I submit a PR to add some information? |
Absolutely.! |
Apache Airflow version
2.2.0 (latest released)
Operating System
official docker image
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
What happened
I noticed scheduler was restarting a lot, and often ended up in CrashLoopBackoff state, apparently due to failed liveness probe:
Triggerer also has this issue and also enters CrashLoopBackoff state frequently.
e.g.
It turns out what was going on is the liveness probe takes too long to run and so it failed continuously, so the scheduler would just restart every 10 minutes.
What you expected to happen
No response
How to reproduce
No response
Anything else
I ran the liveness probe code in a container on k8s and found that it generally takes longer than 5 seconds.
Probably we should increase default timeout to 10 seconds and possibly reduce frequency so that it's not wasting as much CPU.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: