Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow liveness probe causes frequent restarts (scheduler and triggerer) #19001

Closed
2 tasks done
dstandish opened this issue Oct 15, 2021 · 14 comments · Fixed by #19003
Closed
2 tasks done

Slow liveness probe causes frequent restarts (scheduler and triggerer) #19001

dstandish opened this issue Oct 15, 2021 · 14 comments · Fixed by #19003
Labels
affected_version:2.2 Issues Reported for 2.2 area:core area:helm-chart Airflow Helm Chart kind:bug This is a clearly a bug

Comments

@dstandish
Copy link
Contributor

dstandish commented Oct 15, 2021

Apache Airflow version

2.2.0 (latest released)

Operating System

official docker image

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened

I noticed scheduler was restarting a lot, and often ended up in CrashLoopBackoff state, apparently due to failed liveness probe:

Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  BackOff    19m (x25 over 24m)     kubelet  Back-off restarting failed container
  Warning  Unhealthy  4m15s (x130 over 73m)  kubelet  Liveness probe failed:

Triggerer also has this issue and also enters CrashLoopBackoff state frequently.

e.g.

NAME                                      READY   STATUS    RESTARTS   AGE
airflow-prod-redis-0                      1/1     Running   0          2d7h
airflow-prod-scheduler-75dc64bc8-m8xdd    2/2     Running   14         77m
airflow-prod-triggerer-7897c44dd4-mtnq9   1/1     Running   126        12h
airflow-prod-webserver-7bdfc8ff48-gfnvs   1/1     Running   0          12h
airflow-prod-worker-659b566588-w8cd2      1/1     Running   0          147m
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  BackOff    18m (x398 over 11h)     kubelet  Back-off restarting failed container
  Warning  Unhealthy  3m32s (x1262 over 12h)  kubelet  Liveness probe failed:

It turns out what was going on is the liveness probe takes too long to run and so it failed continuously, so the scheduler would just restart every 10 minutes.

What you expected to happen

No response

How to reproduce

No response

Anything else

I ran the liveness probe code in a container on k8s and found that it generally takes longer than 5 seconds.

Probably we should increase default timeout to 10 seconds and possibly reduce frequency so that it's not wasting as much CPU.

❯ keti airflow-prod-scheduler-6956684c7f-swfgb -- bash
Defaulted container "scheduler" out of: scheduler, scheduler-log-groomer, wait-for-airflow-migrations (init)
airflow@airflow-prod-scheduler-6956684c7f-swfgb:/opt/airflow$ time /entrypoint python -Wignore -c "import os
> os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
> os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'
>
> from airflow.jobs.scheduler_job import SchedulerJob
> from airflow.utils.db import create_session
> from airflow.utils.net import get_hostname
> import sys
>
> with create_session() as session:
>     job = session.query(SchedulerJob).filter_by(hostname=get_hostname()).order_by(
>         SchedulerJob.latest_heartbeat.desc()).limit(1).first()
>
> print(0 if job.is_alive() else 1)
> "

0

real	0m5.696s
user	0m4.989s
sys	0m0.375s
airflow@airflow-prod-scheduler-6956684c7f-swfgb:/opt/airflow$ time /entrypoint python -Wignore -c "import os
os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'

from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.db import create_session
from airflow.utils.net import get_hostname
import sys

with create_session() as session:
    job = session.query(SchedulerJob).filter_by(hostname=get_hostname()).order_by(
        SchedulerJob.latest_heartbeat.desc()).limit(1).first()

print(0 if job.is_alive() else 1)
"

0

real	0m7.261s
user	0m5.273s
sys	0m0.411s
airflow@airflow-prod-scheduler-6956684c7f-swfgb:/opt/airflow$

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@V0lantis
Copy link
Contributor

V0lantis commented Dec 8, 2021

Thank you so much for this PR, but it seams that the issue is still there.

│ airflow                airflow-postgresql-0                               ●            1/1                               7 Running             10.1.23.63               serverhome             13d              │
│ airflow                airflow-redis-0                                    ●            1/1                               7 Running             10.1.23.25               serverhome             13d              │
│ airflow                airflow-scheduler-6ffbb755d9-29b2k                 ●            3/3                            1048 Running             10.1.23.1                serverhome             12d              │
│ airflow                airflow-statsd-84f4f9898-k86qb                     ●            1/1                               8 Running             10.1.23.56               serverhome             13d              │
│ airflow                airflow-triggerer-5c684fbf65-t9r58                 ●            1/1                            1173 Running             10.1.23.45               serverhome             13d              │

In my values.yaml I have :

# Airflow scheduler settings
scheduler:
  # If the scheduler stops heartbeating for 5 minutes (5*60s) kill the
  # scheduler and let Kubernetes restart it
  livenessProbe:
    initialDelaySeconds: 10
    timeoutSeconds: 10
    failureThreshold: 5
    periodSeconds: 60
  # Airflow 2.0 allows users to run multiple schedulers,
  # However this feature is only recommended for MySQL 8+ and Postgres
  replicas: 1

And I am using Helm chart v1.3.0
Any idea why ?

@jedcunningham
Copy link
Member

jedcunningham commented Dec 8, 2021

@V0lantis, you can try bumping your scheduler.livenessProbe.timeoutSeconds higher. These defaults should work fine for most people, but there are definitely situations where it'd need to be bumped higher.

@V0lantis
Copy link
Contributor

V0lantis commented Dec 9, 2021

@V0lantis, you can try bumping your scheduler.livenessProbe.timeoutSeconds higher. These defaults should work fine for most people, but there are definitely situations where it'd need to be bumped higher.

Thank you @jedcunningham, it seems to be working for now. I put the value to 30 seconds. Hope it is not too much

@V0lantis
Copy link
Contributor

V0lantis commented Dec 9, 2021

For the triggerer, I have the following config :

triggerer:
  livenessProbe:
    initialDelaySeconds: 10
    timeoutSeconds: 30
    failureThreshold: 5
    periodSeconds: 60

I have :

airflow            airflow-triggerer-696f47d7ff-8x2d7             ●         1/1                        9 Running         10.1.23.2            serverhome         51m   

Exactly the same for the scheduler unfortunately, and it is still restarting. Should I increase the timeoutSeconds even more? What value is not too much?

@jedcunningham
Copy link
Member

@V0lantis, you wouldn't happen to be on 2.2.0 still? There was an issue with the cli command used in the health check in that version.

@V0lantis
Copy link
Contributor

V0lantis commented Jan 5, 2022

Hello @jedcunningham, thank you to reach me back !
Unfortunally no, I am using apache/airflow:2.2.1-python3.9.

Here are my new param for the scheduler and triggerer :

# Airflow scheduler settings
scheduler:
  # If the scheduler stops heartbeating for 5 minutes (5*60s) kill the
  # scheduler and let Kubernetes restart it
  livenessProbe:
    initialDelaySeconds: 10
    timeoutSeconds: 180
    failureThreshold: 20
    periodSeconds: 60
  # Airflow 2.0 allows users to run multiple schedulers,
  # However this feature is only recommended for MySQL 8+ and Postgres
  replicas: 2

I am using 2 replicas for most of my pods now, hoping that it will avoid any downtime :)

@jedcunningham
Copy link
Member

3 minutes, yeah, definitely shouldn't be taking anywhere near that long!

Can you try exec'ing into the pod and timing the command?

- /entrypoint
- python
- -Wignore
- -c
- |
import os
os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'
from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.db import create_session
from airflow.utils.net import get_hostname
import sys
with create_session() as session:
job = session.query(SchedulerJob).filter_by(hostname=get_hostname()).order_by(
SchedulerJob.latest_heartbeat.desc()).limit(1).first()
sys.exit(0 if job.is_alive() else 1)

As for what is "too long", well, that's kinda up to you. If you are on slower and/or busy hardware it might need to be higher than 30s, but it doesn't really "do" much either (start python, load modules, query the db).

@V0lantis
Copy link
Contributor

V0lantis commented Jan 6, 2022

There is definitely a mistake from my side, since the following command :

cat <<EOF > tmp.py
import os

os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'
from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.db import create_session
from airflow.utils.net import get_hostname
import sys

with create_session() as session:
   job = session.query(SchedulerJob).filter_by(hostname=get_hostname()).order_by(
       SchedulerJob.latest_heartbeat.desc()).limit(1).first()
sys.exit(0 if job.is_alive() else 1)
EOF


/entrypoint python -Wignore tmp.py

return

Traceback (most recent call last):
  File "/opt/airflow/tmp.py", line 5, in <module>
    from airflow.jobs.scheduler_job import SchedulerJob
ModuleNotFoundError: No module named 'airflow'

I am using a custom image to import private dependancies. Therefore, my airflow image is defined by :

FROM python:3.9-slim AS compile-image
RUN apt-get update
RUN apt-get install -y --no-install-recommends build-essential gcc git ssh libpq-dev python3-dev

RUN python -m venv /opt/venv
# Make sure we use the virtualenv:
ENV PATH="/opt/venv/bin:$PATH"

RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts
COPY requirements.txt .
RUN --mount=type=ssh,id=github_ssh_key pip install \
    --no-cache \
    -r requirements.txt

FROM apache/airflow:2.2.1-python3.9

COPY --from=compile-image --chown=airflow /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
ENV PYTHONPATH="/opt/venv/lib/python3.9/site-packages/:$PYTHONPATH"

I'll try with the normal deployment setup.

Update

Thank you for your comment. Since it is not normal that I cannot import airflow from the CLI, I looked back at how I defined my image, and I updated it from airflows guidelines.

I can now time /entrypoint python -Wignore tmp.py :

real	0m3.120s
user	0m2.916s
sys	0m0.183s

I think the root reason was my image. I am going to pass the LivenessProbe to its ancient values.

Here is the new definition of my image :

FROM apache/airflow:2.2.1-python3.9

USER root

RUN apt-get update
RUN apt-get install -y --no-install-recommends build-essential gcc git ssh libpq-dev python3-dev \
    && apt-get autoremove -yqq --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*


RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts
COPY requirements.txt .
RUN --mount=type=ssh,id=github_ssh_key pip install \
    --no-cache \
    -r requirements.txt

USER airflow

Final Update

That was definitely the reason. I wonder why it was working correctly in my dags 🤔

@V0lantis
Copy link
Contributor

V0lantis commented Jan 6, 2022

Sorry for this comment, but actually my last Dockerfile wasn't properly working. I had to write this dirty Dockerfile :

FROM apache/airflow:2.2.1-python3.9

USER root

RUN apt-get update
RUN apt-get install -y --no-install-recommends build-essential gcc git ssh libpq-dev python3-dev \
    && apt-get autoremove -yqq --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*


RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts
COPY requirements.txt .
RUN --mount=type=ssh,id=github_ssh_key pip install \
    --no-cache \
    --target /home/airflow/.local/lib/python3.9/site-packages \
    -r requirements.txt --no-user

USER airflow

I wonder if this is the correct way, but I am sure this is not the place for this subject. I am just commenting it here in case it could be useful to anyone :

@potiuk
Copy link
Member

potiuk commented Jan 6, 2022

A comment: this approach will produce a much bigger docker image than it should be (because of the build essentials - extending the image like that is not the best idea).

You can actually take a look at the https://airflow.apache.org/docs/docker-stack/build.html#customizing-the-image where you can find description and a lot of exmples on how you can customise the image instead to get 30%-50% smaller image as a result (with slight complexity in the building process).

@V0lantis
Copy link
Contributor

V0lantis commented Jan 6, 2022

Thank you @potiuk for your comment. I actually took a look to this documentation as I wrote in my previous comment, but unfortunately, I didn’t find a clean way to install private dependancies from those examples.

I am installing a custom python package from a private git repository which forced me to do this trick with the root user. Apparently, I am not able to install it through the normal pip install With airflow user

@potiuk
Copy link
Member

potiuk commented Jan 6, 2022

Didn't this one work: https://airflow.apache.org/docs/docker-stack/build.html#using-custom-installation-sources ? It is exactly an example of instaling from private repos.

@V0lantis
Copy link
Contributor

Thank you so much @potiuk, I finally managed to make it works, but not exactly as the doc suggests. Should I submit a PR to add some information?

@potiuk
Copy link
Member

potiuk commented Jan 10, 2022

Absolutely.!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.2 Issues Reported for 2.2 area:core area:helm-chart Airflow Helm Chart kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants