Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add liveness probe for celery workers #766

Conversation

nickwood
Copy link
Contributor

@nickwood nickwood commented Aug 2, 2023

What issues does your PR fix?

What does your PR do?

Currently celery workers can enter a 'zombie' state where they become disconnected from celery and will no longer pick up new jobs.

This PR adds a iveness probe (enabled by default) to detect such pods so they can be killed by k8s and recreated.

Checklist

For all Pull Requests

Currently pods can enter a 'zombie' state where they become disconnected from celery and will no longer pick up new jobs. We create a liveness probe to detect such pods so they can be killed by k8s and recreated

Signed-off-by: Nick Wood <nwood@cloudflare.com>
@nickwood nickwood force-pushed the 600-liveness-probe-for-celery-workers branch from 40dbe4e to e1a5afa Compare August 2, 2023 15:57
@Talador12
Copy link

@thesuperzapper this one comes from our org, bringing a change that helped us upstream

Copy link

@Talador12 Talador12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thesuperzapper thesuperzapper added this to the airflow-8.7.2 milestone Aug 28, 2023
Signed-off-by: Mathew Wicks <thesuperzapper@users.noreply.github.com>
@thesuperzapper thesuperzapper changed the title feat: Add liveness probes for celery workers feat: add liveness probe for celery workers Aug 29, 2023
thesuperzapper
thesuperzapper previously approved these changes Aug 29, 2023
Copy link
Member

@thesuperzapper thesuperzapper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickwood thanks for this!

I have made a significant change in 20e7546 to rewrite the probe in Python, so it should now be ready to include in the next release of the chart.

This is because the approach that you proposed did not work for all versions of Airflow (for example it did not work in Airflow 2.7.0 because the --app name changed).

@thesuperzapper thesuperzapper added this to Unsorted in Issue Triage and PR Tracking via automation Aug 29, 2023
@thesuperzapper thesuperzapper added the status/ready-to-merge status - this will be merged into next release label Aug 29, 2023
@thesuperzapper thesuperzapper moved this from Unsorted to PR | Ready to Merge in Issue Triage and PR Tracking Aug 29, 2023
Signed-off-by: Mathew Wicks <thesuperzapper@users.noreply.github.com>
Issue Triage and PR Tracking automation moved this from PR | Ready to Merge to PR | Needs Review Aug 29, 2023
@thesuperzapper thesuperzapper merged commit 424b5ca into airflow-helm:main Aug 29, 2023
3 checks passed
Issue Triage and PR Tracking automation moved this from PR | Needs Review to Done Aug 29, 2023
@nglehuy
Copy link

nglehuy commented Nov 3, 2023

FYI, for anyone who has airflow worker liveness probe failure TypeError: argument of type 'NoneType' is not iterable.
This is because of celery inspect can not see the worker running.
We need to set the environment variable AIRFLOW__CELERY__WORKER_ENABLE_REMOTE_CONTROL=True instead of False to make the celery inspect ping works.

@Talador12
Copy link

Talador12 commented Nov 3, 2023 via email

@nglehuy
Copy link

nglehuy commented Nov 4, 2023

Could we set this AIRFLOW__CELERY__WORKER_ENABLE_REMOTE_CONTROL field as

It's default 'true' in airflow config: https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#worker-enable-remote-control
The FYI is just for anyone who somehow already set it to false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/ready-to-merge status - this will be merged into next release
Development

Successfully merging this pull request may close these issues.

the celery worker indefinitely not processing any tasks after airflow-redis pod restarted
4 participants