the celery worker indefinitely not processing any tasks after airflow-redis pod restarted #600

anu251989 · 2022-06-07T16:02:47Z

Checks

I have checked for existing issues.
This report is about the User-Community Airflow Helm Chart.

Chart Version

8.6.0

Kubernetes Version

v1.20.0

Helm Version

Helm version 3

Description

as part of deployment, the worker pod deployed first after that airflow-redis-master pod recreated. in airflow worker pod, we can see the redis disconnect message and after few retries connected to redis but missing heartbeat from worker node and worker node indefinitely not processing any tasks. The tasks will be in queued status until worker pod restarted.

[2022-05-30 17:53:06,186: ERROR/MainProcess] consumer: Cannot connect to redis://:**@airflow-redis-master.-XX-XXX--env1.svc.cluster.local:6379/1: Error 111 connecting to airflow-redis-master-XX-XX-env1.svc.cluster.local:6379. Connection refused..
Trying again in 12.00 seconds... (6/100)

Relevant Logs

[2022-05-30 17:53:18,213: INFO/MainProcess] Connected to redis://:**@airflow-redis-master-XX-XXX-env1.svc.cluster.local:6379/1
[2022-05-30 17:53:18,228: INFO/MainProcess] mingle: searching for neighbors
[2022-05-30 17:53:19,239: INFO/MainProcess] mingle: all alone
[2022-05-30 17:53:24,246: INFO/MainProcess] missed heartbeat from celery@airflow-worker-0

Custom Helm Values

celery:
celery_app_name = airflow.executors.celery_executor
worker_concurrency = 16
worker_umask = 0o077
broker_url = redis://:airflow@airflow-redis-master.airflow2-env1.svc.cluster.local:6379/1
result_backend = db+postgresql://airflow_airflow2_x:5432/airflow_db
flower_host = 0.0.0.0
flower_url_prefix = /flower
flower_port = 5555
flower_basic_auth = admin:admin,admin1:admin1
sync_parallelism = 0
celery_config_options = airflow.config_templates.default_celery.DEFAULT_CELERY_CONFIG
ssl_active = False
ssl_key =
ssl_cert =
ssl_cacert =
pool = prefork
operation_timeout = 1.0
task_track_started = True
task_adoption_timeout = 600
task_publish_max_retries = 3
worker_precheck = False
worker_log_server_port = 8793

The text was updated successfully, but these errors were encountered:

thesuperzapper · 2022-06-22T07:36:47Z

@anu251989 can you please show us your helm values? You can either:

include your custom-values.yaml file (if you have one)
use the helm get values --namespace MY_NAMESPACE RELEASE_NAME command to get any non-default values

NOTE: be sure to redact any sensitive information before uploading /pasting to github

anu251989 · 2022-07-13T09:15:25Z

@thesuperzapper , Thanks for responding. We didn't define custom-values.
I have resolved this issue with configuring liveness checks for Worker pod as per below .
apache/airflow#22378
The only problem here is, The worker containers restart happening if both of the workers heartbeat missed and not processing any messages.

command:
{{- if .Values.workers.livenessProbe.command }}
{{ toYaml .Values.workers.livenessProbe.command | nindent 16 }}
{{- else}}
- sh
- -c
- exec celery --app airflow.executors.celery_executor.app inspect ping

thesuperzapper · 2023-08-29T19:34:26Z

@anu251989 we now have a liveness probe for the workers as of chart version 8.8.0.

It will only restart the specific worker that is not responding, and was implemented with PR #766.

anu251989 added the kind/bug kind - things not working properly label Jun 7, 2022

thesuperzapper added this to Triage | Waiting for Response in Issue Triage and PR Tracking Jun 22, 2022

anu251989 mentioned this issue Jan 2, 2023

airflow scheduler and worker memory leak #683

Closed

2 tasks

thesuperzapper moved this from Triage | Waiting for Response to Triage | Needs Investigation in Issue Triage and PR Tracking Feb 7, 2023

thesuperzapper added this to the airflow-8.7.0 milestone Feb 7, 2023

nickwood mentioned this issue Aug 2, 2023

feat: add liveness probe for celery workers #766

Merged

4 tasks

thesuperzapper modified the milestones: airflow-8.8.0, airflow-8.7.2 Aug 28, 2023

thesuperzapper closed this as completed in #766 Aug 29, 2023

Issue Triage and PR Tracking automation moved this from Triage | Needs Investigation to Done Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the celery worker indefinitely not processing any tasks after airflow-redis pod restarted #600

the celery worker indefinitely not processing any tasks after airflow-redis pod restarted #600

anu251989 commented Jun 7, 2022

thesuperzapper commented Jun 22, 2022

anu251989 commented Jul 13, 2022

thesuperzapper commented Aug 29, 2023

the celery worker indefinitely not processing any tasks after airflow-redis pod restarted #600

the celery worker indefinitely not processing any tasks after airflow-redis pod restarted #600

Comments

anu251989 commented Jun 7, 2022

Checks

Chart Version

Kubernetes Version

Helm Version

Description

Relevant Logs

Custom Helm Values

thesuperzapper commented Jun 22, 2022

anu251989 commented Jul 13, 2022

thesuperzapper commented Aug 29, 2023