Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the celery worker indefinitely not processing any tasks after airflow-redis pod restarted #600

Closed
2 tasks done
anu251989 opened this issue Jun 7, 2022 · 3 comments · Fixed by #766
Closed
2 tasks done
Labels
kind/bug kind - things not working properly
Milestone

Comments

@anu251989
Copy link

Checks

Chart Version

8.6.0

Kubernetes Version

v1.20.0

Helm Version

Helm version 3

Description

as part of deployment, the worker pod deployed first after that airflow-redis-master pod recreated. in airflow worker pod, we can see the redis disconnect message and after few retries connected to redis but missing heartbeat from worker node and worker node indefinitely not processing any tasks. The tasks will be in queued status until worker pod restarted.

[2022-05-30 17:53:06,186: ERROR/MainProcess] consumer: Cannot connect to redis://:**@airflow-redis-master.-XX-XXX--env1.svc.cluster.local:6379/1: Error 111 connecting to airflow-redis-master-XX-XX-env1.svc.cluster.local:6379. Connection refused..
Trying again in 12.00 seconds... (6/100)

Relevant Logs

[2022-05-30 17:53:18,213: INFO/MainProcess] Connected to redis://:**@airflow-redis-master-XX-XXX-env1.svc.cluster.local:6379/1
[2022-05-30 17:53:18,228: INFO/MainProcess] mingle: searching for neighbors
[2022-05-30 17:53:19,239: INFO/MainProcess] mingle: all alone
[2022-05-30 17:53:24,246: INFO/MainProcess] missed heartbeat from celery@airflow-worker-0

Custom Helm Values

celery:
celery_app_name = airflow.executors.celery_executor
worker_concurrency = 16
worker_umask = 0o077
broker_url = redis://:airflow@airflow-redis-master.airflow2-env1.svc.cluster.local:6379/1
result_backend = db+postgresql://airflow_airflow2_x:5432/airflow_db
flower_host = 0.0.0.0
flower_url_prefix = /flower
flower_port = 5555
flower_basic_auth = admin:admin,admin1:admin1
sync_parallelism = 0
celery_config_options = airflow.config_templates.default_celery.DEFAULT_CELERY_CONFIG
ssl_active = False
ssl_key =
ssl_cert =
ssl_cacert =
pool = prefork
operation_timeout = 1.0
task_track_started = True
task_adoption_timeout = 600
task_publish_max_retries = 3
worker_precheck = False
worker_log_server_port = 8793
@anu251989 anu251989 added the kind/bug kind - things not working properly label Jun 7, 2022
@thesuperzapper
Copy link
Member

@anu251989 can you please show us your helm values? You can either:

  1. include your custom-values.yaml file (if you have one)
  2. use the helm get values --namespace MY_NAMESPACE RELEASE_NAME command to get any non-default values

NOTE: be sure to redact any sensitive information before uploading /pasting to github

@thesuperzapper thesuperzapper added this to Triage | Waiting for Response in Issue Triage and PR Tracking Jun 22, 2022
@anu251989
Copy link
Author

@thesuperzapper , Thanks for responding. We didn't define custom-values.
I have resolved this issue with configuring liveness checks for Worker pod as per below .
apache/airflow#22378
The only problem here is, The worker containers restart happening if both of the workers heartbeat missed and not processing any messages.

command:
{{- if .Values.workers.livenessProbe.command }}
{{ toYaml .Values.workers.livenessProbe.command | nindent 16 }}
{{- else}}
- sh
- -c
- exec celery --app airflow.executors.celery_executor.app inspect ping

@thesuperzapper thesuperzapper moved this from Triage | Waiting for Response to Triage | Needs Investigation in Issue Triage and PR Tracking Feb 7, 2023
@thesuperzapper thesuperzapper added this to the airflow-8.7.0 milestone Feb 7, 2023
Issue Triage and PR Tracking automation moved this from Triage | Needs Investigation to Done Aug 29, 2023
@thesuperzapper
Copy link
Member

@anu251989 we now have a liveness probe for the workers as of chart version 8.8.0.

It will only restart the specific worker that is not responding, and was implemented with PR #766.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind - things not working properly
Development

Successfully merging a pull request may close this issue.

2 participants