Skip to content

Commit

Permalink
feat: Add liveness probes for celery workers
Browse files Browse the repository at this point in the history
Currently pods can enter a 'zombie' state where they become disconnected from celery and will no longer pick up new jobs. We create a liveness probe to detect such pods so they can be killed by k8s and recreated

Signed-off-by: Nick Wood <nwood@cloudflare.com>
  • Loading branch information
nickwood committed Aug 2, 2023
1 parent c3339af commit 40dbe4e
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 0 deletions.
16 changes: 16 additions & 0 deletions charts/airflow/templates/worker/worker-statefulset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,22 @@ spec:
time.sleep(10)
active_tasks = i.active()[local_celery_host]
{{- end }}
{{- if .Values.workers.celery.livenessProbe.enabled }}
livenessProbe:
initialDelaySeconds: {{ .Values.workers.celery.livenessProbe.initialDelaySeconds }}
timeoutSeconds: {{ .Values.workers.celery.livenessProbe.timeoutSeconds }}
failureThreshold: {{ .Values.workers.celery.livenessProbe.failureThreshold }}
periodSeconds: {{ .Values.workers.celery.livenessProbe.periodSeconds }}
exec:
command:
{{- if .Values.workers.celery.livenessProbe.command }}
{{ toYaml .Values.workers.celery.livenessProbe.command | nindent 16 }}
{{- else}}
- sh
- -c
- exec /entrypoint python -m celery --app airflow.executors.celery_executor.app inspect ping -d celery@${HOSTNAME}
{{- end }}
{{- end }}
ports:
- name: wlog
containerPort: 8793
Expand Down
8 changes: 8 additions & 0 deletions charts/airflow/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -963,6 +963,14 @@ workers:
##
gracefullTerminationPeriod: 600

## config for celery worker liveness probes
livenessProbe:
enabled: false
initialDelaySeconds: 300
periodSeconds: 60
timeoutSeconds: 20
failureThreshold: 5

## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
## - [WARNING] tasks that are still running during SIGKILL will be orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue running
Expand Down

0 comments on commit 40dbe4e

Please sign in to comment.