Skip to content

Commit

Permalink
feat: add liveness probe for celery workers (#766)
Browse files Browse the repository at this point in the history
* feat: Add liveness probes for celery workers

Currently pods can enter a 'zombie' state where they become disconnected from celery and will no longer pick up new jobs. We create a liveness probe to detect such pods so they can be killed by k8s and recreated

Signed-off-by: Nick Wood <nwood@cloudflare.com>

* feat: rewrite celery probe with python

Signed-off-by: Mathew Wicks <thesuperzapper@users.noreply.github.com>

* feat: add new worker liveness probe values to docs

Signed-off-by: Mathew Wicks <thesuperzapper@users.noreply.github.com>

---------

Signed-off-by: Nick Wood <nwood@cloudflare.com>
Signed-off-by: Mathew Wicks <thesuperzapper@users.noreply.github.com>
Co-authored-by: Mathew Wicks <thesuperzapper@users.noreply.github.com>
  • Loading branch information
nickwood and thesuperzapper committed Aug 29, 2023
1 parent 773dbf8 commit 424b5ca
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 1 deletion.
3 changes: 2 additions & 1 deletion charts/airflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,7 @@ Parameter | Description | Default
`workers.celery.*` | configs for the celery worker Pods | `<see values.yaml>`
`workers.terminationPeriod` | how many seconds to wait after SIGTERM before SIGKILL of the celery worker | `60`
`workers.logCleanup.*` | configs for the log-cleanup sidecar of the worker Pods | `<see values.yaml>`
`workers.livenessProbe.*` | configs for the worker Pods' liveness probe | `<see values.yaml>`
`workers.extraPipPackages` | extra pip packages to install in the worker Pods | `[]`
`workers.extraVolumeMounts` | extra VolumeMounts for the worker Pods | `[]`
`workers.extraVolumes` | extra Volumes for the worker Pods | `[]`
Expand All @@ -333,7 +334,7 @@ Parameter | Description | Default
`triggerer.safeToEvict` | if we add the annotation: "cluster-autoscaler.kubernetes.io/safe-to-evict" = "true" | `true`
`triggerer.podDisruptionBudget.*` | configs for the PodDisruptionBudget of the triggerer Deployment | `<see values.yaml>`
`triggerer.capacity` | maximum number of triggers each triggerer will run at once (sets `AIRFLOW__TRIGGERER__DEFAULT_CAPACITY`) | `1000`
`triggerer.livenessProbe.*` | liveness probe for the triggerer Pods | `<see values.yaml>`
`triggerer.livenessProbe.*` | configs for the triggerer Pods' liveness probe | `<see values.yaml>`
`triggerer.extraPipPackages` | extra pip packages to install in the triggerer Pods | `[]`
`triggerer.extraVolumeMounts` | extra VolumeMounts for the triggerer Pods | `[]`
`triggerer.extraVolumes` | extra Volumes for the triggerer Pods | `[]`
Expand Down
38 changes: 38 additions & 0 deletions charts/airflow/templates/worker/worker-statefulset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,44 @@ spec:
time.sleep(10)
active_tasks = i.active()[local_celery_host]
{{- end }}
{{- if .Values.workers.livenessProbe.enabled }}
livenessProbe:
initialDelaySeconds: {{ .Values.workers.livenessProbe.initialDelaySeconds }}
timeoutSeconds: {{ .Values.workers.livenessProbe.timeoutSeconds }}
failureThreshold: {{ .Values.workers.livenessProbe.failureThreshold }}
periodSeconds: {{ .Values.workers.livenessProbe.periodSeconds }}
exec:
command:
{{- include "airflow.command" . | indent 16 }}
- "python"
- "-Wignore"
- "-c"
- |
import os
import sys
import subprocess
from celery import Celery
from celery.app.control import Inspect
from typing import List
def run_command(cmd: List[str]) -> str:
process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
output, error = process.communicate()
if error is not None:
raise Exception(error)
else:
return output.decode(encoding="utf-8")
broker_url = run_command(["bash", "-c", "eval $AIRFLOW__CELERY__BROKER_URL_CMD"])
local_celery_host = f"celery@{os.environ['HOSTNAME']}"
app = Celery(broker=broker_url)
# ping the local celery worker to see if it's ok
i = Inspect(app=app, destination=[local_celery_host], timeout=5.0)
ping_responses = i.ping()
if local_celery_host not in ping_responses:
sys.exit(f"celery worker '{local_celery_host}' did not respond to ping")
{{- end }}
ports:
- name: wlog
containerPort: 8793
Expand Down
9 changes: 9 additions & 0 deletions charts/airflow/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -994,6 +994,15 @@ workers:
##
intervalSeconds: 900

## configs for the worker Pods' liveness probe
##
livenessProbe:
enabled: true
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 60
failureThreshold: 5

## extra pip packages to install in the worker Pod
##
## ____ EXAMPLE _______________
Expand Down

0 comments on commit 424b5ca

Please sign in to comment.