You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have Airflow set up with Celery executor but our tasks implemented using KubernetesPodExecutor. We create dag runs with set of tasks and run them as pods. We have tasks that can run for 40 minutes and more. Pretty often, we see that task is still running, actively doing required operations, but airflow marks task as failed, and retries it or if there are no retries left it just marks it as failed. Sometimes pods are stuck in running, though task is showing succeed status. We currently have one worker pod, which basically starts tasks execution, and we started to notice that worker goes OOMKilled pretty often because of low memory. Sometimes though tasks run just fine.
Apache Airflow version: 1.10.9
Kubernetes version (if you are using kubernetes) (use
kubectl version
): 1.14Environment:
What happened:
We have Airflow set up with Celery executor but our tasks implemented using KubernetesPodExecutor. We create dag runs with set of tasks and run them as pods. We have tasks that can run for 40 minutes and more. Pretty often, we see that task is still running, actively doing required operations, but airflow marks task as failed, and retries it or if there are no retries left it just marks it as failed. Sometimes pods are stuck in running, though task is showing succeed status. We currently have one worker pod, which basically starts tasks execution, and we started to notice that worker goes OOMKilled pretty often because of low memory. Sometimes though tasks run just fine.
This might be related to this bug: https://issues.apache.org/jira/browse/AIRFLOW-6580
What you expected to happen:
We expect pod to run as long as needed, and task reflect real status of the underlying pod.
Anything else we need to know:
We have tasks that run every night, and it happens either every day to 2-3 tasks, or every other day. Sometimes it runs just fine.
This really impacts our production services and any help is highly appreciated!
The text was updated successfully, but these errors were encountered: