KubernetesPodOperator is still running running pod but task is marked as failed #10325

art-i-svsg · 2020-08-14T08:14:19Z

Apache Airflow version: 1.10.9

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.14

Environment:

Cloud provider or hardware configuration: AWS EKS
Install tools: Helm version 3
Others: Helm chart - https://hub.helm.sh/charts/stable/airflow version 6.10.4

What happened:

We have Airflow set up with Celery executor but our tasks implemented using KubernetesPodExecutor. We create dag runs with set of tasks and run them as pods. We have tasks that can run for 40 minutes and more. Pretty often, we see that task is still running, actively doing required operations, but airflow marks task as failed, and retries it or if there are no retries left it just marks it as failed. Sometimes pods are stuck in running, though task is showing succeed status. We currently have one worker pod, which basically starts tasks execution, and we started to notice that worker goes OOMKilled pretty often because of low memory. Sometimes though tasks run just fine.

This might be related to this bug: https://issues.apache.org/jira/browse/AIRFLOW-6580

What you expected to happen:

We expect pod to run as long as needed, and task reflect real status of the underlying pod.

Anything else we need to know:

We have tasks that run every night, and it happens either every day to 2-3 tasks, or every other day. Sometimes it runs just fine.

This really impacts our production services and any help is highly appreciated!

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2020-08-14T08:14:21Z

Thanks for opening your first issue here! Be sure to follow the issue template!

eladkal · 2021-09-30T18:35:52Z

This may have been solved by #10230
If the issue still happens on latest airflow version and kubernetes provider let us know
closing for now

art-i-svsg added the kind:bug This is a clearly a bug label Aug 14, 2020

FloChehab mentioned this issue Aug 24, 2020

Fix KubernetesPodOperator reattachment #10230

Merged

eladkal added the provider:cncf-kubernetes Kubernetes provider related issues label Nov 19, 2020

eladkal closed this as completed Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubernetesPodOperator is still running running pod but task is marked as failed #10325

KubernetesPodOperator is still running running pod but task is marked as failed #10325

art-i-svsg commented Aug 14, 2020

boring-cyborg bot commented Aug 14, 2020

eladkal commented Sep 30, 2021 •

edited

KubernetesPodOperator is still running running pod but task is marked as failed #10325

KubernetesPodOperator is still running running pod but task is marked as failed #10325

Comments

art-i-svsg commented Aug 14, 2020

boring-cyborg bot commented Aug 14, 2020

eladkal commented Sep 30, 2021 • edited

eladkal commented Sep 30, 2021 •

edited