Tasks are being relaunched before first task finished its execution #28853
Replies: 3 comments
-
We can't see the DAG. It's hard to reason about it without it. I believe you might have made some mistake on dag_id/task_id - but It's hard to see it . |
Beta Was this translation helpful? Give feedback.
-
I had/have a similar problem. In my case it was/is problem with database connection. If it's off for more than 50 seconds airflow decides to restart task due to this database connection broken. Then database comes back and system ends up with two tasks running in parallel. I increased stale_dag_threshold/dag_file_processor_timeout/scheduler_health_check_threshold in airflow.cfg but I don't know if it's proper solution... IMHO dirty workaround. |
Beta Was this translation helpful? Give feedback.
-
I've also had a similar issue, with scheduled tasks being identified as zombie even if the pod (K8S operator) was actually still running fine. in my case scaling-up the cluster helped, especially worker cpu and memory. Run on MWAA |
Beta Was this translation helpful? Give feedback.
-
Hello,
Apache Airflow version
The versions of postgres and airflow that we currently have are the following ones:
What happened
The problem we are encountering is the following. For some unknown reason, a task retry is being relaunched before first task finish correctly or incorrectly.
As the following example:
We have 4 retries:
Last logs from task 1 are at 04:42:02 ... and it finished correctly.
TIME: 04:42:02 when it finished
At this point the task has already been relaunched (retry 2) without waiting for task 1 to finish successfully. As you can see second retry is being launched without waiting for first task to end.
Here logs from retry 2:
TIME: 04:41:12 when it started
The same thing is happening between 1 and 2, 2 and 3, 3 and 4, all tasks are being retried before the previous task finishes.
We have encountered this problem after upgrading Airflow from 2.3.0 to 2.5.0 and also postgres upgrading from version 13.0 to 15.1. In addition to this, due to external problems we have migrated our infrastructure from Kubernetes to Openshift.
How to reproduce
It occurs randomly in the executions of our tasks, we do not observe a logical pattern
Deployment details
What you think should happen instead
Wait for the first task to complete successfully or unsuccessfully to launch a retry
Versions of Apache Airflow Providers
Also comment that we are using for all these jobs the library from 'airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator'.
Are you willing to submit PR?
Yes I am willing to submit a PR!
Beta Was this translation helpful? Give feedback.
All reactions