-
|
Kubernetes version: v1.23 (EKS) After a few seconds from the start of a DAG with a large number of parallel tasks (+100), the Airflow scheduler starts trying to adopt lots of those tasks, probably because those were triggered by the "mini-scheduler": It then fails to adopt a few with a 409 (Conflict) error and finally, a subset (but usually all) of the tasks that became orphans and were reset by the scheduler receive a SIGTERM, as the following task log shows: This used to be more frequent when the Airflow scheduler was restarted due to other issues and it had to readopt all its tasks. After we handled those restarts, it became less frequent. Unfortunately, this is not easily reproducible, that's why I didn't submit an Issue, but it is frequent enough to bug us with the time wasted in the retry interval, and also the interruption of the computation. It is worth mentioning that we run several Airflow instances in the same Kubernetes cluster and namespace, which is giving us the idea that the other schedulers might be interacting with the Pods and changing them. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
|
Mini schedulers do not start new pods. I think your "un several Airflow instances in the same Kubernetes cluster and namespace" statement explains it well. DON'T DO IT. It's what causes the problem and the idea of doing it makes very little sense and is not supported by Airflow.. |
Beta Was this translation helpful? Give feedback.
Mini schedulers do not start new pods. I think your "un several Airflow instances in the same Kubernetes cluster and namespace" statement explains it well. DON'T DO IT. It's what causes the problem and the idea of doing it makes very little sense and is not supported by Airflow..