-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when running multiple schedulers #19038
Comments
Hm, yes, this is probably a lot more likely after my commit but would also have been present before (since that function would still run on boot). I'm curious why it queues it though if the pod is there - the function was meant to, by my understanding, only move it from QUEUED to SCHEDULED if there's no pod that would match. Is the pod around in this case when the taskinstance gets moved back to SCHEDULED? |
Seeing this issue a lot with 3 schedulers on 2.2.2, KubernetesExecutor on k8s 1.21. Code is PythonOperator calling a webservice. We do get some logs logs on startup though - this is right before job code starts:
Next try for this task started running, but then:
And then retry 3 goes through finally. Only seeing this issue on pipelines with high number of parallel tasks - in our case, 3 task pools 48 + 48 + 90 total capacity. Also for >1 scheduler, scheduler pods sometimes print this to logs:
|
This should be fixed on 2.2.3 via #19904. |
Apache Airflow version
2.2.0 (latest released)
Operating System
Debian GNU/Linux 10 (buster)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==2.3.0
apache-airflow-providers-celery==2.1.0
apache-airflow-providers-cncf-kubernetes==2.0.3
apache-airflow-providers-docker==2.2.0
apache-airflow-providers-elasticsearch==2.0.3
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-google==6.0.0
apache-airflow-providers-grpc==2.0.1
apache-airflow-providers-hashicorp==2.1.1
apache-airflow-providers-http==2.0.1
apache-airflow-providers-imap==2.0.1
apache-airflow-providers-microsoft-azure==3.2.0
apache-airflow-providers-mysql==2.1.1
apache-airflow-providers-odbc==2.0.1
apache-airflow-providers-postgres==2.3.0
apache-airflow-providers-redis==2.0.1
apache-airflow-providers-sendgrid==2.0.1
apache-airflow-providers-sftp==2.1.1
apache-airflow-providers-slack==4.1.0
apache-airflow-providers-sqlite==2.0.1
apache-airflow-providers-ssh==2.2.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
Helm deployed using the official Apache Airflow Helm chart
What happened
We recently upgraded to 2.2.0 but have now noticed some of the jobs being killed by the scheduler not log after they start.
So we are using KubernetesPodOperator to launch all our tasks.
What I can see happening is:
and repeat
Tracking it back it seems to be introduced in #18152, as this function is now scheduled it looks like you can get into a situation where a job has been launched correctly but the scheduler who kicked that off hasn't had time to update the state from queued to scheduled
ie this function
airflow/airflow/executors/kubernetes_executor.py
Line 438 in 5dd690b
What you expected to happen
Tasks that have been scheduled shouldn't be killed
How to reproduce
Startup at least 2 schedulers
Launch a set of tasks using the Kubernetes pod operator (or something that will cause a delay a job moving from queued to scheduled)
Anything else
Work around at the moment seems to just use 1 scheduler but it would be great if this could be patched.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: