-
Notifications
You must be signed in to change notification settings - Fork 16.6k
Description
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
We are using Airflow 2.5.2 to deploy python scripts in Azure containers.
In cases where the logger breaks (in our case because someone used tqdm for progress bars which are known to break it), Airflow failing to find the log keeps re-provisioning the container and restarting the job infinitely, even if all is set to not retry in Airflow. This incurs costs on API calls for us and thus is an impactful problem.
The issue could be because of the handling in _monitor_logging() in Azure cointainer_instances.py line 298 where it changes the state to provisioning, but then doesn't do anything with it when it continues to fail to get instance_view. Maybe some form of check like if state=="Provisioning" and last_state=="Running": return 1 if retries are off could help handle it?
Any insight would be appreciated. I am happy to help write a fix, if you can help me understand this flow a bit better.
What you think should happen instead
The job should fail/exit code 1 instead of reprovisioning/retrying.
How to reproduce
Run an airflow job in which a script is run in an Azure container, which employs tqdm progress bars, or otherwise overwhelms the logger and makes it fail.
Operating System
Ubuntu 20.04
Versions of Apache Airflow Providers
apache-airflow-providers-common-sql==1.3.4
apache-airflow-providers-ftp==2.1.1
apache-airflow-providers-http==2.1.1
apache-airflow-providers-imap==2.2.2
apache-airflow-providers-microsoft-azure==3.7.2
apache-airflow-providers-postgres==4.0.1
apache-airflow-providers-sqlite==2.1.2
apache-airflow-providers-ssh==3.1.0
Deployment
Virtualenv installation
Deployment details
Airflow running on a VM hosted in Azure.
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct