Skip to content

Jobs in Azure Containers restart infinitely if logger crashes, despite retries being set to off. #34516

@krisfur

Description

@krisfur

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

We are using Airflow 2.5.2 to deploy python scripts in Azure containers.

In cases where the logger breaks (in our case because someone used tqdm for progress bars which are known to break it), Airflow failing to find the log keeps re-provisioning the container and restarting the job infinitely, even if all is set to not retry in Airflow. This incurs costs on API calls for us and thus is an impactful problem.

The issue could be because of the handling in _monitor_logging() in Azure cointainer_instances.py line 298 where it changes the state to provisioning, but then doesn't do anything with it when it continues to fail to get instance_view. Maybe some form of check like if state=="Provisioning" and last_state=="Running": return 1 if retries are off could help handle it?

Any insight would be appreciated. I am happy to help write a fix, if you can help me understand this flow a bit better.

What you think should happen instead

The job should fail/exit code 1 instead of reprovisioning/retrying.

How to reproduce

Run an airflow job in which a script is run in an Azure container, which employs tqdm progress bars, or otherwise overwhelms the logger and makes it fail.

Operating System

Ubuntu 20.04

Versions of Apache Airflow Providers

apache-airflow-providers-common-sql==1.3.4
apache-airflow-providers-ftp==2.1.1
apache-airflow-providers-http==2.1.1
apache-airflow-providers-imap==2.2.2
apache-airflow-providers-microsoft-azure==3.7.2
apache-airflow-providers-postgres==4.0.1
apache-airflow-providers-sqlite==2.1.2
apache-airflow-providers-ssh==3.1.0

Deployment

Virtualenv installation

Deployment details

Airflow running on a VM hosted in Azure.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions