Add deferrable mode to AzureContainerInstancesOperator by cruseakshay · Pull Request #62772 · apache/airflow

cruseakshay · 2026-03-03T08:29:16Z

Add deferrable=True support to AzureContainerInstancesOperator so
tasks release their worker slot while waiting for long-running containers,
offloading polling to the lightweight Triggerer process.

Changes

New AzureContainerInstanceAsyncHook — async counterpart to the
existing sync hook, using azure.mgmt.containerinstance.aio for
non-blocking state/log/delete calls.
New AzureContainerInstanceTrigger — polls ACI at a configurable
interval and yields a TriggerEvent when the container reaches a
terminal state.
AzureContainerInstancesOperator — three new parameters:
- deferrable (default from conf) — enables deferrable mode
- polling_interval (default 5.0s) — trigger poll frequency
- remove_on_success (default True) — controls cleanup on success
- Also fixes a bug where the finally cleanup block would delete a
  still-running container group immediately after self.defer() raised
  TaskDeferred.
provider.yaml + get_provider_info.py updated to register the
new trigger.

closes: #62433

Was generative AI tooling used to co-author this PR?

Yes — Cursor, following the guidelines

...iders/microsoft/azure/src/airflow/providers/microsoft/azure/operators/container_instances.py

SameerMesiah97

The overall approach is solid, though the style is a bit different from the deferrable mode implementation for the ADF counterpart. I have left a few comments for you to address. On a slightly tangential point, the PR is quite large so, for future reference, I would advise splitting a PR like this into the following:

PR 1: Async Hook
PR 2: Deferrable mode implementation (Operator + Trigger)
PR 3: Bug fix

Combining all 3 in one PR makes it harder to review. You need not do this now but please keep this in mind the next time you submit a PR.

SameerMesiah97 · 2026-03-06T00:55:27Z

providers/microsoft/azure/src/airflow/providers/microsoft/azure/hooks/container_instance.py

+        :param name: the name of the container group
+        """
+        client = await self.get_async_conn()
+        await client.container_groups.begin_delete(resource_group, name)


Should the async hook mirror the existing sync behavior (fire-and-forget), or should it await the poller to ensure deletion completes and surface errors? Did you consider something like this:

poller = await client.container_groups.begin_delete( resource_group, name ) await poller.result()

The key thing to understand here is that begin_delete does not return the result but an LRO poller object which waits for the result. I don't think it's necessarily wrong to keep the replicate the mechanics of the sync hook here but I was just wondering if you had thought of this approach.

Yes, I kept it the same to remain consistent. LMK if you think we should do await poller.result() in async and then similarly change the sync hook as well.

Yes, I kept it the same to remain consistent. LMK if you think we should do await poller.result() in async and then similarly change the sync hook as well.

On second thought, I think you should keep the current approach. After re-evaluating my suggestion in light of your comment, I cannot see a strong reason to break consistency with sync mode. Both should be fire-and-forget or delete-and-confirm, instead of one being the former and the other being the latter and vice versa.

SameerMesiah97 · 2026-03-06T01:16:16Z

providers/microsoft/azure/src/airflow/providers/microsoft/azure/hooks/container_instance.py

+        """Close the async connection."""
+        if self._async_conn is not None:
+            await self._async_conn.close()
+            self._async_conn = None


The async credential created in get_async_conn() isn't stored or closed in close(). Some Azure async credentials support close(). Would it make sense to keep a reference and close it here too?

Yes, good point.

SameerMesiah97 · 2026-03-06T01:23:11Z

...iders/microsoft/azure/src/airflow/providers/microsoft/azure/operators/container_instances.py

+                    if self.remove_on_success:
+                        self.on_kill()
+                elif self.remove_on_error:
+                    self.on_kill()


This can be made clearer like this:

if _cleanup: if exit_code == 0 and self.remove_on_success: self.on_kill() elif exit_code != 0 and self.remove_on_error: self.on_kill()

SameerMesiah97 · 2026-03-06T01:30:55Z

providers/microsoft/azure/src/airflow/providers/microsoft/azure/triggers/container_instance.py

+                        exit_code = 0
+                        detail_status = "Provisioning"
+
+                    self.log.info("Container group %s/%s state: %s", self.resource_group, self.name, state)


This does not need to be logged in every iteration. I think you only log container state during transitions. With large numbers of concurrent deferrable tasks, this will result in extreme log pollution in the triggerer.

SameerMesiah97 · 2026-03-06T01:43:16Z

...iders/microsoft/azure/src/airflow/providers/microsoft/azure/operators/container_instances.py

        priority: str | None = "Regular",
        identity: ContainerGroupIdentity | dict | None = None,
+        deferrable: bool = conf.getboolean("operators", "default_deferrable", fallback=False),
+        polling_interval: float = 5.0,


Why 5 seconds? How long do you expect Azure Container Instances to run? It's seconds to minutes, this is fine. But if they can run for hours, I think 5 seconds is a bit too aggressive. 10-30 seconds would be more reasonable.

I agree, 5 sec is aggressive. Setting it to 30 sec.

SameerMesiah97 · 2026-03-06T09:44:59Z

...iders/microsoft/azure/src/airflow/providers/microsoft/azure/operators/container_instances.py

+    @cached_property
+    def _ci_hook(self) -> AzureContainerInstanceHook:
+        return AzureContainerInstanceHook(azure_conn_id=self.ci_conn_id)
+


Can you explain why you turned this into a cached property?

This is consistent with ADF/Synapse operators. The benefit is lazy initialization for execute() and execute_complete(), since both methods require the hook. With @cached_property, we define it once and avoid duplicating the instantiation logic across both methods.

SameerMesiah97 · 2026-03-06T09:50:08Z

providers/microsoft/azure/tests/unit/microsoft/azure/hooks/test_container_instance.py

+        hook._async_conn = mock_conn
+
+        conn = await hook.get_async_conn()
+        assert conn is mock_conn


It might be worth adding a test that calls get_async_conn() twice and verifies that the same client instance is returned, to confirm the caching behavior.

SameerMesiah97 · 2026-03-06T09:53:29Z

providers/microsoft/azure/tests/unit/microsoft/azure/hooks/test_container_instance.py

+    async def test_delete(self, async_conn_with_credentials):
+        hook = AzureContainerInstanceAsyncHook(azure_conn_id=async_conn_with_credentials.conn_id)
+        mock_client = MagicMock()
+        mock_client.container_groups.begin_delete = AsyncMock()


If you adjust the implementation for deleting containers, you will have to change this too.

We can change if we decide against fire-and-forget

We can change if we decide against fire-and-forget

No need to change now as I think your current approach is ideal.

- async credential: store and close _async_credential in close() - Remove per-poll state logging in trigger - Change default polling_interval from 5s to 30s - Simplify finally cleanup block in execute() - Add get_async_conn() caching test

Add aync support to ACI

f5597ed

boring-cyborg bot added area:providers provider:microsoft-azure Azure-related issues labels Mar 3, 2026

cruseakshay marked this pull request as ready for review March 3, 2026 10:30

cruseakshay requested a review from dabla as a code owner March 3, 2026 10:30

Merge branch 'main' into feature/aci-deferrable-62433

a232cf9

cruseakshay mentioned this pull request Mar 3, 2026

Add parameter deferrable to AzureContainerInstanceOperator #62433

Open

2 tasks

Merge branch 'main' into feature/aci-deferrable-62433

b97da55

nailo2c reviewed Mar 3, 2026

View reviewed changes

...iders/microsoft/azure/src/airflow/providers/microsoft/azure/operators/container_instances.py Outdated Show resolved Hide resolved

avoid raising AirflowException directly

608b93f

SameerMesiah97 reviewed Mar 6, 2026

View reviewed changes

Address review

c810e28

- async credential: store and close _async_credential in close() - Remove per-poll state logging in trigger - Change default polling_interval from 5s to 30s - Simplify finally cleanup block in execute() - Add get_async_conn() caching test

Conversation

cruseakshay commented Mar 3, 2026

Changes

Was generative AI tooling used to co-author this PR?

Uh oh!

Uh oh!

SameerMesiah97 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SameerMesiah97 Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SameerMesiah97 Mar 7, 2026 •

edited

Loading