Skip to content

Add deferrable mode to AzureContainerInstancesOperator#62772

Open
cruseakshay wants to merge 5 commits intoapache:mainfrom
cruseakshay:feature/aci-deferrable-62433
Open

Add deferrable mode to AzureContainerInstancesOperator#62772
cruseakshay wants to merge 5 commits intoapache:mainfrom
cruseakshay:feature/aci-deferrable-62433

Conversation

@cruseakshay
Copy link
Contributor

Add deferrable=True support to AzureContainerInstancesOperator so
tasks release their worker slot while waiting for long-running containers,
offloading polling to the lightweight Triggerer process.

Changes

  • New AzureContainerInstanceAsyncHook — async counterpart to the
    existing sync hook, using azure.mgmt.containerinstance.aio for
    non-blocking state/log/delete calls.

  • New AzureContainerInstanceTrigger — polls ACI at a configurable
    interval and yields a TriggerEvent when the container reaches a
    terminal state.

  • AzureContainerInstancesOperator — three new parameters:

    • deferrable (default from conf) — enables deferrable mode
    • polling_interval (default 5.0s) — trigger poll frequency
    • remove_on_success (default True) — controls cleanup on success
    • Also fixes a bug where the finally cleanup block would delete a
      still-running container group immediately after self.defer() raised
      TaskDeferred.
  • provider.yaml + get_provider_info.py updated to register the
    new trigger.

closes: #62433


Was generative AI tooling used to co-author this PR?

@cruseakshay cruseakshay marked this pull request as ready for review March 3, 2026 10:30
@cruseakshay cruseakshay requested a review from dabla as a code owner March 3, 2026 10:30
Copy link
Contributor

@SameerMesiah97 SameerMesiah97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach is solid, though the style is a bit different from the deferrable mode implementation for the ADF counterpart. I have left a few comments for you to address. On a slightly tangential point, the PR is quite large so, for future reference, I would advise splitting a PR like this into the following:

  • PR 1: Async Hook
  • PR 2: Deferrable mode implementation (Operator + Trigger)
  • PR 3: Bug fix

Combining all 3 in one PR makes it harder to review. You need not do this now but please keep this in mind the next time you submit a PR.

:param name: the name of the container group
"""
client = await self.get_async_conn()
await client.container_groups.begin_delete(resource_group, name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the async hook mirror the existing sync behavior (fire-and-forget), or should it await the poller to ensure deletion completes and surface errors? Did you consider something like this:

poller = await client.container_groups.begin_delete(
        resource_group,
        name
    )

await poller.result()

The key thing to understand here is that begin_delete does not return the result but an LRO poller object which waits for the result. I don't think it's necessarily wrong to keep the replicate the mechanics of the sync hook here but I was just wondering if you had thought of this approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I kept it the same to remain consistent. LMK if you think we should do await poller.result() in async and then similarly change the sync hook as well.

Copy link
Contributor

@SameerMesiah97 SameerMesiah97 Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I kept it the same to remain consistent. LMK if you think we should do await poller.result() in async and then similarly change the sync hook as well.

On second thought, I think you should keep the current approach. After re-evaluating my suggestion in light of your comment, I cannot see a strong reason to break consistency with sync mode. Both should be fire-and-forget or delete-and-confirm, instead of one being the former and the other being the latter and vice versa.

"""Close the async connection."""
if self._async_conn is not None:
await self._async_conn.close()
self._async_conn = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async credential created in get_async_conn() isn't stored or closed in close(). Some Azure async credentials support close(). Would it make sense to keep a reference and close it here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point.

if self.remove_on_success:
self.on_kill()
elif self.remove_on_error:
self.on_kill()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be made clearer like this:

if _cleanup:
    if exit_code == 0 and self.remove_on_success:
        self.on_kill()
    elif exit_code != 0 and self.remove_on_error:
        self.on_kill()

exit_code = 0
detail_status = "Provisioning"

self.log.info("Container group %s/%s state: %s", self.resource_group, self.name, state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be logged in every iteration. I think you only log container state during transitions. With large numbers of concurrent deferrable tasks, this will result in extreme log pollution in the triggerer.

priority: str | None = "Regular",
identity: ContainerGroupIdentity | dict | None = None,
deferrable: bool = conf.getboolean("operators", "default_deferrable", fallback=False),
polling_interval: float = 5.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 5 seconds? How long do you expect Azure Container Instances to run? It's seconds to minutes, this is fine. But if they can run for hours, I think 5 seconds is a bit too aggressive. 10-30 seconds would be more reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, 5 sec is aggressive. Setting it to 30 sec.

@cached_property
def _ci_hook(self) -> AzureContainerInstanceHook:
return AzureContainerInstanceHook(azure_conn_id=self.ci_conn_id)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you turned this into a cached property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with ADF/Synapse operators. The benefit is lazy initialization for execute() and execute_complete(), since both methods require the hook. With @cached_property, we define it once and avoid duplicating the instantiation logic across both methods.

hook._async_conn = mock_conn

conn = await hook.get_async_conn()
assert conn is mock_conn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a test that calls get_async_conn() twice and verifies that the same client instance is returned, to confirm the caching behavior.

async def test_delete(self, async_conn_with_credentials):
hook = AzureContainerInstanceAsyncHook(azure_conn_id=async_conn_with_credentials.conn_id)
mock_client = MagicMock()
mock_client.container_groups.begin_delete = AsyncMock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you adjust the implementation for deleting containers, you will have to change this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change if we decide against fire-and-forget

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change if we decide against fire-and-forget

No need to change now as I think your current approach is ideal.

- async credential: store and close _async_credential in close()
- Remove per-poll state logging in trigger
- Change default polling_interval from 5s to 30s
- Simplify finally cleanup block in execute()
- Add get_async_conn() caching test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add parameter deferrable to AzureContainerInstanceOperator

3 participants