Skip to content

Azure: Retry transient Azurite image-fetch failures in integration tests#16540

Open
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:issue/16530-azurite-fetch-retry
Open

Azure: Retry transient Azurite image-fetch failures in integration tests#16540
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:issue/16530-azurite-fetch-retry

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Summary

Closes #16530.

The iceberg-azure integration tests (TestADLSFileIO, TestADLSInputStream, TestADLSOutputStream) start a shared Azurite container in AzuriteTestBase.@BeforeAll. They intermittently fail at startup with a ContainerFetchException caused by a NotFoundException (HTTP 404) while testcontainers pulls mcr.microsoft.com/azure-storage/azurite:3.35.0. The image tag is already pinned, so this is a transient registry/CDN failure (rate-limiting or CDN inconsistency), not a missing image — re-running usually succeeds, which is exactly what makes these tests flaky.

What changed

AzuriteContainer now overrides start() to retry transient image-fetch failures before giving up: up to 5 attempts with exponential backoff (2s → 4s → 8s → 16s). Each retry re-invokes start(), which re-triggers the image pull, because testcontainers' RemoteDockerImage does not cache a failed resolution.

Notably, testcontainers' own withStartupAttempts(...) does not help here: in testcontainers 2.0.5 the image is fetched (getDockerImageName()RemoteDockerImage) before the startupAttempts retry loop in GenericContainer.doStart(), so a fetch failure short-circuits that loop. Retrying start() itself is what actually re-attempts the pull.

The retry is deliberately narrow: it only retries when the failure (or its cause chain) is a ContainerFetchException. Genuine launch failures — e.g. a wait-strategy timeout surfaced as a ContainerLaunchException without a fetch cause — are rethrown immediately so real problems are never masked.

Tests

  • New TestAzuriteContainer exercises the retry helper without Docker: succeeds without retry, retries until success, rethrows after exhausting attempts, does not retry non-fetch launch failures, and retries a fetch failure wrapped in a ContainerLaunchException.
  • Confirmed the existing Azurite integration tests still pass through the overridden start() (TestADLSFileIO, TestADLSInputStream, TestADLSOutputStream) with Docker.

Note

This reduces the flakiness by tolerating transient fetch hiccups; it cannot mask a registry outage that persists across all retries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the AZURE label May 23, 2026
…iner

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky tests in iceberg-azure integrationTest

1 participant