Skip to content

Fix flaky CI integration test failures#7336

Merged
CharlieTLe merged 1 commit into
cortexproject:masterfrom
CharlieTLe:fix-flaky-integration-tests
Mar 9, 2026
Merged

Fix flaky CI integration test failures#7336
CharlieTLe merged 1 commit into
cortexproject:masterfrom
CharlieTLe:fix-flaky-integration-tests

Conversation

@CharlieTLe
Copy link
Copy Markdown
Member

@CharlieTLe CharlieTLe commented Mar 7, 2026

Summary

Fix three flaky integration tests and Parquet bucket index race conditions:

  • TestStartStop: Add docker rm --force after docker wait in service Stop() and Kill() to prevent container name collisions. Docker removes --rm containers asynchronously after process exit, so an explicit removal prevents races on restart.

  • TestIngesterMetadataWithTenantFederation: Add WaitMissingMetrics option when waiting for ring member metrics. With tenant federation enabled, the metrics endpoint takes longer to expose ring metrics.

  • TestBackwardCompatibilityQueryFuzz: Filter out additional PromQL constructs that produce different results between Cortex versions: double negation (--), quantile, predict_linear, and atan2.

  • Parquet bucket index race condition (both single binary and microservices mode): Wait for the compactor's blocks cleaner to complete before querying. The parquet store-gateway discovers blocks on-demand through the bucket index (SyncBlocks/InitialSync are no-ops), so the bucket index must exist before queries can succeed. Affected tests:

    • TestQuerierWithBlocksStorageRunningInMicroservicesMode (Parquet shuffle sharding)
    • TestQuerierWithBlocksStorageRunningInSingleBinaryMode (Parquet sharding)

Test plan

  • All CI checks pass (integration_querier, integration_query_fuzz on both amd64/arm64)
  • Verified each fix addresses the specific flaky failure mode
  • Reviewed recent CI history (~22 of 25 integration test failures covered)

🤖 Generated with Claude Code

@CharlieTLe CharlieTLe force-pushed the fix-flaky-integration-tests branch 3 times, most recently from 769fc2c to e7bfe47 Compare March 8, 2026 00:59
Fix three flaky integration tests:

1. TestStartStop: Add `docker rm --force` after `docker wait` in
   service Stop() and Kill() to prevent container name collisions.
   Docker removes --rm containers asynchronously after process exit,
   so an explicit removal prevents races on restart.

2. TestIngesterMetadataWithTenantFederation: Add WaitMissingMetrics
   option when waiting for ring member metrics. With tenant federation
   enabled, the metrics endpoint takes longer to expose ring metrics.

3. TestBackwardCompatibilityQueryFuzz: Filter out additional PromQL
   constructs that produce different results between Cortex versions:
   double negation (--), quantile, predict_linear, and atan2.

Also fix a pre-existing failure in the Parquet shuffle sharding test
by waiting for the compactor's blocks cleaner to complete before
querying. The parquet store-gateway discovers blocks on-demand through
the bucket index (SyncBlocks/InitialSync are no-ops), so the bucket
index must exist before queries can succeed.

Signed-off-by: Charlie Le <charlie.le@grafana.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
@CharlieTLe CharlieTLe force-pushed the fix-flaky-integration-tests branch from e7bfe47 to 93092c0 Compare March 8, 2026 01:31
Copy link
Copy Markdown
Member

@SungJin1212 SungJin1212 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Mar 8, 2026
@CharlieTLe CharlieTLe merged commit e62ef7a into cortexproject:master Mar 9, 2026
61 of 64 checks passed
Shvejan pushed a commit to Shvejan/cortex that referenced this pull request Mar 25, 2026
Fix three flaky integration tests:

1. TestStartStop: Add `docker rm --force` after `docker wait` in
   service Stop() and Kill() to prevent container name collisions.
   Docker removes --rm containers asynchronously after process exit,
   so an explicit removal prevents races on restart.

2. TestIngesterMetadataWithTenantFederation: Add WaitMissingMetrics
   option when waiting for ring member metrics. With tenant federation
   enabled, the metrics endpoint takes longer to expose ring metrics.

3. TestBackwardCompatibilityQueryFuzz: Filter out additional PromQL
   constructs that produce different results between Cortex versions:
   double negation (--), quantile, predict_linear, and atan2.

Also fix a pre-existing failure in the Parquet shuffle sharding test
by waiting for the compactor's blocks cleaner to complete before
querying. The parquet store-gateway discovers blocks on-demand through
the bucket index (SyncBlocks/InitialSync are no-ops), so the bucket
index must exist before queries can succeed.

Signed-off-by: Charlie Le <charlie_le@apple.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size/M type/flaky-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants