Skip to content

fix(e2e): Fix flaky Test_VerifyComponentsAreSuccessfullyStarted_WithRuntimeConfigLoad#7502

Merged
yeya24 merged 1 commit into
masterfrom
fix-runtime-config-flaky-test
May 12, 2026
Merged

fix(e2e): Fix flaky Test_VerifyComponentsAreSuccessfullyStarted_WithRuntimeConfigLoad#7502
yeya24 merged 1 commit into
masterfrom
fix-runtime-config-flaky-test

Conversation

@yeya24
Copy link
Copy Markdown
Contributor

@yeya24 yeya24 commented May 11, 2026

What this PR does

Fixes a flaky integration test Test_VerifyComponentsAreSuccessfullyStarted_WithRuntimeConfigLoad that intermittently fails with:

another service with the same name 'distributor' has already been started

Root Cause

The test intentionally starts services (querier, ruler, distributor) with invalid config (-distributor.shard-by-all-labels=false) expecting them to fail, then retries with valid config. When StartAndWaitReady fails during WaitReady (container starts but crashes), the service remains registered in the scenario's services slice. The subsequent attempt to start a new service with the same name fails because isRegistered() returns true.

This is a race condition: if the container crashes fast enough that Start() itself fails, the service is never registered and the retry works. But if the container starts successfully and then crashes during WaitReady, it stays registered.

Fix

  1. Test fix: Call s.Stop() after each expected StartAndWaitReady failure to properly unregister the service before retrying.
  2. Framework fix: Make ConcreteService.Stop() and Kill() tolerant of already-removed containers (started with --rm flag) by treating "No such container" errors as successful operations.

How was this tested

  • go build ./integration/... passes
  • go test ./integration/e2e/... -count=1 -short passes

Observed in: https://github.com/cortexproject/cortex/actions/runs/25643063397/job/75267056134

…untimeConfigLoad

When a service fails during WaitReady (container starts but crashes due to
runtime config validation), it remains registered in the scenario's services
slice. The next attempt to start a service with the same name then fails with
"another service with the same name has already been started".

Fix by:
1. Calling s.Stop() after expected StartAndWaitReady failures to unregister
   the service before retrying with a new instance.
2. Making ConcreteService.Stop() and Kill() tolerant of already-removed
   containers (started with --rm flag) by treating "No such container"
   errors as successful stops.

Signed-off-by: Ben Ye <benye@amazon.com>
@yeya24 yeya24 force-pushed the fix-runtime-config-flaky-test branch from 7e8e0b0 to 1a11a5f Compare May 11, 2026 03:04
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 11, 2026
@yeya24 yeya24 merged commit 24e8624 into master May 12, 2026
67 of 69 checks passed
@yeya24 yeya24 deleted the fix-runtime-config-flaky-test branch May 12, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size/S type/flaky-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants