Skip to content

test: fix flaky reserve worker startup#5384

Merged
acud merged 3 commits intomasterfrom
fix-reserve-worker-flake
Mar 5, 2026
Merged

test: fix flaky reserve worker startup#5384
acud merged 3 commits intomasterfrom
fix-reserve-worker-flake

Conversation

@acud
Copy link
Contributor

@acud acud commented Feb 27, 2026

Description

A data race in the reserve worker's startup was created in a way that it may get scheduled after the dependent tests already started. I've added a channel to synchronize this behavior. In node.go I am not using this synchronization as it appears to affect the node startup time in the integration tests.

This fixes the flakes in the TestReserveEvict and the TestReplaceOldIndex tests.

@martinconic
Copy link
Contributor

I have run the CI/CD tests on this PR multiple times, and one time on tests failed that I did not saw before:
https://github.com/ethersphere/bee/actions/runs/22489298892/job/65364049599?pr=5384

Could be because of these changes? can you please check it out?

@acud
Copy link
Contributor Author

acud commented Mar 2, 2026

Checked it out. Can't see how this is related. The goroutine dump in the CI suggest some problem in the CI runner. Note the IO waits etc.

db.events.Trigger(reserveOverCapacity)
}

close(ready)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it not read, and we are sure it is created, maybe it is better to put:

if ready != nil {
    close(ready)
}

storer.StartReserveWorker(context.Background(), pullerMock.NewMockRateReporter(0), networkRadiusFunc(0))
readyC := make(chan struct{})
storer.StartReserveWorker(context.Background(), pullerMock.NewMockRateReporter(0), networkRadiusFunc(0), readyC)
<-readyC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to avoid naked reads from the channel here. For example to have select on on t.Context().Done()?

This goes for all instances in this pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? it is a test. i can't see why you'd like to have a graceful context cancellation here. i can't really clearly see when a test context would be cancelled (ctrl-c? 600s timeout?), but in any case, if we are stuck on the read from the channel i'd actually like to see an ugly test failure with a goroutine dump (all goroutines blocked message) rather than having the test exit gracefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failure on timeout would be much easier to read than to investigate the panic goroutine dump.

@acud acud merged commit 2f1dade into master Mar 5, 2026
16 checks passed
@acud acud deleted the fix-reserve-worker-flake branch March 5, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants