-
Notifications
You must be signed in to change notification settings - Fork 830
Description
Describe the bug
When scaling down extremely fast, a tombstone can still go missing. The TestSingleBinaryWithMemberlistScaling
can reproduce this on occasion with the default values. e.g.
integration_memberlist_single_binary_test.go:212: cortex-1: cortex_ring_members=4.000000 memberlist_client_kv_store_value_tombstones=16.000000
memberlist-tombstone-with-debug.log
What appears to be happening is that the final messages from the instance being scaled down are being sent the expected number of times, but the intended recipients are also shutting down. This is not trivial to fix because we do not get any feedback from memberlist as to whether our messages were actually received. Possible solutions:
- Somehow monitor for failed sends and re-send until some number of successful sends are achieved
- Send out messages tombstones more times (e.g. a form of retransmit multiplier specifically for tombstones)
To Reproduce
Run the TestSingleBinaryWithMemberlistScaling
a few times.
make ./cmd/cortex/.uptodate
go test -timeout=1h -count=20 -v -tags=requires_docker ./integration -run "^TestSingleBinaryWithMemberlistScaling$"
Tweaking the scaling numbers in the test make it fail more often:
maxCortex := 8
minCortex := 1
Expected behavior
The test doesn't fail.
Environment:
- Infrastructure: N/A
- Deployment tool: N/A
Additional Context