Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memberlist: Aggressive scale down can cause lost tombstones #4360

Closed
stevesg opened this issue Jul 12, 2021 · 1 comment
Closed

Memberlist: Aggressive scale down can cause lost tombstones #4360

stevesg opened this issue Jul 12, 2021 · 1 comment
Labels

Comments

@stevesg
Copy link
Contributor

stevesg commented Jul 12, 2021

Describe the bug
When scaling down extremely fast, a tombstone can still go missing. The TestSingleBinaryWithMemberlistScaling can reproduce this on occasion with the default values. e.g.

integration_memberlist_single_binary_test.go:212: cortex-1: cortex_ring_members=4.000000 memberlist_client_kv_store_value_tombstones=16.000000

memberlist-tombstone-with-debug.log

What appears to be happening is that the final messages from the instance being scaled down are being sent the expected number of times, but the intended recipients are also shutting down. This is not trivial to fix because we do not get any feedback from memberlist as to whether our messages were actually received. Possible solutions:

  • Somehow monitor for failed sends and re-send until some number of successful sends are achieved
  • Send out messages tombstones more times (e.g. a form of retransmit multiplier specifically for tombstones)

To Reproduce
Run the TestSingleBinaryWithMemberlistScaling a few times.

make ./cmd/cortex/.uptodate
go test -timeout=1h -count=20 -v -tags=requires_docker ./integration -run "^TestSingleBinaryWithMemberlistScaling$"

Tweaking the scaling numbers in the test make it fail more often:

maxCortex := 8
minCortex := 1

Expected behavior
The test doesn't fail.

Environment:

  • Infrastructure: N/A
  • Deployment tool: N/A

Additional Context

stevesg added a commit to stevesg/cortex that referenced this issue Jul 12, 2021
This appears to be highlighting an issue - raised as cortexproject#4360.
This change just stops the test flaking until it can be fixed.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
pracucci pushed a commit that referenced this issue Jul 13, 2021
This appears to be highlighting an issue - raised as #4360.
This change just stops the test flaking until it can be fixed.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
@stale
Copy link

stale bot commented Oct 10, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 10, 2021
@stale stale bot closed this as completed Oct 26, 2021
alvinlin123 pushed a commit to ac1214/cortex that referenced this issue Jan 14, 2022
…xproject#4361)

This appears to be highlighting an issue - raised as cortexproject#4360.
This change just stops the test flaking until it can be fixed.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant