-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not run check in cets_discovery on nodedown #50
Conversation
When a node goes down, it would be kicked from other nodes in the cluster by global. This process in not instant. This process happens even if node is alive and could reconnect. Reconnecting too fast would interfere with prevent_overlapped_partitions logic though. We still allow for this node to reconnect, but only during the regular checks. The change temporary puts the node into unavalable_nodes list though
This allows us to try to reconnect to the remote node after some period of time after the netsplit. Enters regular phase once we contact the remote node.
ba68b94
to
89f62d1
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #50 +/- ##
==========================================
+ Coverage 98.33% 98.35% +0.02%
==========================================
Files 10 10
Lines 780 792 +12
==========================================
+ Hits 767 779 +12
Misses 13 13 ☔ View full report in Codecov by Sentry. |
89f62d1
to
469f177
Compare
cc8a9b7
to
eddf756
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but CI failed. I added a minor comment.
Call peer to stop the processes Use proc_lib to spawn processes for better error reporting in tests
More reliable disconnect_node in tests Link disco process in tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general, but cets_SUITE
is becoming huge, and for me it's really hard to follow the changes there. IMO we need to extract some helpers to separate modules - especially that this suite is full of generic helpers. It should be a separate story - please don't extend this PR.
Patches for disco.
Do not call check on nodedown. This prevents immediate reconnect (which could confuse code inside global, trying to execute logic for prevent_overlapping_partitions).
If nodedown is received, wait 30 second before the next check. This allows the restarting node to initiate the connection on its own, without the whole cluster trying to connect to it.
Improve logging for when wait_for_ready fails.
Provide extra args into handle_down_fun: is_leader and remote_node