Do not run check in cets_discovery on nodedown #50

arcusfelis · 2024-01-31T20:28:02Z

Patches for disco.

Do not call check on nodedown. This prevents immediate reconnect (which could confuse code inside global, trying to execute logic for prevent_overlapping_partitions).
If nodedown is received, wait 30 second before the next check. This allows the restarting node to initiate the connection on its own, without the whole cluster trying to connect to it.
Improve logging for when wait_for_ready fails.
Provide extra args into handle_down_fun: is_leader and remote_node

When a node goes down, it would be kicked from other nodes in the cluster by global. This process in not instant. This process happens even if node is alive and could reconnect. Reconnecting too fast would interfere with prevent_overlapped_partitions logic though. We still allow for this node to reconnect, but only during the regular checks. The change temporary puts the node into unavalable_nodes list though

This allows us to try to reconnect to the remote node after some period of time after the netsplit. Enters regular phase once we contact the remote node.

codecov · 2024-01-31T22:33:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.35%. Comparing base (b58feaf) to head (fbaf1d8).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #50      +/-   ##
==========================================
+ Coverage   98.33%   98.35%   +0.02%     
==========================================
  Files          10       10              
  Lines         780      792      +12     
==========================================
+ Hits          767      779      +12     
  Misses         13       13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…overy_nodes

chrzaszcz

Looks good, but CI failed. I added a minor comment.

src/cets_discovery.erl

Call peer to stop the processes Use proc_lib to spawn processes for better error reporting in tests

More reliable disconnect_node in tests Link disco process in tests

chrzaszcz

Looks good in general, but cets_SUITE is becoming huge, and for me it's really hard to follow the changes there. IMO we need to extract some helpers to separate modules - especially that this suite is full of generic helpers. It should be a separate story - please don't extend this PR.

arcusfelis added 3 commits January 31, 2024 17:06

Do check manually in disco_node_down_timestamp_is_remembered testcase

c522d63

Add new retry_type after_nodedown

7cb7316

This allows us to try to reconnect to the remote node after some period of time after the netsplit. Enters regular phase once we contact the remote node.

arcusfelis closed this Jan 31, 2024

arcusfelis reopened this Jan 31, 2024

arcusfelis force-pushed the less-checks branch from ba68b94 to 89f62d1 Compare January 31, 2024 22:31

arcusfelis added 4 commits January 31, 2024 23:35

Print status when wait_for_ready fails in cets_SUITE

b0f16de

Add spec for cets_discovery:get_time()

f704fda

Avoid any retries in global:trans in cets_join

87246cb

Pass remote_node and is_leader into handle_down

469f177

arcusfelis force-pushed the less-checks branch from 89f62d1 to 469f177 Compare February 1, 2024 18:12

arcusfelis added 2 commits February 1, 2024 19:23

Wait for get_fn2_called in status_unavailable_nodes_is_subset_of_disc…

85e91ab

…overy_nodes

Pass correct is_leader to handle_down

43d8ba7

arcusfelis changed the title ~~Less checks~~ Do not run check in cets_discovery on nodedown Feb 12, 2024

Fix spec for handle_down_fun()

cec51bf

arcusfelis marked this pull request as ready for review February 28, 2024 09:21

Test retry_type result

eddf756

arcusfelis force-pushed the less-checks branch from cc8a9b7 to eddf756 Compare February 28, 2024 09:25

chrzaszcz reviewed Feb 28, 2024

View reviewed changes

src/cets_discovery.erl Outdated Show resolved Hide resolved

src/cets_discovery.erl Outdated Show resolved Hide resolved

arcusfelis added 2 commits February 29, 2024 11:19

Make stopping CETS gen_servers more reliable

30f0452

Call peer to stop the processes Use proc_lib to spawn processes for better error reporting in tests

Add cleanup table for tests

fbaf1d8

More reliable disconnect_node in tests Link disco process in tests

chrzaszcz approved these changes Mar 4, 2024

View reviewed changes

chrzaszcz merged commit aac8ba7 into main Mar 4, 2024
9 checks passed

chrzaszcz deleted the less-checks branch March 4, 2024 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not run check in cets_discovery on nodedown #50

Do not run check in cets_discovery on nodedown #50

arcusfelis commented Jan 31, 2024 •

edited

Loading

codecov bot commented Jan 31, 2024 •

edited

Loading

chrzaszcz left a comment •

edited

Loading

chrzaszcz left a comment

Do not run check in cets_discovery on nodedown #50

Do not run check in cets_discovery on nodedown #50

Conversation

arcusfelis commented Jan 31, 2024 • edited Loading

codecov bot commented Jan 31, 2024 • edited Loading

Codecov Report

chrzaszcz left a comment • edited Loading

Choose a reason for hiding this comment

chrzaszcz left a comment

Choose a reason for hiding this comment

arcusfelis commented Jan 31, 2024 •

edited

Loading

codecov bot commented Jan 31, 2024 •

edited

Loading

chrzaszcz left a comment •

edited

Loading