Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert dist blocker #4251

Merged
merged 1 commit into from
Mar 28, 2024
Merged

Revert dist blocker #4251

merged 1 commit into from
Mar 28, 2024

Conversation

chrzaszcz
Copy link
Member

@chrzaszcz chrzaszcz commented Mar 27, 2024

The dist_blocker feature was introduced to protect a disconnected node from reconnecting before other nodes finished the cleanup to avoid issues with inconsistent CETS tables. The solution was to change the cookie value on nodeup (sic!) for the particular node until the cleanup for that node is finished.

There are following issues with that solution:

  • Sometimes there is a delay before nodedown is received, and cleanup is started. However, when the node comes back online, nodedown would be received, followed by nodeup. The problem is that dist_blocker is preventing the node from connecting, which might delay reconnection. In MongooseHelm tests this was causing a 15-second delay before reconnection, but it could be more in some configurations. There is even potential for a deadlock, but I couldn't trigger it in tests.
  • When dist_blocker kicks in, multiple error messages would appear on several nodes, informing about the blocked connections. These unwanted and surprising logs could confuse the user.
  • Changing cookies looks like a very radical solution to me, and the recent improvement of cleanup makes it avoidable.

Because of these concerns, dist_blocker is disabled in this PR. It is enough to disable it in MIM, because CETS does not enable it by default.

The MongooseHelm tests have shown that after #4250 it is difficult to reproduce the issue - it occurred only once in about 50 tests, which try to quickly restart MIM to trigger the issue. Keep in mind, that with dist_blocker it has also failed at least twice on CI.

We can prevent the issues in different ways, and we can do so in separate PR's.

The dist_blocker feature was introduced to protect disconnected nodes
from reconnecting before other nodes finished their clenaup.

However, when dist_blocker kicks in, multiple unexpected error
messages would appear on several nodes.

This means that the solution cannot be accepted.
@mongoose-im
Copy link
Collaborator

mongoose-im commented Mar 27, 2024

elasticsearch_and_cassandra_26 / elasticsearch_and_cassandra_mnesia / 509759e
Reports root/ big
OK: 437 / Failed: 0 / User-skipped: 43 / Auto-skipped: 0


small_tests_25 / small_tests / 509759e
Reports root / small


small_tests_26 / small_tests / 509759e
Reports root / small


small_tests_26_arm64 / small_tests / 509759e
Reports root / small


ldap_mnesia_25 / ldap_mnesia / 509759e
Reports root/ big
OK: 2284 / Failed: 0 / User-skipped: 895 / Auto-skipped: 0


dynamic_domains_mysql_redis_26 / mysql_redis / 509759e
Reports root/ big
OK: 4506 / Failed: 0 / User-skipped: 144 / Auto-skipped: 0


ldap_mnesia_26 / ldap_mnesia / 509759e
Reports root/ big
OK: 2284 / Failed: 0 / User-skipped: 895 / Auto-skipped: 0


dynamic_domains_pgsql_mnesia_26 / pgsql_mnesia / 509759e
Reports root/ big
OK: 4539 / Failed: 0 / User-skipped: 111 / Auto-skipped: 0


internal_mnesia_26 / internal_mnesia / 509759e
Reports root/ big
OK: 2424 / Failed: 0 / User-skipped: 755 / Auto-skipped: 0


dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / 509759e
Reports root/ big
OK: 4535 / Failed: 1 / User-skipped: 114 / Auto-skipped: 0

graphql_SUITE:tls_enabled:tls_connect_admin_selfsigned_certificate
{error,{{assertMatch,[{module,graphql_SUITE},
            {line,259},
            {expression,"Result"},
            {pattern,"{ error , { tls_alert , { bad_certificate , _ } } }"},
            {value,{error,connection_closed}}]},
    [{graphql_SUITE,tls_connect_admin_selfsigned_certificate,1,
            [{file,"/home/circleci/project/big_tests/tests/graphql_SUITE.erl"},
             {line,259}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
     {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1302}]},
     {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,1234}]}]}}

Report log


pgsql_mnesia_25 / pgsql_mnesia / 509759e
Reports root/ big
OK: 4928 / Failed: 0 / User-skipped: 118 / Auto-skipped: 0


dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / 509759e
Reports root/ big
OK: 4539 / Failed: 0 / User-skipped: 111 / Auto-skipped: 0


pgsql_mnesia_26 / pgsql_mnesia / 509759e
Reports root/ big
OK: 4928 / Failed: 0 / User-skipped: 118 / Auto-skipped: 0


mysql_redis_26 / mysql_redis / 509759e
Reports root/ big
OK: 4907 / Failed: 0 / User-skipped: 139 / Auto-skipped: 0


mssql_mnesia_26 / odbc_mssql_mnesia / 509759e
Reports root/ big
OK: 4925 / Failed: 0 / User-skipped: 121 / Auto-skipped: 0


pgsql_cets_26 / pgsql_cets / 509759e
Reports root/ big
OK: 4454 / Failed: 0 / User-skipped: 178 / Auto-skipped: 0


dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / 509759e
Reports root/ big
OK: 4536 / Failed: 0 / User-skipped: 114 / Auto-skipped: 0

Copy link

codecov bot commented Mar 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.43%. Comparing base (e284af4) to head (509759e).

Additional details and impacted files
@@                     Coverage Diff                     @@
##           consistent-node-cleanup    #4251      +/-   ##
===========================================================
- Coverage                    84.44%   84.43%   -0.02%     
===========================================================
  Files                          552      552              
  Lines                        33549    33546       -3     
===========================================================
- Hits                         28331    28324       -7     
- Misses                        5218     5222       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from consistent-node-cleanup to master March 27, 2024 12:03
@chrzaszcz chrzaszcz marked this pull request as ready for review March 28, 2024 09:13
Copy link
Contributor

@jacekwegr jacekwegr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 👍

@jacekwegr jacekwegr merged commit f67e8f6 into master Mar 28, 2024
4 checks passed
@jacekwegr jacekwegr deleted the cets/revert-dist-blocker branch March 28, 2024 10:20
@jacekwegr jacekwegr added this to the 6.2.1 milestone Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants