Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

1107 fix cluster conf sync wait loop #11897

Merged

Conversation

zmstone
Copy link
Member

@zmstone zmstone commented Nov 7, 2023

Fixes EMQX-11329

When EMQX boots up, it tries to get latest config from peer (core type)
nodes, if none of the nodes are replying, the node will decide
to boot with local config (and replay the committed changes) if
the commit table is loaded from disk locally (an indication of the
data being latest), otherwise it will sleep for 1-2 seconds and
retry.

This lead to a race condition, e.g. in a two nodes cluster:

  1. node1 boots up
  2. node2 boots up and copy mnesia table from node1
  3. node1 restart before node2 can sync cluster.hocon from it
  4. node1 boots up and copy mnesia table from node2

Now that both node1 and node2 has the mnesia load_node pointing
to each other (i.e. not a local disk load).

Prior to this fix, the nodes would wait for each other in a dead loop.

This commit fixes the issue by allowing node to boot
with local config if it does not have a lagging.

Summary

馃 Generated by Copilot at 7bfad34

This pull request improves the cluster configuration synchronization and booting process by using a new module emqx_cluster_rpc that provides better functions to query and update the cluster state. It also refactors and cleans up the code in emqx_conf_app and enhances the logging messages.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Added property-based tests for code which performs user input validation
  • Changed lines covered in coverage report
  • Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@zmstone zmstone force-pushed the 1107-fix-cluster-conf-sync-wait-loop branch 2 times, most recently from 10eb70d to a722df7 Compare November 8, 2023 14:03
When EMQX boots up, it tries to get latest config from peer (core type)
nodes, if none of the nodes are replying, the node will decide
to boot with local config (and replay the committed changes) if
the commit table is loaded from disk locally (an indication of the
data being latest), otherwise it will sleep for 1-2 seconds and
retry.

This lead to a race condition, e.g. in a two nodes cluster:

1. node1 boots up
2. node2 boots up and copy mnesia table from node1
3. node1 restart before node2 can sync cluster.hocon from it
4. node1 boots up and copy mnesia table from node2

Now that both node1 and node2 has the mnesia `load_node` pointing
to each other (i.e. not a local disk load).

Prior to this fix, the nodes would wait for each other in a dead loop.

This commit fixes the issue by allowing node to boot
with local config if it does not have a lagging.
@zmstone zmstone force-pushed the 1107-fix-cluster-conf-sync-wait-loop branch from a722df7 to f9e9748 Compare November 8, 2023 14:06
@zmstone zmstone marked this pull request as ready for review November 8, 2023 17:52
@zmstone zmstone requested a review from a team as a code owner November 8, 2023 17:52
thalesmg
thalesmg previously approved these changes Nov 8, 2023
apps/emqx_conf/src/emqx_conf_app.erl Outdated Show resolved Hide resolved
Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com>
@zmstone zmstone merged commit f95058a into emqx:release-53 Nov 8, 2023
155 checks passed
@zmstone zmstone deleted the 1107-fix-cluster-conf-sync-wait-loop branch November 8, 2023 22:37
@zmstone zmstone mentioned this pull request Dec 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants