Skip to content

SDSTOR-21438: Adjust commit_quorum reduction ordering#875

Merged
yuwmao merged 1 commit intoeBay:masterfrom
yuwmao:quorum
Apr 8, 2026
Merged

SDSTOR-21438: Adjust commit_quorum reduction ordering#875
yuwmao merged 1 commit intoeBay:masterfrom
yuwmao:quorum

Conversation

@yuwmao
Copy link
Copy Markdown
Contributor

@yuwmao yuwmao commented Apr 2, 2026

When two members are down, the raft leader yields leadership after leadership_expiry (20x heartbeat). reset_quorum_size must be called before the NOT_LEADER check so the node can maintain/reclaim leadership with election_quorum=1.

Previously, the TwoMemberDown UT called replace_member immediately, so the leadership doesn't expired. Sleep 10s to simulate leader expiry.

One thing to mention is that if a request fails after NOT_LEADER check, need retry twice, because it will reset commit_quorum to 0 before return.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 2, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 16.66667% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.22%. Comparing base (1a0cef8) to head (50b89a6).
⚠️ Report is 325 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp 16.66% 1 Missing and 9 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #875      +/-   ##
==========================================
- Coverage   56.51%   48.22%   -8.29%     
==========================================
  Files         108      110       +2     
  Lines       10300    12900    +2600     
  Branches     1402     6204    +4802     
==========================================
+ Hits         5821     6221     +400     
+ Misses       3894     2566    -1328     
- Partials      585     4113    +3528     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yuwmao
Copy link
Copy Markdown
Contributor Author

yuwmao commented Apr 2, 2026

When two members are down, the raft leader yields leadership after leadership_expiry (20x heartbeat). reset_quorum_size must be called before the NOT_LEADER check so the node can maintain/reclaim leadership with election_quorum=1. Add comment explaining this ordering constraint.

Previously, the TwoMemberDown UT called replace_member immediately, so the leadership doesn't expired. Sleep 10s to simulate leader expiry.

@yuwmao yuwmao closed this Apr 2, 2026
@yuwmao yuwmao reopened this Apr 2, 2026
@yuwmao yuwmao requested a review from xiaoxichen April 2, 2026 12:51
@yuwmao yuwmao changed the title Adjust commit_quorum reduction ordering SDSTOR-21438: Adjust commit_quorum reduction ordering Apr 8, 2026

if (m_my_repl_id != get_leader_id()) { return make_async_error<>(ReplServiceError::NOT_LEADER); }
// Check if leader itself is requested to move out.
if (m_my_repl_id == member_out.id) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that better to move if (m_my_repl_id == member_out.id) check at the very beginning?

bool is_leader = m_my_repl_id == get_leader_id());
// I am not the one to be removed
if (m_my_repl_id == member_out.id)  {
      if (is_leader). yield_leadership();
     return make_async_error<>(ReplServiceError::NOT_LEADER);
}
// not leader handling
if (!is_leader) {
    if (commit_quorum >=1) {
        // reduce quorum size ....
        reset_quorum_size(commit_quorum, trace_id);
    }
    return make_async_error<>(ReplServiceError::NOT_LEADER); 
}
// Now I am the leader

When two members are down, the raft leader yields leadership after
leadership_expiry (20x heartbeat). reset_quorum_size must be called
before the NOT_LEADER check so the node can maintain/reclaim leadership
with election_quorum=1.

Previously, the TwoMemberDown UT called replace_member immediately, so the leadership doesn't expired. Sleep 10s to simulate leader expiry.
@yuwmao yuwmao merged commit 6e2fa5c into eBay:master Apr 8, 2026
40 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants