Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

Don't commit entries from previous terms by counting replicas + noop entry upon leader election #336

Merged
merged 5 commits into from
Dec 16, 2022

Conversation

MathieuBordere
Copy link
Contributor

@MathieuBordere MathieuBordere commented Dec 13, 2022

An entry from a previous term will no longer be committed by counting votes (which violates the raft spec), instead upon leader election a leader applies a noop barrier entry to its log. When committing the noop barrier entry, all previous entries in the log will be committed.

The noop entry upon leader election is needed because raft, on startup doesn't know if its latest configuration loaded from disk is committed or not, therefore it must set configuration_uncommitted_index (to be added in following PR) and will reject any configuration changes requested by the client. To clear configuration_uncommitted_index the entry at that index must be committed. It's also needed when a follower becomes leader while still having a non-0 configuration_uncommitted_index.

This also includes a fix to update last_applied before firing an applied request's callback. The callback could contain logic dependent on last_applied which should have logically increased after applying the entry, but before firing the callback. This fixes an issue in dqlite where a barrier could fire another barrier in its callback.

fixes #220.

! This will break the dqlite tests, but not dqlite's functionality, I've updated raft's version so dqlite's configure step can test against it.

@codecov
Copy link

codecov bot commented Dec 13, 2022

Codecov Report

Merging #336 (5cc0bed) into master (85059c4) will increase coverage by 0.07%.
The diff coverage is 86.66%.

❗ Current head 5cc0bed differs from pull request most recent head b4bdc8a. Consider uploading reports for the commit b4bdc8a to get more accurate results

@@            Coverage Diff             @@
##           master     #336      +/-   ##
==========================================
+ Coverage   84.12%   84.19%   +0.07%     
==========================================
  Files          50       50              
  Lines        9416     9460      +44     
  Branches     2500     2515      +15     
==========================================
+ Hits         7921     7965      +44     
+ Misses        940      936       -4     
- Partials      555      559       +4     
Impacted Files Coverage Δ
src/convert.c 86.66% <75.00%> (+5.21%) ⬆️
src/replication.c 78.12% <89.58%> (+0.38%) ⬆️
src/fixture.c 94.62% <0.00%> (-0.10%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

src/replication.c Outdated Show resolved Hide resolved
@MathieuBordere MathieuBordere marked this pull request as draft December 13, 2022 14:22
@MathieuBordere
Copy link
Contributor Author

Converted to draft to optimize the vote counting a bit first.

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Dec 13, 2022

Added an optimization, need to test it a bit more to feel comfortable to release it for review. In short, raft will only try to commit log entry indices that are smaller than or equal to the largest match_index of another voter in the cluster (except in some corner cases).

This fixes a case where the loop that probes for new commit indices would run needlessly for e.g. a 1000 times when the leader's match index is for example 5000 and the next voter's match index is only 4000. In this case it only makes sense to start looking for a new commit index from 4000 and below.

@MathieuBordere
Copy link
Contributor Author

The overhaul of the replicationQuorum logic might seem like a bit of overkill, but the old implementation contained a bug I think.
It's uncovered in the code I added to the test here.
In short, when a leader continuously receives apply requests, it will continuously update its last_stored index, that will always remain higher than the match_index of the follower, resulting in a commit_index that will never be updated.

src/convert.c Outdated Show resolved Hide resolved
src/replication.c Outdated Show resolved Hide resolved
if (index <= r->commit_index) {
return;
/* In a single voter cluster where this node is part of the configuration,
* is a voter all stored entries can be committed. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is a voter can be dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will investigate if it can be dropped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will investigate if it can be dropped.

Not sure it's clear, I mean that the words is a voter can be dropped from the comment (probably they are a leftover? the full comment does quite sound grammatically correct to me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, yes the sentence indeed looks a bit strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests also still pass when I remove r->configuration.servers[server_index].role == RAFT_VOTER from the check, I will investigate anyway to make sure my understanding is correct.

Copy link
Contributor

@freeekanayaka freeekanayaka Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests also still pass when I remove r->configuration.servers[server_index].role == RAFT_VOTER from the check, I will investigate anyway to make sure my understanding is correct.

I was a bit suspicious about that condition too, but didn't mention it actually. I would not expect that cases where there is a single server and that server is not a voter are possible, because otherwise there would be no way to make progress and commit further entries. So that condition could be turned into an assertion. However, I'm just hand-waving, not sure.

Copy link
Contributor Author

@MathieuBordere MathieuBordere Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check actually says that there is 1 voter in the cluster, not that there is a single server. Checking for r->configuration.servers[server_index].role == RAFT_VOTER then means that this server was not demoted to standby or spare in the latest configuration and is busy committing the configuration change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah sorry, I misread the configurationVoterCount call, that's the number of voters indeed, not servers. Makes sense then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I double-checked and I think the condition is fine. The test still succeed after removing the RAFT_VOTER condition because the test is less strict then and the start_index is chosen to be last_stored on the current leader, which will be the highest possible index that can be committed anyway.

src/replication.c Outdated Show resolved Hide resolved
Copy link
Contributor

@freeekanayaka freeekanayaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me at first sight. Just a new minor nits.

src/replication.c Outdated Show resolved Hide resolved
configure.ac Show resolved Hide resolved
src/convert.c Outdated Show resolved Hide resolved
* of matchIndex[i] >= N, and log[N].term == currentTerm:
* set commitIndex = N (5.3, 5.4).
* */
start_index = replicationCalculateCommitIndex(r);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought of a different approach here: sort the nodes' match indices from highest to lowest, then check for a majority at each of those indices only. Whatever the highest index we can commit is, it's going to be equal to the match index i of some node n, such that for i we have a majority but for i + 1 we don't (because we lose n). That way the number of runs of the loop body below is bounded by the number of nodes, instead of the length of the log. What do you think?

Copy link
Contributor

@cole-miller cole-miller Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it'll always be the median (rounded up) of that sorted array of match indices, right? So no looping at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's worth it to optimize this, let me think about it for a bit.

Copy link
Contributor Author

@MathieuBordere MathieuBordere Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we can keep the current (what's on master) implementation of replicationQuorum, while adding the check to not commit entries by counting if the term is older than the current term, on the condition that we call it here

replicationQuorum(r, r->last_stored);
with last_index instead of r->last_stored. last_index is equal to the match_index of the server from which we have just received an AppendEntriesResultRPC. (due to
if (!progressMaybeUpdate(r, i, last_index)) {
)

This works because a new commit_index is always equal to some server's match_index and all match_index updates trickle down to the replicationQuorum call. Unless this call

rv = triggerActualPromotion(r);
fails, but then we convert to RAFT_UNAVAILABLE anyway.

It's a lot simpler solution than the current one proposed in this PR (and still passes the new tests). Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reasoning seems sound to me, and I like the simplification. I guess that here

raft/src/replication.c

Lines 520 to 521 in 0ddb329

/* Check if we can commit some new entries. */
replicationQuorum(r, r->last_stored);

we'd still use r->last_stored, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reasoning seems sound to me, and I like the simplification. I guess that here

raft/src/replication.c

Lines 520 to 521 in 0ddb329

/* Check if we can commit some new entries. */
replicationQuorum(r, r->last_stored);

we'd still use r->last_stored, right?

yes

src/replication.c Show resolved Hide resolved
Mathieu Borderé added 5 commits December 15, 2022 11:11
The callback could contain logic that depends on the value of last_applied,
which logically should have increased after applying the log entry, but
before calling the callback.
Add test to ensure the commit_index is updated when a leader
continually receives apply requests. The test fails without the fix in this commit.

Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Adapt the tests to take into account the add barrier entries in the log.
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
@MathieuBordere
Copy link
Contributor Author

Implemented the simplification described here. Removed the test that checks that we don't commit entries from previous terms by counting votes, as it was really testing nothing anymore after this change (it was hacky anyway) and from the code it's clear we can't commit entries from an older term by counting voters.

@freeekanayaka
Copy link
Contributor

It's surely simpler :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Committing entries from previous terms
3 participants