Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

recvAppendEntries: Assertion r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE failed #386

Closed
cole-miller opened this issue Mar 14, 2023 · 9 comments · Fixed by #429 or #467
Closed
Assignees
Labels
Bug Confirmed to be a bug

Comments

@cole-miller
Copy link
Contributor

Found by Jepsen, here's the job:

https://github.com/canonical/jepsen.dqlite/actions/runs/4418832275/jobs/7746473216

Hat tip to @nurturenature, whose recent improvements to our Jepsen harness helped uncover this.

@cole-miller cole-miller changed the title recvAppendEntries: Assertion r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE' failed.` recvAppendEntries: Assertion `r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE' failed. Mar 14, 2023
@cole-miller cole-miller changed the title recvAppendEntries: Assertion `r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE' failed. recvAppendEntries: Assertion r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE failed Mar 14, 2023
@cole-miller
Copy link
Contributor Author

Here's the failing assertion:

/* If we get here it means that the term in the request matches our current
* term or it was higher and we have possibly stepped down, because we
* discovered the current leader:
*
* From Figure 3.1:
*
* Rules for Servers: Candidates: if AppendEntries RPC is received from
* new leader: convert to follower.
*
* From Section 3.4:
*
* While waiting for votes, a candidate may receive an AppendEntries RPC
* from another server claiming to be leader. If the leader's term
* (included in its RPC) is at least as large as the candidate's current
* term, then the candidate recognizes the leader as legitimate and
* returns to follower state. If the term in the RPC is smaller than the
* candidate's current term, then the candidate rejects the RPC and
* continues in candidate state.
*
* From state diagram in Figure 3.3:
*
* [candidate]: discovers current leader -> [follower]
*
* Note that it should not be possible for us to be in leader state, because
* the leader that is sending us the request should have either a lower term
* (and in that case we reject the request above), or a higher term (and in
* that case we step down). It can't have the same term because at most one
* leader can be elected at any given term.
*/
assert(r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE);

If the invariants described in that comment are being violated, that seems pretty serious.

@cole-miller
Copy link
Contributor Author

This continues to happen, here's a job where you can also see a consistency violation (stale reads): https://github.com/canonical/jepsen.dqlite/actions/runs/5187761474/jobs/9350528154

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Jun 6, 2023

This continues to happen, here's a job where you can also see a consistency violation (stale reads): https://github.com/canonical/jepsen.dqlite/actions/runs/5187761474/jobs/9350528154

Looking at the logs It looks like term 11 had two leaders. The first term-11 leader was n2 elected at:

LIBRAFT   1686049635974788419 src/recv_request_vote_result.c:132 votes quorum reached -> convert to leader

the second term-11 leader was n3 elected about 8 seconds later at:

LIBRAFT   1686049643929871543 src/recv_request_vote_result.c:132 votes quorum reached -> convert to leader

Perhaps n3 was partitioned before n2 was elected and was stuck at term 10, at some point it became candidate and bumped to term 11, winning the elections. I will investigate further to see what happened exactly.

@freeekanayaka
Copy link
Contributor

It seems that the data.tar.bz2 file in the GHA artifact saved for the Jepsen run is empty, for all nodes. Is that normal? Like we're not actually including in the artifact zip the data dir of the node, because of space concerns. Or is this a bug in our Jepsen code?

@cole-miller
Copy link
Contributor Author

@freeekanayaka I think it's a bug.

@MathieuBordere
Copy link
Contributor

Happened again here
jepsen-data-append-partition-failure.zip

Not on a master run, but don't think it matters ...

@cole-miller
Copy link
Contributor Author

cole-miller commented Jul 31, 2023

It seems what happens in the logs Mathieu posted is that n3 goes offline at some point, comes back, gets its log up to speed, but somehow keeps operating using a very old configuration (several ASSIGNs ago). So when the election for term 4 happens, n2 thinks that the voters are {n1, n2, n4}, while n3 thinks the voters are {n1, n3, n5}. They both receive one other vote and become leader, and some time later n2 gets an unexpected request from n3. I will have to dig further to figure out why n3 doesn't switch to a newer configuration as it's updating its log.

@cole-miller
Copy link
Contributor Author

cole-miller commented Jul 31, 2023

I think the root of the problem might be the code in appendFollowerCb (replication.c) that can update the last_stored index for a configuration log entry without applying that configuration locally, if the node has become a candidate since appendFollowerCb was installed. This make it possible for n3 to appear up to date to other nodes based on the data in its RequestVote messages, while still operating with an old configuration.

That raises the question of what a candidate node should do if it finds itself applying (uncommitted) a configuration change during its candidacy. Seems like it would be best for it to just end its candidacy in this case? Otherwise counting votes turns into a mess. Or maybe followers shouldn't convert to candidate until all their in-flight append requests have completed successfully, so that configuration changes get a chance to be applied.

@hifron
Copy link

hifron commented Nov 24, 2023

I think that this is candidate for corporate db vendors review where replication problem with a lot of theory was made into real products on the market...

There is also Postgres but think they are some century behind...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.