Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant information in LeaderId::node_id #652

Closed
schreter opened this issue Feb 7, 2023 · 7 comments
Closed

Redundant information in LeaderId::node_id #652

schreter opened this issue Feb 7, 2023 · 7 comments

Comments

@schreter
Copy link
Collaborator

schreter commented Feb 7, 2023

Currently, LeaderId contains not only term, which is sufficient to identify the leader's term and thus also the validity of log entries, but it also contains node_id of the leader.

This information is redundant, but nevertheless always transported via LogId (which subsumes LeaderId). This leads to unnecessary complexity, since either the node ID of the leader must be logged or otherwise recovered at log replay time.

The only location in code where LeaderId::node_id is evaluated seems to be in Engine::update_progress(), where it would likely suffice to compare the term (the same term cannot be used by two different leaders).

There is a TODO in leader_id.rs already. Most likely, it would be sufficient to simply drop node_id from this structure.

In any case, removing the comparison for testing purposes doesn't have any effect on test outcome - all green.

diff --git a/openraft/src/engine/handler/replication_handler.rs b/openraft/src/engine/handler/replication_handler.rs
index 4d0d86c..165a630 100644
--- a/openraft/src/engine/handler/replication_handler.rs
+++ b/openraft/src/engine/handler/replication_handler.rs
@@ -86,7 +86,7 @@ where
     pub(crate) fn try_commit_granted(&mut self, granted: Option<LogId<NID>>) {
         // Only when the log id is proposed by current leader, it is committed.
         if let Some(c) = granted {
-            if c.leader_id.term != self.state.vote.term || c.leader_id.node_id != self.state.vote.node_id {
+            if c.leader_id.term != self.state.vote.term {
                 return;
             }
         }

My suggestion is to remove the node_id member, since it's basically unused and not needed. Going further, maybe the entire LeaderId structure can be simply replaced by term?

Or is there some error in my thinking?

Thanks.

@github-actions
Copy link

github-actions bot commented Feb 7, 2023

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@drmingdrmer
Copy link
Member

I added a node_id to LeaderId because LeaderId(term, node_id) reduces conflict during an election:

Given 3 nodes, if all of them start to elect at the same time, with the same term, without node_id, it is very likely there is no leader established in this term:

  • a voted for itself and sent vote requests to b, c;
  • b voted for itself and sent vote requests to a, c;
  • c voted for itself and sent vote requests to a, b;

Because they all have already voted for themselves, they just reject every received vote request from other nodes. This is the way the standard raft does

But with node_id, finally there will be a leader elected in this term. Because LeaderId(term=1, node_id=2) > LeaderId(term=1, node_id=1), the node with greatest node_id will finally succeed.

In any case, removing the comparison for testing purposes doesn't have any effect on test outcome - all green.

It has to compare node_id too here, because there might have been multiple leaders in the same term, e.g., LeaderId(term=3,node_id=a) then a new LeaderId(term=3,node_id=b) established. In this case, a log of LeaderId(term=3,node_id=a) being replicated by LeaderId(term=3,node_id=b) does not imply the log is committed.

It looks I have to add some tests for such a case:(

The only location in code where LeaderId::node_id is evaluated seems to be in Engine::update_progress(), where it would likely suffice to compare the term (the same term cannot be used by two different leaders).

This is the only location where it is explicitly used. When handling a vote request, a node just compares votes with req.vote >= self.vote, this is the most important usage of it.

@schreter
Copy link
Collaborator Author

schreter commented Feb 9, 2023

Hm, I still don't get it. The standard Raft uses only term, but no node_id for its election and it works correctly. Raft guarantees by its election process that there cannot be two leaders elected for the same term. So how come there are two leaders with the same term in openraft?

In the first case you described (all start voting for self at the exact time), there would be no leader elected for this term, which would trigger another election with a new term. The likelihood that even the second election again occurs at exact the same time on all nodes (and fails too) is virtually zero.

When I understand you correctly, you only store node_id to speed up the election? If so, I don't think it makes sense to optimize the "bad" case (new election occurring once in a while) by penalizing the "good" case all the time (e.g., additional info to be logged; occurring all the time).

The vote handling compares log ID (i.e., term + log index). That should be sufficient, since the same term cannot be used by two different nodes. I don't see how the node_id should influence anything there.

@drmingdrmer
Copy link
Member

Hm, I still don't get it. The standard Raft uses only term, but no node_id for its election and it works correctly. Raft guarantees by its election process that there cannot be two leaders elected for the same term. So how come there are two leaders with the same term in openraft?

In standard raft, a leader is not just identified by term. A leader is identified by a tuple LeaderId: (term, voted_for), e.g., a node has voted for node-1 in term 3 has its LeaderId stored as (term=3, voted_for=1).

image

The LeaderId in standard raft is a partial-order value, because voted_for=1 !> voted_for=2 and voted_for=1 !< voted_for=2. I had a slide explaining the conflict when candidate y in term 5 tries sending vote request to another node in term 5:

image

And this partial-order LeaderId allows only one leader to be granted in every term in the standard raft, term, without voted_for can be used as a leader identifier.

In openraft LeaderId(term, node_id) is a total-order value, i.e., LeaderId(term=3, node_id=2) > LeaderId(term=3,node_id=1), which makes it possible to have multiple leaders in one term. But still, at any time, there is only one leader is granted by a quorum. E.g., LeaderId(term=3,node_id=1) is granted by node-1 and node-2(3 nodes cluster), then LeaderId(term=3,node_id=2) is granted by node-1 and node-2. Because LeaderId is totally ordered, a node can grant LeaderId(term=3,node_id=2) even when it has already granted LeaderId(term=3,node_id=1).

image

In the first case you described (all start voting for self at the exact time), there would be no leader elected for this term, which would trigger another election with a new term. The likelihood that even the second election again occurs at exact the same time on all nodes (and fails too) is virtually zero.

Right, the chance there is another round of conflict is rare. But still, it takes two rounds of election. The time gap between these two rounds will be considered service downtime(about 2 RTT).

When I understand you correctly, you only store node_id to speed up the election?

Generally speaking, yes:). And it simplifies the logic. Adding another field will save several lines of code that handle term.

If so, I don't think it makes sense to optimize the "bad" case (new election occurring once in a while) by penalizing the "good" case all the time (e.g., additional info to be logged; occurring all the time).

Hm... to me, with or without a node_id in a log entry, the storage or transport efficiency can be well optimized when needed, e.g., extracting term, node_id and storing them separately. But the election conflict can not be solved without node_id in the LeaderId.

@schreter
Copy link
Collaborator Author

Wow, what a long explanation :-), thanks.

I was imprecise, sorry. Of course, (one) voted-for node ID needs to be stored and the node_id is sent as part of the Vote message.

So we can conclude that the node_id storage is for speeding up the election. I can live with that, I just have to find a way how to optimize log storage that we don't have to store the additional node_id information for each log entry unnecessarily. Maybe something like not storing the term/node_id per log entry at all and rather generate "switch" log entries when there is a change.

Alternatively, looking at the code again, it seems like Entry could use a simpler LogId only containing term and index, without the node_id. The node_id seems to be only relevant for replication reply, where it can be easily computed based on the last voted_for for Matching. For Conflict, it can be ignored, since it's not used. Of course, the default implementation of RaftLogReader would be more complex/impossible and quite a few changes would be needed elsewhere...

I'll probably go with the above mentioned storage optimization in the log ("switch" entries) or we'll store our 16B node IDs for each log entry anyway.

@schreter schreter closed this as not planned Won't fix, can't repro, duplicate, stale Feb 10, 2023
@drmingdrmer
Copy link
Member

Alternatively, looking at the code again, it seems like Entry could use a simpler LogId only containing term and index, without the node_id. The node_id seems to be only relevant for replication reply, where it can be easily computed based on the last voted_for for Matching. For Conflict, it can be ignored, since it's not used. Of course, the default implementation of RaftLogReader would be more complex/impossible and quite a few changes would be needed elsewhere...

LogId is also used on a follower to determine if a log entry is identical to the follower's local log entry at the same index, when a follower receives append-entries request.

I have a LogIdList structure that contains every log id from which a new LeaderId starts:
https://github.com/datafuselabs/openraft/blob/main/openraft/src/engine/log_id_list.rs

In fact, in my opinion, such a structure reflects more precisely the data structure in raft.
Logs should be a 2 dimension list entry = logs[leader_id][index].

May be I can introduce a feature to disable/enable to contain a node-id in LogId. 🤔

I'll probably go with the above mentioned storage optimization in the log ("switch" entries) or we'll store our 16B node IDs for each log entry anyway.

I did not know you have such a big node-id :(
Is it possible using a u32 node-id and store the 16B real id in a Node?
Membership stores every NodeId and Node for applications:

pub struct Membership<NID, N>
where
N: Node,
NID: NodeId,
{
/// Multi configs of members.
///
/// AKA a joint config in original raft paper.
configs: Vec<BTreeSet<NID>>,
/// Additional info of all nodes, e.g., the connecting host and port.
///
/// A node-id key that is in `nodes` but is not in `configs` is a **learner**.
nodes: BTreeMap<NID, N>,
}

@schreter
Copy link
Collaborator Author

LogId is also used on a follower to determine if a log entry is identical to the follower's local log entry at the same index, when a follower receives append-entries request.

Sure, but considering that Raft doesn't allow to have two leaders for the same term, the term comparison must be a sufficient condition - no need for node_id there.

May be I can introduce a feature to disable/enable to contain a node-id in LogId. 🤔

I wouldn't complicate the API and core algorithms. Either-or, I'd say.

We have plenty of optimization potential elsewhere (e.g., get rid of Vec<T> and similar on the API, replacing them with impl IntoIterator<T> or similar to prevent unnecessary memory allocation, where not required, etc.).

I did not know you have such a big node-id :( Is it possible using a u32 node-id and store the 16B real id in a Node?

Indeed, I'm thinking about doing exactly that. So far, we didn't use the Node, but maybe it's a better way; simply store Raft node index in the node_id and use Node to store the actual node ID (we use 16B GUIDs; that's a long story). That would actually make some other operations simpler (restoring a consensus domain replica in a catastrophic scenario on a completely unrelated node w/o the need to "renumber" everything).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants