-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: add Raft::ensure_linearizable()
to ensure linearizable read
#964
Conversation
The `Raft::is_leader()` method does not fully ensure linearizable read operations and is deprecated in this version. Instead, applications should use the `Raft::ensure_linearizable()` method to guarantee linearizability. Under the hood, `Raft::ensure_linearizable()` obtains a `ReadIndex` from `RaftCore` if it remains the leader, and blocks until the state machine applies up to the `ReadIndex`. This process ensures that the application observes all state visible to a preceding read operation. - Fix: databendlabs#965 Upgrade tip: Replace `Raft::is_leader()` with `Raft::ensure_linearizable()`.
9c3050e
to
79372b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a fundamental problem with the approach. However, it's not useful for high-performance servers, where (linearizable) reads are frequent.
Much better option would be leader lease, where the leader leases time for itself, preventing another node from electing itself a leader before this time expires. This can be easily piggybacked on regular keep-alive traffic. Effectively, if you set the leader lease time of X, then each follower would set its election timeout to X + random value and refuse voting for a leader when the vote request came within time period of X from the last keep-alive from the leader. This way, if the leader could build a quorum, it knows it continues to be the leader for at least the time X. No communication and no waiting needed, especially if X is reasonably larger than keep-alive timeout (e.g., 1s when keep-alive is sent every 300ms).
Reviewed 16 of 16 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ariesdevil and @lichuang)
@tvsfx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This PR is meant to stabilize an API for linearizable read based on the current implementation.
The lease based read can be done with clock_progress
, which tracks what wall clock time has been granted by a quorum for a leader.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ariesdevil and @lichuang)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better option would be leader lease, where the leader leases time for itself, preventing another node from electing itself a leader before this time expires. This can be easily piggybacked on regular keep-alive traffic. Effectively, if you set the leader lease time of X, then each follower would set its election timeout to X + random value and refuse voting for a leader when the vote request came within time period of X from the last keep-alive from the leader. This way, if the leader could build a quorum, it knows it continues to be the leader for at least the time X. No communication and no waiting needed, especially if X is reasonably larger than keep-alive timeout (e.g., 1s when keep-alive is sent every 300ms).
@schreter ; you are right that leader-based leases achieve better performance in most cases. A couple notes though:
- This approach does not rely on bounded clock drift for its correctness.
- It is easier to implement than leader leases.
- It can be used as a stepping stone for other types of reads; trading off leader clock cycles for network IO by forwarding read indexes to followers, or even further offloading the leader by sacrificing linearizability for sequential consistency, and serving a variant of read index-based reads from followers immediately.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ariesdevil, @drmingdrmer, and @lichuang)
openraft/src/docs/protocol/read.md
line 60 at r1 (raw file):
When the `last_applied_log_id` meets or exceeds `read_log_id`, the state machine contains all state upto `A`. Therefore, a linearizable read is assured when `last_applied_log_id >= read_log_id`.
This is not true if read_log_id
might still be reverted, see below.
openraft/src/raft/mod.rs
line 450 at r1 (raw file):
&self, ) -> Result< (Option<LogId<C::NodeId>>, Option<LogId<C::NodeId>>),
Is there ever any benefit to returning a pair of LogId
s over a pair of indexes here? If you do not always use a committed entry for the ReadIndex
(see next comment), I agree that you need to use the LogId
and not the log index (in this case, you cannot use RaftMetrics
in ensure_linearizable
, because they don't include LogId
s for all entries that were applied since the last check). However, if you do stick to this principle, I wonder if there are any cases where having this extra info compared to a raw index is of use?
Zooming out a bit, to uniquely represent a committed entry, you do not need a whole LogId
; just like how you have a CommittedLeaderId
, the index I wanted to return here is essentially a CommittedLogId
, which consists of an index only. Similarly, it feels like e.g. save\_committed
and read\_committed
in v2.rs
could just store an CommittedLogId
(i.e. an index into the log) instead of a whole LogId
?
openraft/src/raft_state/mod.rs
line 244 at r1 (raw file):
let committed = self.committed(); std::cmp::max(leader_first, committed)
I like the idea of taking the max
, because it ensures that we don't miss any entries from prior terms of which the leader does not know they're committed, while at the same time avoiding the check whether an entry has been committed this term, because you pick at least leader_first
.
However, as we've briefly discussed before, I don't think it is correct to pick the maximum here, if you want to return a simple log_index
, and not a LogId
(see previous comment for the advantages of this approach). The reason it is not correct, is that the leader's blank entry might still be reverted at some point, since it is not committed, and overwritten by a different entry at the same index.
I think the easiest fix might be to just check whether the leader has already committed an entry and reject the request if such is not the case, as the client can just retry if desired, and this seems easier than monitoring the logs to see whether a specific LogId
has been applied. Alternatively either RaftMetrics
would have to be extended with LogId
s (making it much more heavyweight) or there should be a better ReadIndex
callback mechanism (like the queue you mentioned) that discards stale readindexes in case of log reverts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ariesdevil, @lichuang, and @tvsfx)
openraft/src/raft/mod.rs
line 450 at r1 (raw file):
Previously, tvsfx wrote…
Is there ever any benefit to returning a pair of
LogId
s over a pair of indexes here? If you do not always use a committed entry for theReadIndex
(see next comment), I agree that you need to use theLogId
and not the log index (in this case, you cannot useRaftMetrics
inensure_linearizable
, because they don't includeLogId
s for all entries that were applied since the last check). However, if you do stick to this principle, I wonder if there are any cases where having this extra info compared to a raw index is of use?Zooming out a bit, to uniquely represent a committed entry, you do not need a whole
LogId
; just like how you have aCommittedLeaderId
, the index I wanted to return here is essentially aCommittedLogId
, which consists of an index only. Similarly, it feels like e.g.save\_committed
andread\_committed
inv2.rs
could just store anCommittedLogId
(i.e. an index into the log) instead of a wholeLogId
?
I have a preference for LogId
over LogIndex
. Given the choice, I would consistently opt for LogId
and steer clear of LogIndex
.
The order of index
doesn't necessarily represent to the order of events. Although in Raft, the order of indices is used to determine the order of events, each time I employ index
, I am compelled to re-evaluate a scenario to affirm its accuracy, which is quite tiresome.
LogId
, on the other hand, invariably represent the order of events.
P.S.
A more profound rationale is that within Openraft, the vote
isn't required to implement the full Ord
trait; it is sufficient for it to implement PartialOrd
to function properly. Moreover, the LogId.committed_leader_id
doesn't need to fulfill the Ord
trait either; implementing PartialOrd
would be adequate in a consensus protocol that permits the existence of multiple leaders. These are not goal of Openraft but to me it's more clear thinking this way
openraft/src/raft_state/mod.rs
line 244 at r1 (raw file):
Previously, tvsfx wrote…
I like the idea of taking the
max
, because it ensures that we don't miss any entries from prior terms of which the leader does not know they're committed, while at the same time avoiding the check whether an entry has been committed this term, because you pick at leastleader_first
.However, as we've briefly discussed before, I don't think it is correct to pick the maximum here, if you want to return a simple
log_index
, and not aLogId
(see previous comment for the advantages of this approach). The reason it is not correct, is that the leader's blank entry might still be reverted at some point, since it is not committed, and overwritten by a different entry at the same index.I think the easiest fix might be to just check whether the leader has already committed an entry and reject the request if such is not the case, as the client can just retry if desired, and this seems easier than monitoring the logs to see whether a specific
LogId
has been applied. Alternatively eitherRaftMetrics
would have to be extended withLogId
s (making it much more heavyweight) or there should be a betterReadIndex
callback mechanism (like the queue you mentioned) that discards stale readindexes in case of log reverts.
The blank log will be reverted but I believe the invariant still holds:
- With the blank log's LogId as
B
and the last committed LogId asC
, whereB > C
; - If
B
is reverted, a subsequent blank logB₁
will finally be committed, whereB₁ > B
; - Consequently, when the LogId of the last committed log updates to
B₁
, fulfilling the conditionC' >= B₁ > B > C
, the state machine will be up to date and ready for reading.
Can you give me a counter example if I missed anything?
Right, although I would argue that the fact that we allow the blank log to be reverted without invalidating the read index suggests that we do not actually care about the order of events in this case, we only care about the order of log entries.
Just came back here to say this; once I was away from my computer I realized that what I said before is wrong, apologies for the confusion :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
I appreciate the confirmation on this approach!
I'm gonna merge it!
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ariesdevil, @lichuang, and @tvsfx)
openraft/src/docs/protocol/read.md
line 60 at r1 (raw file):
Previously, tvsfx wrote…
This is not true if
read_log_id
might still be reverted, see below.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach does not rely on bounded clock drift for its correctness.
Yes, that's the deficiency with leader lease. OTOH, I don't think we have any practical problem with it, since clock drift will be likely <<1% on all modern systems, most likely always under RTT.
It is easier to implement than leader leases.
Implementing leader leases is actually quite straightforward, as I described above:
- The leader sends its lease time as a delta to the communication time (NOT the target wall clock time!) with each communication to its followers, or the lease time is configured per cluster.
- Each follower will update the earliest timestamp when it may start an election and/or accept a vote request from someone else than the current leader (probably including some safety margin for the clock drift) upon receiving a message from the leader.
- Leader has a lease from the communication time to communication time + lease time.
It can be used as a stepping stone for other types of reads; trading off leader clock cycles for network IO by forwarding read indexes to followers, or even further offloading the leader by sacrificing linearizability for sequential consistency, and serving a variant of read index-based reads from followers immediately.
As I wrote, I don't have any problem with this approach (I gave my approval), exactly because of the reasons you are citing here - possible leader offload in the future (though that was more my gut feeling than something provable :-).
Most likely, we should simply combine the two, since the leader lease is practically zero-cost (if configured on cluster level for each replica equally).
@drmingdrmer Not sure what you meant with the link to the replication handler. That doesn't prevent the follower from starting an election and becoming a leader before the lease expires, right?
Reviewable status: all files reviewed, 3 unresolved discussions
@schreter
|
I agree that it shouldn't matter in most cases. However, our use case is one in which we do not have access to a trustworthy time source, so this implementation is very useful to us. |
Changelog
Feature: add
Raft::ensure_linearizable()
to ensure linearizable readThe
Raft::is_leader()
method does not fully ensure linearizable readoperations and is deprecated in this version. Instead, applications
should use the
Raft::ensure_linearizable()
method to guaranteelinearizability.
Under the hood,
Raft::ensure_linearizable()
obtains aReadIndex
fromRaftCore
if it remains the leader, and blocks until the statemachine applies up to the
ReadIndex
. This process ensures that theapplication observes all state visible to a preceding read operation.
ReadIndex
#965Upgrade tip:
Replace
Raft::is_leader()
withRaft::ensure_linearizable()
.This PR implements the very basic requirement proposed in:
This change is