Feature: add `Raft::ensure_linearizable()` to ensure linearizable read #964

drmingdrmer · 2023-12-08T16:08:55Z

Changelog

Feature: add `Raft::ensure_linearizable()` to ensure linearizable read

The Raft::is_leader() method does not fully ensure linearizable read
operations and is deprecated in this version. Instead, applications
should use the Raft::ensure_linearizable() method to guarantee
linearizability.

Under the hood, Raft::ensure_linearizable() obtains a ReadIndex from
RaftCore if it remains the leader, and blocks until the state
machine applies up to the ReadIndex. This process ensures that the
application observes all state visible to a preceding read operation.

Fix: Linearizable read with ReadIndex #965

Upgrade tip:

Replace Raft::is_leader() with Raft::ensure_linearizable().

This PR implements the very basic requirement proposed in:

Feature request: optimization for read-only requests #262

This change is

The `Raft::is_leader()` method does not fully ensure linearizable read operations and is deprecated in this version. Instead, applications should use the `Raft::ensure_linearizable()` method to guarantee linearizability. Under the hood, `Raft::ensure_linearizable()` obtains a `ReadIndex` from `RaftCore` if it remains the leader, and blocks until the state machine applies up to the `ReadIndex`. This process ensures that the application observes all state visible to a preceding read operation. - Fix: databendlabs#965 Upgrade tip: Replace `Raft::is_leader()` with `Raft::ensure_linearizable()`.

schreter

I don't see a fundamental problem with the approach. However, it's not useful for high-performance servers, where (linearizable) reads are frequent.

Much better option would be leader lease, where the leader leases time for itself, preventing another node from electing itself a leader before this time expires. This can be easily piggybacked on regular keep-alive traffic. Effectively, if you set the leader lease time of X, then each follower would set its election timeout to X + random value and refuse voting for a leader when the vote request came within time period of X from the last keep-alive from the leader. This way, if the leader could build a quorum, it knows it continues to be the leader for at least the time X. No communication and no waiting needed, especially if X is reasonably larger than keep-alive timeout (e.g., 1s when keep-alive is sent every 300ms).

Reviewed 16 of 16 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ariesdevil and @lichuang)

drmingdrmer · 2023-12-09T01:22:15Z

@tvsfx
Please have a look on this approach:)

drmingdrmer

You are right. This PR is meant to stabilize an API for linearizable read based on the current implementation.

The lease based read can be done with clock_progress, which tracks what wall clock time has been granted by a quorum for a leader.

https://github.com/datafuselabs/openraft/blob/14d621ebeea41e8ce20f4f5df82929dd70279533/openraft/src/engine/handler/replication_handler/mod.rs#L172

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ariesdevil and @lichuang)

tvsfx

Much better option would be leader lease, where the leader leases time for itself, preventing another node from electing itself a leader before this time expires. This can be easily piggybacked on regular keep-alive traffic. Effectively, if you set the leader lease time of X, then each follower would set its election timeout to X + random value and refuse voting for a leader when the vote request came within time period of X from the last keep-alive from the leader. This way, if the leader could build a quorum, it knows it continues to be the leader for at least the time X. No communication and no waiting needed, especially if X is reasonably larger than keep-alive timeout (e.g., 1s when keep-alive is sent every 300ms).

@schreter ; you are right that leader-based leases achieve better performance in most cases. A couple notes though:

This approach does not rely on bounded clock drift for its correctness.
It is easier to implement than leader leases.
It can be used as a stepping stone for other types of reads; trading off leader clock cycles for network IO by forwarding read indexes to followers, or even further offloading the leader by sacrificing linearizability for sequential consistency, and serving a variant of read index-based reads from followers immediately.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ariesdevil, @drmingdrmer, and @lichuang)

openraft/src/docs/protocol/read.md line 60 at r1 (raw file):

When the `last_applied_log_id` meets or exceeds `read_log_id`,
the state machine contains all state upto `A`. Therefore, a linearizable read is assured
when `last_applied_log_id >= read_log_id`.

~~This is not true if read_log_id might still be reverted, see below.~~

openraft/src/raft/mod.rs line 450 at r1 (raw file):

        &self,
    ) -> Result<
        (Option<LogId<C::NodeId>>, Option<LogId<C::NodeId>>),

Is there ever any benefit to returning a pair of LogIds over a pair of indexes here? If you do not always use a committed entry for the ReadIndex (see next comment), I agree that you need to use the LogId and not the log index (in this case, you cannot use RaftMetrics in ensure_linearizable, because they don't include LogIds for all entries that were applied since the last check). However, if you do stick to this principle, I wonder if there are any cases where having this extra info compared to a raw index is of use?

Zooming out a bit, to uniquely represent a committed entry, you do not need a whole LogId; just like how you have a CommittedLeaderId, the index I wanted to return here is essentially a CommittedLogId, which consists of an index only. Similarly, it feels like e.g. save\_committedand read\_committed in v2.rs could just store an CommittedLogId(i.e. an index into the log) instead of a whole LogId?

openraft/src/raft_state/mod.rs line 244 at r1 (raw file):

        let committed = self.committed();

        std::cmp::max(leader_first, committed)

I like the idea of taking the max, because it ensures that we don't miss any entries from prior terms of which the leader does not know they're committed, while at the same time avoiding the check whether an entry has been committed this term, because you pick at least leader_first.

However, as we've briefly discussed before, I don't think it is correct to pick the maximum here, if you want to return a simple log_index, and not a LogId (see previous comment for the advantages of this approach). The reason it is not correct, is that the leader's blank entry might still be reverted at some point, since it is not committed, and overwritten by a different entry at the same index.

I think the easiest fix might be to just check whether the leader has already committed an entry and reject the request if such is not the case, as the client can just retry if desired, and this seems easier than monitoring the logs to see whether a specific LogId has been applied. Alternatively either RaftMetrics would have to be extended with LogIds (making it much more heavyweight) or there should be a better ReadIndex callback mechanism (like the queue you mentioned) that discards stale readindexes in case of log reverts.

drmingdrmer

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ariesdevil, @lichuang, and @tvsfx)

openraft/src/raft/mod.rs line 450 at r1 (raw file):

Previously, tvsfx wrote…

Is there ever any benefit to returning a pair of LogIds over a pair of indexes here? If you do not always use a committed entry for the ReadIndex (see next comment), I agree that you need to use the LogId and not the log index (in this case, you cannot use RaftMetrics in ensure_linearizable, because they don't include LogIds for all entries that were applied since the last check). However, if you do stick to this principle, I wonder if there are any cases where having this extra info compared to a raw index is of use?

Zooming out a bit, to uniquely represent a committed entry, you do not need a whole LogId; just like how you have a CommittedLeaderId, the index I wanted to return here is essentially a CommittedLogId, which consists of an index only. Similarly, it feels like e.g. save\_committedand read\_committed in v2.rs could just store an CommittedLogId(i.e. an index into the log) instead of a whole LogId?

I have a preference for LogId over LogIndex. Given the choice, I would consistently opt for LogId and steer clear of LogIndex.

The order of index doesn't necessarily represent to the order of events. Although in Raft, the order of indices is used to determine the order of events, each time I employ index, I am compelled to re-evaluate a scenario to affirm its accuracy, which is quite tiresome.

LogId, on the other hand, invariably represent the order of events.

P.S.
A more profound rationale is that within Openraft, the vote isn't required to implement the full Ord trait; it is sufficient for it to implement PartialOrd to function properly. Moreover, the LogId.committed_leader_id doesn't need to fulfill the Ord trait either; implementing PartialOrd would be adequate in a consensus protocol that permits the existence of multiple leaders. These are not goal of Openraft but to me it's more clear thinking this way

openraft/src/raft_state/mod.rs line 244 at r1 (raw file):

Previously, tvsfx wrote…

I like the idea of taking the max, because it ensures that we don't miss any entries from prior terms of which the leader does not know they're committed, while at the same time avoiding the check whether an entry has been committed this term, because you pick at least leader_first.

However, as we've briefly discussed before, I don't think it is correct to pick the maximum here, if you want to return a simplelog_index, and not a LogId (see previous comment for the advantages of this approach). The reason it is not correct, is that the leader's blank entry might still be reverted at some point, since it is not committed, and overwritten by a different entry at the same index.

I think the easiest fix might be to just check whether the leader has already committed an entry and reject the request if such is not the case, as the client can just retry if desired, and this seems easier than monitoring the logs to see whether a specific LogId has been applied. Alternatively either RaftMetrics would have to be extended with LogIds (making it much more heavyweight) or there should be a better ReadIndex callback mechanism (like the queue you mentioned) that discards stale readindexes in case of log reverts.

The blank log will be reverted but I believe the invariant still holds:

With the blank log's LogId as B and the last committed LogId as C, where B > C;
If B is reverted, a subsequent blank log B₁ will finally be committed, where B₁ > B;
Consequently, when the LogId of the last committed log updates to B₁, fulfilling the condition C' >= B₁ > B > C, the state machine will be up to date and ready for reading.

Can you give me a counter example if I missed anything?

tvsfx · 2023-12-11T06:12:58Z

LogId, on the other hand, invariably represent the order of events.

Right, although I would argue that the fact that we allow the blank log to be reverted without invalidating the read index suggests that we do not actually care about the order of events in this case, we only care about the order of log entries.

The blank log will be reverted but I believe the invariant still holds:

Just came back here to say this; once I was away from my computer I realized that what I said before is wrong, apologies for the confusion :)
It is fine if the blank log entry gets reverted; a sufficient condition for us to serve the read in a linearizable way is if all committed entries of which we currently don't know that they're committed are applied as well. Only entries from previous terms could have been committed unbeknownst to us. All we need to ensure then is that the read index is high enough, which you did by taking the maximum. So I think this is just a nice optimization over the classical Raft approach!

drmingdrmer

Great!
I appreciate the confirmation on this approach!

I'm gonna merge it!

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ariesdevil, @lichuang, and @tvsfx)

openraft/src/docs/protocol/read.md line 60 at r1 (raw file):

Previously, tvsfx wrote…

This is not true if read_log_id might still be reverted, see below.

Done.

schreter

@tvsfx

This approach does not rely on bounded clock drift for its correctness.

Yes, that's the deficiency with leader lease. OTOH, I don't think we have any practical problem with it, since clock drift will be likely <<1% on all modern systems, most likely always under RTT.

It is easier to implement than leader leases.

Implementing leader leases is actually quite straightforward, as I described above:

The leader sends its lease time as a delta to the communication time (NOT the target wall clock time!) with each communication to its followers, or the lease time is configured per cluster.
Each follower will update the earliest timestamp when it may start an election and/or accept a vote request from someone else than the current leader (probably including some safety margin for the clock drift) upon receiving a message from the leader.
Leader has a lease from the communication time to communication time + lease time.

It can be used as a stepping stone for other types of reads; trading off leader clock cycles for network IO by forwarding read indexes to followers, or even further offloading the leader by sacrificing linearizability for sequential consistency, and serving a variant of read index-based reads from followers immediately.

As I wrote, I don't have any problem with this approach (I gave my approval), exactly because of the reasons you are citing here - possible leader offload in the future (though that was more my gut feeling than something provable :-).

Most likely, we should simply combine the two, since the leader lease is practically zero-cost (if configured on cluster level for each replica equally).

@drmingdrmer Not sure what you meant with the link to the replication handler. That doesn't prevent the follower from starting an election and becoming a leader before the lease expires, right?

Reviewable status: all files reviewed, 3 unresolved discussions

drmingdrmer · 2023-12-12T23:59:20Z

@drmingdrmer Not sure what you meant with the link to the replication handler. That doesn't prevent the follower from starting an election and becoming a leader before the lease expires, right?

@schreter
This is not about the follower. This is on the leader side:

clock_progress records the earliest timestamp t from which a heartbeat has been granted by a quorum, ensuring that no new leader will be elected until t+lease.

https://github.com/datafuselabs/openraft/blob/14d621ebeea41e8ce20f4f5df82929dd70279533/openraft/src/engine/handler/replication_handler/mod.rs#L170-L174

tvsfx · 2023-12-18T09:22:16Z

Yes, that's the deficiency with leader lease. OTOH, I don't think we have any practical problem with it, since clock drift will be likely <<1% on all modern systems, most likely always under RTT.

I agree that it shouldn't matter in most cases. However, our use case is one in which we do not have access to a trustworthy time source, so this implementation is very useful to us.

drmingdrmer requested review from lichuang and ariesdevil December 8, 2023 16:08

drmingdrmer force-pushed the 53-is-leader branch from 9c3050e to 79372b4 Compare December 8, 2023 16:19

drmingdrmer requested a review from schreter December 8, 2023 16:29

schreter approved these changes Dec 8, 2023

View reviewed changes

drmingdrmer commented Dec 9, 2023

View reviewed changes

tvsfx reviewed Dec 10, 2023

View reviewed changes

drmingdrmer commented Dec 11, 2023

View reviewed changes

drmingdrmer merged commit 212a19c into databendlabs:main Dec 11, 2023
26 of 27 checks passed

drmingdrmer deleted the 53-is-leader branch December 11, 2023 07:52

schreter reviewed Dec 12, 2023

View reviewed changes

drmingdrmer mentioned this pull request Mar 9, 2024

Release 0.9 #1041

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: add `Raft::ensure_linearizable()` to ensure linearizable read #964

Feature: add `Raft::ensure_linearizable()` to ensure linearizable read #964

drmingdrmer commented Dec 8, 2023 •

edited

Loading

schreter left a comment

drmingdrmer commented Dec 9, 2023

drmingdrmer left a comment

tvsfx left a comment •

edited

Loading

drmingdrmer left a comment

tvsfx commented Dec 11, 2023 •

edited

Loading

drmingdrmer left a comment

schreter left a comment

drmingdrmer commented Dec 12, 2023

tvsfx commented Dec 18, 2023

Feature: add Raft::ensure_linearizable() to ensure linearizable read #964

Feature: add Raft::ensure_linearizable() to ensure linearizable read #964

Conversation

drmingdrmer commented Dec 8, 2023 • edited Loading

Changelog

Feature: add Raft::ensure_linearizable() to ensure linearizable read

schreter left a comment

Choose a reason for hiding this comment

drmingdrmer commented Dec 9, 2023

drmingdrmer left a comment

Choose a reason for hiding this comment

tvsfx left a comment • edited Loading

Choose a reason for hiding this comment

drmingdrmer left a comment

Choose a reason for hiding this comment

tvsfx commented Dec 11, 2023 • edited Loading

drmingdrmer left a comment

Choose a reason for hiding this comment

schreter left a comment

Choose a reason for hiding this comment

drmingdrmer commented Dec 12, 2023

tvsfx commented Dec 18, 2023

Feature: add `Raft::ensure_linearizable()` to ensure linearizable read #964

Feature: add `Raft::ensure_linearizable()` to ensure linearizable read #964

drmingdrmer commented Dec 8, 2023 •

edited

Loading

Feature: add `Raft::ensure_linearizable()` to ensure linearizable read

tvsfx left a comment •

edited

Loading

tvsfx commented Dec 11, 2023 •

edited

Loading