Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add Raft::ensure_linearizable() to ensure linearizable read #964

Merged
merged 1 commit into from
Dec 11, 2023

Conversation

drmingdrmer
Copy link
Member

@drmingdrmer drmingdrmer commented Dec 8, 2023

Changelog

Feature: add Raft::ensure_linearizable() to ensure linearizable read

The Raft::is_leader() method does not fully ensure linearizable read
operations and is deprecated in this version. Instead, applications
should use the Raft::ensure_linearizable() method to guarantee
linearizability.

Under the hood, Raft::ensure_linearizable() obtains a ReadIndex from
RaftCore if it remains the leader, and blocks until the state
machine applies up to the ReadIndex. This process ensures that the
application observes all state visible to a preceding read operation.

Upgrade tip:

Replace Raft::is_leader() with Raft::ensure_linearizable().

This PR implements the very basic requirement proposed in:


This change is Reviewable

The `Raft::is_leader()` method does not fully ensure linearizable read
operations and is deprecated in this version. Instead, applications
should use the `Raft::ensure_linearizable()` method to guarantee
linearizability.

Under the hood, `Raft::ensure_linearizable()` obtains a `ReadIndex` from
`RaftCore` if it remains the leader, and blocks until the state
machine applies up to the `ReadIndex`. This process ensures that the
application observes all state visible to a preceding read operation.

- Fix: databendlabs#965

Upgrade tip:

Replace `Raft::is_leader()` with `Raft::ensure_linearizable()`.
Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a fundamental problem with the approach. However, it's not useful for high-performance servers, where (linearizable) reads are frequent.

Much better option would be leader lease, where the leader leases time for itself, preventing another node from electing itself a leader before this time expires. This can be easily piggybacked on regular keep-alive traffic. Effectively, if you set the leader lease time of X, then each follower would set its election timeout to X + random value and refuse voting for a leader when the vote request came within time period of X from the last keep-alive from the leader. This way, if the leader could build a quorum, it knows it continues to be the leader for at least the time X. No communication and no waiting needed, especially if X is reasonably larger than keep-alive timeout (e.g., 1s when keep-alive is sent every 300ms).

Reviewed 16 of 16 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ariesdevil and @lichuang)

@drmingdrmer
Copy link
Member Author

@tvsfx
Please have a look on this approach:)

Copy link
Member Author

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This PR is meant to stabilize an API for linearizable read based on the current implementation.

The lease based read can be done with clock_progress, which tracks what wall clock time has been granted by a quorum for a leader.

https://github.com/datafuselabs/openraft/blob/14d621ebeea41e8ce20f4f5df82929dd70279533/openraft/src/engine/handler/replication_handler/mod.rs#L172

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ariesdevil and @lichuang)

Copy link
Contributor

@tvsfx tvsfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better option would be leader lease, where the leader leases time for itself, preventing another node from electing itself a leader before this time expires. This can be easily piggybacked on regular keep-alive traffic. Effectively, if you set the leader lease time of X, then each follower would set its election timeout to X + random value and refuse voting for a leader when the vote request came within time period of X from the last keep-alive from the leader. This way, if the leader could build a quorum, it knows it continues to be the leader for at least the time X. No communication and no waiting needed, especially if X is reasonably larger than keep-alive timeout (e.g., 1s when keep-alive is sent every 300ms).

@schreter ; you are right that leader-based leases achieve better performance in most cases. A couple notes though:

  1. This approach does not rely on bounded clock drift for its correctness.
  2. It is easier to implement than leader leases.
  3. It can be used as a stepping stone for other types of reads; trading off leader clock cycles for network IO by forwarding read indexes to followers, or even further offloading the leader by sacrificing linearizability for sequential consistency, and serving a variant of read index-based reads from followers immediately.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ariesdevil, @drmingdrmer, and @lichuang)


openraft/src/docs/protocol/read.md line 60 at r1 (raw file):

When the `last_applied_log_id` meets or exceeds `read_log_id`,
the state machine contains all state upto `A`. Therefore, a linearizable read is assured
when `last_applied_log_id >= read_log_id`.

This is not true if read_log_id might still be reverted, see below.


openraft/src/raft/mod.rs line 450 at r1 (raw file):

        &self,
    ) -> Result<
        (Option<LogId<C::NodeId>>, Option<LogId<C::NodeId>>),

Is there ever any benefit to returning a pair of LogIds over a pair of indexes here? If you do not always use a committed entry for the ReadIndex (see next comment), I agree that you need to use the LogId and not the log index (in this case, you cannot use RaftMetrics in ensure_linearizable, because they don't include LogIds for all entries that were applied since the last check). However, if you do stick to this principle, I wonder if there are any cases where having this extra info compared to a raw index is of use?

Zooming out a bit, to uniquely represent a committed entry, you do not need a whole LogId; just like how you have a CommittedLeaderId, the index I wanted to return here is essentially a CommittedLogId, which consists of an index only. Similarly, it feels like e.g. save\_committedand read\_committed in v2.rs could just store an CommittedLogId(i.e. an index into the log) instead of a whole LogId?


openraft/src/raft_state/mod.rs line 244 at r1 (raw file):

        let committed = self.committed();

        std::cmp::max(leader_first, committed)

I like the idea of taking the max, because it ensures that we don't miss any entries from prior terms of which the leader does not know they're committed, while at the same time avoiding the check whether an entry has been committed this term, because you pick at least leader_first.

However, as we've briefly discussed before, I don't think it is correct to pick the maximum here, if you want to return a simple log_index, and not a LogId (see previous comment for the advantages of this approach). The reason it is not correct, is that the leader's blank entry might still be reverted at some point, since it is not committed, and overwritten by a different entry at the same index.

I think the easiest fix might be to just check whether the leader has already committed an entry and reject the request if such is not the case, as the client can just retry if desired, and this seems easier than monitoring the logs to see whether a specific LogId has been applied. Alternatively either RaftMetrics would have to be extended with LogIds (making it much more heavyweight) or there should be a better ReadIndex callback mechanism (like the queue you mentioned) that discards stale readindexes in case of log reverts.

Copy link
Member Author

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ariesdevil, @lichuang, and @tvsfx)


openraft/src/raft/mod.rs line 450 at r1 (raw file):

Previously, tvsfx wrote…

Is there ever any benefit to returning a pair of LogIds over a pair of indexes here? If you do not always use a committed entry for the ReadIndex (see next comment), I agree that you need to use the LogId and not the log index (in this case, you cannot use RaftMetrics in ensure_linearizable, because they don't include LogIds for all entries that were applied since the last check). However, if you do stick to this principle, I wonder if there are any cases where having this extra info compared to a raw index is of use?

Zooming out a bit, to uniquely represent a committed entry, you do not need a whole LogId; just like how you have a CommittedLeaderId, the index I wanted to return here is essentially a CommittedLogId, which consists of an index only. Similarly, it feels like e.g. save\_committedand read\_committed in v2.rs could just store an CommittedLogId(i.e. an index into the log) instead of a whole LogId?

I have a preference for LogId over LogIndex. Given the choice, I would consistently opt for LogId and steer clear of LogIndex.

The order of index doesn't necessarily represent to the order of events. Although in Raft, the order of indices is used to determine the order of events, each time I employ index, I am compelled to re-evaluate a scenario to affirm its accuracy, which is quite tiresome.

LogId, on the other hand, invariably represent the order of events.

P.S.
A more profound rationale is that within Openraft, the vote isn't required to implement the full Ord trait; it is sufficient for it to implement PartialOrd to function properly. Moreover, the LogId.committed_leader_id doesn't need to fulfill the Ord trait either; implementing PartialOrd would be adequate in a consensus protocol that permits the existence of multiple leaders. These are not goal of Openraft but to me it's more clear thinking this way


openraft/src/raft_state/mod.rs line 244 at r1 (raw file):

Previously, tvsfx wrote…

I like the idea of taking the max, because it ensures that we don't miss any entries from prior terms of which the leader does not know they're committed, while at the same time avoiding the check whether an entry has been committed this term, because you pick at least leader_first.

However, as we've briefly discussed before, I don't think it is correct to pick the maximum here, if you want to return a simplelog_index, and not a LogId (see previous comment for the advantages of this approach). The reason it is not correct, is that the leader's blank entry might still be reverted at some point, since it is not committed, and overwritten by a different entry at the same index.

I think the easiest fix might be to just check whether the leader has already committed an entry and reject the request if such is not the case, as the client can just retry if desired, and this seems easier than monitoring the logs to see whether a specific LogId has been applied. Alternatively either RaftMetrics would have to be extended with LogIds (making it much more heavyweight) or there should be a better ReadIndex callback mechanism (like the queue you mentioned) that discards stale readindexes in case of log reverts.

The blank log will be reverted but I believe the invariant still holds:

  • With the blank log's LogId as B and the last committed LogId as C, where B > C;
  • If B is reverted, a subsequent blank log B₁ will finally be committed, where B₁ > B;
  • Consequently, when the LogId of the last committed log updates to B₁, fulfilling the condition C' >= B₁ > B > C, the state machine will be up to date and ready for reading.

Can you give me a counter example if I missed anything?

@tvsfx
Copy link
Contributor

tvsfx commented Dec 11, 2023

LogId, on the other hand, invariably represent the order of events.

Right, although I would argue that the fact that we allow the blank log to be reverted without invalidating the read index suggests that we do not actually care about the order of events in this case, we only care about the order of log entries.

The blank log will be reverted but I believe the invariant still holds:

Just came back here to say this; once I was away from my computer I realized that what I said before is wrong, apologies for the confusion :)
It is fine if the blank log entry gets reverted; a sufficient condition for us to serve the read in a linearizable way is if all committed entries of which we currently don't know that they're committed are applied as well. Only entries from previous terms could have been committed unbeknownst to us. All we need to ensure then is that the read index is high enough, which you did by taking the maximum. So I think this is just a nice optimization over the classical Raft approach!

Copy link
Member Author

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
I appreciate the confirmation on this approach!

I'm gonna merge it!

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ariesdevil, @lichuang, and @tvsfx)


openraft/src/docs/protocol/read.md line 60 at r1 (raw file):

Previously, tvsfx wrote…

This is not true if read_log_id might still be reverted, see below.

Done.

@drmingdrmer drmingdrmer merged commit 212a19c into databendlabs:main Dec 11, 2023
26 of 27 checks passed
@drmingdrmer drmingdrmer deleted the 53-is-leader branch December 11, 2023 07:52
Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tvsfx

This approach does not rely on bounded clock drift for its correctness.

Yes, that's the deficiency with leader lease. OTOH, I don't think we have any practical problem with it, since clock drift will be likely <<1% on all modern systems, most likely always under RTT.

It is easier to implement than leader leases.

Implementing leader leases is actually quite straightforward, as I described above:

  • The leader sends its lease time as a delta to the communication time (NOT the target wall clock time!) with each communication to its followers, or the lease time is configured per cluster.
  • Each follower will update the earliest timestamp when it may start an election and/or accept a vote request from someone else than the current leader (probably including some safety margin for the clock drift) upon receiving a message from the leader.
  • Leader has a lease from the communication time to communication time + lease time.

It can be used as a stepping stone for other types of reads; trading off leader clock cycles for network IO by forwarding read indexes to followers, or even further offloading the leader by sacrificing linearizability for sequential consistency, and serving a variant of read index-based reads from followers immediately.

As I wrote, I don't have any problem with this approach (I gave my approval), exactly because of the reasons you are citing here - possible leader offload in the future (though that was more my gut feeling than something provable :-).

Most likely, we should simply combine the two, since the leader lease is practically zero-cost (if configured on cluster level for each replica equally).

@drmingdrmer Not sure what you meant with the link to the replication handler. That doesn't prevent the follower from starting an election and becoming a leader before the lease expires, right?

Reviewable status: all files reviewed, 3 unresolved discussions

@drmingdrmer
Copy link
Member Author

@drmingdrmer Not sure what you meant with the link to the replication handler. That doesn't prevent the follower from starting an election and becoming a leader before the lease expires, right?

@schreter
This is not about the follower. This is on the leader side:

clock_progress records the earliest timestamp t from which a heartbeat has been granted by a quorum, ensuring that no new leader will be elected until t+lease.

https://github.com/datafuselabs/openraft/blob/14d621ebeea41e8ce20f4f5df82929dd70279533/openraft/src/engine/handler/replication_handler/mod.rs#L170-L174

@tvsfx
Copy link
Contributor

tvsfx commented Dec 18, 2023

Yes, that's the deficiency with leader lease. OTOH, I don't think we have any practical problem with it, since clock drift will be likely <<1% on all modern systems, most likely always under RTT.

I agree that it shouldn't matter in most cases. However, our use case is one in which we do not have access to a trustworthy time source, so this implementation is very useful to us.

@drmingdrmer drmingdrmer mentioned this pull request Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linearizable read with ReadIndex
3 participants