Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: optimization for read-only requests #262

Open
MrCroxx opened this issue Apr 4, 2022 · 17 comments
Open

Feature request: optimization for read-only requests #262

MrCroxx opened this issue Apr 4, 2022 · 17 comments

Comments

@MrCroxx
Copy link
Contributor

MrCroxx commented Apr 4, 2022

In Raft thesis 6.4, it proposed a way to optimize read-only requests with read index, which etcd/raft and tikv/raft-rs have already supported (and they further support read-only optimization that is called lease read).

I haven't seen a plan for it in the roadmap. Will the optimization be supported?

The basic procedure to handle read index with openraft can be:

  1. Client sends a read-only request to the leader or a follower.
  2. (optional) If a follower receives a read-only request, the follower send a ReadIndex request to the current leader (if there is one). (To support follower read.)
  3. If a leader receives a read-only request, the leader send a ReadIndex request to itself.
  4. (optional) If a newly stepped leader receives a ReadIndex request before it commits an empty entry, the leader saves the ReadIndex request to pending_read_indices. (To make sure the leader has the largest committed index of the group.)
  5. If a leader receives a ReadIndex request or a newly stepped leader successfully committed an empty entry, the leader uses the current committed as ReadIndex.
  6. (optional) If the ReadIndex request comes from a follower, the leader responds to the follower with the ReadIndex.
  7. When applying entries to the state machine. If the applied index meets a ReadIndex, the related read-only request can be served.
@github-actions
Copy link

github-actions bot commented Apr 4, 2022

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@MrCroxx
Copy link
Contributor Author

MrCroxx commented Apr 4, 2022

I'd like to contribute to the optimization. But I'm busy with my master degree project, my thesis, and my internship job.

If time is not a big issue, I'd like to pick the challenge.

@drmingdrmer
Copy link
Member

Currently, the extensible and decoupled overall structure of the framework has the highest priority, to make some performance-related refactoring possible, which is suggested by @schreter .

Everything that is well defined in the Raft thesis has a relatively low priority
because there won't be surprises in implementing it.

Unless somebody needs it. :)

@schreter
Copy link
Collaborator

schreter commented Apr 4, 2022

Well, I was planning to suggest implementing lease read myself :-). This indeed can speed up the reads considerably. OTOH if I understand it correctly, there is the additional cost in the takeover path, where lease time must be awaited. Picking too short lease time leads to frequent reassertions, picking too long lease time then to delayed election.

For our project, we'll likely use an independent external sequencer keeping "transaction time", which, as a bonus, will allow us to safely read from replicas as well. But that's nowhere as general as read leases :-).

In other words, for me, time is not a big issue, so feel free to experiment with it at your own pace.

@drmingdrmer
Copy link
Member

Well, I was planning to suggest implementing lease read myself :-). This indeed can speed up the reads considerably. OTOH if I understand it correctly, there is the additional cost in the takeover path, where lease time must be awaited. Picking too short lease time leads to frequent reassertions, picking too long lease time then to delayed election.

For our project, we'll likely use an independent external sequencer keeping "transaction time", which, as a bonus, will allow us to safely read from replicas as well. But that's nowhere as general as read leases :-).

In other words, for me, time is not a big issue, so feel free to experiment with it at your own pace.

Agree.

Raft defines its own time with term and log index.
Although people try hard to synchronize this raft-defined-time with the wall-clock time,
the best way is to forget the wall-clock time.

@schreter
Copy link
Collaborator

schreter commented Apr 4, 2022

Raft defines its own time with term and log index.

Hehe, I didn't intend to make a pun on "time" :-). The "transaction time" we use is a kind of MVCC, so as long as the follower replica knows that certain time point from the sequencer was replicated, any requests with older "transaction time" than this time point can be answered directly. But again, that's relevant for our specific use case, it's not a general solution.

Although people try hard to synchronize this raft-defined-time with the wall-clock time, the best way is to forget the wall-clock time.

+1. We also don't use wall-clock time.

Of course, the wall-clock time is still used in Raft for timeouts... To get rid of that, we need something like this: https://web.stanford.edu/class/ee380/Abstracts/161116-slides.pdf.

@drmingdrmer
Copy link
Member

Of course, the wall-clock time is still used in Raft for timeouts... To get rid of that, we need something like this: https://web.stanford.edu/class/ee380/Abstracts/161116-slides.pdf.

Maybe the election timeout can be replaced with some external event source that triggers re-election.
E.g., let a user install a metrics watcher, to monitor the leader messages received by a raft node.
And its duty is to tell the raft node to elect if the watcher does not receive a leader message in 10 seconds.

This way the raft core becomes purely event-driven.

@schreter
Copy link
Collaborator

schreter commented Apr 4, 2022

Maybe the election timeout can be replaced with some external event source that triggers re-election.

That's indeed a good idea and it would help in our project too (we have a "liveness" indication between a pair of nodes, which can be used for any communication between this pair of nodes in multiple consensus domains which happen to have replicas on this pair). It would also help removing the non-determinism from tests.

But, it will require fairly large refactoring of various tests, I suppose.

@lichuang
Copy link
Contributor

+1. We also don't use wall-clock time.

In etcd raft implementation, the time event if trigger by user, as a tick() call in interface, and heart time and election time is defined as tick unit, which is not a wall-clock time. see:

type Config struct {
	// ElectionTick is the number of Node.Tick invocations that must pass between
	// elections. That is, if a follower does not receive any message from the
	// leader of current term before ElectionTick has elapsed, it will become
	// candidate and start an election. ElectionTick must be greater than
	// HeartbeatTick. We suggest ElectionTick = 10 * HeartbeatTick to avoid
	// unnecessary leader switching.
	ElectionTick int

	// HeartbeatTick is the number of Node.Tick invocations that must pass between
	// heartbeats. That is, a leader sends heartbeat messages to maintain its
	// leadership every HeartbeatTick ticks.
	HeartbeatTick int
}

etcd raft time event such as election\heartbeat is a full trigger by user.may be we optimize openraft in this way, let me think about it.

@lichuang
Copy link
Contributor

+1. We also don't use wall-clock time.

In etcd raft implementation, the time event if trigger by user, as a tick() call in interface, and heart time and election time is defined as tick unit, which is not a wall-clock time. see:

type Config struct {
	// ElectionTick is the number of Node.Tick invocations that must pass between
	// elections. That is, if a follower does not receive any message from the
	// leader of current term before ElectionTick has elapsed, it will become
	// candidate and start an election. ElectionTick must be greater than
	// HeartbeatTick. We suggest ElectionTick = 10 * HeartbeatTick to avoid
	// unnecessary leader switching.
	ElectionTick int

	// HeartbeatTick is the number of Node.Tick invocations that must pass between
	// heartbeats. That is, a leader sends heartbeat messages to maintain its
	// leadership every HeartbeatTick ticks.
	HeartbeatTick int
}

etcd raft time event such as election\heartbeat is a full trigger by user.may be we optimize openraft in this way, let me think about it.

emm, after i read the source, i think refactor the time-trigger policy like etcd way may be difficult in openraft.

in etcd, raft core and application communicate with msg array: send msg\commited msg,etc, the the application level can send the msg in there way, which decouple the network from raft core, so etcd can be trigger by tick from application.

But in openraft, it has the RaftNetWork interface, raft core can directly send msg using this interface.

@drmingdrmer
Copy link
Member

in etcd, raft core and application communicate with msg array: send msg\commited msg,etc, the the application level can send the msg in there way, which decouple the network from raft core, so etcd can be trigger by tick from application.

I'm trying to extract the algorithm part out of raft core.
And left everything else, such as RaftStorage, RaftNetwork, the timer, and message channels to a runtime.
And let the timer a customizable component.

@avantgardnerio
Copy link

I was planning to suggest implementing lease read myself :-). This indeed can speed up the reads considerably

@drmingdrmer @schreter if I understand this correctly, that would mean there would be a leader that knows it's the leader for a given time interval? Could this be used to assign unique monotonic timestamps from the leader without appending them to the raft log?

I ask because we don't (yet) have this:

For our project, we'll likely use an independent external sequencer keeping "transaction time"

and the options as I see them are:

  1. create a proprietary timestamp oracle separate from raft (finicky, essentially requiring re-implementing something like raft for failover, split brain, etc)
  2. do what we do now and get unique timestamps from raft, but through the write path including log persistence to disk (slow)
  3. switch to tikv's raft (lose tokio/async goodness, and also have implement the timestamp feature? I don't see that they have it yet, but with leader leases it should be doable)
  4. switch to logical clocks or HLCs, but then we have to deal with timestamps being "pushed" into the future, read-refreshes, etc

Unless somebody needs it. :)

I'm willing to be corrected on any of these points, but I think my company "needs" it.

@avantgardnerio
Copy link

To be overly clear, here's pseudocode for what I'm proposing:

struct Raft {
   timestamp_gen: AtomicU64,
   ...
}

impl Raft {
   pub fn get_unique_ts() -> Result<(Term, u64), Error> {
      if !self.is_leader {
         Err(FwdToLeader)?;
      }
      if !self.has_lease {
         Err(WaitForLease)?;
      }
      (self.term, self.timestamp_gen.inc_and_get())
   }
}

@drmingdrmer
Copy link
Member

I was planning to suggest implementing lease read myself :-). This indeed can speed up the reads considerably

@drmingdrmer @schreter if I understand this correctly, that would mean there would be a leader that knows it's the leader for a given time interval?

You are correct! Openraft has a mechanism called leader lease. This mechanism ensures that a follower does not elect itself as a leader before a certain period of time, which is determined by adding the timer_config.leader_lease + timer_config.election_timeout. The follower refreshes the lease by updating the last-updated-time of the vote. However, the current issue is that the leader does not update the lease yet.

To support reading consistent state, openraft needs to :

  • The leader should extend its lease when a log is accepted by a quorum.
  • and add a public API such as Raft::read(sm: &mut impl RaftStateMachine) that atomically checks leader lease and reads some data from the state machine.

let current_vote = self.engine.state.vote_ref();
let utime = self.engine.state.vote_last_modified();
let timer_config = &self.engine.config.timer_config;
let mut election_timeout = if current_vote.is_committed() {
timer_config.leader_lease + timer_config.election_timeout
} else {
timer_config.election_timeout
};
if self.engine.is_there_greater_log() {
election_timeout += timer_config.smaller_log_timeout;
}

self.state.vote.update(*self.timer.now(), *vote);

Could this be used to assign unique monotonic timestamps from the leader without appending them to the raft log?

Yes, internally a timestamp is updated for every tick:

let now = Instant::now();
// TODO: store server start time and use relative time
self.engine.timer.update_now(now);
tracing::debug!("received tick: {}, now: {:?}", i, now);

I ask because we don't (yet) have this:

For our project, we'll likely use an independent external sequencer keeping "transaction time"

and the options as I see them are:

  1. create a proprietary timestamp oracle separate from raft (finicky, essentially requiring re-implementing something like raft for failover, split brain, etc)
  2. do what we do now and get unique timestamps from raft, but through the write path including log persistence to disk (slow)
  3. switch to tikv's raft (lose tokio/async goodness, and also have implement the timestamp feature? I don't see that they have it yet, but with leader leases it should be doable)
  4. switch to logical clocks or HLCs, but then we have to deal with timestamps being "pushed" into the future, read-refreshes, etc

One potential solution is to use the raft-log-id as a pseudo time. This solution involves:

  • Returning the log id i to the writer W when a log is applied.
  • A subsequent reader R (whether it's the same as W or not) forwards i to an Openraft node for reading, Openraft will not process this read request until it has applied i.

Since the committed log id ((term, index)) in raft is monotonic, it can be used as a measure of time. However, this solution requires the application to track the log-id amount across a cluster.

Unless somebody needs it. :)

I'm willing to be corrected on any of these points, but I think my company "needs" it.

Roger it.

@avantgardnerio
Copy link

@drmingdrmer , thank you for your quick response.

One potential solution is to use the raft-log-id as a pseudo time

I think this is what we are doing now, but by using writes to get a unique timestamp. I've seen in documentation that OpenRaft has been tested at 30k messages per second, but for our application we would need this volume (and possibly more) just for assigning timestamps, much less doing any "real" raft operations.

If everyone agrees who the leader is, I was hoping it could rapidly assign timestamps simply by bumping an atomic int and serving them as rapidly as possible.

and add a public API such as Raft::read(sm: &mut impl RaftStateMachine) that atomically checks leader lease and reads some data from the state machine.

So similar to the above, but instead of reading data it would increment-and-get.

Even typing this out, I realize it is hitting the limits of what is possible (Hyper says it can serve about 80k requests per second), so perhaps logical clocks are our only option, but I wanted to see if you thought the solution I proposed above was possible.

@drmingdrmer
Copy link
Member

I think this is what we are doing now, but by using writes to get a unique timestamp. I've seen in documentation that OpenRaft has been tested at 30k messages per second, but for our application we would need this volume (and possibly more) just for assigning timestamps, much less doing any "real" raft operations.

Without network and storage overhead, it is about 1 million rps with 256 clients. But for a real world application the TPS would be much less.

Even typing this out, I realize it is hitting the limits of what is possible (Hyper says it can serve about 80k requests per second), so perhaps logical clocks are our only option, but I wanted to see if you thought the solution I proposed above was possible.

If you were building a timestamp assigning service, it could be simpler:

    1. A raft leader generate monotonic incremental timestamps with a local AtomicU64, without any replication.
    1. A new leader guarantees it will never assign a timestamp that is lower than any former leader(leader with less term).

The second can be done by binding timestamps to logs:

  • let the leader commit a log for every second,
  • and limit the leader to generate at most 1000 timestamp after every committed log. The generated pseudo timestamp would be in form of last_committed_log_index*1000 + atomic_u64.

Because a new leader will propose and commit a blank log, the new leader will always generate greater timestamps.

And such a timestamp can be easily mapped to clock time, by embedding a clock-time in the raft-log.

@avantgardnerio
Copy link

@drmingdrmer thank you! Fantastic response. Nice to see its already possible without changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants