-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: optimization for read-only requests #262
Comments
👋 Thanks for opening this issue! Get help or engage by:
|
I'd like to contribute to the optimization. But I'm busy with my master degree project, my thesis, and my internship job. If time is not a big issue, I'd like to pick the challenge. |
Currently, the extensible and decoupled overall structure of the framework has the highest priority, to make some performance-related refactoring possible, which is suggested by @schreter . Everything that is well defined in the Raft thesis has a relatively low priority Unless somebody needs it. :) |
Well, I was planning to suggest implementing lease read myself :-). This indeed can speed up the reads considerably. OTOH if I understand it correctly, there is the additional cost in the takeover path, where lease time must be awaited. Picking too short lease time leads to frequent reassertions, picking too long lease time then to delayed election. For our project, we'll likely use an independent external sequencer keeping "transaction time", which, as a bonus, will allow us to safely read from replicas as well. But that's nowhere as general as read leases :-). In other words, for me, time is not a big issue, so feel free to experiment with it at your own pace. |
Agree. Raft defines its own time with |
Hehe, I didn't intend to make a pun on "time" :-). The "transaction time" we use is a kind of MVCC, so as long as the follower replica knows that certain time point from the sequencer was replicated, any requests with older "transaction time" than this time point can be answered directly. But again, that's relevant for our specific use case, it's not a general solution.
+1. We also don't use wall-clock time. Of course, the wall-clock time is still used in Raft for timeouts... To get rid of that, we need something like this: https://web.stanford.edu/class/ee380/Abstracts/161116-slides.pdf. |
Maybe the election timeout can be replaced with some external event source that triggers re-election. This way the raft core becomes purely event-driven. |
That's indeed a good idea and it would help in our project too (we have a "liveness" indication between a pair of nodes, which can be used for any communication between this pair of nodes in multiple consensus domains which happen to have replicas on this pair). It would also help removing the non-determinism from tests. But, it will require fairly large refactoring of various tests, I suppose. |
In etcd raft implementation, the time event if trigger by user, as a
etcd raft time event such as election\heartbeat is a full trigger by user.may be we optimize openraft in this way, let me think about it. |
emm, after i read the source, i think refactor the time-trigger policy like etcd way may be difficult in openraft. in etcd, raft core and application communicate with But in openraft, it has the RaftNetWork interface, raft core can directly send msg using this interface. |
I'm trying to extract the algorithm part out of raft core. |
@drmingdrmer @schreter if I understand this correctly, that would mean there would be a leader that knows it's the leader for a given time interval? Could this be used to assign unique monotonic timestamps from the leader without appending them to the raft log? I ask because we don't (yet) have this:
and the options as I see them are:
I'm willing to be corrected on any of these points, but I think my company "needs" it. |
To be overly clear, here's pseudocode for what I'm proposing:
|
You are correct! Openraft has a mechanism called leader lease. This mechanism ensures that a follower does not elect itself as a leader before a certain period of time, which is determined by adding the To support reading consistent state, openraft needs to :
openraft/openraft/src/core/raft_core.rs Lines 1345 to 1357 in 5415420
Yes, internally a timestamp is updated for every tick: openraft/openraft/src/core/raft_core.rs Lines 1182 to 1185 in 5415420
One potential solution is to use the raft-log-id as a pseudo time. This solution involves:
Since the committed log id (
Roger it. |
@drmingdrmer , thank you for your quick response.
I think this is what we are doing now, but by using writes to get a unique timestamp. I've seen in documentation that OpenRaft has been tested at 30k messages per second, but for our application we would need this volume (and possibly more) just for assigning timestamps, much less doing any "real" raft operations. If everyone agrees who the leader is, I was hoping it could rapidly assign timestamps simply by bumping an atomic int and serving them as rapidly as possible.
So similar to the above, but instead of reading data it would increment-and-get. Even typing this out, I realize it is hitting the limits of what is possible (Hyper says it can serve about 80k requests per second), so perhaps logical clocks are our only option, but I wanted to see if you thought the solution I proposed above was possible. |
Without network and storage overhead, it is about 1 million rps with 256 clients. But for a real world application the TPS would be much less.
If you were building a timestamp assigning service, it could be simpler:
The second can be done by binding timestamps to logs:
Because a new leader will propose and commit a blank log, the new leader will always generate greater timestamps. And such a timestamp can be easily mapped to clock time, by embedding a clock-time in the raft-log. |
@drmingdrmer thank you! Fantastic response. Nice to see its already possible without changes. |
In Raft thesis 6.4, it proposed a way to optimize read-only requests with read index, which etcd/raft and tikv/raft-rs have already supported (and they further support read-only optimization that is called lease read).
I haven't seen a plan for it in the roadmap. Will the optimization be supported?
The basic procedure to handle read index with openraft can be:
ReadIndex
request to the current leader (if there is one). (To support follower read.)ReadIndex
request to itself.ReadIndex
request before it commits an empty entry, the leader saves theReadIndex
request topending_read_indices
. (To make sure the leader has the largest committed index of the group.)ReadIndex
request or a newly stepped leader successfully committed an empty entry, the leader uses the current committed asReadIndex
.ReadIndex
request comes from a follower, the leader responds to the follower with theReadIndex
.ReadIndex
, the related read-only request can be served.The text was updated successfully, but these errors were encountered: