-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leader blocked when apply_entry_to_state_machine takes to long #76
Comments
Hey @MarinPostma
A quick clarifying question: do you mean "appending to the log" or "applying to the state machine"? There are definitely some complexities with that statement, and there are a few different ways of finding a path forward. On one hand, I would normally say that applying a log entry should NOT take a long time. It should be quite fast. There are a few reasons for this:
All in all, I think this actually just boils down to the need to spawn a heartbeat task on the leader node. This should be quite simple. |
Hey, Yes I mean applying to the state machine. Indeed, I am aware of that trade-off, but writes throughput is not a problem for me. |
This will not be enough, will the follower are replicating the entry to the state machine, they're not responding to heartbeats either. Actually, I was wrong, and the heartbeat on the leader side is already handled by the replication core, so this is not an issue, and that's why I'm receiving this errror message repeatedly.
|
@MarinPostma ah, yea I was going to say, I seemed to recall that the leader was already handling that in the replication stream (replication core). @MarinPostma check out this section in the docs, which is actually just a quote of the spec: https://docs.rs/async-raft/0.5.1/async_raft/config/struct.Config.html (specifically the quote of the Raft spec which mentions A key takeaway here is that the Raft protocol is time sensitive. I would recommend that you update the architecture of your system to make the process of applying to the state machine faster. There are lots of possibilities here, however to keep it simple they all boil down to this: push the heavy work to an async model. By my analysis, ANYTHING which is intensive / time consuming should be done before or after the Raft component in an app. Take a few examples:
If none of those things are possible in your case (I'm not going to pretend to have all of the answers), then you should modify your runtime config to have the All that said ... I can see an argument for more strictly separating the Let me know what you think about the architecture updates which I suggested above. I'll think a bit more about the updated I've suggested in the previous paragraph. |
Thanks for all these advices! As you suggested, I had initially though of doing the heavy lifting after the raft component, but it turned out to be an issue. Currently, I replicate write operations prior to indexation. Document indexing is the task that requires a lot of time. I have re-read the paper this is what I understand:
To me, the |
@MarinPostma thanks for the background. That definitely helps. I do want to ask more about the indexing operation ... however, I think that you should be free to tackle this problem in whatever way you see fit as long as it is within the bounds of Raft safety ... and this should be. So, that said, I was thinking about this a bit earlier, and I remembered that we do indeed make calls to I think a good solution for this is to simply have followers spawn a task which monitors this state asynchronously and applies to the state machine on its own as need data becomes available and it is able to prove that it is safe to apply to the state machine safely according to the Raft protocol. I opened #12 A LONG time ago, and it might be good to update that issue with an additional requirement to move the |
Many thanks @thedodd, let me know if you want help with the implementation. |
@MarinPostma just wanted to let you know that I've started hacking on this now. I'm blocked until its done, so it should be done quite soon |
With this change, we are also caching entries which come from the leader replication protocol. As entries come in, we append them to the log and then cache the entry. When it is safe to apply entries to the state machine, we will take them directly from the in-memory cache instead of going to disk. Moreover, and most importantly, we are not longer blocking the AppendEntries RPC handler with the logic of the state machine replication workflow. There is a small amount of async task juggling to ensure that we don't run into situations where we would have two writers attempting to write to the state machine at the same time. This is easily avoided in our algorithm. closes #12 closes #76
With this change, we are also caching entries which come from the leader replication protocol. As entries come in, we append them to the log and then cache the entry. When it is safe to apply entries to the state machine, we will take them directly from the in-memory cache instead of going to disk. Moreover, and most importantly, we are not longer blocking the AppendEntries RPC handler with the logic of the state machine replication workflow. There is a small amount of async task juggling to ensure that we don't run into situations where we would have two writers attempting to write to the state machine at the same time. This is easily avoided in our algorithm. closes #12 closes #76
In my use case, applying a log entry may take a while, and I found that this would make the leader unable to make progresses, causing it to timeout, and trigger new elections. Here what I have found so far:
LeaderState::run
theselect!
blocks onhandle_replica_event
(after the log entry has been commited)handle_replica_event
callshandle_update_match_index
that repeatedly callclient_request_post_commit
for each entry to be applied.During all this time, the loop in
run
is blocked performing this task, and can't perform it's other tasks, such as sending heartbeats. This causes elections timeout, and for some reason that I haven't yet completely understand (probably in my implementation) the entry is not applied on the leader node, but is on the other nodes.The log application process should not block the leader from operating normally. A possible fix would be to offload the work performed by
client_request_post_commit
to another thread, and instead of call it directly,handle_update_match_index
enqueues entries to be applied. I am currently experimenting with this solution.The text was updated successfully, but these errors were encountered: