Leader blocked when apply_entry_to_state_machine takes to long #76

MarinPostma · 2020-10-13T14:01:59Z

In my use case, applying a log entry may take a while, and I found that this would make the leader unable to make progresses, causing it to timeout, and trigger new elections. Here what I have found so far:

in LeaderState::run the select! blocks on handle_replica_event (after the log entry has been commited)
handle_replica_event calls handle_update_match_index that repeatedly call client_request_post_commit for each entry to be applied.

During all this time, the loop in run is blocked performing this task, and can't perform it's other tasks, such as sending heartbeats. This causes elections timeout, and for some reason that I haven't yet completely understand (probably in my implementation) the entry is not applied on the leader node, but is on the other nodes.

The log application process should not block the leader from operating normally. A possible fix would be to offload the work performed by client_request_post_commit to another thread, and instead of call it directly, handle_update_match_index enqueues entries to be applied. I am currently experimenting with this solution.

The text was updated successfully, but these errors were encountered:

thedodd · 2020-10-13T16:00:34Z

Hey @MarinPostma

In my use case, applying a log entry may take a while

A quick clarifying question: do you mean "appending to the log" or "applying to the state machine"?

There are definitely some complexities with that statement, and there are a few different ways of finding a path forward. On one hand, I would normally say that applying a log entry should NOT take a long time. It should be quite fast. There are a few reasons for this:

in Raft, changes must be linearizable. The ID given to each entry must be properly ordered, and they must be applied in proper order as well. As such, the longer it takes for an app to apply changes to the state machine, the lower the throughput of the app will be overall (which is fine, that's a trade off, I suppose).
in the current implementation, this does boil down to impacting the sending of heartbeats, which is definitely something which we will need to update in the code. The previous implementation of this system (the 0.4.x line and earlier releases) did not have this issue.
- the update which should be made here is to have the leader node spawn a task which only sends heartbeats.
- this will ensure that heartbeats are not blocked by anything else.

All in all, I think this actually just boils down to the need to spawn a heartbeat task on the leader node. This should be quite simple.

MarinPostma · 2020-10-13T16:16:17Z

Hey,

Yes I mean applying to the state machine. Indeed, I am aware of that trade-off, but writes throughput is not a problem for me.

MarinPostma · 2020-10-13T17:29:02Z

All in all, I think this actually just boils down to the need to spawn a heartbeat task on the leader node. This should be quite simple.

This will not be enough, will the follower are replicating the entry to the state machine, they're not responding to heartbeats either. Actually, I was wrong, and the heartbeat on the leader side is already handled by the replication core, so this is not an issue, and that's why I'm receiving this errror message repeatedly.

 [2020-10-13T17:27:34Z ERROR async_raft::replication] timeout while sending AppendEntries RPC to target error=deadline has elapsed

thedodd · 2020-10-13T19:26:43Z

@MarinPostma ah, yea I was going to say, I seemed to recall that the leader was already handling that in the replication stream (replication core).

@MarinPostma check out this section in the docs, which is actually just a quote of the spec: https://docs.rs/async-raft/0.5.1/async_raft/config/struct.Config.html (specifically the quote of the Raft spec which mentions broadcastTime).

A key takeaway here is that the Raft protocol is time sensitive. I would recommend that you update the architecture of your system to make the process of applying to the state machine faster. There are lots of possibilities here, however to keep it simple they all boil down to this: push the heavy work to an async model. By my analysis, ANYTHING which is intensive / time consuming should be done before or after the Raft component in an app. Take a few examples:

in traditional SQL systems (and most other databases), you have MVCC. This makes it "trivial" (I exaggerate) to ensure that the process of modifying data, applying transactions and such is snappy.
indexes are used in many systems to make expensive sorting and searching (often required for write operations to uphold constraints) much faster.
a state based approach can be used where the process of applying to the state machine could be made EXTREMEMLY fast by writing an authoritative update to your state machine. From there, your app can spawn asynchronous work to finish up the heavy lifting or whatever it is which is taking time.
in short, broadcastTime should be as small as possible. Applying to the state machine should NOT take much time, it should be very fast.

If none of those things are possible in your case (I'm not going to pretend to have all of the answers), then you should modify your runtime config to have the heartbeat_interval set to a higher value. There is risk associated with doing this. It will take longer to detect a dead leader.

All that said ... I can see an argument for more strictly separating the heartbeat & the append entries paths. If we were to do this, then the replication streams would have a dedicated "pipe" for sending heartbeats with no entries. This would constitute a relatively simple set of updates to the leader & follower to ensure that the heartbeat path is never coupled to the append_entries process (which it is according to the spec ... they are literally the same thing ... heartbeats simply don't have entries to replicate).

Let me know what you think about the architecture updates which I suggested above. I'll think a bit more about the updated I've suggested in the previous paragraph.

MarinPostma · 2020-10-13T21:26:08Z

Thanks for all these advices!

As you suggested, I had initially though of doing the heavy lifting after the raft component, but it turned out to be an issue. Currently, I replicate write operations prior to indexation. Document indexing is the task that requires a lot of time.
Before, those task would just be enqueued, and as far as raft was concerned, this was instantaneous, and there would be no problem. However, when the need for snapshot arised, I had to make sure that all the enqueued indexing operation where processed before I could create a snapshot. This adds the need for synchronization between the indexing and the snapshot creation tasks. With this in mind it, the update queue now appears redundant, and synchronization mechanisms have to be found so I can reliably create snapshots, hence having raft synchronously proceeding to the indexation. I will try synchronizing during snapshoting, but this is not very satisfying.

I have re-read the paper this is what I understand:
As you pointed out, broadcastTime is the average time it takes for a server to send a RPC and receives a response, this we both agree on, but looking at the implementation guideline on page 4 of the paper, it says (for AppendEntries RPC):

Receiver implementation:

Reply false if term < currentTerm (§5.1)

Reply false if log doesn’t contain an entry at prevLogIndex
whose term matches prevLogTerm (§5.3)

If an existing entry conflicts with a new one (same index
but different terms), delete the existing entry and all that
follow it (§5.3)

Append any new entries not already in the log

If leaderCommit > commitIndex, set commitIndex =
min(leaderCommit, index of last new entry)

To me, the AppendEntries RPC is not in charge of actually replicating the state to the state machine. According to this, broadcastTime is not correlated to the time it takes to perform the entry's operation, is that right?

thedodd · 2020-10-14T00:03:17Z

@MarinPostma thanks for the background. That definitely helps. I do want to ask more about the indexing operation ... however, I think that you should be free to tackle this problem in whatever way you see fit as long as it is within the bounds of Raft safety ... and this should be.

So, that said, I was thinking about this a bit earlier, and I remembered that we do indeed make calls to replicate_to_state_machine_if_needed from within the handle_append_entries_request handler and we await the response. So, in this particular case, that is what is causing the delay on responding to heartbeats in a timely fashion.

I think a good solution for this is to simply have followers spawn a task which monitors this state asynchronously and applies to the state machine on its own as need data becomes available and it is able to prove that it is safe to apply to the state machine safely according to the Raft protocol.

I opened #12 A LONG time ago, and it might be good to update that issue with an additional requirement to move the replicate_to_state_machine_if_needed calls off of the hot path of the append entries handler. I'll do that now and then prioritize those changes. Doing so should address the issue you are facing.

MarinPostma · 2020-10-14T00:16:27Z

Many thanks @thedodd, let me know if you want help with the implementation.

thedodd · 2020-11-18T20:51:31Z

@MarinPostma just wanted to let you know that I've started hacking on this now. I'm blocked until its done, so it should be done quite soon :).

With this change, we are also caching entries which come from the leader replication protocol. As entries come in, we append them to the log and then cache the entry. When it is safe to apply entries to the state machine, we will take them directly from the in-memory cache instead of going to disk. Moreover, and most importantly, we are not longer blocking the AppendEntries RPC handler with the logic of the state machine replication workflow. There is a small amount of async task juggling to ensure that we don't run into situations where we would have two writers attempting to write to the state machine at the same time. This is easily avoided in our algorithm. closes #12 closes #76

thedodd added enhancement New feature or request replication Related to the replication system labels Oct 13, 2020

thedodd self-assigned this Oct 13, 2020

thedodd added bug Something isn't working and removed enhancement New feature or request labels Nov 5, 2020

MarinPostma mentioned this issue Nov 16, 2020

Add parallel writes test. #87

Closed

thedodd mentioned this issue Nov 20, 2020

Cache replicated entries & don't block main task w/ replication #88

Merged

thedodd closed this as completed in #88 Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader blocked when apply_entry_to_state_machine takes to long #76

Leader blocked when apply_entry_to_state_machine takes to long #76

MarinPostma commented Oct 13, 2020

thedodd commented Oct 13, 2020

MarinPostma commented Oct 13, 2020

MarinPostma commented Oct 13, 2020

thedodd commented Oct 13, 2020

MarinPostma commented Oct 13, 2020 •

edited

thedodd commented Oct 14, 2020

MarinPostma commented Oct 14, 2020

thedodd commented Nov 18, 2020

Leader blocked when apply_entry_to_state_machine takes to long #76

Leader blocked when apply_entry_to_state_machine takes to long #76

Comments

MarinPostma commented Oct 13, 2020

thedodd commented Oct 13, 2020

MarinPostma commented Oct 13, 2020

MarinPostma commented Oct 13, 2020

thedodd commented Oct 13, 2020

MarinPostma commented Oct 13, 2020 • edited

thedodd commented Oct 14, 2020

MarinPostma commented Oct 14, 2020

thedodd commented Nov 18, 2020

MarinPostma commented Oct 13, 2020 •

edited