`LineRateState` won't ever handle shutdown as long as there's something in its outgoing buffer #136

lilyball · 2021-09-23T23:57:14Z

While a replication stream is in LineRateState, it always prioritizes draining its buffers first. Unfortunately this means if it never manages to drain its buffers, it will also never check for events which means it will never handle a terminate event.

The practical effect of this is if the remote node goes down, the buffers will never drain, and the replication stream will stay alive as a zombie even after the raft core switches out of the leader state, and even after raft itself shuts down. As long as the tokio executor is running, the replication stream will continue shouting into the void, trying to contact a node that doesn't respond.

This is also exacerbated by the fact that raft shutdown doesn't actually wait for its replication streams to shut down, which means raft will declare that it's fully shutdown but the zombie replication streams will stay alive.

Sample log:

Sep 23 16:52:15.636  INFO main{id=5 cluster=LeaderElectionCluster}: async_raft::core: node has shutdown 
Sep 23 16:52:15.654 ERROR main{id=5 target=2 cluster=LeaderElectionCluster}:run{state="line-rate"}:send_append_entries: async_raft::replication: error sending AppendEntries RPC to target error=the Append Entries RPC failed 
Sep 23 16:52:15.677 ERROR main{id=5 target=1 cluster=LeaderElectionCluster}:run{state="line-rate"}:send_append_entries: async_raft::replication: timeout while sending AppendEntries RPC to target error=deadline has elapsed 
Sep 23 16:52:15.678 ERROR main{id=5 target=1 cluster=LeaderElectionCluster}:run{state="line-rate"}:send_append_entries: async_raft::replication: error sending AppendEntries RPC to target error=the Append Entries RPC failed 
Sep 23 16:52:15.686 ERROR main{id=5 target=2 cluster=LeaderElectionCluster}:run{state="line-rate"}:send_append_entries: async_raft::replication: error sending AppendEntries RPC to target error=the Append Entries RPC failed

The log continues like this forever, with it failing to send to targets 1 and 2 over and over even though raft itself shutdown.

The text was updated successfully, but these errors were encountered:

lilyball · 2021-09-23T23:59:39Z

On a related note, being told to shutdown doesn't interrupt the send_append_entries() calls. I feel like termination should probably be handled on a separate channel from the other events, such that send_append_entries() can watch that channel too (or any other async operation that might take a while to complete).

drmingdrmer · 2021-09-25T07:20:02Z

On a related note, being told to shutdown doesn't interrupt the send_append_entries() calls. I feel like termination should probably be handled on a separate channel from the other events, such that send_append_entries() can watch that channel too (or any other async operation that might take a while to complete).

Since there is not a loop in send_append_entries(), the only way IMHO is to put send_append_entries() into an Abortable, so that one is able to stop it by sending a signal:

async-raft/async-raft/src/core/mod.rs

Lines 381 to 405 in bcc246c

    
           let (handle, reg) = AbortHandle::new_pair(); 
        
           let (chan_tx, _) = broadcast::channel(1); 
        
           let tx_compaction = self.tx_compaction.clone(); 
        
           self.snapshot_state = Some(SnapshotState::Snapshotting { 
        
               handle, 
        
               sender: chan_tx.clone(), 
        
           }); 
        
           tokio::spawn( 
        
               async move { 
        
                   let res = Abortable::new(storage.do_log_compaction(), reg).await; 
        
                   match res { 
        
                       Ok(res) => match res { 
        
                           Ok(snapshot) => { 
        
                               let _ = tx_compaction.try_send(SnapshotUpdate::SnapshotComplete(snapshot.index)); 
        
                               let _ = chan_tx.send(snapshot.index); // This will always succeed. 
        
                           } 
        
                           Err(err) => { 
        
                               tracing::error!({error=%err}, "error while generating snapshot"); 
        
                               let _ = tx_compaction.try_send(SnapshotUpdate::SnapshotFailed); 
        
                           } 
        
                       }, 
        
                       Err(_aborted) => { 
        
                           let _ = tx_compaction.try_send(SnapshotUpdate::SnapshotFailed); 
        
                       } 
        
                   }

lilyball · 2021-09-26T20:48:09Z

Isn’t an Abortable just a convenience wrapper around a channel and a select! call that watches both the channel and the original future? If you already have your own shutdown channel you can just do the select! yourself. Heck, the state handler itself could just move its body into an async move {} block and wrap it in a select with the shutdown channel and that way any async operation it does becomes interruptible. This assumes all of its async operations are safe to cancel that way, but that’s generally a good idea anyway.

drmingdrmer · 2021-09-27T08:01:13Z

I did not read the source code of Abortable. But I guess it is very likely implemented as you described.

This assumes all of its async operations are safe to cancel that way, but that’s generally a good idea anyway.

IMHO, yes, every future is safe to cancel at every of its await point.
Thanks to RAII, an object is well cleaned up when it is dropped. :DDD

lilyball mentioned this issue Sep 27, 2021

Raft shutdown doesn't wait for replication streams to shut down #138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`LineRateState` won't ever handle shutdown as long as there's something in its outgoing buffer #136

`LineRateState` won't ever handle shutdown as long as there's something in its outgoing buffer #136

lilyball commented Sep 23, 2021

lilyball commented Sep 23, 2021

drmingdrmer commented Sep 25, 2021

lilyball commented Sep 26, 2021

drmingdrmer commented Sep 27, 2021

LineRateState won't ever handle shutdown as long as there's something in its outgoing buffer #136

LineRateState won't ever handle shutdown as long as there's something in its outgoing buffer #136

Comments

lilyball commented Sep 23, 2021

lilyball commented Sep 23, 2021

drmingdrmer commented Sep 25, 2021

lilyball commented Sep 26, 2021

drmingdrmer commented Sep 27, 2021

`LineRateState` won't ever handle shutdown as long as there's something in its outgoing buffer #136

`LineRateState` won't ever handle shutdown as long as there's something in its outgoing buffer #136