In normal operation, all of the replicas for a range are close to the commit index and raft log truncation deletes unnecessary state. When one replica falls too far behind, the raft log queue will decide to truncate the raft log for the range in such a way that will force that replica to be "caught up" via a snapshot. Once this occurs, there is not much point in keeping that original replica as part of the range as it no longer contributes to the Range quorum.
Raft-initiated snapshots involve a decent amount of complexity in the code (i.e. Replica.mu.{outSnap,outSnapDone}). As an alternative to that complexity, we could replace the Raft related snapshot code with a small bit of code to enqueue the replica in the replicate queue when a Raft snapshot is requested. I think this would involve lying to Raft to indicate that the snapshot was generated and then having the replicate queue remove any replica which is in state ProgressStateSnapshot. In addition to the code simplification, this approach would have the benefit of proactively rebalancing active ranges away from slow nodes. There is some additional overhead to going through the Raft group configuration change to remove a replica and add a new one, but I think that will be lost in the noise.
There is some exploration to do here to make sure I'm not missing anything. @bdarnell please point out the flaws in this idea.
Cc @cockroachdb/stability
In normal operation, all of the replicas for a range are close to the commit index and raft log truncation deletes unnecessary state. When one replica falls too far behind, the raft log queue will decide to truncate the raft log for the range in such a way that will force that replica to be "caught up" via a snapshot. Once this occurs, there is not much point in keeping that original replica as part of the range as it no longer contributes to the Range quorum.
Raft-initiated snapshots involve a decent amount of complexity in the code (i.e.
Replica.mu.{outSnap,outSnapDone}). As an alternative to that complexity, we could replace the Raft related snapshot code with a small bit of code to enqueue the replica in the replicate queue when a Raft snapshot is requested. I think this would involve lying to Raft to indicate that the snapshot was generated and then having the replicate queue remove any replica which is in stateProgressStateSnapshot. In addition to the code simplification, this approach would have the benefit of proactively rebalancing active ranges away from slow nodes. There is some additional overhead to going through the Raft group configuration change to remove a replica and add a new one, but I think that will be lost in the noise.There is some exploration to do here to make sure I'm not missing anything. @bdarnell please point out the flaws in this idea.
Cc @cockroachdb/stability