Skip to content

storage: revisit Raft log truncation and Raft snapshots #12238

@petermattis

Description

@petermattis

In normal operation, all of the replicas for a range are close to the commit index and raft log truncation deletes unnecessary state. When one replica falls too far behind, the raft log queue will decide to truncate the raft log for the range in such a way that will force that replica to be "caught up" via a snapshot. Once this occurs, there is not much point in keeping that original replica as part of the range as it no longer contributes to the Range quorum.

Raft-initiated snapshots involve a decent amount of complexity in the code (i.e. Replica.mu.{outSnap,outSnapDone}). As an alternative to that complexity, we could replace the Raft related snapshot code with a small bit of code to enqueue the replica in the replicate queue when a Raft snapshot is requested. I think this would involve lying to Raft to indicate that the snapshot was generated and then having the replicate queue remove any replica which is in state ProgressStateSnapshot. In addition to the code simplification, this approach would have the benefit of proactively rebalancing active ranges away from slow nodes. There is some additional overhead to going through the Raft group configuration change to remove a replica and add a new one, but I think that will be lost in the noise.

There is some exploration to do here to make sure I'm not missing anything. @bdarnell please point out the flaws in this idea.

Cc @cockroachdb/stability

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions