storage: let preemptive snapshots go through Raft #7833

tbg · 2016-07-14T16:50:34Z

Change the way in which preemptive snapshots are applied to use
a temporary Raft group. This lets Raft perform the verification
of the snapshot against its on-disk state and allows us to remove
some of the ad-hoc checks we have added recently.

Some fallout from this was making Raft config creation reusable,
trimming down (*Replica).applySnapshot and minor improvements
on the sending side.

See #6144.

cc @tamird @bdarnell @petermattis @arjunravinarayan

This change is

tamird · 2016-07-14T17:45:12Z

Looks reasonable to me. Acquiring exclusive access to a portion of the key space should amount to holding a read lock on an interval tree and validating that no other replica overlaps with that portion of key space. cc @arjunravinarayan #7830.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 4 of 4 files at r5.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

storage/replica_command.go, line 2929 [r3] (raw file):

      // negate the benefits of pre-emptive snapshots, but that is a recoverable
      // degradation, not a catastrophic failure.
      log.Trace(ctx, "generated snapshot")

wrong place, obviously

storage/replica_command.go, line 2941 [r4] (raw file):

      if repDesc.ReplicaID != 0 {
          return errors.Errorf("must not specify a ReplicaID for new Replica")

needs to include the replica ID itself

storage/replica_raftstorage.go, line 590 [r5] (raw file):

  }
  // TODO(tschottdorf): interestingly, it looks that Raft will make use of

is that surprising? seems expected - an old snapshot is still useful if the remainder of the work can be done by replaying the log.

storage/store.go, line 2228 [r4] (raw file):

              return errors.New("already running")
          }
          const preemptiveSnapReplicaID = math.MaxUint64

this was confusing. it's not a replica ID, it's a raft ID.

storage/store.go, line 2227 [r5] (raw file):

          // TODO(tschottdorf): This should be synchronized better with the
          // rest of the game. We're not under any continuous lock here. It's
          // conceivable that checks become invalidated as we process with

proceed?

storage/store.go, line 2254 [r5] (raw file):

              defer func() {
                  if err != nil {
                      r.mu.internalRaftGroup = nil

looks like this can avoid checking the error and instead always:


            // TODO(tschottdorf): at this point, the Replica should simply
            // disappear. Since this is a WIP, we just nil out the group.
            r.mu.Lock()
            r.mu.internalRaftGroup = nil // force recreation on next op
            r.mu.Unlock()

storage/store.go, line 2258 [r5] (raw file):

              }()
              if r.mu.internalRaftGroup != nil {
                  // This can happen if a previous ChangeReplicas failed.

more words?

Comments from Reviewable

bdarnell · 2016-07-14T19:43:06Z

Reviewed 1 of 1 files at r1, 1 of 2 files at r4, 4 of 4 files at r5.
Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 576 [r5] (raw file):

      // to apply, so that should never happen, but we check anyway.
      // TODO(tschottdorf): is this correct or did we concluce that Raft's
      // sanity checks assumed we weren't touching the acked log entries?

I believe raft's sanity checks here are sufficient and we don't need to do our own checking here now that we're passing through raft's validation.

storage/replica_raftstorage.go, line 590 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

is that surprising? seems expected - an old snapshot is still useful if the remainder of the work can be done by replaying the log.

It's surprising and subtle because there are two terms involved: `raftpb.Message.Term`, which is the `HardState.Term` of the node that sent the snapshot, and `raftpb.Message.Snapshot.Metadata.Term`, which is the term at which the last entry in the snapshot was proposed.

It's fine for Metadata.Term to be older than the current term of the node sending it (although it is unusual for the two to differ for long). The checks implemented in this method are more conservative than those used in raft itself (which are somewhat subtle), so I think it's best to remove them and rely on raft's implementation.

One problem, though, is that we need to set Message.Term when sending the preemptive snapshot. Raft skips the term check for any message with Term 0 (on the assumption that it's a local message for which terms are irrelevant).

storage/store.go, line 2228 [r4] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this was confusing. it's not a replica ID, it's a raft ID.

Raft's node IDs are cockroach replica IDs.

storage/store.go, line 2226 [r5] (raw file):

          //
          // TODO(tschottdorf): This should be synchronized better with the
          // rest of the game. We're not under any continuous lock here. It's

This is under processRaftMu, which provides pretty strong synchronization. This is one of the areas we'll need to be careful about as we increase concurrency and break up processRaftMu, but for now it's safe.

storage/store.go, line 2230 [r5] (raw file):

          // the preemptive snapshot and that we might clash into someone
          // else's keyspace.
          if !r.store.canApplySnapshot(req.RangeID, req.Message.Snapshot) {

We've already called canApplySnapshot above; no need to do it again.

storage/store.go, line 2259 [r5] (raw file):

              if r.mu.internalRaftGroup != nil {
                  // This can happen if a previous ChangeReplicas failed.
                  return errors.New("cannot apply preemptive snapshot, Raft group already active")

Why can't we just apply the preemptive snapshot on the normal code path for initialized replicas? (i.e. instead of going down this preemptive snapshot path for ReplicaID==0, only do it when internalRaftGroup is nil)

storage/store.go, line 2262 [r5] (raw file):

              }
              const preemptiveSnapReplicaID = math.MaxUint64
              if r.mu.internalRaftGroup, err = raft.NewRawNode(&raft.Config{

I don't think this needs to be assigned to r.mu.internalRaftGroup. We can just create a local RawNode and call Step() on it.

storage/store.go, line 2266 [r5] (raw file):

                  // TODO(tschottdorf): I don't think we have to be truthful
                  // about this, but if we did have on-disk state, this Raft
                  // instance might try to give us a large slab of entries

I don't think we need to set Applied here; raft will increase Applied for us when it applies the snapshot.

storage/store.go, line 2274 [r5] (raw file):

                  // TODO(tschottdorf): copied verbatim from `withRaftGroup`,
                  // should centralize this.
                  ElectionTick:    r.store.ctx.RaftElectionTimeoutTicks,

Some of these can also be dummy values since we'll never call Tick() or other methods on this raft group.

storage/store.go, line 2290 [r5] (raw file):

              // In the normal case, the group should have something to say.
              // If it doesn't, our snapshot was probably stale.
              if !r.mu.internalRaftGroup.HasReady() {

In some cases if the snapshot is discarded by raft we would get a Ready, but its Snapshot field would be empty.

storage/store.go, line 2299 [r5] (raw file):

          // Now handle the reaction of Raft to receiving the snapshot
          // (usually applying the snapshot, along with a HardState update).
          if err := r.handleRaftReady(); err != nil {

We don't want to handle the whole Ready struct - in particular if it contains any Messages, we don't want to send them (they'll contain our bogus ReplicaID). We also don't want to accept a new SoftState because it assumes that the sender of a MsgSnap is a leader. We only want to process the snapshot and HardState.

storage/store.go, line 2303 [r5] (raw file):

          }

          // TODO(tschottdorf): at this point, the Replica should simply

I don't think we want the Replica to disappear - we've had problems before when there was data on disk with no corresponding in-memory Replica object.

Comments from Reviewable

bdarnell · 2016-07-14T19:56:13Z

Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

storage/store.go, line 2303 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't think we want the Replica to disappear - we've had problems before when there was data on disk with no corresponding in-memory Replica object.

Context: https://github.com//pull/4204

Comments from Reviewable

tamird · 2016-07-15T04:23:07Z

Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 590 [r5] (raw file):

One problem, though, is that we need to set Message.Term when sending the preemptive snapshot. Raft skips the term check for any message with Term 0 (on the assumption that it's a local message for which terms are irrelevant).

Is this documented? I think we should seriously consider adding preemptive snapshot support upstream, or we'll never stop depending on knowledge of upstream's implementation.

storage/store.go, line 2228 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Raft's node IDs are cockroach replica IDs.

Right, but this particular value is not a cockroach replica ID; it's a completely bogus value which is _only_ a raft ID.

Comments from Reviewable

tbg · 2016-07-15T09:19:18Z

Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

storage/replica_command.go, line 2929 [r3] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

wrong place, obviously

Done

storage/replica_command.go, line 2941 [r4] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

needs to include the replica ID itself

Done.

storage/replica_raftstorage.go, line 576 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I believe raft's sanity checks here are sufficient and we don't need to do our own checking here now that we're passing through raft's validation.

Done.

storage/replica_raftstorage.go, line 590 [r5] (raw file):

One problem, though, is that we need to set Message.Term when sending the preemptive snapshot.

Done.

(on the assumption that it's a local message for which terms are irrelevant).

Shouldn't be hard to add an assertion against this somewhere down low enough?

storage/store.go, line 2228 [r4] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Right, but this particular value is not a cockroach replica ID; it's a completely bogus value which is only a raft ID.

Done.

storage/store.go, line 2226 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is under processRaftMu, which provides pretty strong synchronization. This is one of the areas we'll need to be careful about as we increase concurrency and break up processRaftMu, but for now it's safe.

Removed parts of this comment. I'm still not comfortable with the locking here, but I suppose I need to look at it more.

storage/store.go, line 2227 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

proceed?

Done.

storage/store.go, line 2230 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We've already called canApplySnapshot above; no need to do it again.

Oh, overlooked that. Done.

storage/store.go, line 2254 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

looks like this can avoid checking the error and instead always:


          // TODO(tschottdorf): at this point, the Replica should simply
          // disappear. Since this is a WIP, we just nil out the group.
          r.mu.Lock()
          r.mu.internalRaftGroup = nil // force recreation on next op
          r.mu.Unlock()

This code is gone.

storage/store.go, line 2258 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

more words?

Nope, less words. Code is gone.

storage/store.go, line 2259 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why can't we just apply the preemptive snapshot on the normal code path for initialized replicas? (i.e. instead of going down this preemptive snapshot path for ReplicaID==0, only do it when internalRaftGroup is nil)

I just didn't want to deal with this situation in the first version. Current version handles this (because it went back to never creating internalRaftGroup for a ReplicaIDReceiver).

storage/store.go, line 2262 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't think this needs to be assigned to r.mu.internalRaftGroup. We can just create a local RawNode and call Step() on it.

This was necessary because I was calling `handleRaftReady` which uses the raft group on `*Replica` in the first iteration. Current one doesn't assign any more and calls `applySnapshot` directly using the `raft.Ready`.

storage/store.go, line 2266 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't think we need to set Applied here; raft will increase Applied for us when it applies the snapshot.

See the comment - I'm worried about the case in which we have some older on-disk state (say applied index 10k) and then we make a Raft group with 0 here. Wouldn't Raft spit a lot of entries back at us which it wants to apply? That seems good to avoid.

storage/store.go, line 2274 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Some of these can also be dummy values since we'll never call Tick() or other methods on this raft group.

I refactored out the config creation, mostly to avoid having to deal with these settings.

storage/store.go, line 2290 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

In some cases if the snapshot is discarded by raft we would get a Ready, but its Snapshot field would be empty.

Done.

storage/store.go, line 2299 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We don't want to handle the whole Ready struct - in particular if it contains any Messages, we don't want to send them (they'll contain our bogus ReplicaID). We also don't want to accept a new SoftState because it assumes that the sender of a MsgSnap is a leader. We only want to process the snapshot and HardState.

Done.

storage/store.go, line 2303 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Context: #4204

Comments updated to reflect that brainfart.

Comments from Reviewable

tamird · 2016-07-15T14:16:35Z

Reviewed 7 of 7 files at r6.
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.

storage/replica.go, line 272 [r6] (raw file):

  if r.mu.internalRaftGroup == nil {
      raftGroup, err := raft.NewRawNode(newRaftConfig(
          raft.Storage(r),

don't think you need this raft.Storage()

storage/store.go, line 2304 [r6] (raw file):

          raftGroup, err := raft.NewRawNode(
              newRaftConfig(
                  raft.Storage(r),

ditto

storage/store.go, line 2330 [r6] (raw file):

              return err
          }
          // TODO(tschottdorf): at this point, the Replica has data but

what's the TODO here?

Comments from Reviewable

petermattis · 2016-07-15T16:38:12Z

Review status: all files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.

storage/client_raft_test.go, line 1361 [r5] (raw file):

  // TODO(peter): We're sometimes generating normal snapshots immediately after
  // the preemptive ones. Need to figure out why and fix.
  if false {

This test was disabled because it was flaky. Sometimes non-preemptive snapshots occur and I'm pretty sure it is due to the async sending of preemptive snapshots.

storage/replica_command.go, line 2943 [r5] (raw file):

          return errors.Errorf("must not specify a ReplicaID for new Replica")
      }
      r.raftSender.SendAsync(&RaftMessageRequest{

See my comment on the preemptive snapshot test. I think this should really be a synchronous RPC. Even better, we should make this a streaming RPC and send the snapshot in bits, waiting for the remote to write it to disk. Both these items can wait for subsequent PRs.

storage/replica_raftstorage.go, line 631 [r6] (raw file):

  batch.Defer(func() {
      // As the last deferred action after committing the batch, update other

If you want to be guaranteed this is the last deferred action, you should put this defer up right after the creation of the batch. Otherwise it is easy for someone else to add a batch.Defer before this one.

Is there a reason you prefer using a batch.Defer for this? The reverse order execution of defers, while symmetric with Go's builtin defer is unintuitive to the casual reader.

storage/store.go, line 2327 [r6] (raw file):

          }
          // Apply the snapshot, as Raft told us to.
          if err := r.applySnapshot(ready.Snapshot, ready.HardState); err != nil {

Just noticed that applying a snapshot here will prevent other raft processing on the store. When we eventually make normal raft processing concurrent in Store.processRaft, this will be a bottleneck.

Comments from Reviewable

bdarnell · 2016-07-15T17:25:46Z

Reviewed 7 of 7 files at r6.
Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 590 [r5] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

One problem, though, is that we need to set Message.Term when sending the preemptive snapshot.

Done.

(on the assumption that it's a local message for which terms are irrelevant).

Shouldn't be hard to add an assertion against this somewhere down low enough?

Nothing about raft.Message is currently documented; any time we construct one (here and for preemptive snapshots) we're relying on implementation details. Once we have a plan for streaming snapshots we should look at changes to the upstream interfaces to support both streaming and preemptive snapshots.

I think we could also tighten up some assertions internal to raft to guard against improperly hand-crafted Messages. Currently it makes a lot of assumptions, like assuming a message without a term is a message that doesn't need one (instead of a mistake).

storage/store.go, line 2266 [r5] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

See the comment - I'm worried about the case in which we have some older on-disk state (say applied index 10k) and then we make a Raft group with 0 here. Wouldn't Raft spit a lot of entries back at us which it wants to apply? That seems good to avoid.

No, because the snapshot updates the applied index before raft decides whether it has new entries for us to apply. After a snapshot `Applied == Commit == LastIndex` and so there are no new entries for it to give us.

storage/store.go, line 2265 [r6] (raw file):

          // at term zero for internal messages). The sending side uses the
          // term from the snapshot itself, so we check for equality here.
          if a, e := req.Message.Term, req.Message.Snapshot.Metadata.Term; a != e {

It's not required that Message.Term == Snapshot.Metadata.Term. I think it is required that Message.Term >= Snapshot.Metadata.Term, but why do this check yourself instead of just letting raft do it?

storage/store.go, line 2297 [r6] (raw file):

          r.mu.Unlock()
          if groupExists {
              log.Fatalf(

If you want to enforce this (and I still think it's better to go through the regular code path when a raft group exists), it shouldn't be a Fatalf. This is possible in the real world if a replica is removed and re-added before it can be GC'd.

storage/store.go, line 2327 [r6] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Just noticed that applying a snapshot here will prevent other raft processing on the store. When we eventually make normal raft processing concurrent in Store.processRaft, this will be a bottleneck.

The solution to #7830 should let us fix this. Once we can reserve our spot in `replicasByKey` before applying the snapshot, we no longer need to hold `processRaftMu` throughout the snapshot application process, and we can use a Replica-level lock instead.

Comments from Reviewable

tbg · 2016-07-18T09:36:41Z

I started this branch as an experiment (to verify our assumption that we could avoid out-of-Raft snapshots), and that seems to have worked out well. Personally, I don't feel that we should re-enable preemptive snapshots due to the related-but-out-of-scope concurrency issues (#7830), so my suggestion is to land this but take out the temporary commit which re-enabled preemptive snapshots until we have those issues sorted out. See also #7875 about removing manual HardState updates (of which I removed one in this PR).

Review status: 1 of 7 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

storage/client_raft_test.go, line 1361 [r5] (raw file):

Previously, petermattis (Peter Mattis) wrote…

This test was disabled because it was flaky. Sometimes non-preemptive snapshots occur and I'm pretty sure it is due to the async sending of preemptive snapshots.

This change is in a commit I have titled `[temp]` for that reason. We'll have to discuss whether we actually want to re-enable preemptive snapshots at this point, but at least during development I wanted to run this test.

storage/replica.go, line 272 [r6] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

don't think you need this raft.Storage()

I added it intentionally since it clarifies what we're passing the Replica as.

storage/replica_command.go, line 2943 [r5] (raw file):

Previously, petermattis (Peter Mattis) wrote…

See my comment on the preemptive snapshot test. I think this should really be a synchronous RPC. Even better, we should make this a streaming RPC and send the snapshot in bits, waiting for the remote to write it to disk. Both these items can wait for subsequent PRs.

Yep, I only enabled this for development. I'll see whether it's easy to turn this into a sync RPC (thus reenabling the test and, if we decide we want to, preemptive snapshots in general) when the rest of the change is approved.

storage/replica_raftstorage.go, line 631 [r6] (raw file):

Previously, petermattis (Peter Mattis) wrote…

If you want to be guaranteed this is the last deferred action, you should put this defer up right after the creation of the batch. Otherwise it is easy for someone else to add a batch.Defer before this one.

Is there a reason you prefer using a batch.Defer for this? The reverse order execution of defers, while symmetric with Go's builtin defer is unintuitive to the casual reader.

I have a recent tendency to group side effects into `Defer`s since I'm planning to remove `batch.Defer` (which means that I'm going to see all the side effects using one in the process).

storage/store.go, line 2266 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

No, because the snapshot updates the applied index before raft decides whether it has new entries for us to apply. After a snapshot Applied == Commit == LastIndex and so there are no new entries for it to give us.

That's after a snapshot which is applied. The snapshot could be discarded. Well, I suppose it's rare enough to not worry about, even if it does happen every once in a blue moon.

storage/store.go, line 2265 [r6] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's not required that Message.Term == Snapshot.Metadata.Term. I think it is required that Message.Term >= Snapshot.Metadata.Term, but why do this check yourself instead of just letting raft do it?

This is really just guarding against messages at Term zero. Since we handcraft this message at the sender using the snapshot's term, it seems silly not to check for this exact condition here as well. I thought the comment above explained that well, did you just miss it?

storage/store.go, line 2297 [r6] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If you want to enforce this (and I still think it's better to go through the regular code path when a raft group exists), it shouldn't be a Fatalf. This is possible in the real world if a replica is removed and re-added before it can be GC'd.

You're right about the `Fatalf`, of course. What do you mean by "going through the regular code path"? You want `raftGroup := r.mu.internalRaftGroup`? That makes the locking more complicated since now I have to (or at least should) lock `*Replica.mu` in one of the branches (but I can't hold the lock throughout since `handleRaftReady` needs it). Or you want to leave the `ReplicaID==0` code path? That's also tricky since we don't want to apply all the parts of the `Ready`. The case which you describe above is rare enough to discard the occasional snapshot, so I'm inclined to keep rejecting the snapshot in that case just to keep things stupid and simple. But I'm curious what you were asking for.

storage/store.go, line 2330 [r6] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

what's the TODO here?

Turned this into a comment.

Comments from Reviewable

tamird · 2016-07-18T10:47:26Z

Reviewed 6 of 6 files at r8.
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

Comments from Reviewable

bdarnell · 2016-07-18T21:07:30Z

Why is #7830 a reason to block preemptive snapshots? I don't see how this makes us any more vulnerable to overlapping ranges than we were before.

Reviewed 1 of 7 files at r9, 6 of 6 files at r10.
Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

storage/store.go, line 2259 [r5] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I just didn't want to deal with this situation in the first version. Current version handles this (because it went back to never creating internalRaftGroup for a ReplicaIDReceiver).

This is better, but you're still dealing with the consequences of a raft group that already exists (the `groupExists` check above). It seems simpler to me to handle that by passing the snapshot to Step

storage/store.go, line 2266 [r5] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

That's after a snapshot which is applied. The snapshot could be discarded.
Well, I suppose it's rare enough to not worry about, even if it does happen every once in a blue moon.

Ah, that's subtle. Yes, if the snapshot is discarded and we have `Applied: 0`, then raft will attempt to replay the whole log into `Ready.CommittedEntries`. It will fail (and panic), because the log doesn't start at zero. (our `Replica.Entries` method is kind of stupid and will load all the log entries into memory before realizing this and returning an error).

I think this TODO has gotten lost.

storage/store.go, line 2265 [r6] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This is really just guarding against messages at Term zero. Since we handcraft this message at the sender using the snapshot's term, it seems silly not to check for this exact condition here as well. I thought the comment above explained that well, did you just miss it?

Yeah, I missed that part of the comment. I still think this is unnecessarily strict and would prefer to just check that `req.Message.Term != 0` and let raft do the rest.

storage/store.go, line 2297 [r6] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You're right about the Fatalf, of course.
What do you mean by "going through the regular code path"? You want raftGroup := r.mu.internalRaftGroup? That makes the locking more complicated since now I have to (or at least should) lock *Replica.mu in one of the branches (but I can't hold the lock throughout since handleRaftReady needs it).
Or you want to leave the ReplicaID==0 code path? That's also tricky since we don't want to apply all the parts of the Ready.
The case which you describe above is rare enough to discard the occasional snapshot, so I'm inclined to keep rejecting the snapshot in that case just to keep things stupid and simple. But I'm curious what you were asking for.

I mean that we should only enter the preemptive snapshot branch if `req.ToReplica.ReplicaID == 0` _and_ `r.mu.internalRaftGroup == nil`. If there is a non-nil raft group already, we should go on to call `raftGroup.Step` instead of following the path for preemptive snapshots.

Comments from Reviewable

tamird · 2016-07-19T01:02:25Z

Reviewed 1 of 7 files at r9, 6 of 6 files at r10.
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

Comments from Reviewable

tbg · 2016-07-19T12:03:45Z

Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

storage/store.go, line 2259 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is better, but you're still dealing with the consequences of a raft group that already exists (the groupExists check above). It seems simpler to me to handle that by passing the snapshot to Step

Hmm, didn't think you were suggesting feeding the preemptive snapshot message to `internalRaftGroup.Step`. That would make it pop out of `handleRaftMessage` and we talked earlier about not wanting to handleRaftReady to process that message (and it's definitely not simpler to have two possible full-fledged code paths that a preemptive snapshot could take). Perhaps it's just FUD, but feeding a hand-crafted message into code paths that have all the right to assume that the origin of the message is a real Raft peer doesn't sit well with me.

I'd prefer to let the preemptive snapshot die here if it comes in at an unexpected time, but I could be convinced if we understand the situation really, really well and it seems to have a worthwhile impact doing it.

storage/store.go, line 2266 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Ah, that's subtle. Yes, if the snapshot is discarded and we have Applied: 0, then raft will attempt to replay the whole log into Ready.CommittedEntries. It will fail (and panic), because the log doesn't start at zero. (our Replica.Entries method is kind of stupid and will load all the log entries into memory before realizing this and returning an error).

I think this TODO has gotten lost.

PTAL

storage/store.go, line 2265 [r6] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, I missed that part of the comment. I still think this is unnecessarily strict and would prefer to just check that req.Message.Term != 0 and let raft do the rest.

Done.

storage/store.go, line 2297 [r6] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I mean that we should only enter the preemptive snapshot branch if req.ToReplica.ReplicaID == 0 and r.mu.internalRaftGroup == nil. If there is a non-nil raft group already, we should go on to call raftGroup.Step instead of following the path for preemptive snapshots.

See below, I'll need a bit of convincing that this is a) safe b) worth it.

storage/store.go, line 2327 [r6] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The solution to #7830 should let us fix this. Once we can reserve our spot in replicasByKey before applying the snapshot, we no longer need to hold processRaftMu throughout the snapshot application process, and we can use a Replica-level lock instead.

Pinged #7830 about this.

Comments from Reviewable

tbg · 2016-07-19T12:07:56Z

Couldn't the fact that a preemptive snapshot can arrive roughly along with a corresponding Raft snapshot cause issues? Perhaps not (due to them having the same range), but still. If you're more comfortable re-enabling preemptive snapshots, I'm going to defer to that judgement (in which case I should begin fixing up the corresponding tests).

Review status: 1 of 7 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending.

Comments from Reviewable

tbg · 2016-07-19T13:00:42Z

I seem to have introduced a test failure:

make stress PKG=./storage TESTS=TestGossipFirstRange TESTFLAGS='-test.timeout 10s'

fails after <1k iters. Just a reminder to myself to sort out, may not be a big deal.

petermattis · 2016-07-19T13:10:43Z

@tschottdorf I just added TestGossipFirstRange. I'm curious how/where it is failing with your PR?

tbg · 2016-07-19T13:12:39Z

@petermattis https://gist.github.com/tschottdorf/aca3521a748ba257171fc528a21aa245

petermattis · 2016-07-19T13:18:21Z

Hmm, the test is failing at gossip_test.go:101 and then hanging when calling the defered tc.Stopper().Stop(). I don't have an explanation for either problem. I'll try reproducing locally without your PR.

tbg · 2016-07-19T13:21:55Z

It doesn't repro without the PR (at least not within 20k iters on GCE when it failed in <1k with the PR).

petermattis · 2016-07-20T13:47:29Z

@tschottdorf You're correct that it doesn't repro without this branch. Do you need help investigating the failure?

tbg · 2016-07-20T13:49:15Z

I'm sidetracked looking at Jepsen, so if you have spare capacity I'd
appreciate it.

On Wed, Jul 20, 2016 at 9:47 AM Peter Mattis notifications@github.com
wrote:

@tschottdorf https://github.com/tschottdorf You're correct that it
doesn't repro without this branch. Do you need help investigating the
failure?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7833 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE135E6aBOm277JvUCilX8VdIhKr4wUEks5qXib_gaJpZM4JMocB
.

-- Tobias

petermattis · 2016-07-20T14:00:38Z

Ok. I can reproduce the problem locally using your branch.

petermattis · 2016-07-20T16:37:35Z

The TestGossipFirstRange failure looks like a race in the test. I'm not seeing why it doesn't occur outside of this PR, since it should. The specific problem is that the changeReplicasTrigger for the replica addition can run after the lease is transferred to node 2 even though it logically occurs before it in the test. When this happens, we gossip the first range once more than the test expects. I don't think there is any easy way to get rid of this extra unnecessary gossip. Simplest approach is to fix the test.

Fixes a test failure seen in cockroachdb#7833.

tbg · 2016-07-21T21:55:04Z

While I wait for @bdarnell's feedback, I'll be fixing the preemptive snapshot tests (tomorrow) since the general consensus seems to be that we do want to enable them again.

bdarnell · 2016-07-24T02:10:44Z

except for the TestStoreRangeRebalance change. We're probably not going to be able to make this non-flaky until we introduce some sort of acknowledgement for snapshots into the protocol, so I'd say we merge this with that test disabled and fix it in a separate PR.

Reviewed 6 of 6 files at r12.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

storage/store.go, line 2259 [r5] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Hmm, didn't think you were suggesting feeding the preemptive snapshot message to internalRaftGroup.Step. That would make it pop out of handleRaftMessage and we talked earlier about not wanting to handleRaftReady to process that message (and it's definitely not simpler to have two possible full-fledged code paths that a preemptive snapshot could take). Perhaps it's just FUD, but feeding a hand-crafted message into code paths that have all the right to assume that the origin of the message is a real Raft peer doesn't sit well with me.

I'd prefer to let the preemptive snapshot die here if it comes in at an unexpected time, but I could be convinced if we understand the situation really, really well and it seems to have a worthwhile impact doing it.

My thinking is that we're already committed to hand-crafting `raftpb.Messages` that raft can handle correctly (for coalesced heartbeats); this part worries me less than constructing the raft group with a bogus ID. But you're right that this is unlikely to do much good so it's safer to drop it.

Comments from Reviewable

Change the way in which preemptive snapshots are applied to use a temporary Raft group. This lets Raft perform the verification of the snapshot against its on-disk state and allows us to remove some of the ad-hoc checks we have added recently. Some fallout from this was making Raft config creation reusable, trimming down `(*Replica).applySnapshot` and minor improvements on the sending side. Re-enabled preemptive snapshots, though the corresponding test is still flaky and remains disabled. It is already tracked separately. See cockroachdb#6144.

The test as written invited flakiness and seemed to have successfully procured it with the preemptive snapshot changes when combined with a lease acquisition in the replica change trigger.

Introduce a synchronous mode for RaftTransport. Abort the ChangeReplicas operation if the snapshot cannot be sent. This required relaxing some of the sanity checks introduced in cockroachdb#7833 in order to get all the tests passing reliably. Notably, a replica with an active raft group can now accept preemptive snapshots. Fixes cockroachdb#7819 Closes cockroachdb#8604 May address cockroachdb#8007, cockroachdb#8056, and cockroachdb#8594

tbg mentioned this pull request Jul 14, 2016

storage: ensure Replica ranges do not overlap #7830

Closed

tbg force-pushed the only-raft-snapshots branch from 9fa4d60 to 8dc0f6b Compare July 14, 2016 17:03

tbg force-pushed the only-raft-snapshots branch from 8dc0f6b to 420a7bb Compare July 15, 2016 09:19

tbg changed the title ~~[WIP] storage: let preemptive snapshots go through Raft~~ storage: let preemptive snapshots go through Raft Jul 15, 2016

tbg force-pushed the only-raft-snapshots branch from 420a7bb to 11da625 Compare July 18, 2016 09:16

tbg mentioned this pull request Jul 18, 2016

storage: roll back manual HardState updates #7875

Closed

tbg force-pushed the only-raft-snapshots branch from 11da625 to 71d64ba Compare July 18, 2016 16:42

tbg force-pushed the only-raft-snapshots branch from 71d64ba to 73e4d7d Compare July 19, 2016 12:03

tbg mentioned this pull request Jul 19, 2016

storage: Make uninitialized replicas distinct types #6144

Closed

petermattis added a commit to petermattis/cockroach that referenced this pull request Jul 20, 2016

storage: make TestGossipFirstRange more robust to extra gossiping

f6badfa

Fixes a test failure seen in cockroachdb#7833.

petermattis mentioned this pull request Jul 20, 2016

storage: make TestGossipFirstRange more robust to extra gossiping #7946

Merged

tbg assigned bdarnell Jul 21, 2016

tbg force-pushed the only-raft-snapshots branch from 73e4d7d to 92f4a94 Compare July 24, 2016 22:08

de-flake TestTransferRaftLeadership

00fbdf2

The test as written invited flakiness and seemed to have successfully procured it with the preemptive snapshot changes when combined with a lease acquisition in the replica change trigger.

tbg force-pushed the only-raft-snapshots branch from f5c05a3 to 00fbdf2 Compare July 25, 2016 10:45

tbg merged commit 6c5d603 into cockroachdb:master Jul 25, 2016

tbg deleted the only-raft-snapshots branch July 25, 2016 11:12

bdarnell mentioned this pull request Jul 31, 2016

storage: update confusing comment #8106

Merged

bdarnell mentioned this pull request Aug 17, 2016

storage: Send preemptive snapshots synchronously #8613

Merged

tbg mentioned this pull request Oct 4, 2016

storage: test preemptive snapshot which arrives before previous Replica is gc'ed #7874

Closed

bdarnell mentioned this pull request Mar 20, 2017

storage: Don't allow preemptive snapshots on initialized replicas #14261

Closed

nvanbenschoten mentioned this pull request Apr 16, 2018

storage: can a Snapshot remove the Raft HardState permanently? #24854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: let preemptive snapshots go through Raft #7833

storage: let preemptive snapshots go through Raft #7833

tbg commented Jul 14, 2016 •

edited

Loading

tamird commented Jul 14, 2016

bdarnell commented Jul 14, 2016

bdarnell commented Jul 14, 2016

tamird commented Jul 15, 2016

tbg commented Jul 15, 2016

tamird commented Jul 15, 2016

petermattis commented Jul 15, 2016

bdarnell commented Jul 15, 2016

tbg commented Jul 18, 2016

tamird commented Jul 18, 2016

bdarnell commented Jul 18, 2016

tamird commented Jul 19, 2016

tbg commented Jul 19, 2016

tbg commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016 •

edited

Loading

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 20, 2016

tbg commented Jul 20, 2016

petermattis commented Jul 20, 2016

petermattis commented Jul 20, 2016

tbg commented Jul 21, 2016

bdarnell commented Jul 24, 2016

storage: let preemptive snapshots go through Raft #7833

storage: let preemptive snapshots go through Raft #7833

Conversation

tbg commented Jul 14, 2016 • edited Loading

tamird commented Jul 14, 2016

bdarnell commented Jul 14, 2016

bdarnell commented Jul 14, 2016

tamird commented Jul 15, 2016

tbg commented Jul 15, 2016

tamird commented Jul 15, 2016

petermattis commented Jul 15, 2016

bdarnell commented Jul 15, 2016

tbg commented Jul 18, 2016

tamird commented Jul 18, 2016

bdarnell commented Jul 18, 2016

tamird commented Jul 19, 2016

tbg commented Jul 19, 2016

tbg commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016 • edited Loading

petermattis commented Jul 19, 2016

tbg commented Jul 19, 2016

petermattis commented Jul 20, 2016

tbg commented Jul 20, 2016

petermattis commented Jul 20, 2016

petermattis commented Jul 20, 2016

tbg commented Jul 21, 2016

bdarnell commented Jul 24, 2016

tbg commented Jul 14, 2016 •

edited

Loading

tbg commented Jul 19, 2016 •

edited

Loading