Raft replay protection through enforced commit index #6961

Merged
merged 2 commits into from Jun 11, 2016

Conversation

Projects
None yet
2 participants
@tschottdorf
Member

tschottdorf commented May 30, 2016

Replaces #6722. See #6287. Needs some polish and testing but worth a look.

Introduce the concept of a lease index for replay protection. The lease index
is similar to the Raft log index in some regards, but behaves differently:

  • multiple Raft commands may share a lease index
  • the lease index may jump (i.e. two adjacent commands in the Raft log may
    increase the lease index by more than one)

The need for a lease index arises from complications encountered when trying
to use the Raft index for replay protection. Trying that, the basic idea would
be as follows:

  • when proposing a command, the Raft index at which it must commit is
    determined and stored into the command.
  • when applying a command, compare the log index against the desired index.
  • on mismatch, do not apply the command but suitably "repropose" it.

That last part is complicated: It requires us to predict future log indexes at
which commands will apply accurately. We keep only loose tabs on Raft
leadership, so the leader lease holder (who proposes the vast majority of Raft
commands) may not be colocated with the Raft leader. Thus, it's very hard to
reliably predict the log positions, especially when thrown off by a spurious
command proposed from elsewhere.

The leader index mitigates both of these issues since predicting the exact
lease index is not necessary (any larger index is permitted), and misplaced
commands don't increase the index.

Add checks in processRaftCommand that handle and assign the lease index.
First, we check whether the command was proposed by the current lease holder;
it catches a NotLeaderError if it wasn't.
Each proposed command contains a MaxLeaseIndex field (which must be strictly
larger than the lease index for it to apply successfully, in which case the
new lease index after application is that of the command). On a mismatch, the
command is "refurbished" (named to distinguish it from a mere reproposal), i.e.
an identical command proposed, at a new (higher) index.

The difference between refurbishing and reproposing is important: A reproposal
is issued when we're not sure that the proposal made it into Raft in the first
place, i.e. we want to guarantee that the client eventually receives a
response, and we err on the side of caution in doing so by reproposing fairly
liberally, however without updating the index to which the command is tied (so
that it really will commit only once, no matter how many copies we create).

A refurbished command must only be issued when we are certain that the original
command cannot commit successfully any more because we see that its
MaxLeaseIndex has been surpassed by the applied lease index.
In this case, we propose a new command which can commit at a new, higher,
index, and route its response to the waiting client.


This change is Reviewable

@bdarnell

This comment has been minimized.

Show comment
Hide comment
@bdarnell

bdarnell Jun 1, 2016

Member

This generally looks OK, but I'm very worried that we're going astray by introducing this much complexity with no clear abstraction boundaries. This is really complicated stuff, and outside of the context of this PR, it will be difficult to know where all the pieces related to refurbishment (etc) live. Are we missing some opportunity to encapsulate parts of the replica/raft interaction in a black box?

Previously, tschottdorf (Tobias Schottdorf) wrote…

Raft replay protection through enforced commit index

Replaces #6722. See #6287. Needs some polish and testing but worth a look.

Introduce the concept of a lease index for replay protection. The lease index

is similar to the Raft log index in some regards, but behaves differently:

  • multiple Raft commands may share a lease index

  • the lease index may jump (i.e. two adjacent commands in the Raft log may

    increase the lease index by more than one)

The need for a lease index arises from complications encountered when trying

to use the Raft index for replay protection. Trying that, the basic idea would

be as follows:

  • when proposing a command, the Raft index at which it must commit is

    determined and stored into the command.

  • when applying a command, compare the log index against the desired index.

  • on mismatch, do not apply the command but suitably "repropose" it.

That last part is complicated: It requires us to predict future log indexes at

which commands will apply accurately. We keep only loose tabs on Raft

leadership, so the leader lease holder (who proposes the vast majority of Raft

commands) may not be colocated with the Raft leader. Thus, it's very hard to

reliably predict the log positions, especially when thrown off by a spurious

command proposed from elsewhere.

The leader index mitigates both of these issues since predicting the exact

lease index is not necessary (any larger index is permitted), and misplaced

commands don't increase the index.

Add checks in processRaftCommand that handle and assign the lease index.

First, we check whether the command was proposed by the current lease holder;

it catches a NotLeaderError if it wasn't.

Each proposed command contains a MaxLeaseIndex field (which must be strictly

larger than the lease index for it to apply successfully, in which case the

new lease index after application is that of the command). On a mismatch, the

command is "refurbished" (named to distinguish it from a mere reproposal), i.e.

an identical command proposed, at a new (higher) index.

The difference between refurbishing and reproposing is important: A reproposal

is issued when we're not sure that the proposal made it into Raft in the first

place, i.e. we want to guarantee that the client eventually receives a

response, and we err on the side of caution in doing so by reproposing fairly

liberally, however without updating the index to which the command is tied (so

that it really will commit only once, no matter how many copies we create).

A refurbished command must only be issued when we are certain that the original

command cannot commit successfully any more because we see that its

MaxLeaseIndex has been surpassed by the applied lease index.

In this case, we propose a new command which can commit at a new, higher,

index, and route its response to the waiting client.


Reviewed 5 of 7 files at r2, 7 of 7 files at r4.
Review status: all files reviewed at latest revision, 28 unresolved discussions, all commit checks successful.


keys/constants.go, line 100 [r2] (raw file):

  LocalRaftAppliedIndexSuffix = []byte("rfta")
  // LocalLeaseAppliedIndexSuffix is the suffix for the applied lease index.
  LocalLeaseAppliedIndexSuffix = []byte("rlla")

Move this after rll-. The intended policy is that these constants are sorted by their value (within a group) to help make sure we don't have any collisions. There used to be a comment about this but it seems to have gone missing.


roachpb/api.proto, line 802 [r2] (raw file):

  optional BatchRequest cmd = 3 [(gogoproto.nullable) = false];
  // When the command is applied, its result is an error if the lease log
  // counter has already reached (or exceeded) max_lease_index.

The definition of the lease index needs to be documented somewhere in the code, not just the commit message. This looks like as good a place as any.


storage/replica.go, line 159 [r2] (raw file):

  ctx context.Context
  // TODO(tschottdorf): idKey is legacy at this point: We could easily key
  // commands by their RequiredIndex, and doing so should be ok with a stop-

s/RequiredIndex/MaxLeaseIndex/g in all comments.


storage/replica.go, line 160 [r2] (raw file):

  // TODO(tschottdorf): idKey is legacy at this point: We could easily key
  // commands by their RequiredIndex, and doing so should be ok with a stop-
  // the-world migration. However, requires adapting tryCancel.

Did you mean tryAbandon or is there a tryCancel somewhere I'm not seeing?


storage/replica.go, line 1327 [r4] (raw file):

  replica roachpb.ReplicaDescriptor,
  ba roachpb.BatchRequest,
  delta uint64, // delta to be added to last assigned lease index, usually 1

When is it not 1? Looks like it's just in one test that passes 0. Should this be a bool instead? Or is there some other way to satisfy that test without adding extra arguments in production code?


storage/replica.go, line 1329 [r4] (raw file):

  delta uint64, // delta to be added to last assigned lease index, usually 1
) (*pendingCmd, error) {
  if r.mu.lastAssignedLeaseIndex < r.mu.leaseAppliedIndex {

When is this true? Would it be better to update lastAssignedLeaseIndex when we update leaseAppliedIndex?


storage/replica.go, line 1332 [r4] (raw file):

      r.mu.lastAssignedLeaseIndex = r.mu.leaseAppliedIndex + 1
  }
  if _, ok := ba.GetArg(roachpb.LeaderLease); !ok {

You can use the newIsLease method here.


storage/replica.go, line 1368 [r4] (raw file):

// - a channel which receives a response or error upon application
// - a closure used to attempt to abandon the command. When called, it tries
//     to pending command from the internal commands map. This is possible until

s/to/to remove/


storage/replica.go, line 1588 [r4] (raw file):

}

// pendingCmdSlice sorts by increasing RequiredIndex.

Would it be better to change pendingCmds from a map to a sorted slice, since we're only appending at the end? We'd need to add a state field to track commands when they become abandoned or when they are in flight and can no longer be abandoned.


storage/replica.go, line 1602 [r4] (raw file):

  // Thus we'd risk reproposing a command that has been committed but not yet
  // applied.
  maxCanRefurbish := r.mu.leaseAppliedIndex // indexes <= can be refurbished

s/can be refurbished/must be refurbished instead of reproposed/

Since we're now drawing a distinction between refurbish and repropose, this method should probably be given a different name to reflect the fact that it does both. (and I think this method's comments would be a good place to document the differences)


storage/replica.go, line 1615 [r4] (raw file):

      // Send to returned channel in case we weren't able to refurbish (in
      // which case the waiting client is on the receiving end).
      ch <- roachpb.ResponseWithError{Err: pErr}

This is a very weird pattern. refurbishPendingCmdLocked gives us a channel and an error, and we write the error back out on that channel? And the other call site assigns the returned channel to cmd.done. I understand what it's doing now but I had to study it for a bit - is there nothing else we could do? Maybe pass an argument into refurbishPendingCmdLocked to tell it whether it should send the error to the client?


storage/replica.go, line 1681 [r4] (raw file):

// refurbishPendingCmdLocked takes a pendingCmd which was discovered to apply
// at an illegal log position. It creates and proposes a new command and

/an illegal log position/a log position other than the one at which it was originally proposed (this can happen when the range lease held by a raft follower, who must forward MsgProp messages to the raft leader without guaranteed ordering)./


storage/replica.go, line 1687 [r4] (raw file):

// (in which case the newly proposed command will report to the client).
// The command passed must have been deleted from r.mu.pendingCmds.
func (r *Replica) refurbishPendingCmdLocked(cmd *pendingCmd, delta uint64) (chan roachpb.ResponseWithError, *roachpb.Error) {

It looks like every call to this method passes 1 for delta.


storage/replica.go, line 1714 [r4] (raw file):

  }

  // TODO(tschottdorf): will clients know to handle the errors potentially

Which errors is this referring to?


storage/replica.go, line 1721 [r4] (raw file):

  r.mu.Lock()
  cmd := r.mu.pendingCmds[idKey]
  if r.mu.leaderLease.Replica != raftCmd.OriginReplica && !raftCmd.Cmd.IsLease() {

We now check that raftCmd.OriginReplica holds the lease in two places: here and in applyRaftCommandInBatch. Can we replace the latter now that we have this?


storage/replica.go, line 1733 [r4] (raw file):

      ctx = cmd.ctx
  }
  if cmd != nil {

Combine this if with the preceding one?


storage/replica.go, line 1743 [r4] (raw file):

      // proposals are often replayed, so not making them update the counter
      // makes sense from a testing perspective. The alternative is to set
      // RequiredIndex to leaseIndex+1 but that will likely throw off the

I would just remove this alternative.


storage/replica.go, line 1747 [r4] (raw file):

  } else if r.mu.leaseAppliedIndex < raftCmd.MaxLeaseIndex {
      // The happy case: the command is applying at or ahead of the minimal
      // permissible index. It's ok if it skips a few slots (as can happen

Why is this OK? Because we're relying on the command queue to ensure that conflicting commands are never pending at the same time? That assumption should be documented explicitly for the future when we decide to do something about that.


storage/replica.go, line 1774 [r4] (raw file):

          // Note that we keep the context as to not hide these internal
          // cycles from traces.
          if cmd.raftCmd.MaxLeaseIndex <= raftCmd.MaxLeaseIndex {

The variable names make this line incomprehensible. I don't have suggestions for better names, but at least add a comment that cmd.raftCmd is our local in-memory copy (which may have had its MaxLeaseIndex increased) while raftCmd is the command we got from raft and are attempting to apply now.


storage/replica_raftstorage.go, line 751 [r2] (raw file):

      // the snapshot.
      //
      // TODO(tschottdorf): Why does this not simply use appliedIndex?

They should be the same. We used snap.Metadata.Index before because we have it without an extra call to loadAppliedIndex, but now that we have to make that read anyway we might as well use it.


storage/replica_test.go, line 5591 [r4] (raw file):

}

func runWrongIndexTest(t *testing.T, repropose bool, withRefurbishError bool, expProposals int32) {

s/withRefurbishError/withErr/, to match the constants below? Otherwise this needs more explanation of what's going on. It's also unclear to me what repropose as a bool means.


storage/replica_test.go, line 5639 [r4] (raw file):

  }

  wrongIndex := ai - 1 // will chose this as RequiredIndex

s/chose/choose/ s/RequiredIndex/MaxLeaseIndex/


storage/replica_test.go, line 5653 [r4] (raw file):

      cmd, err := tc.rng.prepareRaftCommandLocked(
          context.WithValue(context.Background(), magicKey{}, "foo"),
          makeIDKey(), replica, ba, 0 /* no increment */)

Why no increment?


storage/replica_test.go, line 5655 [r4] (raw file):

          makeIDKey(), replica, ba, 0 /* no increment */)
      if err != nil {
          fatalf(err.Error())

Nit: the first argument to f functions should be a constant (what if the error message contains a %?)


storage/replica_test.go, line 5714 [r4] (raw file):

}

func TestReplicaRefurbishOnWrongIndex(t *testing.T) {

Delete or implement this test.


storage/replica_test.go, line 5719 [r4] (raw file):

// TestReplicaBurstPendingCommandsAndRepropose verifies that a burst of
// proposed commands assigns roughly the correct sequence of required indexes,

What exactly is "roughly correct" here?


storage/replica_test.go, line 5748 [r4] (raw file):

  expIndexes := make([]int, 0, num)
  for _, ch := range func() []chan roachpb.ResponseWithError {

Split this line up - give either the function or its result a name instead of making a loop expression too big to fit on one screen (and a lot of code indented under a for statement that is only run once)


storage/replica_test.go, line 5807 [r4] (raw file):

      return errors.New("still pending commands")
  })
}

Do we need an integration test that uses a custom raft transport to reorder MsgProp messages and verify that they still apply in order?


Comments from Reviewable

Member

bdarnell commented Jun 1, 2016

This generally looks OK, but I'm very worried that we're going astray by introducing this much complexity with no clear abstraction boundaries. This is really complicated stuff, and outside of the context of this PR, it will be difficult to know where all the pieces related to refurbishment (etc) live. Are we missing some opportunity to encapsulate parts of the replica/raft interaction in a black box?

Previously, tschottdorf (Tobias Schottdorf) wrote…

Raft replay protection through enforced commit index

Replaces #6722. See #6287. Needs some polish and testing but worth a look.

Introduce the concept of a lease index for replay protection. The lease index

is similar to the Raft log index in some regards, but behaves differently:

  • multiple Raft commands may share a lease index

  • the lease index may jump (i.e. two adjacent commands in the Raft log may

    increase the lease index by more than one)

The need for a lease index arises from complications encountered when trying

to use the Raft index for replay protection. Trying that, the basic idea would

be as follows:

  • when proposing a command, the Raft index at which it must commit is

    determined and stored into the command.

  • when applying a command, compare the log index against the desired index.

  • on mismatch, do not apply the command but suitably "repropose" it.

That last part is complicated: It requires us to predict future log indexes at

which commands will apply accurately. We keep only loose tabs on Raft

leadership, so the leader lease holder (who proposes the vast majority of Raft

commands) may not be colocated with the Raft leader. Thus, it's very hard to

reliably predict the log positions, especially when thrown off by a spurious

command proposed from elsewhere.

The leader index mitigates both of these issues since predicting the exact

lease index is not necessary (any larger index is permitted), and misplaced

commands don't increase the index.

Add checks in processRaftCommand that handle and assign the lease index.

First, we check whether the command was proposed by the current lease holder;

it catches a NotLeaderError if it wasn't.

Each proposed command contains a MaxLeaseIndex field (which must be strictly

larger than the lease index for it to apply successfully, in which case the

new lease index after application is that of the command). On a mismatch, the

command is "refurbished" (named to distinguish it from a mere reproposal), i.e.

an identical command proposed, at a new (higher) index.

The difference between refurbishing and reproposing is important: A reproposal

is issued when we're not sure that the proposal made it into Raft in the first

place, i.e. we want to guarantee that the client eventually receives a

response, and we err on the side of caution in doing so by reproposing fairly

liberally, however without updating the index to which the command is tied (so

that it really will commit only once, no matter how many copies we create).

A refurbished command must only be issued when we are certain that the original

command cannot commit successfully any more because we see that its

MaxLeaseIndex has been surpassed by the applied lease index.

In this case, we propose a new command which can commit at a new, higher,

index, and route its response to the waiting client.


Reviewed 5 of 7 files at r2, 7 of 7 files at r4.
Review status: all files reviewed at latest revision, 28 unresolved discussions, all commit checks successful.


keys/constants.go, line 100 [r2] (raw file):

  LocalRaftAppliedIndexSuffix = []byte("rfta")
  // LocalLeaseAppliedIndexSuffix is the suffix for the applied lease index.
  LocalLeaseAppliedIndexSuffix = []byte("rlla")

Move this after rll-. The intended policy is that these constants are sorted by their value (within a group) to help make sure we don't have any collisions. There used to be a comment about this but it seems to have gone missing.


roachpb/api.proto, line 802 [r2] (raw file):

  optional BatchRequest cmd = 3 [(gogoproto.nullable) = false];
  // When the command is applied, its result is an error if the lease log
  // counter has already reached (or exceeded) max_lease_index.

The definition of the lease index needs to be documented somewhere in the code, not just the commit message. This looks like as good a place as any.


storage/replica.go, line 159 [r2] (raw file):

  ctx context.Context
  // TODO(tschottdorf): idKey is legacy at this point: We could easily key
  // commands by their RequiredIndex, and doing so should be ok with a stop-

s/RequiredIndex/MaxLeaseIndex/g in all comments.


storage/replica.go, line 160 [r2] (raw file):

  // TODO(tschottdorf): idKey is legacy at this point: We could easily key
  // commands by their RequiredIndex, and doing so should be ok with a stop-
  // the-world migration. However, requires adapting tryCancel.

Did you mean tryAbandon or is there a tryCancel somewhere I'm not seeing?


storage/replica.go, line 1327 [r4] (raw file):

  replica roachpb.ReplicaDescriptor,
  ba roachpb.BatchRequest,
  delta uint64, // delta to be added to last assigned lease index, usually 1

When is it not 1? Looks like it's just in one test that passes 0. Should this be a bool instead? Or is there some other way to satisfy that test without adding extra arguments in production code?


storage/replica.go, line 1329 [r4] (raw file):

  delta uint64, // delta to be added to last assigned lease index, usually 1
) (*pendingCmd, error) {
  if r.mu.lastAssignedLeaseIndex < r.mu.leaseAppliedIndex {

When is this true? Would it be better to update lastAssignedLeaseIndex when we update leaseAppliedIndex?


storage/replica.go, line 1332 [r4] (raw file):

      r.mu.lastAssignedLeaseIndex = r.mu.leaseAppliedIndex + 1
  }
  if _, ok := ba.GetArg(roachpb.LeaderLease); !ok {

You can use the newIsLease method here.


storage/replica.go, line 1368 [r4] (raw file):

// - a channel which receives a response or error upon application
// - a closure used to attempt to abandon the command. When called, it tries
//     to pending command from the internal commands map. This is possible until

s/to/to remove/


storage/replica.go, line 1588 [r4] (raw file):

}

// pendingCmdSlice sorts by increasing RequiredIndex.

Would it be better to change pendingCmds from a map to a sorted slice, since we're only appending at the end? We'd need to add a state field to track commands when they become abandoned or when they are in flight and can no longer be abandoned.


storage/replica.go, line 1602 [r4] (raw file):

  // Thus we'd risk reproposing a command that has been committed but not yet
  // applied.
  maxCanRefurbish := r.mu.leaseAppliedIndex // indexes <= can be refurbished

s/can be refurbished/must be refurbished instead of reproposed/

Since we're now drawing a distinction between refurbish and repropose, this method should probably be given a different name to reflect the fact that it does both. (and I think this method's comments would be a good place to document the differences)


storage/replica.go, line 1615 [r4] (raw file):

      // Send to returned channel in case we weren't able to refurbish (in
      // which case the waiting client is on the receiving end).
      ch <- roachpb.ResponseWithError{Err: pErr}

This is a very weird pattern. refurbishPendingCmdLocked gives us a channel and an error, and we write the error back out on that channel? And the other call site assigns the returned channel to cmd.done. I understand what it's doing now but I had to study it for a bit - is there nothing else we could do? Maybe pass an argument into refurbishPendingCmdLocked to tell it whether it should send the error to the client?


storage/replica.go, line 1681 [r4] (raw file):

// refurbishPendingCmdLocked takes a pendingCmd which was discovered to apply
// at an illegal log position. It creates and proposes a new command and

/an illegal log position/a log position other than the one at which it was originally proposed (this can happen when the range lease held by a raft follower, who must forward MsgProp messages to the raft leader without guaranteed ordering)./


storage/replica.go, line 1687 [r4] (raw file):

// (in which case the newly proposed command will report to the client).
// The command passed must have been deleted from r.mu.pendingCmds.
func (r *Replica) refurbishPendingCmdLocked(cmd *pendingCmd, delta uint64) (chan roachpb.ResponseWithError, *roachpb.Error) {

It looks like every call to this method passes 1 for delta.


storage/replica.go, line 1714 [r4] (raw file):

  }

  // TODO(tschottdorf): will clients know to handle the errors potentially

Which errors is this referring to?


storage/replica.go, line 1721 [r4] (raw file):

  r.mu.Lock()
  cmd := r.mu.pendingCmds[idKey]
  if r.mu.leaderLease.Replica != raftCmd.OriginReplica && !raftCmd.Cmd.IsLease() {

We now check that raftCmd.OriginReplica holds the lease in two places: here and in applyRaftCommandInBatch. Can we replace the latter now that we have this?


storage/replica.go, line 1733 [r4] (raw file):

      ctx = cmd.ctx
  }
  if cmd != nil {

Combine this if with the preceding one?


storage/replica.go, line 1743 [r4] (raw file):

      // proposals are often replayed, so not making them update the counter
      // makes sense from a testing perspective. The alternative is to set
      // RequiredIndex to leaseIndex+1 but that will likely throw off the

I would just remove this alternative.


storage/replica.go, line 1747 [r4] (raw file):

  } else if r.mu.leaseAppliedIndex < raftCmd.MaxLeaseIndex {
      // The happy case: the command is applying at or ahead of the minimal
      // permissible index. It's ok if it skips a few slots (as can happen

Why is this OK? Because we're relying on the command queue to ensure that conflicting commands are never pending at the same time? That assumption should be documented explicitly for the future when we decide to do something about that.


storage/replica.go, line 1774 [r4] (raw file):

          // Note that we keep the context as to not hide these internal
          // cycles from traces.
          if cmd.raftCmd.MaxLeaseIndex <= raftCmd.MaxLeaseIndex {

The variable names make this line incomprehensible. I don't have suggestions for better names, but at least add a comment that cmd.raftCmd is our local in-memory copy (which may have had its MaxLeaseIndex increased) while raftCmd is the command we got from raft and are attempting to apply now.


storage/replica_raftstorage.go, line 751 [r2] (raw file):

      // the snapshot.
      //
      // TODO(tschottdorf): Why does this not simply use appliedIndex?

They should be the same. We used snap.Metadata.Index before because we have it without an extra call to loadAppliedIndex, but now that we have to make that read anyway we might as well use it.


storage/replica_test.go, line 5591 [r4] (raw file):

}

func runWrongIndexTest(t *testing.T, repropose bool, withRefurbishError bool, expProposals int32) {

s/withRefurbishError/withErr/, to match the constants below? Otherwise this needs more explanation of what's going on. It's also unclear to me what repropose as a bool means.


storage/replica_test.go, line 5639 [r4] (raw file):

  }

  wrongIndex := ai - 1 // will chose this as RequiredIndex

s/chose/choose/ s/RequiredIndex/MaxLeaseIndex/


storage/replica_test.go, line 5653 [r4] (raw file):

      cmd, err := tc.rng.prepareRaftCommandLocked(
          context.WithValue(context.Background(), magicKey{}, "foo"),
          makeIDKey(), replica, ba, 0 /* no increment */)

Why no increment?


storage/replica_test.go, line 5655 [r4] (raw file):

          makeIDKey(), replica, ba, 0 /* no increment */)
      if err != nil {
          fatalf(err.Error())

Nit: the first argument to f functions should be a constant (what if the error message contains a %?)


storage/replica_test.go, line 5714 [r4] (raw file):

}

func TestReplicaRefurbishOnWrongIndex(t *testing.T) {

Delete or implement this test.


storage/replica_test.go, line 5719 [r4] (raw file):

// TestReplicaBurstPendingCommandsAndRepropose verifies that a burst of
// proposed commands assigns roughly the correct sequence of required indexes,

What exactly is "roughly correct" here?


storage/replica_test.go, line 5748 [r4] (raw file):

  expIndexes := make([]int, 0, num)
  for _, ch := range func() []chan roachpb.ResponseWithError {

Split this line up - give either the function or its result a name instead of making a loop expression too big to fit on one screen (and a lot of code indented under a for statement that is only run once)


storage/replica_test.go, line 5807 [r4] (raw file):

      return errors.New("still pending commands")
  })
}

Do we need an integration test that uses a custom raft transport to reorder MsgProp messages and verify that they still apply in order?


Comments from Reviewable

@bdarnell

This comment has been minimized.

Show comment
Hide comment
@bdarnell

bdarnell Jun 1, 2016

Member

Also, looking ahead to leader-proposed raft, where will the execution of the command take place?

Member

bdarnell commented Jun 1, 2016

Also, looking ahead to leader-proposed raft, where will the execution of the command take place?

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Jun 1, 2016

Member

TFTR! I will get back to the remaining comments in a bit.

Previously, bdarnell (Ben Darnell) wrote…

Also, looking ahead to leader-proposed raft, where will the execution of the command take place?


Review status: all files reviewed at latest revision, 28 unresolved discussions, all commit checks successful.


keys/constants.go, line 100 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Move this after rll-. The intended policy is that these constants are sorted by their value (within a group) to help make sure we don't have any collisions. There used to be a comment about this but it seems to have gone missing.

Done and added comment.

roachpb/api.proto, line 802 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The definition of the lease index needs to be documented somewhere in the code, not just the commit message. This looks like as good a place as any.

Done.

storage/replica.go, line 159 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/RequiredIndex/MaxLeaseIndex/g in all comments.

Done.

storage/replica.go, line 160 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Did you mean tryAbandon or is there a tryCancel somewhere I'm not seeing?

Meant `tryAbandon`, fixed.

storage/replica.go, line 1327 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

When is it not 1? Looks like it's just in one test that passes 0. Should this be a bool instead? Or is there some other way to satisfy that test without adding extra arguments in production code?

There was a real usage of this code, but it's since gone (though I need to page back in why). That usage was that when refurbishing during application, the applied lease index would only be updated _after_ the refurbishment, so you wanted to add an extra unit for the new command. Ah, I remember: that was when failed applications would also bump the lease index, which isn't the case in this version. Removed the code, the test is a little leaky but looks ok.

storage/replica.go, line 1329 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

When is this true? Would it be better to update lastAssignedLeaseIndex when we update leaseAppliedIndex?

If some other node proposes a lot of commands and we don't, we will have a lower `lastAssignedLeaseIndex` and so we'll catch up here. This should be a pretty ironclad way of making sure we're assigning decent indexes, but updating during lease application is also an option, though it's not necessary and doing it here seems good because that's also where this variable matters.

storage/replica.go, line 1332 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

You can use the newIsLease method here.

Done.

storage/replica.go, line 1368 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/to/to remove/

Done.

storage/replica.go, line 1588 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Would it be better to change pendingCmds from a map to a sorted slice, since we're only appending at the end? We'd need to add a state field to track commands when they become abandoned or when they are in flight and can no longer be abandoned.

I considered that, but it seemed like something to consider separately. We rely on the map lookup in a few places, notably in `processRaftCommand`. We could do a linear search there, but in the usual case reproposals are rare and not in the critical path, so I went with this for now. Adding a TODO though.

storage/replica.go, line 1602 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/can be refurbished/must be refurbished instead of reproposed/

Since we're now drawing a distinction between refurbish and repropose, this method should probably be given a different name to reflect the fact that it does both. (and I think this method's comments would be a good place to document the differences)

I'd like to avoid that wording - it doesn't _have_ to be refurbished. If you reproposed it instead sometimes, it simply wouldn't have a chance of applying. But it's not inherently incorrect - what's incorrect is reproposing a command which could lead to a replay. I did `s/can/will/` and added more commentary.

storage/replica.go, line 1681 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

/an illegal log position/a log position other than the one at which it was originally proposed (this can happen when the range lease held by a raft follower, who must forward MsgProp messages to the raft leader without guaranteed ordering)./

Done.

storage/replica.go, line 1687 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It looks like every call to this method passes 1 for delta.

Removed.

storage/replica.go, line 1714 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Which errors is this referring to?

None in particular, just noted that this may let some errors pop up in places they weren't expected before (I think). Looked around a bit, but don't see a reason for this to be non-kosher, so removed the TODO.

storage/replica.go, line 1733 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Combine this if with the preceding one?

Done.

storage/replica.go, line 1743 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I would just remove this alternative.

Done.

storage/replica.go, line 1747 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why is this OK? Because we're relying on the command queue to ensure that conflicting commands are never pending at the same time? That assumption should be documented explicitly for the future when we decide to do something about that.

Kind of the whole point of keeping the lease applied index separate is that it allows relaxing where a command applies. In the Raft log, commands are applied in strict order without gaps. The lease applied index tracks the maximum applied index ever. So we don't need to guess the correct slot, we just need one higher (and then nothing lower can apply after). In practice, of course in 99.9% of the cases, the new one is going to be the old one + 1. I have commentary on this in various places already, so probably I misunderstood your question - this is inside of Raft, so everything is serialized and the command queue doesn't matter. If command 9 comes before 7, then 7 will fail (the applied index will be 9 at the time 7 tries to apply).

storage/replica.go, line 1774 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The variable names make this line incomprehensible. I don't have suggestions for better names, but at least add a comment that cmd.raftCmd is our local in-memory copy (which may have had its MaxLeaseIndex increased) while raftCmd is the command we got from raft and are attempting to apply now.

Does this help?
if localMaxLeaseIndex := cmd.raftCmd.MaxLeaseIndex; localMaxLeaseIndex <= raftCmd.MaxLeaseIndex {

storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

They should be the same. We used snap.Metadata.Index before because we have it without an extra call to loadAppliedIndex, but now that we have to make that read anyway we might as well use it.

Ok, doing that and removed the TODO.

I also added r.mu.lastIndex = appliedIndex - that seems like it should have been there all along?


storage/replica_test.go, line 5591 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/withRefurbishError/withErr/, to match the constants below? Otherwise this needs more explanation of what's going on. It's also unclear to me what repropose as a bool means.

The test is unfortunately a little involved. I'll look into refactoring it a little bit, but it's digging around pretty close to internals, so not sure I can get that far. Until then, I added commentary.

storage/replica_test.go, line 5639 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/chose/choose/ s/RequiredIndex/MaxLeaseIndex/

Done.

storage/replica_test.go, line 5653 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why no increment?

Because we're manually setting a wrong index below, in analogy to a command from another replica showing up uninvited. It doesn't make a difference (though the test outcome may change).

storage/replica_test.go, line 5655 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Nit: the first argument to f functions should be a constant (what if the error message contains a %?)

Done.

storage/replica_test.go, line 5714 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Delete or implement this test.

Done (deleted).

storage/replica_test.go, line 5719 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What exactly is "roughly correct" here?

`roughly` refers to the guesswork (during lease changes etc, it may be off a tad). Removed the word since it's mostly just confusing.

storage/replica_test.go, line 5748 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Split this line up - give either the function or its result a name instead of making a loop expression too big to fit on one screen (and a lot of code indented under a for statement that is only run once)

Done.

storage/replica_test.go, line 5807 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Do we need an integration test that uses a custom raft transport to reorder MsgProp messages and verify that they still apply in order?

They wouldn't apply in order (if I understand you correctly). You're picturing a follower sending two proposals (say it picks lease indexes 10 and 11) and they arrive in opposite order at the leader? In that case, 11 would commit first, would bump the applied lease index to 11, and the command at 10 would fail and would require a refurbishment.

Comments from Reviewable

Member

tschottdorf commented Jun 1, 2016

TFTR! I will get back to the remaining comments in a bit.

Previously, bdarnell (Ben Darnell) wrote…

Also, looking ahead to leader-proposed raft, where will the execution of the command take place?


Review status: all files reviewed at latest revision, 28 unresolved discussions, all commit checks successful.


keys/constants.go, line 100 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Move this after rll-. The intended policy is that these constants are sorted by their value (within a group) to help make sure we don't have any collisions. There used to be a comment about this but it seems to have gone missing.

Done and added comment.

roachpb/api.proto, line 802 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The definition of the lease index needs to be documented somewhere in the code, not just the commit message. This looks like as good a place as any.

Done.

storage/replica.go, line 159 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/RequiredIndex/MaxLeaseIndex/g in all comments.

Done.

storage/replica.go, line 160 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Did you mean tryAbandon or is there a tryCancel somewhere I'm not seeing?

Meant `tryAbandon`, fixed.

storage/replica.go, line 1327 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

When is it not 1? Looks like it's just in one test that passes 0. Should this be a bool instead? Or is there some other way to satisfy that test without adding extra arguments in production code?

There was a real usage of this code, but it's since gone (though I need to page back in why). That usage was that when refurbishing during application, the applied lease index would only be updated _after_ the refurbishment, so you wanted to add an extra unit for the new command. Ah, I remember: that was when failed applications would also bump the lease index, which isn't the case in this version. Removed the code, the test is a little leaky but looks ok.

storage/replica.go, line 1329 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

When is this true? Would it be better to update lastAssignedLeaseIndex when we update leaseAppliedIndex?

If some other node proposes a lot of commands and we don't, we will have a lower `lastAssignedLeaseIndex` and so we'll catch up here. This should be a pretty ironclad way of making sure we're assigning decent indexes, but updating during lease application is also an option, though it's not necessary and doing it here seems good because that's also where this variable matters.

storage/replica.go, line 1332 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

You can use the newIsLease method here.

Done.

storage/replica.go, line 1368 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/to/to remove/

Done.

storage/replica.go, line 1588 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Would it be better to change pendingCmds from a map to a sorted slice, since we're only appending at the end? We'd need to add a state field to track commands when they become abandoned or when they are in flight and can no longer be abandoned.

I considered that, but it seemed like something to consider separately. We rely on the map lookup in a few places, notably in `processRaftCommand`. We could do a linear search there, but in the usual case reproposals are rare and not in the critical path, so I went with this for now. Adding a TODO though.

storage/replica.go, line 1602 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/can be refurbished/must be refurbished instead of reproposed/

Since we're now drawing a distinction between refurbish and repropose, this method should probably be given a different name to reflect the fact that it does both. (and I think this method's comments would be a good place to document the differences)

I'd like to avoid that wording - it doesn't _have_ to be refurbished. If you reproposed it instead sometimes, it simply wouldn't have a chance of applying. But it's not inherently incorrect - what's incorrect is reproposing a command which could lead to a replay. I did `s/can/will/` and added more commentary.

storage/replica.go, line 1681 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

/an illegal log position/a log position other than the one at which it was originally proposed (this can happen when the range lease held by a raft follower, who must forward MsgProp messages to the raft leader without guaranteed ordering)./

Done.

storage/replica.go, line 1687 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It looks like every call to this method passes 1 for delta.

Removed.

storage/replica.go, line 1714 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Which errors is this referring to?

None in particular, just noted that this may let some errors pop up in places they weren't expected before (I think). Looked around a bit, but don't see a reason for this to be non-kosher, so removed the TODO.

storage/replica.go, line 1733 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Combine this if with the preceding one?

Done.

storage/replica.go, line 1743 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I would just remove this alternative.

Done.

storage/replica.go, line 1747 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why is this OK? Because we're relying on the command queue to ensure that conflicting commands are never pending at the same time? That assumption should be documented explicitly for the future when we decide to do something about that.

Kind of the whole point of keeping the lease applied index separate is that it allows relaxing where a command applies. In the Raft log, commands are applied in strict order without gaps. The lease applied index tracks the maximum applied index ever. So we don't need to guess the correct slot, we just need one higher (and then nothing lower can apply after). In practice, of course in 99.9% of the cases, the new one is going to be the old one + 1. I have commentary on this in various places already, so probably I misunderstood your question - this is inside of Raft, so everything is serialized and the command queue doesn't matter. If command 9 comes before 7, then 7 will fail (the applied index will be 9 at the time 7 tries to apply).

storage/replica.go, line 1774 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The variable names make this line incomprehensible. I don't have suggestions for better names, but at least add a comment that cmd.raftCmd is our local in-memory copy (which may have had its MaxLeaseIndex increased) while raftCmd is the command we got from raft and are attempting to apply now.

Does this help?
if localMaxLeaseIndex := cmd.raftCmd.MaxLeaseIndex; localMaxLeaseIndex <= raftCmd.MaxLeaseIndex {

storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

They should be the same. We used snap.Metadata.Index before because we have it without an extra call to loadAppliedIndex, but now that we have to make that read anyway we might as well use it.

Ok, doing that and removed the TODO.

I also added r.mu.lastIndex = appliedIndex - that seems like it should have been there all along?


storage/replica_test.go, line 5591 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/withRefurbishError/withErr/, to match the constants below? Otherwise this needs more explanation of what's going on. It's also unclear to me what repropose as a bool means.

The test is unfortunately a little involved. I'll look into refactoring it a little bit, but it's digging around pretty close to internals, so not sure I can get that far. Until then, I added commentary.

storage/replica_test.go, line 5639 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/chose/choose/ s/RequiredIndex/MaxLeaseIndex/

Done.

storage/replica_test.go, line 5653 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why no increment?

Because we're manually setting a wrong index below, in analogy to a command from another replica showing up uninvited. It doesn't make a difference (though the test outcome may change).

storage/replica_test.go, line 5655 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Nit: the first argument to f functions should be a constant (what if the error message contains a %?)

Done.

storage/replica_test.go, line 5714 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Delete or implement this test.

Done (deleted).

storage/replica_test.go, line 5719 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What exactly is "roughly correct" here?

`roughly` refers to the guesswork (during lease changes etc, it may be off a tad). Removed the word since it's mostly just confusing.

storage/replica_test.go, line 5748 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Split this line up - give either the function or its result a name instead of making a loop expression too big to fit on one screen (and a lot of code indented under a for statement that is only run once)

Done.

storage/replica_test.go, line 5807 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Do we need an integration test that uses a custom raft transport to reorder MsgProp messages and verify that they still apply in order?

They wouldn't apply in order (if I understand you correctly). You're picturing a follower sending two proposals (say it picks lease indexes 10 and 11) and they arrive in opposite order at the leader? In that case, 11 would commit first, would bump the applied lease index to 11, and the command at 10 would fail and would require a refurbishment.

Comments from Reviewable

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Jun 1, 2016

Member

Review status: 2 of 7 files reviewed at latest revision, 28 unresolved discussions, some commit checks pending.


storage/replica.go, line 1721 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We now check that raftCmd.OriginReplica holds the lease in two places: here and in applyRaftCommandInBatch. Can we replace the latter now that we have this?

Done.

Comments from Reviewable

Member

tschottdorf commented Jun 1, 2016

Review status: 2 of 7 files reviewed at latest revision, 28 unresolved discussions, some commit checks pending.


storage/replica.go, line 1721 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We now check that raftCmd.OriginReplica holds the lease in two places: here and in applyRaftCommandInBatch. Can we replace the latter now that we have this?

Done.

Comments from Reviewable

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Jun 1, 2016

Member

Review status: 1 of 7 files reviewed at latest revision, 28 unresolved discussions, all commit checks successful.


storage/replica.go, line 1615 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is a very weird pattern. refurbishPendingCmdLocked gives us a channel and an error, and we write the error back out on that channel? And the other call site assigns the returned channel to cmd.done. I understand what it's doing now but I had to study it for a bit - is there nothing else we could do? Maybe pass an argument into refurbishPendingCmdLocked to tell it whether it should send the error to the client?

Agreed and changed. How's that?

Comments from Reviewable

Member

tschottdorf commented Jun 1, 2016

Review status: 1 of 7 files reviewed at latest revision, 28 unresolved discussions, all commit checks successful.


storage/replica.go, line 1615 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is a very weird pattern. refurbishPendingCmdLocked gives us a channel and an error, and we write the error back out on that channel? And the other call site assigns the returned channel to cmd.done. I understand what it's doing now but I had to study it for a bit - is there nothing else we could do? Maybe pass an argument into refurbishPendingCmdLocked to tell it whether it should send the error to the client?

Agreed and changed. How's that?

Comments from Reviewable

@bdarnell

This comment has been minimized.

Show comment
Hide comment
@bdarnell

bdarnell Jun 3, 2016

Member

Reviewed 1 of 7 files at r6, 5 of 6 files at r7.
Review status: 6 of 7 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


keys/constants.go, line 100 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done and added comment.

They're supposed to be sorted by value, not by name, to make collisions easy to spot. So the lease constants `rl*` go after the raft constants `rf*`.

storage/replica.go, line 1329 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

If some other node proposes a lot of commands and we don't, we will have a lower lastAssignedLeaseIndex and so we'll catch up here. This should be a pretty ironclad way of making sure we're assigning decent indexes, but updating during lease application is also an option, though it's not necessary and doing it here seems good because that's also where this variable matters.

Hmm. I think I'd rather update it when applying the lease so we can always maintain the invariant that `lastAssignedLeaseIndex >= leaseAppliedIndex`. Although I'm not sure I completely understand what's going on with the +1 here. I would expect either `<=` and `+1` or `<` and no `+1`.

storage/replica.go, line 1747 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Kind of the whole point of keeping the lease applied index separate is that it allows relaxing where a command applies. In the Raft log, commands are applied in strict order without gaps. The lease applied index tracks the maximum applied index ever. So we don't need to guess the correct slot, we just need one higher (and then nothing lower can apply after). In practice, of course in 99.9% of the cases, the new one is going to be the old one + 1.
I have commentary on this in various places already, so probably I misunderstood your question - this is inside of Raft, so everything is serialized and the command queue doesn't matter. If command 9 comes before 7, then 7 will fail (the applied index will be 9 at the time 7 tries to apply).

The big reason to keep the lease applied index separate is that the raft log index may be increased by commands that are out of our control (leases and raft's internal election operations). But all commands that affect KV data _are_ under our control, so we could be stricter about them.

It's OK to apply 9 before 7 (and let 7 get refurbished to eventually apply at 10+) if they are completely independent of each other, and that's what my question was about. Currently that independence is guaranteed by the command queue (I think), but if we ever do something to improve pipelining it would be important to not let the commands be reordered (so if 9 came in before 7 we would need to refurbish them both).

Basically, my theory here is that it's better to start out strict and require that commands apply in strict order with no gaps (if out-of-order commands are going to be as rare as we think they are). We can then decide in the future whether we want to relax that strictness or take advantage of it.


storage/replica.go, line 1719 [r7] (raw file):

  cmd := r.mu.pendingCmds[idKey]

  catchesLeaseError := func() bool {

"Catches" in an error context sounds like java style try/catch, which is the opposite of what is intended here. returnLeaseError?


storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ok, doing that and removed the TODO.

I also added r.mu.lastIndex = appliedIndex - that seems like it should have been there all along?

That shouldn't be necessary - we return the new last index and then our caller (handleRaftReady) updates r.mu.lastIndex for us. (we may want to clean that up, but I think I'd add a call to batch.Defer in Replica.append instead of setting lastIndex here)

Comments from Reviewable

Member

bdarnell commented Jun 3, 2016

Reviewed 1 of 7 files at r6, 5 of 6 files at r7.
Review status: 6 of 7 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


keys/constants.go, line 100 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done and added comment.

They're supposed to be sorted by value, not by name, to make collisions easy to spot. So the lease constants `rl*` go after the raft constants `rf*`.

storage/replica.go, line 1329 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

If some other node proposes a lot of commands and we don't, we will have a lower lastAssignedLeaseIndex and so we'll catch up here. This should be a pretty ironclad way of making sure we're assigning decent indexes, but updating during lease application is also an option, though it's not necessary and doing it here seems good because that's also where this variable matters.

Hmm. I think I'd rather update it when applying the lease so we can always maintain the invariant that `lastAssignedLeaseIndex >= leaseAppliedIndex`. Although I'm not sure I completely understand what's going on with the +1 here. I would expect either `<=` and `+1` or `<` and no `+1`.

storage/replica.go, line 1747 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Kind of the whole point of keeping the lease applied index separate is that it allows relaxing where a command applies. In the Raft log, commands are applied in strict order without gaps. The lease applied index tracks the maximum applied index ever. So we don't need to guess the correct slot, we just need one higher (and then nothing lower can apply after). In practice, of course in 99.9% of the cases, the new one is going to be the old one + 1.
I have commentary on this in various places already, so probably I misunderstood your question - this is inside of Raft, so everything is serialized and the command queue doesn't matter. If command 9 comes before 7, then 7 will fail (the applied index will be 9 at the time 7 tries to apply).

The big reason to keep the lease applied index separate is that the raft log index may be increased by commands that are out of our control (leases and raft's internal election operations). But all commands that affect KV data _are_ under our control, so we could be stricter about them.

It's OK to apply 9 before 7 (and let 7 get refurbished to eventually apply at 10+) if they are completely independent of each other, and that's what my question was about. Currently that independence is guaranteed by the command queue (I think), but if we ever do something to improve pipelining it would be important to not let the commands be reordered (so if 9 came in before 7 we would need to refurbish them both).

Basically, my theory here is that it's better to start out strict and require that commands apply in strict order with no gaps (if out-of-order commands are going to be as rare as we think they are). We can then decide in the future whether we want to relax that strictness or take advantage of it.


storage/replica.go, line 1719 [r7] (raw file):

  cmd := r.mu.pendingCmds[idKey]

  catchesLeaseError := func() bool {

"Catches" in an error context sounds like java style try/catch, which is the opposite of what is intended here. returnLeaseError?


storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ok, doing that and removed the TODO.

I also added r.mu.lastIndex = appliedIndex - that seems like it should have been there all along?

That shouldn't be necessary - we return the new last index and then our caller (handleRaftReady) updates r.mu.lastIndex for us. (we may want to clean that up, but I think I'd add a call to batch.Defer in Replica.append instead of setting lastIndex here)

Comments from Reviewable

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Jun 5, 2016

Member

Re: your earlier comments, I agree that it's a bit subtle (though the complexity is fairly well-contained within the methods dealing with *pendingCmd - the abstraction boundary is proposeRaftCommand). *Replica has been pretty muddled up ever since we broke down all the Raft-related abstractions. Maybe we can gain something by moving the Raft-related methods and fields into a substruct of r.mu?

In leader-proposed Raft, commands would ideally be executed in (or just before)proposeRaftCommand (i.e. refurbishments wouldn't have to re-execute).

Previously, tschottdorf (Tobias Schottdorf) wrote…

TFTR! I will get back to the remaining comments in a bit.


Review status: 6 of 7 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


keys/constants.go, line 100 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

They're supposed to be sorted by value, not by name, to make collisions easy to spot. So the lease constants rl* go after the raft constants rf*.

Whoops, moved the wrong one. Better now?

storage/replica.go, line 1329 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Hmm. I think I'd rather update it when applying the lease so we can always maintain the invariant that lastAssignedLeaseIndex >= leaseAppliedIndex. Although I'm not sure I completely understand what's going on with the +1 here. I would expect either <= and +1 or < and no +1.

I think you understand what's going on - there's an off-by-one (now fixed). Can you elaborate why you prefer it the other way? The variable is only used here, so I don't see the upshot in manipulating somewhere else and enforcing an invariant that can be localized here. Your alternative seems more brittle. For example, forgetting to update it in `applySnapshot` could (in theory) create a long painful refurbishing loop that could while our accidentally zero counter tries to catch up.

storage/replica.go, line 1747 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The big reason to keep the lease applied index separate is that the raft log index may be increased by commands that are out of our control (leases and raft's internal election operations). But all commands that affect KV data are under our control, so we could be stricter about them.

It's OK to apply 9 before 7 (and let 7 get refurbished to eventually apply at 10+) if they are completely independent of each other, and that's what my question was about. Currently that independence is guaranteed by the command queue (I think), but if we ever do something to improve pipelining it would be important to not let the commands be reordered (so if 9 came in before 7 we would need to refurbish them both).

Basically, my theory here is that it's better to start out strict and require that commands apply in strict order with no gaps (if out-of-order commands are going to be as rare as we think they are). We can then decide in the future whether we want to relax that strictness or take advantage of it.

Ah, I see - you want to maximize the use we can get out of this to relax ordering constraints higher up (command queue) whereas I'm mostly worried about introducing a new source of reproposals/complications. I like the idea of preserving command ordering in Raft, though it'll add a bunch more complication here which I was looking to avoid.

For example, consider the case in which two commands are proposed on the leader at lease index 11 and 12 (respectively), but the first command is immediately abandoned by the client (i.e. removed from pendingCmds), and also doesn't actually get proposed (for example due to a networking issue). The second command makes it in fine, however.
That leads to the situation in which only that second command applies. The leader has already assigned index 11 and 12, so all following commands will get 13, 14, ...

But index 11 will never be filled, so this has effectively halted progress indefinitely - the commands at index 12 and up will come in, and we'll have no chance but repropose them over and over again, waiting for 11 to land first (which it never will); or, it could at some point try to propose no-ops at the missing indexes to fill them up, but that could lead to 11 to be refurbished (if it did land) to a later slot, effectively reversing order again, which is what you wanted to avoid in the first place.

So we need to adapt cancellation somehow in a way that keeps the pending command, but realizes the client is gone. That's not hard (instead of removing the pending command, just nix out its context - use-after-finish of the trace is the main complication in cancellation) but it shows that requiring the exact index introduces the hard problems - we have to guarantee that all indexes we assign actually get filled.

Misplaced commands must continue to keep the applied lease index untouched (or they will force the legitimate command for their slot into a new one, causing reordering). So after a lease changes hands, the leader must be forced to assign at the lease's applied lease index + 1 precisely or havoc ensues. This again isn't hard to do, but it is spooky that any code path which tips over this delicate balance is pretty fatal for progress.


storage/replica.go, line 1719 [r7] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"Catches" in an error context sounds like java style try/catch, which is the opposite of what is intended here. returnLeaseError?

went for `isLeaseError`.

storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

That shouldn't be necessary - we return the new last index and then our caller (handleRaftReady) updates r.mu.lastIndex for us. (we may want to clean that up, but I think I'd add a call to batch.Defer in Replica.append instead of setting lastIndex here)

I agree that it shouldn't be necessary, but a method named `applySnapshot` should actually apply a snapshot and leave the replica in a well-defined state, no? I'm not sure about the `batch.Defer` even here. If a later call to `append` also used a `batch.Defer`, they would be run in the wrong order. We would want FIFO here, but it's LIFO. This would be cleaner if applying the snapshot used a separate batch (as you suggested in one of Pete's recent PRs). It looks like we return the lastIndex from this method only to avoid another acquisition of `r.mu`, which we can afford when we're applying the snapshot. So the callsite could as well take the form
    if !raft.IsEmptySnap(rd.Snapshot) {
        var err error
                batch = r.store.Engine().NewBatch()
        if err := r.applySnapshot(batch, rd.Snapshot); err != nil {
            batch.Close()
                        return err
        } else if err := batch.Commit(); err != nil {
                        return err
                }
                r.mu.Lock()
                lastIndex = r.mu.lastIndex
                r.mu.Unlock()
        // TODO(bdarnell): update coalesced heartbeat mapping with snapshot info.
    }

Excuse the whitespace above, wrote this in the comment window.
I don't want to get into the refactor for this PR, but basically my point is that it does seem to me that this is where we want to update lastIndex.


Comments from Reviewable

Member

tschottdorf commented Jun 5, 2016

Re: your earlier comments, I agree that it's a bit subtle (though the complexity is fairly well-contained within the methods dealing with *pendingCmd - the abstraction boundary is proposeRaftCommand). *Replica has been pretty muddled up ever since we broke down all the Raft-related abstractions. Maybe we can gain something by moving the Raft-related methods and fields into a substruct of r.mu?

In leader-proposed Raft, commands would ideally be executed in (or just before)proposeRaftCommand (i.e. refurbishments wouldn't have to re-execute).

Previously, tschottdorf (Tobias Schottdorf) wrote…

TFTR! I will get back to the remaining comments in a bit.


Review status: 6 of 7 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


keys/constants.go, line 100 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

They're supposed to be sorted by value, not by name, to make collisions easy to spot. So the lease constants rl* go after the raft constants rf*.

Whoops, moved the wrong one. Better now?

storage/replica.go, line 1329 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Hmm. I think I'd rather update it when applying the lease so we can always maintain the invariant that lastAssignedLeaseIndex >= leaseAppliedIndex. Although I'm not sure I completely understand what's going on with the +1 here. I would expect either <= and +1 or < and no +1.

I think you understand what's going on - there's an off-by-one (now fixed). Can you elaborate why you prefer it the other way? The variable is only used here, so I don't see the upshot in manipulating somewhere else and enforcing an invariant that can be localized here. Your alternative seems more brittle. For example, forgetting to update it in `applySnapshot` could (in theory) create a long painful refurbishing loop that could while our accidentally zero counter tries to catch up.

storage/replica.go, line 1747 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The big reason to keep the lease applied index separate is that the raft log index may be increased by commands that are out of our control (leases and raft's internal election operations). But all commands that affect KV data are under our control, so we could be stricter about them.

It's OK to apply 9 before 7 (and let 7 get refurbished to eventually apply at 10+) if they are completely independent of each other, and that's what my question was about. Currently that independence is guaranteed by the command queue (I think), but if we ever do something to improve pipelining it would be important to not let the commands be reordered (so if 9 came in before 7 we would need to refurbish them both).

Basically, my theory here is that it's better to start out strict and require that commands apply in strict order with no gaps (if out-of-order commands are going to be as rare as we think they are). We can then decide in the future whether we want to relax that strictness or take advantage of it.

Ah, I see - you want to maximize the use we can get out of this to relax ordering constraints higher up (command queue) whereas I'm mostly worried about introducing a new source of reproposals/complications. I like the idea of preserving command ordering in Raft, though it'll add a bunch more complication here which I was looking to avoid.

For example, consider the case in which two commands are proposed on the leader at lease index 11 and 12 (respectively), but the first command is immediately abandoned by the client (i.e. removed from pendingCmds), and also doesn't actually get proposed (for example due to a networking issue). The second command makes it in fine, however.
That leads to the situation in which only that second command applies. The leader has already assigned index 11 and 12, so all following commands will get 13, 14, ...

But index 11 will never be filled, so this has effectively halted progress indefinitely - the commands at index 12 and up will come in, and we'll have no chance but repropose them over and over again, waiting for 11 to land first (which it never will); or, it could at some point try to propose no-ops at the missing indexes to fill them up, but that could lead to 11 to be refurbished (if it did land) to a later slot, effectively reversing order again, which is what you wanted to avoid in the first place.

So we need to adapt cancellation somehow in a way that keeps the pending command, but realizes the client is gone. That's not hard (instead of removing the pending command, just nix out its context - use-after-finish of the trace is the main complication in cancellation) but it shows that requiring the exact index introduces the hard problems - we have to guarantee that all indexes we assign actually get filled.

Misplaced commands must continue to keep the applied lease index untouched (or they will force the legitimate command for their slot into a new one, causing reordering). So after a lease changes hands, the leader must be forced to assign at the lease's applied lease index + 1 precisely or havoc ensues. This again isn't hard to do, but it is spooky that any code path which tips over this delicate balance is pretty fatal for progress.


storage/replica.go, line 1719 [r7] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"Catches" in an error context sounds like java style try/catch, which is the opposite of what is intended here. returnLeaseError?

went for `isLeaseError`.

storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

That shouldn't be necessary - we return the new last index and then our caller (handleRaftReady) updates r.mu.lastIndex for us. (we may want to clean that up, but I think I'd add a call to batch.Defer in Replica.append instead of setting lastIndex here)

I agree that it shouldn't be necessary, but a method named `applySnapshot` should actually apply a snapshot and leave the replica in a well-defined state, no? I'm not sure about the `batch.Defer` even here. If a later call to `append` also used a `batch.Defer`, they would be run in the wrong order. We would want FIFO here, but it's LIFO. This would be cleaner if applying the snapshot used a separate batch (as you suggested in one of Pete's recent PRs). It looks like we return the lastIndex from this method only to avoid another acquisition of `r.mu`, which we can afford when we're applying the snapshot. So the callsite could as well take the form
    if !raft.IsEmptySnap(rd.Snapshot) {
        var err error
                batch = r.store.Engine().NewBatch()
        if err := r.applySnapshot(batch, rd.Snapshot); err != nil {
            batch.Close()
                        return err
        } else if err := batch.Commit(); err != nil {
                        return err
                }
                r.mu.Lock()
                lastIndex = r.mu.lastIndex
                r.mu.Unlock()
        // TODO(bdarnell): update coalesced heartbeat mapping with snapshot info.
    }

Excuse the whitespace above, wrote this in the comment window.
I don't want to get into the refactor for this PR, but basically my point is that it does seem to me that this is where we want to update lastIndex.


Comments from Reviewable

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Jun 6, 2016

Member

Pushed a commit which explores the complexity of requiring the exact index. See commit message.

Previously, tschottdorf (Tobias Schottdorf) wrote…

Re: your earlier comments, I agree that it's a bit subtle (though the complexity is fairly well-contained within the methods dealing with *pendingCmd - the abstraction boundary is proposeRaftCommand). *Replica has been pretty muddled up ever since we broke down all the Raft-related abstractions. Maybe we can gain something by moving the Raft-related methods and fields into a substruct of r.mu?

In leader-proposed Raft, commands would ideally be executed in (or just before)proposeRaftCommand (i.e. refurbishments wouldn't have to re-execute).


Review status: 0 of 7 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


Comments from Reviewable

Member

tschottdorf commented Jun 6, 2016

Pushed a commit which explores the complexity of requiring the exact index. See commit message.

Previously, tschottdorf (Tobias Schottdorf) wrote…

Re: your earlier comments, I agree that it's a bit subtle (though the complexity is fairly well-contained within the methods dealing with *pendingCmd - the abstraction boundary is proposeRaftCommand). *Replica has been pretty muddled up ever since we broke down all the Raft-related abstractions. Maybe we can gain something by moving the Raft-related methods and fields into a substruct of r.mu?

In leader-proposed Raft, commands would ideally be executed in (or just before)proposeRaftCommand (i.e. refurbishments wouldn't have to re-execute).


Review status: 0 of 7 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


Comments from Reviewable

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Jun 6, 2016

Member

We should be able to address the reservations in the commit in two ways when leader-proposed Raft is active:

  • we re-execute the write set when refurbishing (i.e. to simulate the effect of a renewed call to addWriteCommand, including tsCache etc)
  • we guarantee that commands never need to be refurbished. Sounds a bit insane.

The first option is likely the better one (and the main difficulty is making it look natural in code).

Let me know what you think about that latest commit before I start polishing up.

Member

tschottdorf commented Jun 6, 2016

We should be able to address the reservations in the commit in two ways when leader-proposed Raft is active:

  • we re-execute the write set when refurbishing (i.e. to simulate the effect of a renewed call to addWriteCommand, including tsCache etc)
  • we guarantee that commands never need to be refurbished. Sounds a bit insane.

The first option is likely the better one (and the main difficulty is making it look natural in code).

Let me know what you think about that latest commit before I start polishing up.

@bdarnell

This comment has been minimized.

Show comment
Hide comment
@bdarnell

bdarnell Jun 6, 2016

Member

*Replica has been pretty muddled up ever since we broke down all the Raft-related abstractions. Maybe we can gain something by moving the Raft-related methods and fields into a substruct of r.mu?

Possibly, but we'd first need to figure out the right abstraction boundaries. The difficulty of doing so was one of the reason we tore down all the old abstractions. I don't have any great ideas for cleaning things up right now.

The first option is likely the better one (and the main difficulty is making it look natural in code).

Yeah, I've been assuming that when/if we do something about command queue blocking we'll have to re-execute the commands when refurbishing them. That's why I was asking earlier about when exactly the commands get executed.


Reviewed 1 of 7 files at r9, 4 of 6 files at r10, 2 of 2 files at r12.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica.go, line 1329 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think you understand what's going on - there's an off-by-one (now fixed).
Can you elaborate why you prefer it the other way? The variable is only used here, so I don't see the upshot in manipulating somewhere else and enforcing an invariant that can be localized here. Your alternative seems more brittle. For example, forgetting to update it in applySnapshot could (in theory) create a long painful refurbishing loop that could while our accidentally zero counter tries to catch up.

I don't feel strongly about it, but my thinking is that if we're concerned about (say) `applySnapshot` updating `leaseAppliedIndex` without `lastLeaseAssignedIndex` then I'd rather fix that with a `setLeaseAppliedIndex()` method that updates both instead of hiding the omission by moving `lastAssignedLeaseIndex` here. The latest commit makes `lastAssignedLeaseIndex` a little less local, although not yet in a way that would require duplicating this advancement elsewhere.

storage/replica.go, line 1747 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ah, I see - you want to maximize the use we can get out of this to relax ordering constraints higher up (command queue) whereas I'm mostly worried about introducing a new source of reproposals/complications. I like the idea of preserving command ordering in Raft, though it'll add a bunch more complication here which I was looking to avoid.

For example, consider the case in which two commands are proposed on the leader at lease index 11 and 12 (respectively), but the first command is immediately abandoned by the client (i.e. removed from pendingCmds), and also doesn't actually get proposed (for example due to a networking issue). The second command makes it in fine, however.
That leads to the situation in which only that second command applies. The leader has already assigned index 11 and 12, so all following commands will get 13, 14, ...

But index 11 will never be filled, so this has effectively halted progress indefinitely - the commands at index 12 and up will come in, and we'll have no chance but repropose them over and over again, waiting for 11 to land first (which it never will); or, it could at some point try to propose no-ops at the missing indexes to fill them up, but that could lead to 11 to be refurbished (if it did land) to a later slot, effectively reversing order again, which is what you wanted to avoid in the first place.

So we need to adapt cancellation somehow in a way that keeps the pending command, but realizes the client is gone. That's not hard (instead of removing the pending command, just nix out its context - use-after-finish of the trace is the main complication in cancellation) but it shows that requiring the exact index introduces the hard problems - we have to guarantee that all indexes we assign actually get filled.

Misplaced commands must continue to keep the applied lease index untouched (or they will force the legitimate command for their slot into a new one, causing reordering). So after a lease changes hands, the leader must be forced to assign at the lease's applied lease index + 1 precisely or havoc ensues. This again isn't hard to do, but it is spooky that any code path which tips over this delicate balance is pretty fatal for progress.

Yeah, this stuff is trickier than it looks. I'm not sure if it's worth it to require exact index matches if it means we have to add more subtle logic to difficult-to-test error/cancellation paths to make sure things still line up.

storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I agree that it shouldn't be necessary, but a method named applySnapshot should actually apply a snapshot and leave the replica in a well-defined state, no?
I'm not sure about the batch.Defer even here. If a later call to append also used a batch.Defer, they would be run in the wrong order. We would want FIFO here, but it's LIFO.
This would be cleaner if applying the snapshot used a separate batch (as you suggested in one of Pete's recent PRs). It looks like we return the lastIndex from this method only to avoid another acquisition of r.mu, which we can afford when we're applying the snapshot. So the callsite could as well take the form

  if !raft.IsEmptySnap(rd.Snapshot) {
      var err error
                batch = r.store.Engine().NewBatch()
      if err := r.applySnapshot(batch, rd.Snapshot); err != nil {
          batch.Close()
                        return err
      } else if err := batch.Commit(); err != nil {
                        return err
                }
                r.mu.Lock()
                lastIndex = r.mu.lastIndex
                r.mu.Unlock()
      // TODO(bdarnell): update coalesced heartbeat mapping with snapshot info.
  }

Excuse the whitespace above, wrote this in the comment window.
I don't want to get into the refactor for this PR, but basically my point is that it does seem to me that this is where we want to update lastIndex.

SGTM. I don't think we're returning lastIndex to avoid a lock acquisition, we're returning it because we're going to use it before the batch is committed and so the defer hasn't fired. But we should just be using a separate batch here and committing it before moving on to `r.append()`.

Comments from Reviewable

Member

bdarnell commented Jun 6, 2016

*Replica has been pretty muddled up ever since we broke down all the Raft-related abstractions. Maybe we can gain something by moving the Raft-related methods and fields into a substruct of r.mu?

Possibly, but we'd first need to figure out the right abstraction boundaries. The difficulty of doing so was one of the reason we tore down all the old abstractions. I don't have any great ideas for cleaning things up right now.

The first option is likely the better one (and the main difficulty is making it look natural in code).

Yeah, I've been assuming that when/if we do something about command queue blocking we'll have to re-execute the commands when refurbishing them. That's why I was asking earlier about when exactly the commands get executed.


Reviewed 1 of 7 files at r9, 4 of 6 files at r10, 2 of 2 files at r12.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica.go, line 1329 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think you understand what's going on - there's an off-by-one (now fixed).
Can you elaborate why you prefer it the other way? The variable is only used here, so I don't see the upshot in manipulating somewhere else and enforcing an invariant that can be localized here. Your alternative seems more brittle. For example, forgetting to update it in applySnapshot could (in theory) create a long painful refurbishing loop that could while our accidentally zero counter tries to catch up.

I don't feel strongly about it, but my thinking is that if we're concerned about (say) `applySnapshot` updating `leaseAppliedIndex` without `lastLeaseAssignedIndex` then I'd rather fix that with a `setLeaseAppliedIndex()` method that updates both instead of hiding the omission by moving `lastAssignedLeaseIndex` here. The latest commit makes `lastAssignedLeaseIndex` a little less local, although not yet in a way that would require duplicating this advancement elsewhere.

storage/replica.go, line 1747 [r4] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ah, I see - you want to maximize the use we can get out of this to relax ordering constraints higher up (command queue) whereas I'm mostly worried about introducing a new source of reproposals/complications. I like the idea of preserving command ordering in Raft, though it'll add a bunch more complication here which I was looking to avoid.

For example, consider the case in which two commands are proposed on the leader at lease index 11 and 12 (respectively), but the first command is immediately abandoned by the client (i.e. removed from pendingCmds), and also doesn't actually get proposed (for example due to a networking issue). The second command makes it in fine, however.
That leads to the situation in which only that second command applies. The leader has already assigned index 11 and 12, so all following commands will get 13, 14, ...

But index 11 will never be filled, so this has effectively halted progress indefinitely - the commands at index 12 and up will come in, and we'll have no chance but repropose them over and over again, waiting for 11 to land first (which it never will); or, it could at some point try to propose no-ops at the missing indexes to fill them up, but that could lead to 11 to be refurbished (if it did land) to a later slot, effectively reversing order again, which is what you wanted to avoid in the first place.

So we need to adapt cancellation somehow in a way that keeps the pending command, but realizes the client is gone. That's not hard (instead of removing the pending command, just nix out its context - use-after-finish of the trace is the main complication in cancellation) but it shows that requiring the exact index introduces the hard problems - we have to guarantee that all indexes we assign actually get filled.

Misplaced commands must continue to keep the applied lease index untouched (or they will force the legitimate command for their slot into a new one, causing reordering). So after a lease changes hands, the leader must be forced to assign at the lease's applied lease index + 1 precisely or havoc ensues. This again isn't hard to do, but it is spooky that any code path which tips over this delicate balance is pretty fatal for progress.

Yeah, this stuff is trickier than it looks. I'm not sure if it's worth it to require exact index matches if it means we have to add more subtle logic to difficult-to-test error/cancellation paths to make sure things still line up.

storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I agree that it shouldn't be necessary, but a method named applySnapshot should actually apply a snapshot and leave the replica in a well-defined state, no?
I'm not sure about the batch.Defer even here. If a later call to append also used a batch.Defer, they would be run in the wrong order. We would want FIFO here, but it's LIFO.
This would be cleaner if applying the snapshot used a separate batch (as you suggested in one of Pete's recent PRs). It looks like we return the lastIndex from this method only to avoid another acquisition of r.mu, which we can afford when we're applying the snapshot. So the callsite could as well take the form

  if !raft.IsEmptySnap(rd.Snapshot) {
      var err error
                batch = r.store.Engine().NewBatch()
      if err := r.applySnapshot(batch, rd.Snapshot); err != nil {
          batch.Close()
                        return err
      } else if err := batch.Commit(); err != nil {
                        return err
                }
                r.mu.Lock()
                lastIndex = r.mu.lastIndex
                r.mu.Unlock()
      // TODO(bdarnell): update coalesced heartbeat mapping with snapshot info.
  }

Excuse the whitespace above, wrote this in the comment window.
I don't want to get into the refactor for this PR, but basically my point is that it does seem to me that this is where we want to update lastIndex.

SGTM. I don't think we're returning lastIndex to avoid a lock acquisition, we're returning it because we're going to use it before the batch is committed and so the defer hasn't fired. But we should just be using a separate batch here and committing it before moving on to `r.append()`.

Comments from Reviewable

@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Jun 9, 2016

Member

Review status: 0 of 7 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/replica.go, line 1329 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't feel strongly about it, but my thinking is that if we're concerned about (say) applySnapshot updating leaseAppliedIndex without lastLeaseAssignedIndex then I'd rather fix that with a setLeaseAppliedIndex() method that updates both instead of hiding the omission by moving lastAssignedLeaseIndex here. The latest commit makes lastAssignedLeaseIndex a little less local, although not yet in a way that would require duplicating this advancement elsewhere.

I'm going to leave as is for now since the current semantics are "pick any number higher than the last assigned lease index" and we're not too concerned about updating this counter continuously.

storage/replica.go, line 1747 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, this stuff is trickier than it looks. I'm not sure if it's worth it to require exact index matches if it means we have to add more subtle logic to difficult-to-test error/cancellation paths to make sure things still line up.

Reverted this part.

storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

SGTM. I don't think we're returning lastIndex to avoid a lock acquisition, we're returning it because we're going to use it before the batch is committed and so the defer hasn't fired. But we should just be using a separate batch here and committing it before moving on to r.append().

Done in https://github.com/cockroachdb/cockroach/pull/7135.

Comments from Reviewable

Member

tschottdorf commented Jun 9, 2016

Review status: 0 of 7 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/replica.go, line 1329 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't feel strongly about it, but my thinking is that if we're concerned about (say) applySnapshot updating leaseAppliedIndex without lastLeaseAssignedIndex then I'd rather fix that with a setLeaseAppliedIndex() method that updates both instead of hiding the omission by moving lastAssignedLeaseIndex here. The latest commit makes lastAssignedLeaseIndex a little less local, although not yet in a way that would require duplicating this advancement elsewhere.

I'm going to leave as is for now since the current semantics are "pick any number higher than the last assigned lease index" and we're not too concerned about updating this counter continuously.

storage/replica.go, line 1747 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, this stuff is trickier than it looks. I'm not sure if it's worth it to require exact index matches if it means we have to add more subtle logic to difficult-to-test error/cancellation paths to make sure things still line up.

Reverted this part.

storage/replica_raftstorage.go, line 751 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

SGTM. I don't think we're returning lastIndex to avoid a lock acquisition, we're returning it because we're going to use it before the batch is committed and so the defer hasn't fired. But we should just be using a separate batch here and committing it before moving on to r.append().

Done in https://github.com/cockroachdb/cockroach/pull/7135.

Comments from Reviewable

tschottdorf added some commits May 16, 2016

storage: refactor Raft proposal into components
Separating out code movement from the next step, which enforces the
MustIndex field in RaftCommand.
storage: add lease index for replay protection
See #6287.

Introduce the concept of a lease index for replay protection. The lease index
is similar to the Raft log index in some regards, but behaves differently:

* multiple Raft commands may share a lease index
* the lease index may jump (i.e. two adjacent commands in the Raft log may
  increase the lease index by more than one)

The need for a lease index arises from complications encountered when trying
to use the Raft index for replay protection. Trying that, the basic idea would
be as follows:

* when proposing a command, the Raft index at which it must commit is
  determined and stored into the command.
* when applying a command, compare the log index against the desired index.
* on mismatch, do not apply the command but suitably "repropose" it.

That last part is complicated: It requires us to predict future log indexes at
which commands will apply accurately. We keep only loose tabs on Raft
leadership, so the leader lease holder (who proposes the vast majority of Raft
commands) may not be colocated with the Raft leader. Thus, it's very hard to
reliably predict the log positions, especially when thrown off by a spurious
command proposed from elsewhere.

The leader index mitigates both of these issues since predicting the exact
lease index is not necessary (any larger index is permitted), and misplaced
commands don't increase the index.

Add checks in `processRaftCommand` that handle and assign the lease index.
First, we check whether the command was proposed by the current lease holder;
it catches a NotLeaderError if it wasn't.
Each proposed command contains a MaxLeaseIndex field (which must be strictly
larger than the lease index for it to apply successfully, in which case the
new lease index after application is that of the command). On a mismatch, the
command is "refurbished" (named to distinguish it from a mere reproposal), i.e.
an identical command proposed, at a new (higher) index.

The difference between refurbishing and reproposing is important: A reproposal
is issued when we're not sure that the proposal made it into Raft in the first
place, i.e. we want to guarantee that the client eventually receives a
response, and we err on the side of caution in doing so by reproposing fairly
liberally, however without updating the index to which the command is tied (so
that it really will commit only once, no matter how many copies we create).

A refurbished command must only be issued when we are certain that the original
command cannot commit successfully any more because we see that its
MaxLeaseIndex has been surpassed by the applied lease index.
In this case, we propose a new command which can commit at a new, higher,
index, and route its response to the waiting client.
@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Jun 10, 2016

Member

Waiting for your final pass, @bdarnell. I'm going to leave the associated issue since the review of existing replay mechanisms is outstanding (I think the TL;DR will be that we need to keep them because not all replays are due to Raft, but I'd like to go through that separately).

Previously, bdarnell (Ben Darnell) wrote…

*Replica has been pretty muddled up ever since we broke down all the Raft-related abstractions. Maybe we can gain something by moving the Raft-related methods and fields into a substruct of r.mu?

Possibly, but we'd first need to figure out the right abstraction boundaries. The difficulty of doing so was one of the reason we tore down all the old abstractions. I don't have any great ideas for cleaning things up right now.

The first option is likely the better one (and the main difficulty is making it look natural in code).

Yeah, I've been assuming that when/if we do something about command queue blocking we'll have to re-execute the commands when refurbishing them. That's why I was asking earlier about when exactly the commands get executed.


Review status: 0 of 7 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

Member

tschottdorf commented Jun 10, 2016

Waiting for your final pass, @bdarnell. I'm going to leave the associated issue since the review of existing replay mechanisms is outstanding (I think the TL;DR will be that we need to keep them because not all replays are due to Raft, but I'd like to go through that separately).

Previously, bdarnell (Ben Darnell) wrote…

*Replica has been pretty muddled up ever since we broke down all the Raft-related abstractions. Maybe we can gain something by moving the Raft-related methods and fields into a substruct of r.mu?

Possibly, but we'd first need to figure out the right abstraction boundaries. The difficulty of doing so was one of the reason we tore down all the old abstractions. I don't have any great ideas for cleaning things up right now.

The first option is likely the better one (and the main difficulty is making it look natural in code).

Yeah, I've been assuming that when/if we do something about command queue blocking we'll have to re-execute the commands when refurbishing them. That's why I was asking earlier about when exactly the commands get executed.


Review status: 0 of 7 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

@bdarnell

This comment has been minimized.

Show comment
Hide comment
@bdarnell

bdarnell Jun 11, 2016

Member

:lgtm:

Previously, tschottdorf (Tobias Schottdorf) wrote…

Waiting for your final pass, @bdarnell. I'm going to leave the associated issue since the review of existing replay mechanisms is outstanding (I think the TL;DR will be that we need to keep them because not all replays are due to Raft, but I'd like to go through that separately).


Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


Comments from Reviewable

Member

bdarnell commented Jun 11, 2016

:lgtm:

Previously, tschottdorf (Tobias Schottdorf) wrote…

Waiting for your final pass, @bdarnell. I'm going to leave the associated issue since the review of existing replay mechanisms is outstanding (I think the TL;DR will be that we need to keep them because not all replays are due to Raft, but I'd like to go through that separately).


Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


Comments from Reviewable

@tschottdorf tschottdorf merged commit e1f8b0c into cockroachdb:master Jun 11, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
code-review/reviewable Review complete: 1 of 0 LGTMs obtained
Details
licence/cla Contributor License Agreement is signed.
Details

@tschottdorf tschottdorf deleted the tschottdorf:lease-enforce-index branch Jun 11, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment