Range created by Split may conflict with Range created by multiraft #1644

es-chow · 2015-07-07T14:51:27Z

For a normal range split, Range struct in memory created by Split will be created first before a Raft message is received.

But there are some corner cases Range created by Split may conflict with Range created by multiraft when receiving a Raft message.

Before A EndTransactionRequest(split_trigger) is applied, a Raft message from other node had been received, then a Range struct will be created by multiraft calling Store.GroupStorage such as the node 3 in the following picture.

2. Before a EndTransactionRequest(split_trigger) is applied, multiple Raft message include MsgSnap from other node had been received, then this Range will failed in Range.ApplySnapshot as the key space is conflict with existing key space in the store such as the node 3 in the following picture.

3. As the Applying Raft command is asynchorous with writing into storage, so some Raft messages may delayed in applying even it's a Raft leader. This also cause key space conflicting such as the node 3.

One way to resolve case 1 and 2 is intercepting the MsgVote, MsgApp message if a multiraft group cannot be found in multiraft and delay the group and range creating until a MsgSnap is received and the RangeDescriptor inside is not conflict the key space in the store, if conflict, then a multiraft group and range will not be created to wait the EndTransaction(split_trigger) to finish. But this is not work with case 3.

es-chow · 2015-07-07T14:55:14Z

"In this snapshot, Range1 still have key space [a,KeyMax)" in the right-bottom of the last picture should be "In this snapshot, Range1 still have key space [KeyMin, KeyMax)".

bdarnell · 2015-07-07T18:20:07Z

What if we include the start and end keys of the range to a header of InternalRaftRequest, and ignore them if they refer to a range that overlaps with a range we already have? We might be able to limit this to just MsgSnap and MsgVote, but I think there's still a bit of an edge case with the MsgApp that precedes a MsgSnap.

tbg · 2015-07-08T13:35:04Z

FYI, case 1 happens regularly in the Put acceptance test.

es-chow · 2015-07-08T15:07:38Z

we had implemented 2 alternatives.

disable the multiraft group creating in receiving a Raft rpc message. Range and multiraft group can only be created by Store.
A. the following code will be commented.

 if _, ok := s.groups[req.GroupID]; !ok {
                                                log.Infof("node %v: got message for unknown group %d; creating it", s.nodeID, req.GroupID)
                                                if err := s.createGroup(req.GroupID); err != nil {
                                                        log.Warningf("Error creating group %d: %s", req.GroupID, err)
                                                        break
                                                }
                                        }

B. A new queue range_creating_queue like replicate_queue will be used for range's Raft Leader to check if the progress of any node in Store.RaftStatus() for this range has Match with value 0, if so, send a InternalAddRange message (this message will not go through raft, otherwise it will cause dead-lock) to that node/store. The receiving node/store will check if the RangeDescription in InternalAddRange message is conflict with key space in the store, if not, create the range and multiraft group. Otherwise, just discard this message.
scenario ADD_REPLICA can be depicted by following pic.

This method has the advantage that only one source which can trigger creating of Range and Multiraft Group. so REMOVE_REPLICA can remove the range data and multiraft group in Range.changeReplicasTrigger and range_gc_queue is not required.
The cons is that the condition (Match=0) to trigger to sending InternalAddRange is strong enough or not? Is any edge case?

Intercept MsgVote/MsgApp/MsgSnap raft message in multiraft, if no multiraft group found for these message, special processing will be required.
A. If no multiraft group found for MsgVote, send a valid MsgVoteResp back so a leader can be elected, and no group will be created.
B. If no multiraft group found for MsgApp, send a Reject MsgAppResp back so a MsgSnap will be send as next message, and no group will be created.
C. If no multiraft group found for MsgSnap, call Storage.CheckSnapshot to check if MsgSnap.RangeDescriptor is overlapping with current store's key space, if no overlapping, then it may be the ADD_REPLICA case, so create a Range and multiraft group and step the message to raft. if overlapping, then it may be the Split case, so discard it.
ADD_REPLICA case:

Split case:

es-chow · 2015-07-08T15:33:34Z

@bdarnell, what's the meaning of the InternalRaftMessage?

bdarnell · 2015-07-09T15:34:00Z

I meant proto.InternalRaftCommand, not InternalRaftMessage. My idea was similar to your second option, but by putting the keys in the command header we can validate the range in MsgVote in addition to MsgSnap.

Making InternalAddRange explicit might help with some of the issues around #768 too. But I don't think it will eliminate the need for range_gc_queue: we still need to be able to remove ranges from a node that was down when the replica was removed. I think that sending InternalAddRange whenever Match is 0 will be safe, but it might be wasteful: if a node is down we need to be able to retry the InternalAddRange when it comes up, and that duplicates the logic in raft itself for re-sending MsgSnap when it gets a MsgHeartbeatResp. The biggest risk I see with InternalAddRange is what happens when there is no leader? If MsgVote can never trigger a group creation then I'm afraid there may be a case when we can't complete an election.

es-chow · 2015-07-10T14:11:37Z

@bdarnell, you are right that InternalAddRange cannot proceed if there is no leader. As of now, we have to stick to the option that create the group and range when a proper Raft message is received.
Don't get the idea "putting the keys in the command header", do you mean put a common header like RangeDescriptor in photo.InternalRaftCommand? But proto.InternalRaftCommand is only encapsulated in MsgApp and not exist in MsgVote.
Is there any issue if we just respond MsgVote with MsgVoteResp unconditionally? or we can add a member RangeDescriptor in RaftMessageRequest and let multiraft fill the value before send a MsgVote message, so it can be checked when a MsgVote is received.

// RaftMessageRequest wraps a raft message.                                                                                                                                     
type RaftMessageRequest struct {
   GroupID proto.RaftID
        RangeDescriptor *proto.RangeDescriptor 
        Message raftpb.Message                                                                                                                                                  
}```

bdarnell · 2015-07-10T14:20:46Z

Oops, yes, I meant RaftMessageRequest instead of InternalRaftCommand.

Unconditionally sending MsgVoteResp is risky. If we always send Reject=true, then we may not be able to elect a leader. If we send Reject=false, we must record this in our HardState for this range.

es-chow · 2015-07-10T15:14:02Z

Ok, we can do like this.

spencerkimball · 2015-07-13T23:19:39Z

@es-chow: @bdarnell and I have been discussing a more radical restructuring of things in order to eliminate the confusion we're currently dealing with in splitting and also in change replica updates. The idea is to create a new storage.StorageKey which would look like:

type StorageKey struct {
  groupID proto.RaftID
  replicaLogIndex uint64
  key proto.Key
}

This composite key struct would be passed into the various storage/engine methods in place of proto.Key and would be used to provide absolute certainty about which data belongs to which range. The groupID and the replicaLogIndex would be prepended as part of proto.EncodedKey MVCC key values. It would mean that on splits and merges, we'd need to rewrite the data in the second half of the range to include the new groupID and replicaLogIndex. The replicaLogIndex for each replica would be stored in the RangeDescriptor; the replicaLogIndex is set to the Raft log index at which a replica was first added to the Raft group. This provides crucial support when avoiding situations where old data for a range which was rebalanced in the past conflicts with new data for a range rebalanced back to the same store at a future point in time.

Thoughts?

es-chow · 2015-07-14T15:37:57Z

This StorageKey solution will be helpful for #768 as it differentiate each incarnation of the range data in store.
But is it possible to get the same result as just before re-adding the same groupID back to the same store, clean the range data of the old incarnation by store a LocalRaftRangeDescriptor to find the old RangeDescriptor from groupID?
And I am still not clear about how the StorageKey solution to help in resolving the Range structure in memory conflict issue, which one Range is created by split, another may be created by Raft Message from peer node. Would you please explain further?

bdarnell · 2015-07-14T15:46:20Z

There are two parts to the proposal. One is the introduction of StorageKey, to guarantee that the snapshots arriving on range 2 don't affect data owned by range 1. The second is that splitTrigger must copy all the data from range 1 to range 2, and this copy is like handling an incoming snapshot (in fact, we may just have splitTrigger call ApplySnapshot). This copy will be skipped if the range has already been initialized.

tbg · 2015-07-15T08:06:54Z

Could you briefly review how that solves all the racy problems in #768 (i.e. short pseudo-code)? If we remove a replica and then receive a stale message, how do we prevent re-adding the group? How will receiving a Raft message before a split has been executed on a replica handled in that case? Will new information be sent along with the Raft messages? You say about removal that

The key issue is that raft groups are created automatically on demand, but we cannot make the removal of the raft group atomic with the deletion of on-disk state, so there is a risk of the group being recreated as it is being deleted (and in fact for various reasons this happens fairly reliably in my tests).

It doesn't have to be atomic though, isn't it enough to have the tombstone write precede the Raft removal and couldn't we arrange for that?

I'm uncomfortable having ranges go from a logical slicing of the key space to something physically separated. For one, that makes the keyspace scattered (we'll never look at it without tooling, so that's probably ok, but finding a certain key just from the encoded keys would now mean a scan of everything), though that's mostly a concern of taste. But also I've noticed that applying a Split can already take more than a second (I'm talking applying the Raft command, nothing else - haven't looked into why) and it could be a price to pay to have to copy data around on top of that.

If the StorageKey solution puts us on solid ground with all the issues we've been having and there's no simpler solution, I'm all for it. I don't understand it well enough to tell yet though.

es-chow · 2015-07-15T12:01:59Z

Another propose for the removal and re-add issue, can we add a ConfChangeTerm concept?

ConfChangeTerm is specific to each Node in a Raft Group.
It will be included in each Raft Message.
For each time a ConfChange happens, ConfChangeTerm for that node increases, Node list and ConfChangeTerm of all nodes including currently removed one in that Raft group will be included in the ConfChange Raft Message.
Upon a Raft message received from network, ConfChangeTerm in the message will be compared with current Node's persisted ConfChangeTerm, if it's less, it's a outdated message and discard it, if it's equal then honor it, if it's greater then clean all residual data to ask for a snapshot.

Please help check.

es-chow · 2015-07-15T13:53:41Z

For the disruptive node issue mentioned in #768 section 1, we may add a GroupConfChangeTerm which is incremented for each configuration change. Only the node with latest GroupConfChangeTerm and log Index can win the vote, in the rejection MsgVoteResp, the latest GroupConfChangeTerm and Node list will be included, so the receiving can remove it from the Raft group based on the Node list.

bdarnell · 2015-07-15T15:50:17Z

@tschottdorf If we get a stale message we'll still re-add the group. The difference is that if a range is deleted and re-added, the new incarnation of the range will have a new StorageKey, so the GC of the old range will not race with the new one. It will no longer be possible for a deleted range incarnation to become alive again.

Regarding performance, one advantage of the new scheme would be that the copying could be asynchronous with respect to the original EndTransaction call (we just need to start the rocksdb snapshot during the split trigger).

Concretely, the plan is to change RangeDescriptor.Nodes to a list of (NodeID, Epoch) pairs, where Epoch is monotonically increasing. It will either be the raft log index, the database timestamp of the ChangeReplicas transaction, or a counter (with a NextEpoch field added to the RangeDescriptor). My first choice is to use the log index but I think it is difficult to get the index at the time that we need it. The Epoch will be sent with all raft messages; all messages addressed to a known range with an obsolete epoch will be dropped. Messages addressed to a known range with a higher epoch will be treated as a brand-new range; the lower-epoch copy of the range will be submitted for GC.

bdarnell · 2015-07-20T00:50:24Z

I've written up a design doc for this proposal and put it on the cockroach wiki.

One correction to my last comment: the copying cannot be asynchronous with respect to the EndTransaction call. The EndTransaction call will delete all of the data that no longer belongs in the left-hand range, so we cannot let that happen without also committing the copy to the new range.

tbg · 2015-10-23T18:05:38Z

Have we made any progress on dealing with this?

bdarnell · 2015-10-23T18:08:28Z

Yes, the entire ReplicaID/StorageKey series of RFCs was partially motivated by this issue. We now have all the major pieces in place so we should be able to fix this soon.

tbg · 2015-10-23T18:24:32Z

I was really wondering what the fix would be, but the rejection notices in the storage key proposal were instructive:

This proposal was deemed too complex and expensive for the problem it
solves. Instead, we will drop snapshots whose application would create
a conflict in the rangesByKey map. This avoids the race conditions
in issue #1644, but leaves the range in an uninitialized and unusable
state. In the common case, this state will resolve quickly, and in the
uncommon case when it persists, we simply rely on the usual repair and
recovery process to move the replica to a new node.

When a range is split, followers of that range may receive a snapshot from the right-hand side of the split before they have caught up and processed the left-hand side where the split originated. This results in a "range already exists" panic. The solution is to silently drop any snapshots which would cause a conflict. They will be retried and will succeed once the left-hand range has performed its split. Fixes cockroachdb#1644.

petermattis · 2015-10-30T12:52:15Z

FYI, I'm seeing the error failed to update Store after split: ... range for Range ID X already exists on store every time I run the TestPut acceptance test for a significant length of time. It sometimes takes several minutes for the error to appear, but it always has in 5 successive runs:

cd ./acceptance
go test -tags acceptance -run 'TestPut$' -timeout 1210s -v -d 1200s -l .

tbg · 2015-10-30T13:00:23Z

yeah, that's why this issue exists. #2944 should take care of it.

When a range is split, followers of that range may receive a snapshot from the right-hand side of the split before they have caught up and processed the left-hand side where the split originated. This results in a "range already exists" panic. The solution is to silently drop any snapshots which would cause a conflict. They will be retried and will succeed once the left-hand range has performed its split. Fixes cockroachdb#1644. Also check destination stopper in multiTestContext.rpcSend

es-chow mentioned this issue Jul 22, 2015

After range splitting, the newly created range take near 2 seconds to accept command. #1384

Closed

tamird added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 22, 2015

bdarnell mentioned this issue Aug 6, 2015

Fix TestRangeGCQueue* to correctly wait for range removal #1964

Merged

jess-edwards mentioned this issue Aug 18, 2015

Product Roadmap #2132

Closed

78 tasks

bdarnell mentioned this issue Oct 23, 2015

failed to update store after split: range ID 11 already exists on store #2887

Closed

bdarnell mentioned this issue Oct 29, 2015

Fix a race between splits and snapshots. #2944

Merged

bdarnell closed this as completed in #2944 Nov 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Range created by Split may conflict with Range created by multiraft #1644

Range created by Split may conflict with Range created by multiraft #1644

es-chow commented Jul 7, 2015

es-chow commented Jul 7, 2015

bdarnell commented Jul 7, 2015

tbg commented Jul 8, 2015

es-chow commented Jul 8, 2015

es-chow commented Jul 8, 2015

bdarnell commented Jul 9, 2015

es-chow commented Jul 10, 2015

bdarnell commented Jul 10, 2015

es-chow commented Jul 10, 2015

spencerkimball commented Jul 13, 2015

es-chow commented Jul 14, 2015

bdarnell commented Jul 14, 2015

tbg commented Jul 15, 2015

es-chow commented Jul 15, 2015

es-chow commented Jul 15, 2015

bdarnell commented Jul 15, 2015

bdarnell commented Jul 20, 2015

tbg commented Oct 23, 2015

bdarnell commented Oct 23, 2015

tbg commented Oct 23, 2015

petermattis commented Oct 30, 2015

tbg commented Oct 30, 2015

Range created by Split may conflict with Range created by multiraft #1644

Range created by Split may conflict with Range created by multiraft #1644

Comments

es-chow commented Jul 7, 2015

es-chow commented Jul 7, 2015

bdarnell commented Jul 7, 2015

tbg commented Jul 8, 2015

es-chow commented Jul 8, 2015

es-chow commented Jul 8, 2015

bdarnell commented Jul 9, 2015

es-chow commented Jul 10, 2015

bdarnell commented Jul 10, 2015

es-chow commented Jul 10, 2015

spencerkimball commented Jul 13, 2015

es-chow commented Jul 14, 2015

bdarnell commented Jul 14, 2015

tbg commented Jul 15, 2015

es-chow commented Jul 15, 2015

es-chow commented Jul 15, 2015

bdarnell commented Jul 15, 2015

bdarnell commented Jul 20, 2015

tbg commented Oct 23, 2015

bdarnell commented Oct 23, 2015

tbg commented Oct 23, 2015

petermattis commented Oct 30, 2015

tbg commented Oct 30, 2015