Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After range splitting, the newly created range take near 2 seconds to accept command. #1384

Closed
es-chow opened this issue Jun 16, 2015 · 11 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@es-chow
Copy link
Contributor

es-chow commented Jun 16, 2015

When running following command

./cockroach range split a
./cockroach range split b

a error will print:

split failed: range_command.go:793: range is already split at key "b" 

but the range split had succeed.
In log, the election time delay from the raft creating time is about 2 seconds (raftTick is 100ms, electionTimeoutTicks is 15 which defined in store.go), so only after 2 seconds, the newly created range will accept command like InternalResolveIntent, ConditonalPut, etc. but AdminSplit retry will failed with forementioned error.

I0616 21:22:26.578833   10431 raft.go:390] 100000001 became follower at term 5
I0616 21:22:26.578877   10431 raft.go:207] newRaft 100000001 [peers: [100000001,200000002,300000003], term: 5, commit: 10, applied: 10, lastindex: 10, lastterm: 5]
I0616 21:22:26.578934   10431 raft.go:620] 100000001 no leader at term 5; dropping proposal
I0616 21:22:28.151670   10431 txn_coord_sender.go:297] txn coordinator: 0.40 txn/sec, 100.00/0.00/0.00 %cmmt/abrt/abnd, 17ms/0/18ms avg/σ/max duration, 0.0/0.0/0.0 avg/σ/max restarts (2 samples)
I0616 21:22:28.549805   10431 raft.go:464] 100000001 is starting a new election at term 5
I0616 21:22:28.549870   10431 raft.go:403] 100000001 became candidate at term 6
I0616 21:22:28.549886   10431 raft.go:447] 100000001 received vote from 100000001 at term 6
I0616 21:22:28.549966   10431 raft.go:440] 100000001 [logterm: 5, index: 10] sent vote request to 200000002 at term 6
I0616 21:22:28.550048   10431 raft.go:440] 100000001 [logterm: 5, index: 10] sent vote request to 300000003 at term 6
I0616 21:22:28.551022   10431 raft.go:447] 100000001 received vote from 200000002 at term 6
I0616 21:22:28.551053   10431 raft.go:605] 100000001 [q:2] has received 2 votes and 0 vote rejections
I0616 21:22:28.551084   10431 raft.go:426] 100000001 became leader at term 6

Is it a way to shorten this time?

  1. issue a leader election after creating the raft group. When server restarting, raft group is created lazily in store.ProposeRaftCommand, so it's not a spike.
  2. shorten the election timeout in raft like:
func (r *raft) isElectionTimeout() bool {
        // if newly created range or in Election, use short timeout.
    if r.lead == None {
        return d > r.rand.Int()%r.electionTimeout
    }                                                           
    d := r.elapsed - r.electionTimeout
    if d < 0 {
        return false
    }
    return d > r.rand.Int()%r.electionTimeout
}
@tbg
Copy link
Member

tbg commented Jun 16, 2015

Yeah, we'll have to do something there. As you noticed correctly, the group takes a while to time out. We already have something in place to have a Raft group which starts up with a single replica elect a leader (itself) automatically, but for the range split that doesn't cut it.
We should investigate a way to force the replica of the node which carries out the AdminSplit (assuming it's one of the replicas) to campaign immediately. Currently this is difficult because etcd/raft panicks when you tell a leader to campaign (because it's an invalid transition in vanilla Raft).

@tbg
Copy link
Member

tbg commented Jun 16, 2015

The is already split error looks like a subsequent reinvocation of AdminSplit, so possibly you ran the same split command twice or one of the scanner queues decided to split there as well (unlikely). If you can reproduce this, we should figure out what's happening there.

@bdarnell
Copy link
Contributor

It should be pretty easy to force a campaign after a split, either on the node that ran AdminSplit or the current raft leader of the source range (probably at the end of Store.SplitRange). I can make the change to etcd/raft to remove the panic from Campaign().

@tbg
Copy link
Member

tbg commented Jun 17, 2015

That sounds like a good plan.

@es-chow
Copy link
Contributor Author

es-chow commented Jun 17, 2015

@tschottdorf , the reason why is already split come from is

  1. range 2 which created by split 'a' is not starting a election, so no range leader is granted.
  2. send.go send AdminSplit to 3 replicas after SendNextTimeout.
  3. raft election is starting, then range leader is granted to replica 1:1
  4. replica 1:1 executes AdminSplit and succeed.
  5. replica 2:2 reply AdminSplit with NotLeaderError
  6. dist_sender.go retry AdminSplit and send to current leader replica 1:1.
  7. the retried AdminSplit will failed with is already split.
    So the is already split is harmless. But if raft election start early, this error can be avoided.
I0617 20:52:12.368865     749 send.go:171] Node.AdminSplit: sending request to local:8081: header:<timestamp:<wall_time:0 logical:0 > cmd_id:<wall_time:1434545532347514829 random:5721410891008325715 > key:"b" user:"root" replica:<node_id:3 store_id:3 > raft_id:2 read_consistency:CONSISTENT > split_key:"b"
I0617 20:52:12.874325     749 send.go:171] Node.AdminSplit: sending request to local:8082: header:<timestamp:<wall_time:0 logical:0 > cmd_id:<wall_time:1434545532347514829 random:5721410891008325715 > key:"b" user:"root" replica:<node_id:2 store_id:2 > raft_id:2 read_consistency:CONSISTENT > split_key:"b"
I0617 20:52:13.379796     749 send.go:171] Node.AdminSplit: sending request to local:8080: header:<timestamp:<wall_time:0 logical:0 > cmd_id:<wall_time:1434545532347514829 random:5721410891008325715 > key:"b" user:"root" replica:<node_id:1 store_id:1 > raft_id:2 read_consistency:CONSISTENT > split_key:"b"
I0617 20:52:14.139113     749 raft.go:464] 100000001 is starting a new election at term 5
I0617 20:52:14.139172     749 raft.go:403] 100000001 became candidate at term 6
I0617 20:52:14.139194     749 raft.go:447] 100000001 received vote from 100000001 at term 6
I0617 20:52:14.139284     749 raft.go:440] 100000001 [logterm: 5, index: 10] sent vote request to 300000003 at term 6
I0617 20:52:14.139470     749 raft.go:440] 100000001 [logterm: 5, index: 10] sent vote request to 200000002 at term 6
I0617 20:52:14.140435     749 raft.go:447] 100000001 received vote from 200000002 at term 6
I0617 20:52:14.140464     749 raft.go:605] 100000001 [q:2] has received 2 votes and 0 vote rejections
I0617 20:52:14.140499     749 raft.go:426] 100000001 became leader at term 6
I0617 20:52:14.143491     749 range_command.go:754] range 2: new leader lease replica 1:1 08:00:00.000 +1434545533.220s
I0617 20:52:14.143960     749 range_command.go:817] initiating a split of range=2 ["a" - "\xff\xff") at key "b"
I0617 20:52:14.146916     749 send.go:201] Node.AdminSplit: successful reply: header:<error:<message:"replica 2:2 not leader; leader is 1:1" retryable:false transaction_restart:ABORT detail:<not_leader:<replica:<node_id:2 store_id:2 > leader:<node_id:1 store_id:1 > > > > timestamp:<wall_time:0 logical:0 > >                                               
W0617 20:52:14.147011     749 dist_sender.go:523] failed to invoke AdminSplit: replica 2:2 not leader; leader is 1:1
I0617 20:52:14.147047     749 dist_sender.go:678] routing AdminSplit rpc failed an iteration: replica 2:2 not leader; leader is 1:1
I0617 20:52:14.147093     749 dist_sender.go:678] routing AdminSplit rpc failed; retrying immediately
I0617 20:52:14.147228     749 send.go:171] Node.AdminSplit: sending request to local:8080: header:<timestamp:<wall_time:0 logical:0 > cmd_id:<wall_time:1434545532347514829 random:5721410891008325715 > key:"b" user:"root" replica:<node_id:1 store_id:1 > raft_id:2 read_consistency:CONSISTENT > split_key:"b" 
I0617 20:52:14.174611     749 send.go:201] Node.AdminSplit: successful reply: header:<error:<message:"range_command.go:793: range is already split at key \"b\"" retryable:false transaction_restart:ABORT > timestamp:<wall_time:0 logical:0 > >   

@es-chow
Copy link
Contributor Author

es-chow commented Jun 23, 2015

@bdarnell, currently multiraft group is created lazily at Store.ProposeRaftCommand, so at the end of Adminsplit or Store.SplitRange, the raft group of the newly created range is not exist, it may not be possible to call Multiraft.Campaign.

@bdarnell
Copy link
Contributor

There's no need to be lazy when we're splitting the range; we could just create the group immediately after the split (or let MultiRaft.Campaign do it for us).

@es-chow
Copy link
Contributor Author

es-chow commented Jul 21, 2015

@bdarnell, this create the raft group and campaign immediately seems doesn't work.

  1. If we just make the range leader to campaign after split,
    A. If node of range leader crashes after sending vote, then only after election timeout leader will be elected.
    B. If the receiving node hasn't finish split, then it will ignore the MsgVote, so the range leader have to wait for election timeout (near 2 seconds) to issue another vote.
    Case B may be resolved by your StorageKey solution. But case A will also cause raft leader be elected in 2 seconds later.
  2. If we make all new replicas created by split to issue a campaign, it will also cause trouble.
    If there are 3 replicas, the replica finish split first which issue campaign will not succeed as other 2 are not finish split; the replica finish split second will be elected as leader with vote confirmation from the first one. But the replica finish split last will send a vote to let the elected leader step down. Then the new leader will only be elected after election timeout (near 2 seconds later).
    Another option is issue a campaign when multiraft receive a propose but no leader for that group and it's not in campaigning state. But in test, sometimes it also not work:
    2 replicas of 3 had elected an leader, the 3 replica continuously send MsgVote to step down the leader as there are too many propose.
    So if the solution to this is to make change to isElectionTimeout in raft.go:
func (r *raft) isElectionTimeout() bool {
        // if newly created range or in Election, use short timeout.
    if r.lead == None {
        return r.elapsed > r.rand.Int()%r.electionTimeout
    }                                                           
    d := r.elapsed - r.electionTimeout
    if d < 0 {
        return false
    }
    return d > r.rand.Int()%r.electionTimeout
}

@tbg
Copy link
Member

tbg commented Jul 21, 2015

I don't think we have to worry about the case in which a leader crashes just after the split. If that happens, it's about as bad as any crash.

in the regular case, the vote message, when received by a node which hasn't completed the split yet, has two possible outcomes:

  • the raft group is created: this is what we currently do and it breaks things since the splitTrigger fails. But that could be fixed. In this case, the campaigning node gets a vote back, so all is well.
  • the raft group is not created: in this case we would rely on the splitTrigger to create the Raft group for us. We could then enforce the leader as we please (with changes to Raft) for the initial term on each replica, which should be good enough to minimize delays.

@bdarnell is currently revisiting the Raft related races. I imagine he'll have more to say about this shortly.

@bdarnell
Copy link
Contributor

Yeah, we don't need to worry about case 1A. A leader crashing always means we need to wait for election timeouts.

For 1B, currently a replica that hasn't processed a split can still vote, so campaigning immediately should work. Although once the election has finished and the first MsgApp is sent, it will be rejected and a snapshot will be sent instead, so campaigning too soon may cause problems with the split/snapshot conflict (once we fix that bug it will still be inefficient to send a snapshot if a split is in progress, since one or the other will be ignored).

We've talked in this thread about ignoring vote requests from unknown ranges, which would invalidate the previous paragraph, but I don't think we'd need to do that if we had storage keys. If we don't do storage keys and instead ignore vote requests from unknown ranges, then we will have to figure out how that interacts with early campaigns.

For case 2, we definitely don't want to cause every replica to campaign; that just increases the chances of split votes which require multiple election timeouts. If we can't pick one node to campaign then we may not want to change the election process at all.

Your modification to isElectionTimeout to choose a timeout from a smaller range during contended elections is similar to a proposal in Heidi Howard's Analysis of Raft Consensus paper (section 4.2), although she combined this with exponential backoff for losing nodes to counteract the increased frequency of split votes.

@es-chow
Copy link
Contributor Author

es-chow commented Jul 22, 2015

  1. The delay using modification to isElectionTimeout is larger than campaign after split on range leader if the vote is accepted by other replicas(this is done by add a delay such as 10ms before issue the campaign) in my test.
  2. We may have to wait for the storage keys solution as currently if we honor the MsgVote will cause splitTrigger failure as discussed in Range created by Split may conflict with Range created by multiraft #1644.

@tamird tamird added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 22, 2015
@jess-edwards jess-edwards mentioned this issue Aug 18, 2015
78 tasks
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 23, 2015
bdarnell added a commit to bdarnell/etcd that referenced this issue Nov 17, 2015
We need to be able to force an election (on one node) after creating a
new group (cockroachdb/cockroach#1384), but it is difficult to ensure
that our call to Campaign does not race with an election that may be
started by raft itself. A redundant call to Campaign should be a no-op
instead of a panic. (But the panic in becomeCandidate remains, because
we don't want to update the term or change the committed index in this
case)
bdarnell added a commit to bdarnell/etcd that referenced this issue Nov 17, 2015
We need to be able to force an election (on one node) after creating a
new group (cockroachdb/cockroach#1384), but it is difficult to ensure
that our call to Campaign does not race with an election that may be
started by raft itself. A redundant call to Campaign should be a no-op
instead of a panic. (But the panic in becomeCandidate remains, because
we don't want to update the term or change the committed index in this
case)
bdarnell added a commit to bdarnell/cockroach that referenced this issue Nov 18, 2015
This minimizes the window of unavailability following a split.

Fixes cockroachdb#1384.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Nov 24, 2015
This minimizes the window of unavailability following a split.

Fixes cockroachdb#1384.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Nov 24, 2015
This minimizes the window of unavailability following a split.

Fixes cockroachdb#1384.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Dec 2, 2015
This minimizes the window of unavailability following a split.

Fixes cockroachdb#1384.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

4 participants