Skip to content

RFC: dismantle multiraft #3431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 16, 2015
Merged

Conversation

bdarnell
Copy link
Contributor

I've been thinking about how we can become confident that we've solved all the races involving multiraft, and have come to the conclusion that it may be best to just get rid of the abstraction and talk to etcd/raft directly from the storage package.

@BramGruneir @es-chow I'd like to get your feedback on this.

Review on Reviewable

@es-chow
Copy link
Contributor

es-chow commented Dec 14, 2015

Good idea, It should be possible and it may improve the concurrency significantly as we found the sigle goroutine model in multiraft is bottleneck under heavy load. As you mentioned this is a paradigm shift from channel-based design to mutex-based design, so few mutex locking and contention should be the focus.

For the thread model, I had the following doubts:

  1. when the Ready in raft will be checked, so it can be stablized to storage? It can be a single goroutine which processes the write task or check immediately.
  2. shall we stick to the single write goroutine model?
  3. shall we stick to the event chanel for pendingEvents? it may be possible to be removed.
  4. can we optimize the manipulation of the state.nodes in multiraft.go to few places, so we can avoid the scattered lock?

@BramGruneir
Copy link
Member

I agree it's a great idea. Some points.

  1. How many raft related gorountines are we talking about. 1 per store, 1 per replica? Or just the 1 per store that visits the replicas? Or did you have a different model in mind?
  2. How will the replica gc queue function? Perhaps it should be refactored onto the replica directly as well. After a replica is removed, it sticks around until after a some timeout and then is deleted and the replica struct is only at this point actually removed. Regardless, we will probably need a collection of states for each stage that a replica's at. Not just uninitialized/initialized. As relying on the different raft signals seems to be tricky.
  3. Can this guarantee the hard state and appliedIndex are updated in one place for each replica?
  4. I agree with @es-chow's point about removing the pendingEvents channel, but it might be tough.
  5. How much extra work will this be as we push for beta? Should we execute this change now, or after our beta release? (I'm leaning towards getting it done asap, but that decision will have to be defended.)

As a side concern, how tough will it be to try out different consensus algorithms? This is moving the raft logic deeper into our system, this might make this type of experimentation even harder in the future.

@tbg
Copy link
Member

tbg commented Dec 14, 2015

I'm also generally in favor. For moving complexity into ./storage, I would argue that the complexity is already there and only exacerbated by keeping tightly coupled items separate, as we do currently.
We should avoid goroutines on *Replica if we comfortably can. Otherwise we'll have to properly start and stop them, and I'd rather not have to do that.
I'm not worried about losing the ability to "try out" other consensus algorithms. I don't think we're losing it more than we already have.

@bdarnell
Copy link
Contributor Author

@es-chow:

  1. When do we check Ready? The same as in multiraft: we check RawNode.Ready() after each call to RawNode.Step() (since RawNode doesn't use channels we don't need to dedicate a goroutine for this)
  2. Single write goroutine? I think we'll probably do all the raft-related writes from the same goroutine(s) that process and apply the committed commands.
  3. pendingEvents channel? I'd like to get rid of this channel.
  4. Locking for state.nodes? state.nodes will be moved to the Store and protected by Store.mu. I think we will be able to centralize this and reduce the number of times we acquire the lock.

@BramGruneir:

  1. How many goroutines? We will definitely have one permanent goroutine for the store, and we will definitely not have a permanent goroutine per replica. At first, we'll probably do everything on the store's processRaft goroutine, but I'm tentatively thinking that we'll move to a temporary goroutine per replica which is created only when there is work for it to do.
  2. replicaGCQueue? I don't think we want to move all of this onto the Replica itself, but we should probably add some supporting methods to drain the replica. The biggest win is simply refactoring the locking so we can do everything atomically without deadlocks.
  3. HardState and AppliedIndex updated in one place? HardState and AppliedIndex will be written at different times, but this will bring them closer together and it should be easier to make sure that nothing else comes between them.
  4. pendingEvents? See above
  5. Now or later? That's a fair concern, but we've spent a lot of time dealing with concurrency bugs in the current code, and we're not done yet. I think this refactoring is our best chance of reliably resolving this class of bug.

Try out different algorithms? I think that the storage package is so tightly coupled to etcd/raft at this point that the MultiRaft abstraction doesn't even buy us the ability to experiment with different raft implementations, let alone other consensus algorithms entirely.

@BramGruneir
Copy link
Member

So this is LGTM, but I would like to see a little bit more depth to the design before we dive right in.

@tbg
Copy link
Member

tbg commented Dec 16, 2015

LGTM. I think the implementation details/discussion do{,es}n't need to be on the RFC.

@bdarnell
Copy link
Contributor Author

What kind of detail do you want? As for the process of getting from here to there, I don't think we're going to be able to avoid going in one big step. We can alleviate that a little bit by dropping some features during the move and re-adding them later (coalesced heartbeats, lazy group creation).

@BramGruneir
Copy link
Member

I was just thinking some details on the threading model is all. Specifically where each goroutine lives and its major function. I just think it could be fleshed out a little bit more is all.

@bdarnell
Copy link
Contributor Author

This RFC is just about removing a layer of abstraction without making too many changes to the model. There will still be one goroutine per store doing all the raft work (or two; I'm not sure yet whether it makes sense to fold everything into Store.processRaft in the initial phase or leave the two goroutines running in parallel). Once the move is done, we will consider changes to the threading model.

@bdarnell bdarnell force-pushed the dismantle-multiraft branch from 0bfcce6 to b367aa2 Compare December 16, 2015 19:39
@BramGruneir
Copy link
Member

SGTM

bdarnell added a commit that referenced this pull request Dec 16, 2015
@bdarnell bdarnell merged commit d825150 into cockroachdb:master Dec 16, 2015
@bdarnell bdarnell deleted the dismantle-multiraft branch December 16, 2015 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants