Skip to content

RFC for segmented storage#1866

Merged
bdarnell merged 2 commits intocockroachdb:masterfrom
bdarnell:segmented-storage
Aug 4, 2015
Merged

RFC for segmented storage#1866
bdarnell merged 2 commits intocockroachdb:masterfrom
bdarnell:segmented-storage

Conversation

@bdarnell
Copy link
Copy Markdown
Contributor

No description provided.

@tbg tbg added the PTAL label Jul 29, 2015
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/comparaing/comparing/

@tbg
Copy link
Copy Markdown
Member

tbg commented Jul 30, 2015

This RFC solves the correctness issue (at some cost) but it still leaves a performance issue open (kickstarting the new ranges without wasting resources on snapshots). I wonder whether this could be one of the rare occasions in which nailing the performance and correctness issues could be the same.
The thought is the following:

  • having an uninitialized range (Raft group) around isn't an issue, it can be gc'ed and doesn't do anything (with proper code: currently also not true - if the range gets created by before the splitTrigger, then there's an error).
  • it gets tricky if ownership over data is shared, and damage is done precisely when a SnapShot is applied.
  • a SnapShot contains the range descriptor, so the Store can check whether this swath of keyspace is currently in any active Range. If it is, then the Snapshot must not be applied (maybe some other action taken; tbd).

So all the conflict resolution happens during the snapshotting process.

This is pretty inefficient, but segmented storage is even less efficient. If that deals with correctness though, we've achieved the same thing already. And the key to performance is to make sure we don't actually send out snapshots too early, as @bdarnell suggested. If we set the new group's leader to the old one's (this could be a little tricky to make sure nobody can vote again in the same term; could also trigger votes but that has its own problems again) during splitTrigger and (maybe on the leader only, but shouldn't matter) additionally delay sending any snapshots for that group (transport.DelaySnapshot(raftID, time.Now.Add(time.Minute))?) we should get fairly close. The cherry on top could be queueing up a certain amount of Raft messages while the newly-split Range hasn't been splitTriggered yet to make the transition truly seamless (this could obviate the need to even delay Snapshots if it works reliably).

@bdarnell
Copy link
Copy Markdown
Contributor Author

How is segmented storage "even less efficient"? Just the cost of putting segment IDs in the keys, or is there something I'm missing?

Discarding snapshots that conflict with an existing range would address the correctness issue, it is true. However, if this were our solution we could have availability problems: the range receiving the snapshot would be a member of the range, but it would be unable to act as one until the conflict had been resolved. In most cases this just means waiting for the left-hand range to catch up and process the split. However, if the left-hand range has been rebalanced away from this node, it won't go away until the replica GC process has deleted it. This could be a long time to go with a non-functioning replica (Does it matter? Can we count on the rebalancer to treat this replica as dead and move the right-hand range away from it too? Or could we accelerate the replica GC process when we detect this conflict?). Segmented storage allows the right-hand replica to become usable immediately even while the left-hand replica lags behind.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be valuable to define the notion of Segment and what it is meant to represent. What follows in this discussion is the pure mechanism of how things work, and explaining the notion of a segment will make a valuable read.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A segment doesn't really represent anything; it's just a way of partitioning the keyspace. There is no conceptual backing here besides the mechanism.

@tbg
Copy link
Copy Markdown
Member

tbg commented Jul 31, 2015

How is segmented storage "even less efficient"? Just the cost of putting segment IDs in the keys, or is there something I'm missing?

no, just that and having to rewrite on a merge (complexity aside).

However, if the left-hand range has been rebalanced away from this node, it won't go away until the replica GC process has deleted it. This could be a long time to go with a non-functioning replica (Does it matter? Can we count on the rebalancer to treat this replica as dead and move the right-hand range away from it too? Or could we accelerate the replica GC process when we detect this conflict?). Segmented storage allows the right-hand replica to become usable immediately even while the left-hand replica lags behind.

Hmm, good point. I'll have to think about that.

@bdarnell
Copy link
Copy Markdown
Contributor Author

bdarnell commented Aug 3, 2015

So after thinking about this some more, I'm doubtful that segmented storage is worth it. The situation in which it improves availability by a significant amount is pretty rare: when a node has been down long enough for the left-hand range to be moved away, but not for the right-hand range. I think that the simpler solution suffices: when an incoming snapshot would cause a snapshot, drop it. This will leave the range in an uninitialized state, in which it will vote in raft elections but not participate in committing log entries. We must make sure that this counts as "down" for the repair system, and when a snapshot is dropped we should send the left-hand range to the GC queue (to see if the local replica can be removed, resolving the conflict) and the right-hand range to the repair queue (to see if the local replica can be reassigned somewhere else).

If this proposal is dropped, #1867 also becomes obsolete.

@tbg
Copy link
Copy Markdown
Member

tbg commented Aug 4, 2015

Sounds good. I had the same intuition, but I don't trust my estimate of the complexities in resolving the race. If you've reached that conclusion, I'm confident that this is worth going for (worst scenario, we're back to the above). Wanna discuss and formalize the race detection mechanisms and measures we can take to ensure a smooth handoff (avoiding elections by keeping leadership for the split groups, possibly queuing up Raft messages at the uninitialized Range until it receives the Split) in normal operation in a new RFC?

@vivekmenezes
Copy link
Copy Markdown
Contributor

Still worth checking this in as a rejected RFC.

bdarnell added a commit that referenced this pull request Aug 4, 2015
@bdarnell bdarnell merged commit e8495c8 into cockroachdb:master Aug 4, 2015
@tbg tbg removed the PTAL label Aug 4, 2015
@bdarnell bdarnell deleted the segmented-storage branch August 4, 2015 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants