Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: copysets #32816

Merged
merged 1 commit into from Jan 22, 2019

Conversation

@mvijaykarthik
Copy link
Collaborator

commented Dec 4, 2018

Copysets reduce the probability of data loss in the
presence of multi node failures in large clusters.

This RFC describes how copysets can be generated and
ranges be rebalanced to reside within copysets.

Release note: None

@mvijaykarthik mvijaykarthik requested a review from cockroachdb/rfc-prs as a code owner Dec 4, 2018

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

commented Dec 4, 2018

This change is Reviewable

@petermattis
Copy link
Contributor

left a comment

I've only skimmed the RFC so far. I may have brought this up before, but did you investigate chainsets? The claim is that they improve on some of the deficiencies of copysets. If you have investigated chainsets and found them lacking, it is worth adding a small note to the Alternatives section about them so that I don't ask this question again in the future.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@mvijaykarthik

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 4, 2018

@petermattis I couldn't find much info on how chainsets handles locality diversity (which is important for us). I'll add a link to this in the alternatives section.
But I believe once we have the framework of this RFC set up, it should be easy to have different heuristics for copyset allocation and try them out. The proposed copyset-store allocation in the RFC optimizes purely for diversity (locality fault tolerance).

We have some ideas on how to make incremental changes to copysets, but potentially having lower diversity than the optimal copyset allocation. It's a strategy we are still polishing and plan to add once we have the initial working version out.

@mvijaykarthik mvijaykarthik force-pushed the mvijaykarthik:rfc branch 2 times, most recently from dd7521c to d646ad2 Dec 4, 2018

@bdarnell
Copy link
Member

left a comment

Is there a detailed description of how exactly chainsets work? That blog post doesn't have much detail, and I haven't been able to find anything better. From [this comment], it sounds like hyperdex (which introduced chainsets) has moved to use tiered replication instead, so maybe we should be looking at that instead of chainsets.

The key idea of tiered replication seems to be to designate certain replicas as the "backup tier" and ensuring that every copyset has two primary replicas and one backup. This is a concept that I've generally been skeptical of in the past, but even if we remove that asymmetry I think the way tiered replication handles these constraints could be a guide to how to integrate locality diversity into the process. Tiered replication also claims to address the incremental change problem.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Dec 8, 2018
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Dec 8, 2018
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Dec 8, 2018
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Dec 8, 2018
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
@mvijaykarthik

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 10, 2018

@bdarnell yes, we could use some learnings from tiered replication, but in their case, having 1 node from backup tier is a hard constraint.

In cockroach, locality awareness is kind of a soft contraint. If all nodes in the cluster belong to the same locality, we still want to allow copysets. Then if another set of nodes join the cluster with different localities, we want to edit the existing copysets to increase it's diversity since it is possible now (while minimizing data movement). We have some ideas here which we are experiment on in parallel.

The strategy for copyset allocation could be a cluster setting so that we can experiment with different strategies and allow a users to choose based on their requirements.

We benchmarked the current proposed strategy using a roachtest on 12 m4.2xlarge AWS nodes with 34 GB total physical data.

  Copysets disabled Copysets enabled (disk thresh 0.15) Copysets enabled (disk thresh 0.03)
Stabilize before populate 5m 9m 30s 8m
Populate 2h 7m 1h 46m 1h 28m
Stabilize after populate 3m 6s 6s
Decommission 8m 36s 19m 8m
Stabilize after decommission 2m 55s 25m 33m
Stabilize after node add 30m 41m 45m
       
Disk usage 9-13% 6-14% 8-11%

Notes:

  1. Stabilize implies no replica movement since last 3 minutes
  2. Disk thresh implies the max idle score diff for ranges to migrate across copysets
  3. The node with the highest number of replicas is decommissioned

The proposed algo is something we can start with while we are parallely thinking about incremental changes to copyset allocation.

mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Dec 10, 2018
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Dec 10, 2018
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
@bdarnell
Copy link
Member

left a comment

I'm concerned about the incompatibility between the first implementation of copysets and other features. Ultimately we need to support copysets in combination with other features including load-based rebalancing and zone constraints (and eventually to make copysets the only option). It's fine to have an MVP and not do these things yet, but we need to make sure that we are progressing towards being able to iterate towards the full combination.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181204_copysets.md, line 73 at r1 (raw file):

The requirements for copysets of a replication factor are
1. There should be no overlap of nodes between copysets.

This is only true if you assume scatter width of 1, right? The copysets paper allows for overlap ("likely...at most one overlapping node").


docs/RFCS/20181204_copysets.md, line 98 at r1 (raw file):

CS3: S3 S6 S9

The store list considered for copyset allocation would be the current live

Using which definition of liveness? "Time until store dead"? See #32949 for a recent bug caused by the wrong definition of liveness.


docs/RFCS/20181204_copysets.md, line 105 at r1 (raw file):

proposal.

Copyset allocation can be presisted as a proto in the distributed KV layer.

s/presisted/persisted/

This is a significant departure from our current stateless decentralized rebalance decision-making. How important is that centralization? If we just had each node run this algorithm on its own, it seems like it would work fairly well. There would be a little thrashing as the nodes transition to a new set of copysets, but that would be small in comparison to the complete rebalancing that is required by any copyset change.

OTOH, I could see how centralizing the process would make it easier to minimize the movement required for a copyset change. I think it's worth sketching out at least a little of what that would require (does it need persistence or can something analogous to consistent hashing be used instead?)


docs/RFCS/20181204_copysets.md, line 150 at r1 (raw file):

For rebalancing ranges into copysets, a new "copyset score" will be added to
the allocator. Priority wise it will be between (2) and (3) above. Zone
constraints and disk fullness take a higher priority over copyset score.

Are zone constraints and copysets effectively mutually exclusive, then? There won't be enough nodes left in the copyset after throwing out the ones that don't match the zone constraint (or maybe with higher scatter width it could work out?)


docs/RFCS/20181204_copysets.md, line 237 at r1 (raw file):

## Drawbacks
1. Copysets increase recovery time since only nodes within the copyset of a 

I'm concerned about the fact that node failure simultaneously requires recovery with a low scatter width and movement of nearly every replica in the cluster. All that rebalancing will compete for resources with the recovery.

From our example above, if S5 dies, its copyset peers S2 and S8 will no longer be in the same copyset in the next iteration. That's a nice (if unintentional) boost in scatter width for some fraction of recovery events.

@mvijaykarthik mvijaykarthik force-pushed the mvijaykarthik:rfc branch from d646ad2 to 6ca6720 Dec 11, 2018

@mvijaykarthik
Copy link
Collaborator Author

left a comment

Yes. I am thinking some of the other features like load-based rebalancing can be added to the idle-score component of copyset score.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181204_copysets.md, line 73 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is only true if you assume scatter width of 1, right? The copysets paper allows for overlap ("likely...at most one overlapping node").

Yes. In the above section I mentioned that for simplicity we'll consider scatter width of 1. Added it here too.


docs/RFCS/20181204_copysets.md, line 98 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Using which definition of liveness? "Time until store dead"? See #32949 for a recent bug caused by the wrong definition of liveness.

Yes. We'll get live nodes from the store pool.


docs/RFCS/20181204_copysets.md, line 105 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/presisted/persisted/

This is a significant departure from our current stateless decentralized rebalance decision-making. How important is that centralization? If we just had each node run this algorithm on its own, it seems like it would work fairly well. There would be a little thrashing as the nodes transition to a new set of copysets, but that would be small in comparison to the complete rebalancing that is required by any copyset change.

OTOH, I could see how centralizing the process would make it easier to minimize the movement required for a copyset change. I think it's worth sketching out at least a little of what that would require (does it need persistence or can something analogous to consistent hashing be used instead?)

We wanted to keep it centralised and update it using transactions so that no node computes incorrect copysets, though you are correct. It's not necessary theoretically (unless each node has a different view of live nodes). We added this to so that we could use it for incremental changes to copysets and minimize thrashing as much as possible.

We are working on a strategy which does incremental changes to copysets. This will require us to know the previous state (which could get lost across restarts if we don't persist copyset mapping). The current state of copyset mapping is dependent on the order in which stores were discovered, so there's no deterministic way to re-compute this without persisting. We should have a design for this fleshed out by next week.


docs/RFCS/20181204_copysets.md, line 150 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Are zone constraints and copysets effectively mutually exclusive, then? There won't be enough nodes left in the copyset after throwing out the ones that don't match the zone constraint (or maybe with higher scatter width it could work out?)

For rebalancing, zone constraint score trumps copyset scores. For the MVP we haven't considered zone constraints, but my guess is we have to just tweak copyset-store allocation based on zone constraints. Rebalancing would remain this way.


docs/RFCS/20181204_copysets.md, line 237 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm concerned about the fact that node failure simultaneously requires recovery with a low scatter width and movement of nearly every replica in the cluster. All that rebalancing will compete for resources with the recovery.

From our example above, if S5 dies, its copyset peers S2 and S8 will no longer be in the same copyset in the next iteration. That's a nice (if unintentional) boost in scatter width for some fraction of recovery events.

There are two things we will do to improve this

  1. Incremental change to copyset-store allocation. This would minimize the number of replicas of the cluster which have to move.
  2. Create copyset allocations with a higher scatter width (This will come later, not a part of MVP). In our internal implementation we have support for higher scatter width in rebalancing, but haven't thought of a way to do copyset-store allocation with higher scatter width yet. It's in our TODO bucket.

I posted some benchmarks earlier on the performance of current strategies. It doesn't look too bad and can work on clusters which are not too heavily loaded.

@a-robinson
Copy link
Collaborator

left a comment

Sorry for the delayed review over the holidays. This is potentially a huge improvement, but it's really going to be quite difficult to get working with existing rebalancing functionality and to build confidence about how it'll work for less simple deployments.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181204_copysets.md, line 73 at r1 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

Yes. In the above section I mentioned that for simplicity we'll consider scatter width of 1. Added it here too.

Why are we talking about a scatter width of 1? You can't have a scatter width of 1 if NumReplicas is set to 3 or more because it would imply that everything is under-replicated.

Should this say "there should be no overlap of nodes between copysets for scatter width of NumReplicas - 1"?


docs/RFCS/20181204_copysets.md, line 98 at r1 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

Yes. We'll get live nodes from the store pool.

That doesn't really answer Ben's question. Is it only considering nodes that are live (meaning they've had a successful liveness heartbeat in the last 9 seconds) or is it also considering nodes that are not live but also not yet considered dead (because their last successful liveness heartbeat was greater than 9 seconds ago but less than 5 minutes ago)?

Does it differ when you're up-replicating vs down-replicating vs rebalancing?


docs/RFCS/20181204_copysets.md, line 150 at r1 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

For rebalancing, zone constraint score trumps copyset scores. For the MVP we haven't considered zone constraints, but my guess is we have to just tweak copyset-store allocation based on zone constraints. Rebalancing would remain this way.

The issue is that different parts of the keyspace (different databases/tables/partitions) can have different constraints, meaning there will potentially be so many different copysets needed that maintaining a low scatter width isn't possible. My impression is similar to Ben's -- in general, copysets and zone constraints don't work well together. There will be many practical cases that they do (no zone constraints or minimal zone constraints), but ensuring that nothing breaks in the other cases will require a lot of extra care.


docs/RFCS/20181204_copysets.md, line 82 at r2 (raw file):

replication factor can be done as follows:
  1. Compute num_copysets = floor(num_stores/replication_factor)

How does this handle different parts of the keyspace having different replication factors?


docs/RFCS/20181204_copysets.md, line 84 at r2 (raw file):

2. Sort stores based on increasing order of locality.
3. Assign copysets to stores in a round robin fashion.

Assuming I understand what you mean by "Sort stores based on increasing order of locality", this won't make for optimal diversity. For example, consider the localities:

region=central,zone=a
region=central,zone=b
region=east,zone=a
region=east,zone=b
region=west,zone=a
region=west,zone=b

These are sorted, but if you assign copysets to stores in a round robin fashion, you'll get one copyset containing region=central,zone=a,region=central,zone=b,region=east,zone=a and another containing the other three, when clearly it'd be preferable (and more in line with current allocator decisions) to include one store from each region in each copyset.


docs/RFCS/20181204_copysets.md, line 100 at r2 (raw file):

The store list considered for copyset allocation would be the current live
stores. Whenever the store list changes, copysets will be re-computed.

How much data movement will be involved in each store list change?


docs/RFCS/20181204_copysets.md, line 142 at r2 (raw file):

  ranges whose movement will cause the stats (like disk usage / writes per
  second) of a range to move away from the global mean.

These stats are no longer considered, they were removed from the code prior to the 2.1 release. The only stat currently considered by the replicate queue is the range count.


docs/RFCS/20181204_copysets.md, line 144 at r2 (raw file):

  second) of a range to move away from the global mean.
5. Balance score difference: Balance score is the normalized utilization of a 
  node. It considers number of ranges, disk usage and writes per second. Nodes 

Ditto


docs/RFCS/20181204_copysets.md, line 162 at r2 (raw file):

2. The copysets the range is in are under-utilized. We want each copyset to 
  be equally loaded. 
  If a range is completely contained in a copyset `x` we should move the range

Where is this decision going to be made? On the lone node that specifies the copysets? Or in a distributed manner within each node's replicate queue?


docs/RFCS/20181204_copysets.md, line 176 at r2 (raw file):

1. Homogeneity score: `Number of pairwise same copyset id / (r choose 2)`
2. Idle score: This score is proportional to how "idle" a store is. For 
starters we can consider this to be % disk free. We want ranges to migrate to 

Using disk fullness as such an important metric is quite divergent from how allocation decisions have ever worked in cockroach -- the previous disk fullness stat was never enabled by default. And with the way that a cockroach node can accumulate a bunch of old data on it and then clear it out via the compaction queue and rocksdb compactions, there's reason to believe that disk fullness isn't a very stable metric in any cluster that doesn't have large amounts of data in it.


docs/RFCS/20181204_copysets.md, line 193 at r2 (raw file):

idle score of `y` differs by more than `d` (configurable).
If `d` is too small, it could lead to thrashing of replicas, so we can use a
value like 15%.  

I suspect 15% is going to be too low in practice, especially due to the effect of compactions as mentioned above


docs/RFCS/20181204_copysets.md, line 198 at r2 (raw file):

scores of two copysets in the cluster.

For example, if idle score of `x` is `a` and `y` is `a + d`, we require:

Is the computation of the idle score ever defined anywhere? I don't see it


docs/RFCS/20181204_copysets.md, line 221 at r2 (raw file):

(x x x) -> (x x y) -> (x y y) -> (y y y)

The above migration will not happen if y has an idle score of 0.34 (since

What happens if y's idle score oscillates back and forth between 0.34 and 0.36? It seems that some amount of cushion is needed here.


docs/RFCS/20181204_copysets.md, line 231 at r2 (raw file):

When a range actually migrates from `(x x x)` to `(x x y)`, it goes
through an intermediate step `(x x x y)` after which one `x` is
removed, but similar math applies.

Similar math applies only as long as the node making the removal decision has the same stats on it as the node making the addition decision. Which isn't always the case if the original leaseholder is the one being removed (because it has to transfer its lease elsewhere before it can be removed). This is another reason for requiring an additional cushion around the decision.


docs/RFCS/20181204_copysets.md, line 279 at r2 (raw file):

## Testing scenarios
Apart from unit tests, roachtests can be added which verify copyset based

I'd add changes of constraints to this list as well. It's going to be incredibly hard to test this in all the different scenarios it'll end up getting used in. People can do all sorts of weird things to their clusters.

@mvijaykarthik mvijaykarthik force-pushed the mvijaykarthik:rfc branch from 6ca6720 to be8f951 Jan 3, 2019

@mvijaykarthik
Copy link
Collaborator Author

left a comment

True. It can be a while before we can enable it by default. But for simple cluster configurations it's going to be a big win.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181204_copysets.md, line 73 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Why are we talking about a scatter width of 1? You can't have a scatter width of 1 if NumReplicas is set to 3 or more because it would imply that everything is under-replicated.

Should this say "there should be no overlap of nodes between copysets for scatter width of NumReplicas - 1"?

Sorry, I meant scatter width of rf - 1


docs/RFCS/20181204_copysets.md, line 98 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

That doesn't really answer Ben's question. Is it only considering nodes that are live (meaning they've had a successful liveness heartbeat in the last 9 seconds) or is it also considering nodes that are not live but also not yet considered dead (because their last successful liveness heartbeat was greater than 9 seconds ago but less than 5 minutes ago)?

Does it differ when you're up-replicating vs down-replicating vs rebalancing?

We consider nodes returned by storePool.getStoreList(roachpb.RangeID(0), storeFilterNone). Stores will be considered dead if there are no heartbeats for > 5 minutes.


docs/RFCS/20181204_copysets.md, line 150 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

The issue is that different parts of the keyspace (different databases/tables/partitions) can have different constraints, meaning there will potentially be so many different copysets needed that maintaining a low scatter width isn't possible. My impression is similar to Ben's -- in general, copysets and zone constraints don't work well together. There will be many practical cases that they do (no zone constraints or minimal zone constraints), but ensuring that nothing breaks in the other cases will require a lot of extra care.

True. Right now zone constraints aren't handled.


docs/RFCS/20181204_copysets.md, line 82 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

How does this handle different parts of the keyspace having different replication factors?

Copysets are generated for each replication factor. A range chooses copysets based on its replication factor.


docs/RFCS/20181204_copysets.md, line 84 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…
2. Sort stores based on increasing order of locality.
3. Assign copysets to stores in a round robin fashion.

Assuming I understand what you mean by "Sort stores based on increasing order of locality", this won't make for optimal diversity. For example, consider the localities:

region=central,zone=a
region=central,zone=b
region=east,zone=a
region=east,zone=b
region=west,zone=a
region=west,zone=b

These are sorted, but if you assign copysets to stores in a round robin fashion, you'll get one copyset containing region=central,zone=a,region=central,zone=b,region=east,zone=a and another containing the other three, when clearly it'd be preferable (and more in line with current allocator decisions) to include one store from each region in each copyset.

If copysets are assigned in a round robin fashion, we get

Zone                      Copyset
region=central,zone=a        1
region=central,zone=b        2 
region=east,zone=a           1
region=east,zone=b           2
region=west,zone=a           1
region=west,zone=b           2

Copyset1: region=central,zone=a, region=east,zone=a, region=west,zone=a
Copyset2: region=central,zone=b, region=east,zone=b, region=west,zone=b

This seems fine no?

This strategy doesn't work that well if the number of stores in each locality are vastly different. We have another strategy which we are still working on which does better in such scenarios. We'll post it once it's complete. But this strategy seems good to start with.


docs/RFCS/20181204_copysets.md, line 100 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

How much data movement will be involved in each store list change?

With the current allocation strategy (round robin) a lot. But we are working on another strategy where data movement is minimized (at the cost of degraded diversity) and are seeing promising performance (of node decommission and node add).


docs/RFCS/20181204_copysets.md, line 142 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…
  ranges whose movement will cause the stats (like disk usage / writes per
  second) of a range to move away from the global mean.

These stats are no longer considered, they were removed from the code prior to the 2.1 release. The only stat currently considered by the replicate queue is the range count.

Done.


docs/RFCS/20181204_copysets.md, line 144 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Ditto

Done.


docs/RFCS/20181204_copysets.md, line 162 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Where is this decision going to be made? On the lone node that specifies the copysets? Or in a distributed manner within each node's replicate queue?

It'll be made by the replicate queue by the current lease holder of the range. The scoring function facilitates this transfer.


docs/RFCS/20181204_copysets.md, line 176 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Using disk fullness as such an important metric is quite divergent from how allocation decisions have ever worked in cockroach -- the previous disk fullness stat was never enabled by default. And with the way that a cockroach node can accumulate a bunch of old data on it and then clear it out via the compaction queue and rocksdb compactions, there's reason to believe that disk fullness isn't a very stable metric in any cluster that doesn't have large amounts of data in it.

The threshold will be much larger than size of a range, so compactions shouldn't have much impact. We've tried this out in our internal cluster (8 node cluster with ~3GB metadata in each) and a stress test with 48 nodes and a total of 240GB metadata and seemed stable (no thrashing with a workload running).
We can tweak this with something like num ranges, qps etc later. This is just something to start with.


docs/RFCS/20181204_copysets.md, line 193 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I suspect 15% is going to be too low in practice, especially due to the effect of compactions as mentioned above

I see, this will be a tweakable cluster setting. Once we have a version of copysets out, we can twek and improve idle score by adding more parameters.
For example, if a range is configured to be 64MB, and we are using a 64GB Cockroach partition, 15% gives a buffer of 9.6GB which should be enough buffer even with compactions (right?).


docs/RFCS/20181204_copysets.md, line 198 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Is the computation of the idle score ever defined anywhere? I don't see it

Idle score is just % disk free in v1.


docs/RFCS/20181204_copysets.md, line 221 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

What happens if y's idle score oscillates back and forth between 0.34 and 0.36? It seems that some amount of cushion is needed here.

That'll happen when a range migrates to y. In that casex x x -> x x y step for some range would have completed. The migration from x x y -> x y y does not need the idle score difference to differ by 0.15 (even a smaller threshold will work now).
So as long as 0.15 * (cockroach data partition size) >> range size, there will be no thrashing.


docs/RFCS/20181204_copysets.md, line 231 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Similar math applies only as long as the node making the removal decision has the same stats on it as the node making the addition decision. Which isn't always the case if the original leaseholder is the one being removed (because it has to transfer its lease elsewhere before it can be removed). This is another reason for requiring an additional cushion around the decision.

To start with the only stats we are using is % disk free. All nodes would have a same enough view of this stat?


docs/RFCS/20181204_copysets.md, line 279 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I'd add changes of constraints to this list as well. It's going to be incredibly hard to test this in all the different scenarios it'll end up getting used in. People can do all sorts of weird things to their clusters.

Done.

@a-robinson
Copy link
Collaborator

left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181204_copysets.md, line 98 at r1 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

We consider nodes returned by storePool.getStoreList(roachpb.RangeID(0), storeFilterNone). Stores will be considered dead if there are no heartbeats for > 5 minutes.

That means that it won't consider nodes whose liveness has expired but who aren't yet dead. I expect that's not what we want -- we don't want to re-generate the copysets that a node was a part of until the node is considered dead. A single liveness heartbeat timeout shouldn't cause copysets to change.


docs/RFCS/20181204_copysets.md, line 82 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

Copysets are generated for each replication factor. A range chooses copysets based on its replication factor.

Do the copysets for different replication factors have to be aligned with each other in any way?


docs/RFCS/20181204_copysets.md, line 84 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

If copysets are assigned in a round robin fashion, we get

Zone                      Copyset
region=central,zone=a        1
region=central,zone=b        2 
region=east,zone=a           1
region=east,zone=b           2
region=west,zone=a           1
region=west,zone=b           2

Copyset1: region=central,zone=a, region=east,zone=a, region=west,zone=a
Copyset2: region=central,zone=b, region=east,zone=b, region=west,zone=b

This seems fine no?

This strategy doesn't work that well if the number of stores in each locality are vastly different. We have another strategy which we are still working on which does better in such scenarios. We'll post it once it's complete. But this strategy seems good to start with.

Sorry, I misunderstood. You're right.


docs/RFCS/20181204_copysets.md, line 100 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

With the current allocation strategy (round robin) a lot. But we are working on another strategy where data movement is minimized (at the cost of degraded diversity) and are seeing promising performance (of node decommission and node add).

Could you include a a little detail about this strategy (or other potential strategies) in the relevant alternatives section below?


docs/RFCS/20181204_copysets.md, line 162 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

It'll be made by the replicate queue by the current lease holder of the range. The scoring function facilitates this transfer.

Are copysets going to be gossiped, then? I'm wondering how it'll be ensured that each node knows the most recently assigned copysets in a timely fashion.


docs/RFCS/20181204_copysets.md, line 176 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

The threshold will be much larger than size of a range, so compactions shouldn't have much impact. We've tried this out in our internal cluster (8 node cluster with ~3GB metadata in each) and a stress test with 48 nodes and a total of 240GB metadata and seemed stable (no thrashing with a workload running).
We can tweak this with something like num ranges, qps etc later. This is just something to start with.

Ack, just letting you know that it'll probably need to be changed. The potentially long delays before compaction happens makes disk space really tough to go off of. For instance, even after rebalancing away many of its ranges a node doesn't actually reclaim any disk space for a while, meaning that there will be a tendency to rebalance too many ranges away because its disk usage doesn't improve quickly enough.


docs/RFCS/20181204_copysets.md, line 193 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

I see, this will be a tweakable cluster setting. Once we have a version of copysets out, we can twek and improve idle score by adding more parameters.
For example, if a range is configured to be 64MB, and we are using a 64GB Cockroach partition, 15% gives a buffer of 9.6GB which should be enough buffer even with compactions (right?).

Ah, 15% of the total disk space may work. I interpreted the 15% number as being 15% of the mean disk space used (e.g. if 10GB of a 1TB disk was being used, I thought you meant 1.5GB, not 150GB), or of the mean disk used fraction.

You might want to clarify that in the doc unless I just missed it.


docs/RFCS/20181204_copysets.md, line 198 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

Idle score is just % disk free in v1.

Ack, you might want to clarify that in the doc unless I just missed it.


docs/RFCS/20181204_copysets.md, line 221 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

That'll happen when a range migrates to y. In that casex x x -> x x y step for some range would have completed. The migration from x x y -> x y y does not need the idle score difference to differ by 0.15 (even a smaller threshold will work now).
So as long as 0.15 * (cockroach data partition size) >> range size, there will be no thrashing.

So you're saying that after the move to x y y that the range will strongly prefer moving to x y y rather than back to x x x? Even though x x x would have a much better homogeneity score?


docs/RFCS/20181204_copysets.md, line 231 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

To start with the only stats we are using is % disk free. All nodes would have a same enough view of this stat?

Usually, yeah. It only causes non-ideal decisions in edge cases, but those should self-correct quickly.

@mvijaykarthik mvijaykarthik force-pushed the mvijaykarthik:rfc branch from be8f951 to 3d91bd9 Jan 8, 2019

@mvijaykarthik
Copy link
Collaborator Author

left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181204_copysets.md, line 98 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

That means that it won't consider nodes whose liveness has expired but who aren't yet dead. I expect that's not what we want -- we don't want to re-generate the copysets that a node was a part of until the node is considered dead. A single liveness heartbeat timeout shouldn't cause copysets to change.

Yes. Changing it to only re-generate copysets if store list is stable for 3 ticks.


docs/RFCS/20181204_copysets.md, line 82 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Do the copysets for different replication factors have to be aligned with each other in any way?

It is good to have, but not a part of MVP. In the current strategy it is not the case. But shouldn't be too bad since higher RFs have more failure tolerance. RF=3 is most vulnerable.
Mentioned it in the doc.


docs/RFCS/20181204_copysets.md, line 100 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Could you include a a little detail about this strategy (or other potential strategies) in the relevant alternatives section below?

Yes. We just finished implementing a strategy which minimizes data movement and it had promising results. Added the strategy.


docs/RFCS/20181204_copysets.md, line 162 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Are copysets going to be gossiped, then? I'm wondering how it'll be ensured that each node knows the most recently assigned copysets in a timely fashion.

Copysets will be persisted by the lowest node ID and read by the other nodes periodically (every 10s). It's in the section: Copyset re-generation

Changes to copysets are rare, I think the 10s hit is something we can live with.


docs/RFCS/20181204_copysets.md, line 176 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Ack, just letting you know that it'll probably need to be changed. The potentially long delays before compaction happens makes disk space really tough to go off of. For instance, even after rebalancing away many of its ranges a node doesn't actually reclaim any disk space for a while, meaning that there will be a tendency to rebalance too many ranges away because its disk usage doesn't improve quickly enough.

yes. We can also add a strategy which looks at range count for idle score. It's something we are going to look into next.


docs/RFCS/20181204_copysets.md, line 193 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Ah, 15% of the total disk space may work. I interpreted the 15% number as being 15% of the mean disk space used (e.g. if 10GB of a 1TB disk was being used, I thought you meant 1.5GB, not 150GB), or of the mean disk used fraction.

You might want to clarify that in the doc unless I just missed it.

Added the equation for idle score.


docs/RFCS/20181204_copysets.md, line 198 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Ack, you might want to clarify that in the doc unless I just missed it.

Done.


docs/RFCS/20181204_copysets.md, line 221 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

So you're saying that after the move to x y y that the range will strongly prefer moving to x y y rather than back to x x x? Even though x x x would have a much better homogeneity score?

Yes. The way the score works makes it go from x x y to x y y. From here it has two choices x y y or x x x
x y y has the same homogeneity score as x x y, but a better idle score (because y has a higher idle score and x y y has more ys. So transition to x y y is good.
x x x has a much lower idle score compared to x y y so the transition to x y y is preferred over x x x.

We've internally tested that there is no thrashing with UTs and a bunch of roachtests with node crashes.


docs/RFCS/20181204_copysets.md, line 231 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Usually, yeah. It only causes non-ideal decisions in edge cases, but those should self-correct quickly.

ok.

@a-robinson
Copy link
Collaborator

left a comment

This is stabilizing from my point of view, but @bdarnell's going to be the final approver here.

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181204_copysets.md, line 98 at r1 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

Yes. Changing it to only re-generate copysets if store list is stable for 3 ticks.

Why not just wait until a store is considered dead before removing it?


docs/RFCS/20181204_copysets.md, line 162 at r2 (raw file):

Previously, mvijaykarthik (Vijay Karthik) wrote…

Copysets will be persisted by the lowest node ID and read by the other nodes periodically (every 10s). It's in the section: Copyset re-generation

Changes to copysets are rare, I think the 10s hit is something we can live with.

We've historically tried to avoid creating hotspots that receive a thundering herd of requests from all the nodes periodically, since it won't scale very well as the number of nodes grows. We typically gossip such information instead. This seems like a good candidate for gossip as well, most logically as part of the SystemConfig.

@mvijaykarthik
Copy link
Collaborator Author

left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181204_copysets.md, line 98 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Why not just wait until a store is considered dead before removing it?

We want to be sure that the store list is stable. There could be other issues causing store list instability. In such cases we don't want to keep re-balancing copysets.

In addition to this we can also remove a store only once it is considered dead. One question here. The threshold used in getStoreListFromIDsRLocked is TimeUntilStoreDead (5 minutes) for a store to be considered dead. A single liveness timeout will still cause it to be considered dead? If so how do we get a store list with dead stores (for > 5min) removed?

The reason we used storePool.getStoreListis because the allocator uses the same. We want to update copysets before the rebalancer starts moving replicas out of a store.


docs/RFCS/20181204_copysets.md, line 162 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

We've historically tried to avoid creating hotspots that receive a thundering herd of requests from all the nodes periodically, since it won't scale very well as the number of nodes grows. We typically gossip such information instead. This seems like a good candidate for gossip as well, most logically as part of the SystemConfig.

We wanted to update it using transactions (especially the min data movement strategy which is heavily reliant on the previous state so we want it to be consistent).

One more optimization we use is that if the store list does not change, we don't re-generate and re-persist copysets (Added this to the doc). This way in steady state they are no persists.
So in steady state all nodes will be reading this key every 10s. That shouldn't cause any performance degradation right?

@mvijaykarthik mvijaykarthik force-pushed the mvijaykarthik:rfc branch from 3d91bd9 to a1869d5 Jan 9, 2019

mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Jan 22, 2019
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will have two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Jan 22, 2019
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will have two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
@bdarnell

This comment has been minimized.

Copy link
Member

commented Jan 22, 2019

bors r+

craig bot pushed a commit that referenced this pull request Jan 22, 2019
Merge #32816
32816: RFC: copysets r=bdarnell a=mvijaykarthik

Copysets reduce the probability of data loss in the
presence of multi node failures in large clusters.

This RFC describes how copysets can be generated and
ranges be rebalanced to reside within copysets.

Release note: None

Co-authored-by: Vijay Karthik <vijay.karthik@rubrik.com>
@craig

This comment has been minimized.

Copy link

commented Jan 22, 2019

Build succeeded

@craig craig bot merged commit 8c2a022 into cockroachdb:master Jan 22, 2019

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Jan 22, 2019
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will have two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Jan 22, 2019
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will have two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 2, 2019
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will have two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 5, 2019
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will have two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 6, 2019
storage: add interfaces for copysets
Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in cockroachdb#32816.

This commit adds interfaces and protos for copysets.
The implementation added later will have two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 6, 2019
Merge #32954
32954: storage: add interfaces for copysets r=tbg a=mvijaykarthik

Copysets significantly reduce the probability of data loss
in the presence of multi node failures. The RFC for copysets
is presented in #32816.

This commit adds interfaces and protos for copysets.
The implementation added later will have two independent tracks:
1. Assignment and persistance of store_id-copyset_id mapping.
2. Rebalancer changes to rebalance replicas into copysets.

Release note: None

Co-authored-by: Vijay Karthik <vijay.karthik@rubrik.com>
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storate: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storate: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 13, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 14, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 16, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 22, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
mvijaykarthik added a commit to mvijaykarthik/cockroach that referenced this pull request Feb 22, 2019
storage: add support for copyset based rebalancing
This PR adds support for copyset based rebalancing.
The allocator will try to prefer placing replicas of a
range within a copyset if copyset based rebalancing is
enabled.

The details on how copyset score works is given in
RFC cockroachdb#32816.

Copyset generation will be added in a separate PR.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.