New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Jitter the StoreRebalancer loop's timing #31227

Merged
merged 1 commit into from Oct 11, 2018

Conversation

Projects
None yet
3 participants
@a-robinson
Member

a-robinson commented Oct 10, 2018

Release note: None

Just as a best practice. It may make failures like #31006 even less likely, although it's hard to say for sure.

@a-robinson a-robinson requested a review from cockroachdb/core-prs as a code owner Oct 10, 2018

@a-robinson a-robinson requested a review from petermattis Oct 10, 2018

@cockroach-teamcity

This comment has been minimized.

Show comment
Hide comment
@cockroach-teamcity

cockroach-teamcity Oct 10, 2018

Member

This change is Reviewable

Member

cockroach-teamcity commented Oct 10, 2018

This change is Reviewable

@a-robinson

This comment has been minimized.

Show comment
Hide comment
@a-robinson

a-robinson Oct 11, 2018

Member

@tschottdorf any thoughts on backporting this? It's much more likely to avoid problems than cause them, but man has it gotten late in the cycle.

bors r+

Member

a-robinson commented Oct 11, 2018

@tschottdorf any thoughts on backporting this? It's much more likely to avoid problems than cause them, but man has it gotten late in the cycle.

bors r+

craig bot pushed a commit that referenced this pull request Oct 11, 2018

Merge #31227
31227: storage: Jitter the StoreRebalancer loop's timing r=a-robinson a=a-robinson

Release note: None

Just as a best practice. It may make failures like #31006 even less likely, although it's hard to say for sure.

Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot commented Oct 11, 2018

Build succeeded

@craig craig bot merged commit 82bef53 into cockroachdb:master Oct 11, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
@tschottdorf

This comment has been minimized.

Show comment
Hide comment
@tschottdorf

tschottdorf Oct 11, 2018

Member

Is there a specific situation in which you expect it to cause problems?
Generally agreed that the jittering, if anything, is going to help.

Member

tschottdorf commented Oct 11, 2018

Is there a specific situation in which you expect it to cause problems?
Generally agreed that the jittering, if anything, is going to help.

@a-robinson

This comment has been minimized.

Show comment
Hide comment
@a-robinson

a-robinson Oct 11, 2018

Member

I don't think it would ever get persistently stuck given that the goal for the StoreRebalancer is to rebalance the store within one or two rounds of work, but due to delays in propagating information about the number of replicas and load on each store, running store x's rebalancer right after store y's means that store x probably doesn't know about any changes that may have been triggered by store y. If both store x and store y need a few rounds of changes, this means that all x's decisions may be suboptimal for multiple minutes, rather than just for a subset of those rounds.

tl;dr It might reduce flakiness of the rebalance-replicas-by-load roachtest (since that only has 5 minutes to succeed), and it might allow load to rebalance more quickly in edge cases in a real cluster where multiple stores are overloaded, and it's generally just good practice. It's not solving any big problems.

Member

a-robinson commented Oct 11, 2018

I don't think it would ever get persistently stuck given that the goal for the StoreRebalancer is to rebalance the store within one or two rounds of work, but due to delays in propagating information about the number of replicas and load on each store, running store x's rebalancer right after store y's means that store x probably doesn't know about any changes that may have been triggered by store y. If both store x and store y need a few rounds of changes, this means that all x's decisions may be suboptimal for multiple minutes, rather than just for a subset of those rounds.

tl;dr It might reduce flakiness of the rebalance-replicas-by-load roachtest (since that only has 5 minutes to succeed), and it might allow load to rebalance more quickly in edge cases in a real cluster where multiple stores are overloaded, and it's generally just good practice. It's not solving any big problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment