Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upstorage: Jitter the StoreRebalancer loop's timing #31227
Conversation
a-robinson
requested a review
from cockroachdb/core-prs
as a
code owner
Oct 10, 2018
a-robinson
requested a review
from
petermattis
Oct 10, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
a-robinson
Oct 11, 2018
Member
@tschottdorf any thoughts on backporting this? It's much more likely to avoid problems than cause them, but man has it gotten late in the cycle.
bors r+
|
@tschottdorf any thoughts on backporting this? It's much more likely to avoid problems than cause them, but man has it gotten late in the cycle. bors r+ |
bot
pushed a commit
that referenced
this pull request
Oct 11, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 11, 2018
Build succeeded |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tschottdorf
Oct 11, 2018
Member
Is there a specific situation in which you expect it to cause problems?
Generally agreed that the jittering, if anything, is going to help.
|
Is there a specific situation in which you expect it to cause problems? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
a-robinson
Oct 11, 2018
Member
I don't think it would ever get persistently stuck given that the goal for the StoreRebalancer is to rebalance the store within one or two rounds of work, but due to delays in propagating information about the number of replicas and load on each store, running store x's rebalancer right after store y's means that store x probably doesn't know about any changes that may have been triggered by store y. If both store x and store y need a few rounds of changes, this means that all x's decisions may be suboptimal for multiple minutes, rather than just for a subset of those rounds.
tl;dr It might reduce flakiness of the rebalance-replicas-by-load roachtest (since that only has 5 minutes to succeed), and it might allow load to rebalance more quickly in edge cases in a real cluster where multiple stores are overloaded, and it's generally just good practice. It's not solving any big problems.
|
I don't think it would ever get persistently stuck given that the goal for the tl;dr It might reduce flakiness of the rebalance-replicas-by-load roachtest (since that only has 5 minutes to succeed), and it might allow load to rebalance more quickly in edge cases in a real cluster where multiple stores are overloaded, and it's generally just good practice. It's not solving any big problems. |
a-robinson commentedOct 10, 2018
Release note: None
Just as a best practice. It may make failures like #31006 even less likely, although it's hard to say for sure.