-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tunable knob for ASG Cooldown period #495
Tunable knob for ASG Cooldown period #495
Conversation
Thanks @prateek, can you tell us a bit more about your use case? Interested to hear a bit more about what problem you are trying to solve that needs the cooldown periods customized! |
tl;dr - the 300 hardcoded value was not in line with my expectation of elasticity of the stack. @lox sure thing! The use-case is standard CI - my team has one mega (github.com/m3db/m3) repo and some smaller repos (github.com/m3db/*) – we're in the process of consolidating in a monorepo right now. Developers in the team rely on the for quick PR feedback, so push to GH/CI frequently. The reason for the PR might completely be misconfiguration (please tell me if it is). Here are the knobs I had when I was testing:
Along with that, each of my CI builds has 6 jobs. So what would end up happening was I'd push some code, look at the build and go incorporate feedback. By the time I pushed again, the stack had hit the cooldown period once and had triggered an instance kill. At this point, it would wait for the cooldown period before spinning up more instances. |
I'm with prateek, a built-in five-minute cooldown on scaling events doesn't fit with some of my workloads. Five-minute spin-up forces a punctuated equilibrium on developers' wait-time expectations. (I shard to sixteen workers and take two minutes to run; all tests are run against all branch commits, so bursts of activity can cause long waits.) Certainly a value of 300 serves the 95% case quite well, but should anything benefit from more granular elasticity, I'd like to be able to provide it. |
The problem I see is that if you decrease the cooldown to below the time it takes to boot an instance and start an agent (around 3 minutes, depending) then you end up with run-away scaling. Might it just be better to use a bigger ScaleUpAdjustments @ecv? If the problem is "get folks more servers, quicker" that will do it. The current scaling mechanism is far from perfect, but it's also a pretty finely balanced system. In future I think we'll replace it with something like a lambda that can more intelligently scale up, but I'm reluctant to add too many knobs to the current setup as I think a lot of context is needed on tuning them and I think it would result in worse experiences for lots of folks than the defaults. |
Right, sure, of course you could shoot yourself in the foot by lowering the value below around 2.5 minutes. But currently the value is double two-and-a-half minutes. I'd just like to shave down that gap. I'm used to being able to blow my own leg off if I mis-handle a root shell or a firearm, this doesn't bother me. But if the support load caused by other users failing to comprehend this would cause an expensive time-drain, then yeah that's a hazard worth avoiding, and one I have no insight into – so I trust your judgment! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds sensible! I just updated the param description with the caveat about ensuring your instances have time to boot. Hopefully I interpreted the situation correctly?
@@ -185,6 +186,12 @@ Parameters: | |||
Default: -1 | |||
MaxValue: 0 | |||
|
|||
ScaleCooldownPeriod: | |||
Description: Minimum number of seconds to wait between any two scaling group changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description: Minimum number of seconds to wait between any two scaling group changes | |
Description: Minimum number of seconds to wait between scaling events. Must be greater than the time needed for an instance to boot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that works!
@lox, 27 days ago
Any reason this is not merged yet? |
A happy new year to you and your family too @dhalperi! We’ll get this out soon. |
Tunable knob for ASG Cooldown period
No description provided.