Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tunable knob for ASG Cooldown period #495

Merged
merged 2 commits into from
Jan 18, 2019

Conversation

prateek
Copy link
Contributor

@prateek prateek commented Sep 30, 2018

No description provided.

@lox
Copy link
Contributor

lox commented Oct 1, 2018

Thanks @prateek, can you tell us a bit more about your use case? Interested to hear a bit more about what problem you are trying to solve that needs the cooldown periods customized!

@prateek
Copy link
Contributor Author

prateek commented Oct 1, 2018

tl;dr - the 300 hardcoded value was not in line with my expectation of elasticity of the stack.

@lox sure thing! The use-case is standard CI - my team has one mega (github.com/m3db/m3) repo and some smaller repos (github.com/m3db/*) – we're in the process of consolidating in a monorepo right now. Developers in the team rely on the for quick PR feedback, so push to GH/CI frequently.

The reason for the PR might completely be misconfiguration (please tell me if it is).

Here are the knobs I had when I was testing:

  • ScaleUpAdjustment: 6
  • ScaleDownAdjustment: -3
  • ScaleDownPeriod: 120

Along with that, each of my CI builds has 6 jobs. So what would end up happening was I'd push some code, look at the build and go incorporate feedback. By the time I pushed again, the stack had hit the cooldown period once and had triggered an instance kill. At this point, it would wait for the cooldown period before spinning up more instances.

@ecv
Copy link

ecv commented Dec 19, 2018

I'm with prateek, a built-in five-minute cooldown on scaling events doesn't fit with some of my workloads.

Five-minute spin-up forces a punctuated equilibrium on developers' wait-time expectations. (I shard to sixteen workers and take two minutes to run; all tests are run against all branch commits, so bursts of activity can cause long waits.)

Certainly a value of 300 serves the 95% case quite well, but should anything benefit from more granular elasticity, I'd like to be able to provide it.

@lox
Copy link
Contributor

lox commented Dec 19, 2018

The problem I see is that if you decrease the cooldown to below the time it takes to boot an instance and start an agent (around 3 minutes, depending) then you end up with run-away scaling. Might it just be better to use a bigger ScaleUpAdjustments @ecv? If the problem is "get folks more servers, quicker" that will do it.

The current scaling mechanism is far from perfect, but it's also a pretty finely balanced system. In future I think we'll replace it with something like a lambda that can more intelligently scale up, but I'm reluctant to add too many knobs to the current setup as I think a lot of context is needed on tuning them and I think it would result in worse experiences for lots of folks than the defaults.

@ecv
Copy link

ecv commented Dec 19, 2018

Right, sure, of course you could shoot yourself in the foot by lowering the value below around 2.5 minutes. But currently the value is double two-and-a-half minutes. I'd just like to shave down that gap.

I'm used to being able to blow my own leg off if I mis-handle a root shell or a firearm, this doesn't bother me. But if the support load caused by other users failing to comprehend this would cause an expensive time-drain, then yeah that's a hazard worth avoiding, and one I have no insight into – so I trust your judgment!

@lox
Copy link
Contributor

lox commented Dec 21, 2018

After some thought, if I have to choose between a config option with a warning on it and no control for folks that need it, I'll choose the config option. Let's get this merged in and documented and I'll do some testing. Thanks for the perspective @ecv and apologies it's taken me so long @prateek.

@lox lox requested a review from toolmantim December 21, 2018 21:10
Copy link
Contributor

@toolmantim toolmantim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds sensible! I just updated the param description with the caveat about ensuring your instances have time to boot. Hopefully I interpreted the situation correctly?

@@ -185,6 +186,12 @@ Parameters:
Default: -1
MaxValue: 0

ScaleCooldownPeriod:
Description: Minimum number of seconds to wait between any two scaling group changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Description: Minimum number of seconds to wait between any two scaling group changes
Description: Minimum number of seconds to wait between scaling events. Must be greater than the time needed for an instance to boot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that works!

@dhalperi
Copy link

@lox, 27 days ago

After some thought, if I have to choose between a config option with a warning on it and no control for folks that need it, I'll choose the config option. Let's get this merged in and documented and I'll do some testing. Thanks for the perspective @ecv and apologies it's taken me so long @prateek.

Any reason this is not merged yet?

@lox
Copy link
Contributor

lox commented Jan 18, 2019

A happy new year to you and your family too @dhalperi! We’ll get this out soon.

@lox lox self-assigned this Jan 18, 2019
@lox lox merged commit 2fe16a5 into buildkite:master Jan 18, 2019
lox added a commit that referenced this pull request Jan 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants