Tunable knob for ASG Cooldown period #495

prateek · 2018-09-30T19:26:41Z

No description provided.

lox · 2018-10-01T01:31:42Z

Thanks @prateek, can you tell us a bit more about your use case? Interested to hear a bit more about what problem you are trying to solve that needs the cooldown periods customized!

prateek · 2018-10-01T23:52:10Z

tl;dr - the 300 hardcoded value was not in line with my expectation of elasticity of the stack.

@lox sure thing! The use-case is standard CI - my team has one mega (github.com/m3db/m3) repo and some smaller repos (github.com/m3db/*) – we're in the process of consolidating in a monorepo right now. Developers in the team rely on the for quick PR feedback, so push to GH/CI frequently.

The reason for the PR might completely be misconfiguration (please tell me if it is).

Here are the knobs I had when I was testing:

ScaleUpAdjustment: 6
ScaleDownAdjustment: -3
ScaleDownPeriod: 120

Along with that, each of my CI builds has 6 jobs. So what would end up happening was I'd push some code, look at the build and go incorporate feedback. By the time I pushed again, the stack had hit the cooldown period once and had triggered an instance kill. At this point, it would wait for the cooldown period before spinning up more instances.

ecv · 2018-12-19T20:52:10Z

I'm with prateek, a built-in five-minute cooldown on scaling events doesn't fit with some of my workloads.

Five-minute spin-up forces a punctuated equilibrium on developers' wait-time expectations. (I shard to sixteen workers and take two minutes to run; all tests are run against all branch commits, so bursts of activity can cause long waits.)

Certainly a value of 300 serves the 95% case quite well, but should anything benefit from more granular elasticity, I'd like to be able to provide it.

lox · 2018-12-19T21:55:25Z

The problem I see is that if you decrease the cooldown to below the time it takes to boot an instance and start an agent (around 3 minutes, depending) then you end up with run-away scaling. Might it just be better to use a bigger ScaleUpAdjustments @ecv? If the problem is "get folks more servers, quicker" that will do it.

The current scaling mechanism is far from perfect, but it's also a pretty finely balanced system. In future I think we'll replace it with something like a lambda that can more intelligently scale up, but I'm reluctant to add too many knobs to the current setup as I think a lot of context is needed on tuning them and I think it would result in worse experiences for lots of folks than the defaults.

ecv · 2018-12-19T22:25:38Z

Right, sure, of course you could shoot yourself in the foot by lowering the value below around 2.5 minutes. But currently the value is double two-and-a-half minutes. I'd just like to shave down that gap.

I'm used to being able to blow my own leg off if I mis-handle a root shell or a firearm, this doesn't bother me. But if the support load caused by other users failing to comprehend this would cause an expensive time-drain, then yeah that's a hazard worth avoiding, and one I have no insight into – so I trust your judgment!

lox · 2018-12-21T21:09:53Z

After some thought, if I have to choose between a config option with a warning on it and no control for folks that need it, I'll choose the config option. Let's get this merged in and documented and I'll do some testing. Thanks for the perspective @ecv and apologies it's taken me so long @prateek.

toolmantim

Sounds sensible! I just updated the param description with the caveat about ensuring your instances have time to boot. Hopefully I interpreted the situation correctly?

toolmantim · 2018-12-21T22:15:49Z

templates/aws-stack.yml

@@ -185,6 +186,12 @@ Parameters:
    Default: -1
    MaxValue: 0

+  ScaleCooldownPeriod:
+    Description: Minimum number of seconds to wait between any two scaling group changes


Suggested change

Description: Minimum number of seconds to wait between any two scaling group changes

Description: Minimum number of seconds to wait between scaling events. Must be greater than the time needed for an instance to boot.

Yup, that works!

dhalperi · 2019-01-18T02:07:44Z

@lox, 27 days ago

After some thought, if I have to choose between a config option with a warning on it and no control for folks that need it, I'll choose the config option. Let's get this merged in and documented and I'll do some testing. Thanks for the perspective @ecv and apologies it's taken me so long @prateek.

Any reason this is not merged yet?

lox · 2019-01-18T02:18:08Z

A happy new year to you and your family too @dhalperi! We’ll get this out soon.

Tunable knob for ASG Cooldown period

prateek added 2 commits September 30, 2018 15:26

Take knob for ASG Cooldown period

c6412eb

both up and down

b89a2ba

lox requested a review from toolmantim December 21, 2018 21:10

toolmantim approved these changes Dec 21, 2018

View reviewed changes

lox self-assigned this Jan 18, 2019

lox merged commit 2fe16a5 into buildkite:master Jan 18, 2019

lox added a commit that referenced this pull request Jan 18, 2019

Merge pull request #495 from prateek/prateek/parametrize-cooldown-period

93bf71e

Tunable knob for ASG Cooldown period

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tunable knob for ASG Cooldown period #495

Tunable knob for ASG Cooldown period #495

prateek commented Sep 30, 2018

lox commented Oct 1, 2018

prateek commented Oct 1, 2018

ecv commented Dec 19, 2018

lox commented Dec 19, 2018

ecv commented Dec 19, 2018

lox commented Dec 21, 2018

toolmantim left a comment

toolmantim Dec 21, 2018

lox Dec 28, 2018

dhalperi commented Jan 18, 2019

lox commented Jan 18, 2019

	Description: Minimum number of seconds to wait between any two scaling group changes
	Description: Minimum number of seconds to wait between scaling events. Must be greater than the time needed for an instance to boot.

Tunable knob for ASG Cooldown period #495

Tunable knob for ASG Cooldown period #495

Conversation

prateek commented Sep 30, 2018

lox commented Oct 1, 2018

prateek commented Oct 1, 2018

ecv commented Dec 19, 2018

lox commented Dec 19, 2018

ecv commented Dec 19, 2018

lox commented Dec 21, 2018

toolmantim left a comment

Choose a reason for hiding this comment

toolmantim Dec 21, 2018

Choose a reason for hiding this comment

lox Dec 28, 2018

Choose a reason for hiding this comment

dhalperi commented Jan 18, 2019

lox commented Jan 18, 2019