Add new "TerminateInstanceAfterJob" configuration #523

tduffield · 2019-01-25T21:00:02Z

This new setting will allow you to stop (and terminate) an instance after it has completed a job.

Signed-off-by: Tom Duffield tom@chef.io

lox · 2019-01-25T21:02:18Z

Interesting! What’s the use case for this? My primary concern would be the 4 minute boot time of instances.

Might ECS be a better fit?

tduffield · 2019-01-25T21:28:52Z

@lox the use case is we have jobs that a) require they be run on a full system (not a container), and b) once they run on the machine, it's pretty much useless (read: it's too much work to clean them up correctly).

The boot time is definitely a concern, and there are some things you can do with preemptive scaling via things like lambda, but this is just a first pass. I would consider this a "use it only if you really REALLY need it and you're willing to eat the 4 min boot time"

What we do right now is basically just keep the ASG over-provisioned so even though there's a huge delay in shutdown and reboot, there's always an instance available. We're working to make that more efficient though.

tduffield · 2019-01-25T21:39:26Z

FWIW this is a pattern we've been using successfully for about 6 months internally. We're just getting around to contributing it back upstream :)

lox · 2019-01-27T23:07:40Z

Makes sense! What would you think of instead of an "instance agent mode" this was just a boolean option along the lines of "terminate instance after job", similar to buildkite-agent start --disconnect-after-job. I think that might avoid a lot of the cognitive overhead that modes bring.

tduffield · 2019-01-28T04:34:22Z

Sure. Works for me. I'll work up something tomorrow.

tduffield · 2019-01-28T15:12:48Z

@lox okay, updated the PR with your suggestions.

lox · 2019-01-29T06:36:11Z

This is looking good, I really like the README description. One last request, could we use a drop-in unit for overriding adding just the ExecStopPost lines?

/etc/systemd/system/buildkite-agent@.service.d/10-power-off-stop.conf

[Service]
ExecStopPost=/usr/local/bin/mark-asg-unhealthy
ExecStopPost=/bin/sudo poweroff

Keeps merge conflicts to a minimum.

tduffield · 2019-01-29T14:13:26Z

Alrighty. With that suggestion I was able to keep most of the changes into the boot script, keeping the AMI setup mostly the same. That's a good pattern overall I think. Thanks for that!

tduffield · 2019-01-29T14:42:17Z

I also made the disconnect timeout configurable, and documented it in the README.

README.md

This new setting will allow you to stop (and terminate) an instance after it has completed a job. Signed-off-by: Tom Duffield <tom@chef.io>

tekumara · 2019-01-31T00:12:25Z

packer/conf/buildkite-agent/scripts/mark-asg-unhealthy

+instance_id=$(curl -fsSL http://169.254.169.254/latest/meta-data/instance-id)
+region=$(curl -fsSL http://169.254.169.254/latest/meta-data/placement/availability-zone | head -c -1)
+
+aws autoscaling set-instance-health --instance-id "$instance_id" --region "$region" --health-status "Unhealthy"


Would love something very similar to this. Our use-case is running some data crunching on very large instances types, and to minimise costs we would like the instances to terminate a soon as the job completes.

One difference in our use-case is that we don't want the terminated instance to be replaced immediately with a fresh one (unless there is a scheduled job).

Would you be willing to accommodate this in your PR?

I think the change would be, instead of marking the instance unhealthy, using terminate-instance-in-auto-scaling-group with the --should-decrement-desired-capacity flag, which could be configurable.

Alternatively happy to submit this change myself as a new PR.

I actually really like this idea. I'll work it up in a new commit and we can discuss.

@tekumara okay, I pushed up a potential implementation in a91caa1. Let me know if you think this would work for you.

Fantastic, yes I think that will do the trick! Thankyou 🙏

This will allow us to have more control over the timing and capacity of the ASG. Signed-off-by: Tom Duffield <tom@chef.io>

tduffield force-pushed the instance-agent-mode branch from 32c0afe to 8d3ea8e Compare January 25, 2019 21:37

tduffield force-pushed the instance-agent-mode branch from 8d3ea8e to 480b1de Compare January 28, 2019 15:12

tduffield changed the title ~~Add new "Instance Agent Mode" configuration~~ Add new "TerminateInstanceAfterJob" configuration Jan 28, 2019

tduffield force-pushed the instance-agent-mode branch from 480b1de to b586f23 Compare January 28, 2019 17:57

tduffield force-pushed the instance-agent-mode branch from b586f23 to 4fb68d9 Compare January 29, 2019 14:12

tduffield force-pushed the instance-agent-mode branch from 4fb68d9 to ee33ffb Compare January 29, 2019 14:40

jeremiahsnapp reviewed Jan 29, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

tduffield force-pushed the instance-agent-mode branch 2 times, most recently from bbb1111 to b58fd50 Compare January 29, 2019 14:54

Add new "TerminateInstanceAfterJob" configuration

346ff34

This new setting will allow you to stop (and terminate) an instance after it has completed a job. Signed-off-by: Tom Duffield <tom@chef.io>

tduffield force-pushed the instance-agent-mode branch from b58fd50 to 346ff34 Compare January 29, 2019 16:07

tekumara reviewed Jan 31, 2019

View reviewed changes

Terminate the instance directly rather than mark it unhealthy

a91caa1

This will allow us to have more control over the timing and capacity of the ASG. Signed-off-by: Tom Duffield <tom@chef.io>

tduffield force-pushed the instance-agent-mode branch from 724d56c to a91caa1 Compare January 31, 2019 14:38

lox requested a review from toolmantim February 6, 2019 06:35

toolmantim approved these changes Feb 6, 2019

View reviewed changes

lox merged commit 6eb10a9 into buildkite:master Feb 6, 2019

JuanitoFatas mentioned this pull request Nov 2, 2020

Why is it not recommended to increase AgentsPerInstance ? #759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new "TerminateInstanceAfterJob" configuration #523

Add new "TerminateInstanceAfterJob" configuration #523

tduffield commented Jan 25, 2019 •

edited

lox commented Jan 25, 2019

tduffield commented Jan 25, 2019 •

edited

tduffield commented Jan 25, 2019

lox commented Jan 27, 2019

tduffield commented Jan 28, 2019

tduffield commented Jan 28, 2019

lox commented Jan 29, 2019

tduffield commented Jan 29, 2019

tduffield commented Jan 29, 2019

tekumara Jan 31, 2019

tduffield Jan 31, 2019

tduffield Jan 31, 2019 •

edited

tekumara Feb 1, 2019

Add new "TerminateInstanceAfterJob" configuration #523

Add new "TerminateInstanceAfterJob" configuration #523

Conversation

tduffield commented Jan 25, 2019 • edited

lox commented Jan 25, 2019

tduffield commented Jan 25, 2019 • edited

tduffield commented Jan 25, 2019

lox commented Jan 27, 2019

tduffield commented Jan 28, 2019

tduffield commented Jan 28, 2019

lox commented Jan 29, 2019

tduffield commented Jan 29, 2019

tduffield commented Jan 29, 2019

tekumara Jan 31, 2019

Choose a reason for hiding this comment

tduffield Jan 31, 2019

Choose a reason for hiding this comment

tduffield Jan 31, 2019 • edited

Choose a reason for hiding this comment

tekumara Feb 1, 2019

Choose a reason for hiding this comment

tduffield commented Jan 25, 2019 •

edited

tduffield commented Jan 25, 2019 •

edited

tduffield Jan 31, 2019 •

edited