New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new "TerminateInstanceAfterJob" configuration #523
Conversation
Interesting! What’s the use case for this? My primary concern would be the 4 minute boot time of instances. Might ECS be a better fit? |
@lox the use case is we have jobs that a) require they be run on a full system (not a container), and b) once they run on the machine, it's pretty much useless (read: it's too much work to clean them up correctly). The boot time is definitely a concern, and there are some things you can do with preemptive scaling via things like lambda, but this is just a first pass. I would consider this a "use it only if you really REALLY need it and you're willing to eat the 4 min boot time" What we do right now is basically just keep the ASG over-provisioned so even though there's a huge delay in shutdown and reboot, there's always an instance available. We're working to make that more efficient though. |
32c0afe
to
8d3ea8e
Compare
FWIW this is a pattern we've been using successfully for about 6 months internally. We're just getting around to contributing it back upstream :) |
Makes sense! What would you think of instead of an "instance agent mode" this was just a boolean option along the lines of "terminate instance after job", similar to |
Sure. Works for me. I'll work up something tomorrow. |
8d3ea8e
to
480b1de
Compare
@lox okay, updated the PR with your suggestions. |
480b1de
to
b586f23
Compare
This is looking good, I really like the README description. One last request, could we use a drop-in unit for overriding adding just the /etc/systemd/system/buildkite-agent@.service.d/10-power-off-stop.conf
Keeps merge conflicts to a minimum. |
b586f23
to
4fb68d9
Compare
Alrighty. With that suggestion I was able to keep most of the changes into the boot script, keeping the AMI setup mostly the same. That's a good pattern overall I think. Thanks for that! |
4fb68d9
to
ee33ffb
Compare
I also made the disconnect timeout configurable, and documented it in the README. |
bbb1111
to
b58fd50
Compare
This new setting will allow you to stop (and terminate) an instance after it has completed a job. Signed-off-by: Tom Duffield <tom@chef.io>
b58fd50
to
346ff34
Compare
instance_id=$(curl -fsSL http://169.254.169.254/latest/meta-data/instance-id) | ||
region=$(curl -fsSL http://169.254.169.254/latest/meta-data/placement/availability-zone | head -c -1) | ||
|
||
aws autoscaling set-instance-health --instance-id "$instance_id" --region "$region" --health-status "Unhealthy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love something very similar to this. Our use-case is running some data crunching on very large instances types, and to minimise costs we would like the instances to terminate a soon as the job completes.
One difference in our use-case is that we don't want the terminated instance to be replaced immediately with a fresh one (unless there is a scheduled job).
Would you be willing to accommodate this in your PR?
I think the change would be, instead of marking the instance unhealthy, using terminate-instance-in-auto-scaling-group with the --should-decrement-desired-capacity
flag, which could be configurable.
Alternatively happy to submit this change myself as a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually really like this idea. I'll work it up in a new commit and we can discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, yes I think that will do the trick! Thankyou 🙏
This will allow us to have more control over the timing and capacity of the ASG. Signed-off-by: Tom Duffield <tom@chef.io>
724d56c
to
a91caa1
Compare
This new setting will allow you to stop (and terminate) an instance after it has completed a job.
Signed-off-by: Tom Duffield tom@chef.io