Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new "TerminateInstanceAfterJob" configuration #523

Merged
merged 2 commits into from Feb 6, 2019

Conversation

tduffield
Copy link
Contributor

@tduffield tduffield commented Jan 25, 2019

This new setting will allow you to stop (and terminate) an instance after it has completed a job.

Signed-off-by: Tom Duffield tom@chef.io

@lox
Copy link
Contributor

lox commented Jan 25, 2019

Interesting! What’s the use case for this? My primary concern would be the 4 minute boot time of instances.

Might ECS be a better fit?

@tduffield
Copy link
Contributor Author

tduffield commented Jan 25, 2019

@lox the use case is we have jobs that a) require they be run on a full system (not a container), and b) once they run on the machine, it's pretty much useless (read: it's too much work to clean them up correctly).

The boot time is definitely a concern, and there are some things you can do with preemptive scaling via things like lambda, but this is just a first pass. I would consider this a "use it only if you really REALLY need it and you're willing to eat the 4 min boot time"

What we do right now is basically just keep the ASG over-provisioned so even though there's a huge delay in shutdown and reboot, there's always an instance available. We're working to make that more efficient though.

@tduffield
Copy link
Contributor Author

FWIW this is a pattern we've been using successfully for about 6 months internally. We're just getting around to contributing it back upstream :)

@lox
Copy link
Contributor

lox commented Jan 27, 2019

Makes sense! What would you think of instead of an "instance agent mode" this was just a boolean option along the lines of "terminate instance after job", similar to buildkite-agent start --disconnect-after-job. I think that might avoid a lot of the cognitive overhead that modes bring.

@tduffield
Copy link
Contributor Author

Sure. Works for me. I'll work up something tomorrow.

@tduffield tduffield changed the title Add new "Instance Agent Mode" configuration Add new "TerminateInstanceAfterJob" configuration Jan 28, 2019
@tduffield
Copy link
Contributor Author

@lox okay, updated the PR with your suggestions.

@lox
Copy link
Contributor

lox commented Jan 29, 2019

This is looking good, I really like the README description. One last request, could we use a drop-in unit for overriding adding just the ExecStopPost lines?

/etc/systemd/system/buildkite-agent@.service.d/10-power-off-stop.conf

[Service]
ExecStopPost=/usr/local/bin/mark-asg-unhealthy
ExecStopPost=/bin/sudo poweroff

Keeps merge conflicts to a minimum.

@tduffield
Copy link
Contributor Author

Alrighty. With that suggestion I was able to keep most of the changes into the boot script, keeping the AMI setup mostly the same. That's a good pattern overall I think. Thanks for that!

@tduffield
Copy link
Contributor Author

I also made the disconnect timeout configurable, and documented it in the README.

README.md Outdated Show resolved Hide resolved
@tduffield tduffield force-pushed the instance-agent-mode branch 2 times, most recently from bbb1111 to b58fd50 Compare January 29, 2019 14:54
This new setting will allow you to stop (and terminate) an instance
after it has completed a job.

Signed-off-by: Tom Duffield <tom@chef.io>
instance_id=$(curl -fsSL http://169.254.169.254/latest/meta-data/instance-id)
region=$(curl -fsSL http://169.254.169.254/latest/meta-data/placement/availability-zone | head -c -1)

aws autoscaling set-instance-health --instance-id "$instance_id" --region "$region" --health-status "Unhealthy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love something very similar to this. Our use-case is running some data crunching on very large instances types, and to minimise costs we would like the instances to terminate a soon as the job completes.

One difference in our use-case is that we don't want the terminated instance to be replaced immediately with a fresh one (unless there is a scheduled job).

Would you be willing to accommodate this in your PR?

I think the change would be, instead of marking the instance unhealthy, using terminate-instance-in-auto-scaling-group with the --should-decrement-desired-capacity flag, which could be configurable.

Alternatively happy to submit this change myself as a new PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually really like this idea. I'll work it up in a new commit and we can discuss.

Copy link
Contributor Author

@tduffield tduffield Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tekumara okay, I pushed up a potential implementation in a91caa1. Let me know if you think this would work for you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, yes I think that will do the trick! Thankyou 🙏

This will allow us to have more control over the timing and capacity
of the ASG.

Signed-off-by: Tom Duffield <tom@chef.io>
@lox lox merged commit 6eb10a9 into buildkite:master Feb 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants