Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to gracefully terminate runner #1029

Closed
long-wan-ep opened this issue Nov 6, 2023 · 13 comments · Fixed by #1117
Closed

Option to gracefully terminate runner #1029

long-wan-ep opened this issue Nov 6, 2023 · 13 comments · Fixed by #1117
Labels
enhancement 🆕 New feature or request work-in-progress Issue/PR is worked, should not become stale

Comments

@long-wan-ep
Copy link
Contributor

Describe the solution you'd like

When the terminate-agent-hook runs, workers are terminated and running jobs are interrupted. We would like an option to gracefully terminate runners, so that the running jobs are given a chance to complete

Describe alternatives you've considered

We previous disabled the creation of terminate-agent-hook and used our own hook + lambda to handle graceful termination, but terminate-agent-hook was made mandatory so we can no longer do this.

Suggest a solution

We suggest adding an option to gracefully terminate runners in the terminate-agent-hook lambda. We can contribute our graceful termination logic to terminate-agent-hook if it works for you. Here is a brief summary of our solution:

  1. Configure the gitlab-runner service to gracefully stop:
    ie.
cat <<EOF > /etc/systemd/system/gitlab-runner.service.d/kill.conf
[Service]
# Time to wait before stopping the service in seconds
TimeoutStopSec=600
KillSignal=SIGQUIT
EOF
  1. Send a message to a SQS queue when a runner's terminate lifecycle hook triggers
  2. Lambda triggers from SQS message
  3. Lambda sends command to runner EC2 instance to stop the gitlab-runner service, done via AWS SSM document
  4. Lambda waits for the SSM document to finish executing
    a. If the gitlab-runner service successfully stopped, lambda completes lifecycle hook
    b. If the gitlab-runner service has not successfully stopped, error is thrown and the SQS message goes back to the queue to be retried in the next run
  5. Lambda terminates any workers still running
@kayman-mk
Copy link
Collaborator

Yes and no, I think. The Lambda is executed in case the GitLab Runner (who started the worker) dies. In this case the Runners can continue with the current job, but they are not able to upload the logs, artifacts, ... to GitLab as this needs the GitLab Runner which is no longer there.

As the job might access external resources, ... it makes sense to wait until it is finished and kill the worker then.

@kayman-mk
Copy link
Collaborator

Some thoughts which popped up during checking your procedure described above:

  • I think the shutdown timeout of 10 minutes doesn't change anything because the GitLab Runner has been shutdown already and can't be contacted from the workers anymore.
  • there is no lifecycle hook for the Runner. But I guess you mean the worker instance, right? 2. Send a message to a SQS queue when a runner's terminate lifecycle hook triggers
  • having some doubts that it complicates the whole setup and might introduce problems. But at the moment I can't think of an easier solution to be honest.

If you could share your implementation, it would be wonderful.

@long-wan-ep
Copy link
Contributor Author

You're right, this wouldn't help the situation where the runner dies. We were intending this for situations where the runner is modified and requires a refresh.

Here is a slimmed down version of our implementation, I added it to the the examples folder: https://github.com/long-wan-ep/terraform-aws-gitlab-runner/tree/graceful-terminate-example/examples/graceful-terminate.

Copy link
Contributor

github-actions bot commented Jan 9, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

@github-actions github-actions bot added the stale Issue/PR is stale and closed automatically label Jan 9, 2024
Copy link
Contributor

This issue was closed because it has been stalled for 15 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2024
@long-wan-ep
Copy link
Contributor Author

long-wan-ep commented Feb 16, 2024

Hi @kayman-mk, noticed this issue was auto-closed, could we re-open it?

Does our solution look ok? Or any other ideas we could try?

@kayman-mk kayman-mk added enhancement 🆕 New feature or request work-in-progress Issue/PR is worked, should not become stale and removed stale Issue/PR is stale and closed automatically labels Feb 22, 2024
@kayman-mk
Copy link
Collaborator

kayman-mk commented Feb 22, 2024

Re-read everything ;-) Let's give it a try.

The teminate-agent-hook is used to kill the Workers in case the Runner (named parent in that module) dies. This should of course never happen. Better to wait until all Executory are finished, then shut-down the Runner.

Could you please propose a PR? Would be a good idea to make the TimeoutStopSec configurable. GitLab uses 7200s, we typically use 3600s for the job timeout.

@kayman-mk kayman-mk reopened this Feb 22, 2024
@kayman-mk
Copy link
Collaborator

It seems that the terminate-agent-hook is good for removing dangling SSH keys, spot requests, ... but not for stopping the Runner itself. I guess ec2_client.terminate_instance(...) simply kills the instance, which is unwanted, because the Executors are simply killed and we do not wait until they have finished processing their jobs.

@long-wan-ep
Copy link
Contributor Author

Sounds good, we'll start working on a PR soon.

@long-wan-ep
Copy link
Contributor Author

#1117 will resolve this issue.

@tmeijn
Copy link
Contributor

tmeijn commented Apr 25, 2024

My bad, I was meant to comment here, but somehow got lost. Original intent:

Hey @kayman-mk, @long-wan-ep I actually started implementing the proposal discussed in #1067: #1117. This MR still needs some polish, but based on my initial testing it seems to work. It basically makes the Runner Manager a bit more smart and aware of it's own desired state and acts accordingly.

@long-wan-ep definitely not meant to steal your thunder, but I do think #1117, if working, makes the implementation a bit simpler. Hope you do not mind! ❤️

@long-wan-ep
Copy link
Contributor Author

@tmeijn I don't mind at all, your implementation looks great, thanks for opening the PR.

kayman-mk added a commit that referenced this issue May 29, 2024
## Description

Based on the discussion
#1067:

1. Move the EventBridge rule that triggers the Lambda from `TERMINATING`
to `TERMINATE`. The Lambda now functions as an "after-the-fact" cleanup
instead of being responsible of cleanup _during_ termination.
2. Introduces a shell script managed by Systemd, that monitors the
target lifecycle of the instance and initiates GitLab Runner graceful
shutdown.
3. Makes the heartbeat timeout of the ASG terminating hook configurable,
with a default of the maximum job timeout + 5 minutes, capped at `7200`
(2 hours).
4. Introduces a launching lifecyclehook, allowing the new instance to
provision itself and GitLab Runner to provision its set capacity before
terminating the current instance.

## Migrations required

No, except that if the default behavior of immediately terminating all
Workers + Manager, the
`runner_worker_graceful_terminate_timeout_duration` variable should be
set to 30 (the minimum allowed).

## Verification

### Graceful terminate

1. Deploy this version of the module.
2. Start a long running GitLab job.
3. Manually trigger an instance refresh in the runner ASG.
4. Verify the job keeps running and has output. Verify from the instance
logs that GitLab Runner service is still running.
6. Once remaining jobs have been completed, observe that GitLab Runner
service is terminated and instance is put into `Terminating:Proceed`
status

### Zero Downtime deployment

1. Deploy this version of the module.
2. Start multiple, long running GitLab jobs, twice the capacity of the
GitLab Runner.
3. Manually trigger an instance refresh in the runner ASG.
4. Verify the jobs keep running and have output. Verify from the
instance logs that GitLab Runner service is still running.
5. Verify new instance gets spun up, while the current instance stays
`InService`.
7. Verify new instance is able to provision its set capacity.
8. Verify new instance starts picking up GitLab jobs from the queue
before current instance gets terminated.
9. Observe that there is zero downtime.
10. Once remaining jobs have been completed, observe that GitLab Runner
service is terminated and current instance is put into
`Terminating:Proceed` status

Closes #1029

---------

Co-authored-by: Matthias Kay <matthias.kay@hlag.com>
Co-authored-by: Matthias Kay <github@matthiaskay.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🆕 New feature or request work-in-progress Issue/PR is worked, should not become stale
Projects
None yet
3 participants