Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix windows agent not restarting #1318

Merged
merged 3 commits into from
May 3, 2024
Merged

Conversation

patrobinson
Copy link
Contributor

@patrobinson patrobinson commented May 2, 2024

Description

Steps to reproduce:

  • Create a windows stack
    • SetScaleInIdlePeriod to some number greater than 0 (I used 6 to speed up the feedback)
    • Set TerminateInstanceAfterJob to false
    • Set the MinSize to 1 (this step is necessary to ensure a call to terminate-instance-in-auto-scaling-group fails)
    • Set the MaxSize to 1 (again prevent terminate instance succeeding)
    • Ensure no jobs get assigned to the agent

Observed behaviour: After the ScaleInIdlePeriod the EC2 instance is running but the buildkite-agent is in the SERVICE_STOPPED state.

Expected behaviour: the EC2 instance is running and the buildkite-agent is in SERVICE_STARTED state.

Analysis

This bug was likely introduced in c3ebaa5

First it's important to understand how the service is configured in windows. We configure the default behaviour on exit to Restart with a 10s delay.
We also configure the terminate-instance script to run once the service stops.

Here's the sequence of events I've pieced together based on log outputs, the nssm source code and Windows documentation

  • The agent exits code 0 once the idle timeout is reached
  • The terminate-instance script is started asynchronously
  • The service enters the SERVICE_PENDING
  • The terminate-instance script sends a STOP signal to the service but it's blocked by the throttled restart
  • After 10 seconds the throttled restart expires
  • The service enters the SERVICE_START_PENDING state
  • The STOP signal returns buildkite-agent: Unexpected status SERVICE_START_PENDING in response to STOP control. but queues the stop command.
  • The terminate-instance-in-auto-scaling-group API call fails because the ASG is already at it's MinSize
  • The terminate-instance script sends a START signal, but this fails with START: An instance of the service is already running.
  • The service processes the stop command and the agent stops

Changes

Don't attempt to stop the agent before terminating the instance, since it is asynchronous it doesn't complete before the start command is issued.
Because of the restart delay it's unlikely to start and pick up a job before the ASG can scale it down, so the stop is not necessary.

Makes it easier to debug
This forces the agent to stop immediately after being restarted.

Because it's an async command the start later fails because the agent hasn't stopped yet.
@patrobinson patrobinson merged commit 5b523a3 into main May 3, 2024
1 check passed
@patrobinson patrobinson deleted the fix-windows-agent-not-restarting branch May 3, 2024 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants