-
Notifications
You must be signed in to change notification settings - Fork 42
Description
The "Add agent mode" commit 42cfd1b added stricter state transition validation in database/sql/instances.go:
// Line ~341
return runnerErrors.NewBadRequestError("invalid instance status transition from %s to %s", current, newStatus)When GARM starts up and finds instances in intermediate states (deleting, creating), it attempts to transition them to pending_delete for cleanup. However, the new validation logic considers these invalid state transitions:
deleting→pending_delete(first error we hit)creating→pending_delete(second error we hit later on)
This error is fatal during scaleset worker initialization:
- The
updateInstanceStatusByProviderID()or similar function fails - The scaleset worker's
Create()operation fails - The worker never starts
- Without a running worker, there's no listener connected to GitHub's longpoll API
- The scaleset appears "offline" in GitHub and jobs cannot be acquired
The Problem with the State Machine
The new code enforces a stricter state machine, but it doesn't account for recovery scenarios:
Scenario 1: Instances stuck in deleting state
- GARM initiates instance deletion, sets state to
deleting - GARM crashes/restarts before deletion completes
- On restart, GARM finds instances in
deletingstate - GARM tries to clean these up by transitioning to
pending_delete - BUG: The new code rejects
deleting→pending_deleteas "invalid"
Scenario 2: Instances stuck in creating state
- GARM creates instances and sets them to
creatingstate - Before the provider returns a
provider_id, GARM crashes/restarts - On restart, GARM finds instances in
creatingwithprovider_id: null - GARM correctly tries to clean these up by transitioning to
pending_delete - BUG: The new code rejects
creating→pending_deleteas "invalid"
Expected behavior:
- Instances in intermediate states should be allowed to transition to cleanup states (
pending_delete,error) - These are recovery paths, not normal state transitions
- The old code handled this gracefully
Evidence
Error message from GARM logs:
{
"time": "2026-02-11T01:01:36.326886898Z",
"level": "ERROR",
"msg": "failed to handle scale set create operation",
"error": "error starting scale set worker: updating runner : error updating instance: invalid instance status transition from creating to pending_delete",
"worker": "scaleset-controller-xxx-xxx-xxx-xxx-xxx",
"entity": "xxx",
"endpoint": "github.com"
}Instances stuck in creating state:
sqlite> SELECT status, count(*) FROM instances GROUP BY status;
creating|49All 49 instances had provider_id: null:
{
"id": "xxx-xxx-xxx-xxx-xxx",
"provider_id": null,
"name": "foo-ubuntu24-small-xxx",
"status": "creating",
"created_at": "2026-02-11T00:49:38.408702051Z"
}Resolution
- We reverted our the Docker image we built to the version of garm we had before rebasing onto upstream changes 2 days ago, which does not include the "Add agent mode" commit
- Rebuilt and redeployed GARM
- System immediately recovered with 105 runners active
Recommendation
The state transition validation in database/sql/instances.go needs to allow recovery transitions:
creating→pending_deleteshould be allowed whenprovider_idis null (cleanup of failed provisioning)creating→errorshould be allowed (mark failed instances)deleting→pending_deleteshould be allowed (cleanup of stuck deletions)
Alternatively, the initialization code should handle these stuck instances differently, perhaps by setting them directly to error state before starting the state machine, rather than trying to transition them through the normal flow.