Skip to content

New "Agent mode" code broke scaleset initialization #610

@benoit-nexthop

Description

@benoit-nexthop

The "Add agent mode" commit 42cfd1b added stricter state transition validation in database/sql/instances.go:

// Line ~341
return runnerErrors.NewBadRequestError("invalid instance status transition from %s to %s", current, newStatus)

When GARM starts up and finds instances in intermediate states (deleting, creating), it attempts to transition them to pending_delete for cleanup. However, the new validation logic considers these invalid state transitions:

  • deletingpending_delete (first error we hit)
  • creatingpending_delete (second error we hit later on)

This error is fatal during scaleset worker initialization:

  1. The updateInstanceStatusByProviderID() or similar function fails
  2. The scaleset worker's Create() operation fails
  3. The worker never starts
  4. Without a running worker, there's no listener connected to GitHub's longpoll API
  5. The scaleset appears "offline" in GitHub and jobs cannot be acquired

The Problem with the State Machine

The new code enforces a stricter state machine, but it doesn't account for recovery scenarios:

Scenario 1: Instances stuck in deleting state

  1. GARM initiates instance deletion, sets state to deleting
  2. GARM crashes/restarts before deletion completes
  3. On restart, GARM finds instances in deleting state
  4. GARM tries to clean these up by transitioning to pending_delete
  5. BUG: The new code rejects deletingpending_delete as "invalid"

Scenario 2: Instances stuck in creating state

  1. GARM creates instances and sets them to creating state
  2. Before the provider returns a provider_id, GARM crashes/restarts
  3. On restart, GARM finds instances in creating with provider_id: null
  4. GARM correctly tries to clean these up by transitioning to pending_delete
  5. BUG: The new code rejects creatingpending_delete as "invalid"

Expected behavior:

  • Instances in intermediate states should be allowed to transition to cleanup states (pending_delete, error)
  • These are recovery paths, not normal state transitions
  • The old code handled this gracefully

Evidence

Error message from GARM logs:

{
  "time": "2026-02-11T01:01:36.326886898Z",
  "level": "ERROR",
  "msg": "failed to handle scale set create operation",
  "error": "error starting scale set worker: updating runner : error updating instance: invalid instance status transition from creating to pending_delete",
  "worker": "scaleset-controller-xxx-xxx-xxx-xxx-xxx",
  "entity": "xxx",
  "endpoint": "github.com"
}

Instances stuck in creating state:

sqlite> SELECT status, count(*) FROM instances GROUP BY status;
creating|49

All 49 instances had provider_id: null:

{
  "id": "xxx-xxx-xxx-xxx-xxx",
  "provider_id": null,
  "name": "foo-ubuntu24-small-xxx",
  "status": "creating",
  "created_at": "2026-02-11T00:49:38.408702051Z"
}

Resolution

  1. We reverted our the Docker image we built to the version of garm we had before rebasing onto upstream changes 2 days ago, which does not include the "Add agent mode" commit
  2. Rebuilt and redeployed GARM
  3. System immediately recovered with 105 runners active

Recommendation

The state transition validation in database/sql/instances.go needs to allow recovery transitions:

  1. creatingpending_delete should be allowed when provider_id is null (cleanup of failed provisioning)
  2. creatingerror should be allowed (mark failed instances)
  3. deletingpending_delete should be allowed (cleanup of stuck deletions)

Alternatively, the initialization code should handle these stuck instances differently, perhaps by setting them directly to error state before starting the state machine, rather than trying to transition them through the normal flow.

  • garm/database/sql/instances.go - State transition validation logic
  • garm/runner/scalesets.go - Scaleset worker initialization
  • Bug introduced in commit 42cfd1b ("Add agent mode")
  • Our reverted state: our changes on top of commit 3640235 (fixed the problem immediately)

cc @gabriel-samfira

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions