New "Agent mode" code broke scaleset initialization

The "Add agent mode" commit 42cfd1b3c6bd54e043170f774f7049e18832abb9 added stricter state transition validation in `database/sql/instances.go`:

```go
// Line ~341
return runnerErrors.NewBadRequestError("invalid instance status transition from %s to %s", current, newStatus)
```

When GARM starts up and finds instances in intermediate states (`deleting`, `creating`), it attempts to transition them to `pending_delete` for cleanup. However, the new validation logic considers these invalid state transitions:
- `deleting` → `pending_delete` (first error we hit)
- `creating` → `pending_delete` (second error we hit later on)

This error is fatal during scaleset worker initialization:
1. The `updateInstanceStatusByProviderID()` or similar function fails
2. The scaleset worker's `Create()` operation fails
3. The worker never starts
4. Without a running worker, there's no listener connected to GitHub's longpoll API
5. The scaleset appears "offline" in GitHub and jobs cannot be acquired

### The Problem with the State Machine

The new code enforces a stricter state machine, but it doesn't account for recovery scenarios:

**Scenario 1: Instances stuck in `deleting` state**
1. GARM initiates instance deletion, sets state to `deleting`
2. GARM crashes/restarts before deletion completes
3. On restart, GARM finds instances in `deleting` state
4. GARM tries to clean these up by transitioning to `pending_delete`
5. **BUG**: The new code rejects `deleting` → `pending_delete` as "invalid"

**Scenario 2: Instances stuck in `creating` state**
1. GARM creates instances and sets them to `creating` state
2. Before the provider returns a `provider_id`, GARM crashes/restarts
3. On restart, GARM finds instances in `creating` with `provider_id: null`
4. GARM correctly tries to clean these up by transitioning to `pending_delete`
5. **BUG**: The new code rejects `creating` → `pending_delete` as "invalid"

**Expected behavior:**
- Instances in intermediate states should be allowed to transition to cleanup states (`pending_delete`, `error`)
- These are recovery paths, not normal state transitions
- The old code handled this gracefully

### Evidence

**Error message from GARM logs:**
```json
{
  "time": "2026-02-11T01:01:36.326886898Z",
  "level": "ERROR",
  "msg": "failed to handle scale set create operation",
  "error": "error starting scale set worker: updating runner : error updating instance: invalid instance status transition from creating to pending_delete",
  "worker": "scaleset-controller-xxx-xxx-xxx-xxx-xxx",
  "entity": "xxx",
  "endpoint": "github.com"
}
```

**Instances stuck in creating state:**
```sql
sqlite> SELECT status, count(*) FROM instances GROUP BY status;
creating|49
```

**All 49 instances had `provider_id: null`:**
```json
{
  "id": "xxx-xxx-xxx-xxx-xxx",
  "provider_id": null,
  "name": "foo-ubuntu24-small-xxx",
  "status": "creating",
  "created_at": "2026-02-11T00:49:38.408702051Z"
}
```

### Resolution

1. We reverted our the Docker image we built to the version of garm we had before rebasing onto upstream changes 2 days ago, which does not include the "Add agent mode" commit
3. Rebuilt and redeployed GARM
4. System immediately recovered with 105 runners active

### Recommendation

The state transition validation in `database/sql/instances.go` needs to allow recovery transitions:

1. `creating` → `pending_delete` should be allowed when `provider_id` is null (cleanup of failed provisioning)
2. `creating` → `error` should be allowed (mark failed instances)
3. `deleting` → `pending_delete` should be allowed (cleanup of stuck deletions)

Alternatively, the initialization code should handle these stuck instances differently, perhaps by setting them directly to `error` state before starting the state machine, rather than trying to transition them through the normal flow.

- `garm/database/sql/instances.go` - State transition validation logic
- `garm/runner/scalesets.go` - Scaleset worker initialization
- Bug introduced in commit 42cfd1b3c6bd54e043170f774f7049e18832abb9 ("Add agent mode")
- Our reverted state: our changes on top of commit 3640235eeb3abf6873448e7c6c85914ec17e4e23 (fixed the problem immediately)

cc @gabriel-samfira 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New "Agent mode" code broke scaleset initialization #610

The Problem with the State Machine

Evidence

Resolution

Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New "Agent mode" code broke scaleset initialization #610

Description

The Problem with the State Machine

Evidence

Resolution

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions