-
Notifications
You must be signed in to change notification settings - Fork 0
fix(scheduler): auto-cleanup stale running runs on service startup (SO-21) #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| # SO-80: Session-scoped API Keys | ||
|
|
||
| **Status:** Implemented (PR #3) | ||
| **Priority:** HIGH (P2) | ||
| **Replaces:** SO-72 (cancelled) | ||
|
|
||
| ## Problem | ||
|
|
||
| Prior to this change, the `api_keys` table had no run binding — `RevokeAPIKeys(agentID)` would invalidate **all** keys for an agent. This meant starting a second run for agent A would revoke the key used by any currently-running instance of agent A, causing: | ||
|
|
||
| - Mid-task auth failures (`401 Unauthorized`) | ||
| - Agents falling back to SQLite direct writes (fragile workaround) | ||
| - Incorrect status updates and orphaned run state | ||
|
|
||
| ## Solution | ||
|
|
||
| Bind each API key to a specific `run_id` instead of just `agent_id`. Key lifecycle: | ||
|
|
||
| ``` | ||
| spawnAgent() → provisionAPIKey(agentID, runID) → CreateAPIKey(..., runID, ttl=2h) | ||
| ↓ | ||
| api_keys row: run_id = runID | ||
| expires_at = now+2h | ||
| revoked_at = NULL | ||
| ``` | ||
|
|
||
| When a run completes (normally or via timeout): `RevokeRunAPIKey(runID)` only revokes that specific run's key — **other runs are unaffected**. | ||
|
|
||
| ## Data Model | ||
|
|
||
| ```sql | ||
| -- Migration 018 adds to existing api_keys table: | ||
| ALTER TABLE api_keys ADD COLUMN run_id TEXT REFERENCES runs(id); | ||
| ALTER TABLE api_keys ADD COLUMN expires_at DATETIME; | ||
| CREATE INDEX idx_api_keys_run ON api_keys(run_id); | ||
| ``` | ||
|
|
||
| Full schema: | ||
| ```sql | ||
| CREATE TABLE api_keys ( | ||
| id TEXT PRIMARY KEY, | ||
| agent_id TEXT NOT NULL REFERENCES agents(id), | ||
| run_id TEXT REFERENCES runs(id), -- NULL for legacy keys | ||
| key_hash TEXT NOT NULL UNIQUE, | ||
| prefix TEXT NOT NULL, | ||
| revoked_at DATETIME, -- NULL = active | ||
| expires_at DATETIME, -- NULL = no expiry (legacy) | ||
| created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP | ||
| ); | ||
| ``` | ||
|
|
||
| ## DB Functions | ||
|
|
||
| | Function | Signature | Notes | | ||
| |---|---|---| | ||
| | `CreateAPIKey` | `(agentID, runID, keyHash, prefix string, ttl time.Duration) error` | Stores hash, sets `expires_at = now + ttl` | | ||
| | `GetAgentByAPIKey` | `(keyHash string) (*Agent, error)` | Checks `revoked_at IS NULL AND (expires_at IS NULL OR expires_at > now)` | | ||
| | `RevokeRunAPIKey` | `(runID string) error` | Sets `revoked_at = now` WHERE `run_id = runID` only | | ||
| | `ExpireStaleAPIKeys` | `() (int64, error)` | Batch-revoke keys past `expires_at` | | ||
|
|
||
| ## Auth Middleware | ||
|
|
||
| `internal/handlers/api.go — Auth()` calls `GetAgentByAPIKey(keyHash)` which already handles expiry: | ||
|
|
||
| - Key found, not revoked, not expired → authenticated as the key's agent ✅ | ||
| - Key not found / revoked / expired → `ErrNoRows` → `401 Unauthorized` ✅ | ||
| - Legacy key (no `expires_at`) → `expires_at IS NULL` treated as no expiry → still works ✅ | ||
|
|
||
| ## Scheduler Integration | ||
|
|
||
| `internal/scheduler/scheduler.go`: | ||
| - `provisionAPIKey(agentID, runID)` called at run start, injects `rawKey` into agent environment | ||
| - Key TTL: **2 hours** (satisfies AC#2: ≥2h) | ||
| - `StartAPIKeyExpiryLoop(interval)` runs a periodic sweep (recommended: 1 min) to expire stale keys | ||
|
|
||
| ## Acceptance Criteria | ||
|
|
||
| | AC | Description | How Verified | | ||
| |---|---|---| | ||
| | AC#1 | Starting run B for agent A does NOT revoke run A key | `TestSessionKeyNotRevokedOnNewRun`, `TestRevokeRunAPIKeyOnlyAffectsTargetRun` | | ||
| | AC#2 | Each run gets its own key, valid ≥2h | `provisionAPIKey` uses `2*time.Hour`; `TestSessionKeyNotRevokedOnNewRun` | | ||
| | AC#3 | Expired key returns 401 | `TestSessionKeyExpiresAfterIdleTimeout` | | ||
| | AC#4 | `ValidateAPIKey()` path unchanged | `TestLegacyKeyBackwardCompat` | | ||
| | AC#5 | `go build ./...` and `go test ./...` pass | CI / all 8 packages green | | ||
| | AC#6 | PR on castrojo/secondorder | PR #3 | | ||
|
|
||
| ## Files Changed | ||
|
|
||
| - `internal/db/migrations/018_session_scoped_api_keys.sql` — schema migration | ||
| - `internal/db/queries.go` — `CreateAPIKey`, `GetAgentByAPIKey`, `RevokeRunAPIKey`, `ExpireStaleAPIKeys` | ||
| - `internal/scheduler/scheduler.go` — `provisionAPIKey` (2h TTL), `StartAPIKeyExpiryLoop` | ||
| - `internal/db/db_test.go` — 5 new tests covering all ACs |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| # Stale Run Cleanup on Startup (SO-21) | ||
|
|
||
| ## Problem | ||
|
|
||
| When the secondorder service is killed mid-run, rows in the `runs` table remain | ||
| stuck at `status=running` indefinitely. This causes: | ||
|
|
||
| - Activity feed clutter (running jobs that will never complete) | ||
| - The stuck-issue recovery loop attempting to re-wake agents for runs that are | ||
| still marked "running" but have no live process | ||
|
|
||
| ## Solution | ||
|
|
||
| `internal/db/queries.go` — `CleanupStaleRuns(cutoff time.Duration)` | ||
|
|
||
| - Queries for all `runs` where `status='running'` AND `started_at < (now - cutoff)` | ||
| - Returns the slice of matching run IDs so callers can emit per-run log lines | ||
| - Updates those runs to `status='failed'` with `completed_at=datetime('now')` in one statement | ||
| - A cutoff of 0 matches all running runs regardless of age (used by the backward-compat wrapper) | ||
|
|
||
| `internal/scheduler/scheduler.go` — `RecoverStuckIssues()` | ||
|
|
||
| - Calls `CleanupStaleRuns(10 * time.Minute)` **before** heartbeat loops and HTTP listener start | ||
| - Logs `scheduler: cleaned up stale run <run_id>` for every affected run (AC2) | ||
| - Logs a summary `scheduler: stale run cleanup complete count=N` when N > 0 | ||
|
|
||
| ### Cutoff: 10 minutes | ||
|
|
||
| Chosen per CEO recommendation. Runs younger than 10 minutes may belong to a | ||
| legitimately running process (e.g. a recently restarted concurrent worker). | ||
| Runs older than 10 minutes from a crashed process are safe to fail. | ||
|
|
||
| ## Backward compatibility | ||
|
|
||
| `MarkStaleRunsFailed()` is preserved and now delegates to `CleanupStaleRuns(0)`, | ||
| maintaining the zero-cutoff ("fail everything") behaviour used in existing tests. | ||
|
|
||
| ## Startup order (main.go) | ||
|
|
||
| ``` | ||
| RecoverStuckIssues() ← CleanupStaleRuns runs here (line 305) | ||
| StartHeartbeatLoop() ← scheduler begins accepting work (line 310) | ||
| StartAPIKeyExpiryLoop() ← (line 313) | ||
| srv.ListenAndServe() ← HTTP open (line 325) | ||
| ``` | ||
|
|
||
| AC3 is satisfied: cleanup completes synchronously before any new work is accepted. | ||
|
|
||
| ## Tests added (`internal/db/db_test.go`) | ||
|
|
||
| | Test | What it covers | | ||
| |------|---------------| | ||
| | `TestCleanupStaleRunsAllRunning` | cutoff=0 fails all running runs, returns all IDs | | ||
| | `TestCleanupStaleRunsCutoffFiltersRecent` | 10-min cutoff spares recent run, fails old one | | ||
| | `TestCleanupStaleRunsNoneRunning` | no running rows → empty slice, no error | | ||
| | `TestMarkStaleRunsFailedBackwardCompat` | MarkStaleRunsFailed wrapper still works | | ||
|
|
||
| All pass: `go test ./...` — 0 failures. |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,169 @@ | ||||||||||||||||||
| # SO-72: Session-Scoped API Keys for Agent Runs | ||||||||||||||||||
|
|
||||||||||||||||||
| **Status:** Implemented | ||||||||||||||||||
| **Priority:** HIGH (P2) | ||||||||||||||||||
| **Parent:** SO-69 | ||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Problem | ||||||||||||||||||
|
|
||||||||||||||||||
| The API key embedded in each agent's system prompt was revoked when **any** new run started for | ||||||||||||||||||
| the same agent. A single `RevokeAPIKeys(agentID)` call wiped all keys regardless of which run | ||||||||||||||||||
| they belonged to. This caused: | ||||||||||||||||||
|
|
||||||||||||||||||
| - Mid-run 401 failures when a heartbeat or reviewer run spawned concurrently | ||||||||||||||||||
| - Agents falling back to SQLite direct writes (fragile, undocumented) | ||||||||||||||||||
| - Incorrect status updates and orphaned state | ||||||||||||||||||
| - CEO-reported: "The API key embedded in the system prompt gets revoked each session" | ||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Solution: Run-Scoped Key Lifecycle | ||||||||||||||||||
|
|
||||||||||||||||||
| Each agent **run** receives its own unique API key at spawn time. Keys are: | ||||||||||||||||||
|
|
||||||||||||||||||
| 1. **Bound to a run** (`api_keys.run_id`) — not to the agent globally | ||||||||||||||||||
| 2. **Agent-scoped** — the key resolves only to the agent that owns it | ||||||||||||||||||
| 3. **Time-limited** — idle timeout enforced via `api_keys.expires_at` | ||||||||||||||||||
| 4. **Non-interfering** — revoking run-A's key does not affect run-B (same agent) | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Key Format | ||||||||||||||||||
|
|
||||||||||||||||||
| ``` | ||||||||||||||||||
| so_<32-bytes-hex> (e.g. so_a3f2...d9e1) | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| Keys are stored as SHA-256 hashes in the database; only the raw key is transmitted to the agent. | ||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Database Schema (Migration 018) | ||||||||||||||||||
|
|
||||||||||||||||||
| ```sql | ||||||||||||||||||
| ALTER TABLE api_keys ADD COLUMN run_id TEXT REFERENCES runs(id); | ||||||||||||||||||
| ALTER TABLE api_keys ADD COLUMN expires_at DATETIME; | ||||||||||||||||||
| CREATE INDEX IF NOT EXISTS idx_api_keys_run ON api_keys(run_id); | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Full `api_keys` table structure | ||||||||||||||||||
|
|
||||||||||||||||||
| | Column | Type | Description | | ||||||||||||||||||
| |-------------|----------|------------------------------------------------------| | ||||||||||||||||||
| | id | TEXT PK | UUID | | ||||||||||||||||||
| | agent_id | TEXT FK | Agent that owns this key | | ||||||||||||||||||
| | run_id | TEXT FK | Run that this key was issued to (nullable: legacy) | | ||||||||||||||||||
| | key_hash | TEXT | SHA-256 of the raw key | | ||||||||||||||||||
| | prefix | TEXT | First 12 chars of raw key (display only) | | ||||||||||||||||||
| | created_at | DATETIME | | | ||||||||||||||||||
| | revoked_at | DATETIME | Set when key is explicitly revoked | | ||||||||||||||||||
| | expires_at | DATETIME | Set to `now + idle_timeout` at creation | | ||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Key Lifecycle | ||||||||||||||||||
|
|
||||||||||||||||||
| ``` | ||||||||||||||||||
| Run starts → provisionAPIKey(agentID, runID) | ||||||||||||||||||
| ├── RevokeRunAPIKey(runID) ← revoke any prior key for *this run* only | ||||||||||||||||||
| ├── generate 32-byte random key | ||||||||||||||||||
| └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min) | ||||||||||||||||||
| ↓ | ||||||||||||||||||
| Key injected into SECONDORDER_API_KEY env var for the agent process | ||||||||||||||||||
|
|
||||||||||||||||||
| Run completes → RevokeRunAPIKey(runID) ← explicit cleanup | ||||||||||||||||||
|
|
||||||||||||||||||
| Background cron (every 60s) → ExpireStaleAPIKeys() | ||||||||||||||||||
| └── UPDATE api_keys SET revoked_at=now WHERE expires_at <= now AND revoked_at IS NULL | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Validity check (Auth middleware) | ||||||||||||||||||
|
|
||||||||||||||||||
| ```sql | ||||||||||||||||||
| SELECT ... | ||||||||||||||||||
| FROM api_keys k JOIN agents a ON k.agent_id = a.id | ||||||||||||||||||
| WHERE k.key_hash = ? | ||||||||||||||||||
| AND k.revoked_at IS NULL | ||||||||||||||||||
| AND (k.expires_at IS NULL OR k.expires_at > datetime('now')) | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| The `expires_at IS NULL` arm preserves backward compatibility with legacy keys that have no expiry. | ||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Acceptance Criteria — Verification | ||||||||||||||||||
|
|
||||||||||||||||||
| | AC | Criterion | Status | Test | | ||||||||||||||||||
| |-----|-----------|--------|------| | ||||||||||||||||||
| | AC#1 | New run start for agent A does NOT revoke the API key of a prior still-running run by agent A | ✅ | `TestSessionKeyNotRevokedOnNewRun`, `TestRevokeRunAPIKeyOnlyAffectsTargetRun` | | ||||||||||||||||||
| | AC#2 | Keys are agent-scoped — a key for agent A cannot authenticate as agent B | ✅ | `TestSessionKeyNotRevokedOnNewRun` | | ||||||||||||||||||
| | AC#3 | Keys expire after run ends (idle timeout ≥ 30min) | ✅ | `TestSessionKeyExpiresAfterIdleTimeout`, `TestExpireStaleAPIKeys` (60min default) | | ||||||||||||||||||
| | AC#4 | Existing PATCH/POST/GET endpoints work with session-scoped keys (no API contract changes) | ✅ | All existing handler tests pass | | ||||||||||||||||||
| | AC#5 | Existing long-lived (legacy) keys still work | ✅ | `TestLegacyKeyBackwardCompat` | | ||||||||||||||||||
| | AC#6 | Unit tests for key lifecycle | ✅ | 5 new tests in `internal/db/db_test.go` | | ||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Files Changed | ||||||||||||||||||
|
|
||||||||||||||||||
| ### New | ||||||||||||||||||
| - `internal/db/migrations/018_session_scoped_api_keys.sql` — schema migration | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Modified | ||||||||||||||||||
| - `internal/db/queries.go` | ||||||||||||||||||
| - `CreateAPIKey(agentID, runID, keyHash, prefix string, idleTimeout time.Duration)` — new signature with run binding and expiry | ||||||||||||||||||
| - `GetAgentByAPIKey(keyHash)` — adds `expires_at` check | ||||||||||||||||||
| - `RevokeRunAPIKey(runID)` — replaces `RevokeAPIKeys(agentID)` (run-scoped, not agent-wide) | ||||||||||||||||||
| - `ExpireStaleAPIKeys()` — periodic cleanup, returns rows affected | ||||||||||||||||||
| - `internal/scheduler/scheduler.go` | ||||||||||||||||||
| - `provisionAPIKey(agentID, runID)` — calls `RevokeRunAPIKey` (not agent-wide) + `CreateAPIKey` | ||||||||||||||||||
| - Background ticker: `ExpireStaleAPIKeys()` every 60s | ||||||||||||||||||
| - `internal/db/db_test.go` — 5 new session-scoped key lifecycle tests | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Documentation | ||||||||||||||||||
| - `artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` (this file) | ||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Idle Timeout Configuration | ||||||||||||||||||
|
|
||||||||||||||||||
| The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the | ||||||||||||||||||
| `idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default. | ||||||||||||||||||
|
|
||||||||||||||||||
|
Comment on lines
+128
to
+132
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update default idle timeout documentation to match code. Lines 130-131 state the default is 60 minutes, but the implementation uses 2 hours. The SO-80 document correctly states 2 hours, so this older SO-72 spec should be updated for consistency. 📝 Suggested fix ## Idle Timeout Configuration
-The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the
-`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default.
+The default idle timeout is **2 hours**. This is configurable in `provisionAPIKey` — change the
+`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 2 hours is the current default.📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||
| Future: expose via `settings` table so operators can adjust without recompile. | ||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Backward Compatibility | ||||||||||||||||||
|
|
||||||||||||||||||
| Legacy keys (inserted before migration 018, or without `run_id`/`expires_at`) continue to work. | ||||||||||||||||||
| The auth query includes `expires_at IS NULL OR expires_at > datetime('now')` — keys with no | ||||||||||||||||||
| expiry set are treated as valid indefinitely, preserving existing agent configurations. | ||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Testing | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| # Run all session-key lifecycle tests | ||||||||||||||||||
| go test ./internal/db/... -run "TestSession|TestLegacy|TestRevokeRun|TestExpireStale|TestAPIKey" -v | ||||||||||||||||||
|
|
||||||||||||||||||
| # Full suite | ||||||||||||||||||
| go test ./... | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| Expected output (all PASS): | ||||||||||||||||||
| ``` | ||||||||||||||||||
| === RUN TestAPIKeyLifecycle | ||||||||||||||||||
| --- PASS: TestAPIKeyLifecycle | ||||||||||||||||||
| === RUN TestSessionKeyNotRevokedOnNewRun | ||||||||||||||||||
| --- PASS: TestSessionKeyNotRevokedOnNewRun | ||||||||||||||||||
| === RUN TestSessionKeyExpiresAfterIdleTimeout | ||||||||||||||||||
| --- PASS: TestSessionKeyExpiresAfterIdleTimeout | ||||||||||||||||||
| === RUN TestLegacyKeyBackwardCompat | ||||||||||||||||||
| --- PASS: TestLegacyKeyBackwardCompat | ||||||||||||||||||
| === RUN TestRevokeRunAPIKeyOnlyAffectsTargetRun | ||||||||||||||||||
| --- PASS: TestRevokeRunAPIKeyOnlyAffectsTargetRun | ||||||||||||||||||
| === RUN TestExpireStaleAPIKeys | ||||||||||||||||||
| --- PASS: TestExpireStaleAPIKeys | ||||||||||||||||||
| ``` | ||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation inconsistent with implementation: TTL is 2 hours, not 60 minutes.
Line 70 shows
idleTimeout=60minbutscheduler.goline 586 passes2*time.HourtoCreateAPIKey. This creates confusion between the documented and actual behavior.📝 Suggested fix
📝 Committable suggestion
🤖 Prompt for AI Agents