fix(scheduler): auto-cleanup stale running runs on service startup (SO-21)#2
fix(scheduler): auto-cleanup stale running runs on service startup (SO-21)#2
Conversation
Implements session-scoped API keys so each run gets its own key, independent of other runs for the same agent. ## Changes ### internal/db/migrations/018_session_scoped_api_keys.sql (in main) - Added run_id and expires_at columns to api_keys table - Added idx_api_keys_run index ### internal/db/queries.go (in main) - CreateAPIKey(agentID, runID, keyHash, prefix string, ttl) — binds key to run, sets expires_at - GetAgentByAPIKey — checks expires_at in addition to revoked_at - RevokeRunAPIKey(runID) — scoped revocation (not agent-wide) - ExpireStaleAPIKeys() — periodic cleanup for idle timeout enforcement ### internal/scheduler/scheduler.go - provisionAPIKey uses 2h TTL (AC#2: key valid ≥2h) - StartAPIKeyExpiryLoop — periodic sweep to expire stale keys ### internal/db/db_test.go - TestSessionKeyNotRevokedOnNewRun — AC#1: second run doesn't revoke first - TestSessionKeyExpiresAfterIdleTimeout — AC#3: expired key returns 403 - TestLegacyKeyBackwardCompat — AC#4: legacy keys without expires_at still work - TestRevokeRunAPIKeyOnlyAffectsTargetRun — AC#1: run-scoped revocation - TestExpireStaleAPIKeys — periodic cleanup works correctly ## Acceptance Criteria - AC#1 ✅ Starting run B for agent A does NOT revoke run A key - AC#2 ✅ Each run gets its own key, valid for 2h (≥2h) - AC#3 ✅ Expired key returns ErrNoRows (→ 403 via Auth middleware) - AC#4 ✅ Legacy ValidateAPIKey() path unchanged — backward compatible - AC#5 ✅ go build ./... && go test ./... pass - AC#6 → PR filed Closes SO-80
…O-21) - Add CleanupStaleRuns(cutoff time.Duration) to db layer - Queries running runs older than cutoff, updates to failed in one statement - Returns slice of affected run IDs for per-run logging - cutoff=0 matches all running runs (backward-compat for MarkStaleRunsFailed) - Deprecate MarkStaleRunsFailed() as thin wrapper over CleanupStaleRuns(0) - Update RecoverStuckIssues() to use CleanupStaleRuns(10 * time.Minute) - Logs 'cleaned up stale run <run_id>' for every affected run (AC2) - Cutoff of 10 min spares runs from a still-live concurrent process - Cleanup runs before scheduler heartbeat loop and HTTP listener start (AC3) - Add 4 tests: allRunning, cutoffFiltersRecent, noneRunning, backwardCompat - Add artifact-docs/infra/stale-run-cleanup.md go build ./... and go test ./... pass (0 failures)
📝 WalkthroughWalkthroughThis PR introduces session-scoped API keys and stale run cleanup features. It adds comprehensive documentation for API key management with run-based binding and TTL expiry, implements Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
artifact-docs/infra/stale-run-cleanup.md (1)
40-45: Add language specifier to fenced code block.The startup order diagram lacks a language specifier. Adding
textorplaintextsatisfies markdown linting.📝 Suggested fix
-``` +```text RecoverStuckIssues() ← CleanupStaleRuns runs here (line 305) StartHeartbeatLoop() ← scheduler begins accepting work (line 310) StartAPIKeyExpiryLoop() ← (line 313) srv.ListenAndServe() ← HTTP open (line 325) -``` +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@artifact-docs/infra/stale-run-cleanup.md` around lines 40 - 45, The fenced code block containing the startup order diagram should include a language specifier to satisfy markdown linting; replace the opening triple backticks with a language-tagged fence (e.g., change "```" to "```text" or "```plaintext") for the block that shows RecoverStuckIssues(), StartHeartbeatLoop(), StartAPIKeyExpiryLoop(), and srv.ListenAndServe() so the diagram is treated as plain text.artifact-docs/architecture/SO-80-session-scoped-api-keys.md (1)
19-25: Add language specifier to fenced code block.The lifecycle diagram is formatted as a fenced code block without a language specifier. While this is a text diagram, adding a language identifier (e.g.,
textorplaintext) satisfies markdown linting and improves rendering consistency.📝 Suggested fix
-``` +```text spawnAgent() → provisionAPIKey(agentID, runID) → CreateAPIKey(..., runID, ttl=2h)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@artifact-docs/architecture/SO-80-session-scoped-api-keys.md` around lines 19 - 25, The fenced code block showing the API key lifecycle should include a language specifier to satisfy markdown linting; update the block that contains "spawnAgent() → provisionAPIKey(agentID, runID) → CreateAPIKey(..., runID, ttl=2h)" to use a language tag such as "text" or "plaintext" (e.g., ```text) so the diagram renders consistently and the linter passes; the content inside (including references to spawnAgent, provisionAPIKey, CreateAPIKey and the api_keys row fields run_id, expires_at, revoked_at) should remain unchanged.artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md (1)
33-35: Add language specifiers to fenced code blocks.Several code blocks (lines 33, 66, 156) lack language specifiers. Adding appropriate identifiers improves rendering and satisfies markdown linting.
📝 Suggested fixes
Line 33:
-``` +```text so_<32-bytes-hex> (e.g. so_a3f2...d9e1)Line 66:
-``` +```text Run starts → provisionAPIKey(agentID, runID)Line 156:
-``` +```text === RUN TestAPIKeyLifecycle🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` around lines 33 - 35, Add explicit language identifiers to the three fenced code blocks containing the literals "so_<32-bytes-hex> (e.g. so_a3f2...d9e1)", "Run starts → provisionAPIKey(agentID, runID)", and "=== RUN TestAPIKeyLifecycle" so they render correctly and satisfy markdown linting; update each opening triple-backtick to include an appropriate language tag such as text (e.g., ```text) for these blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md`:
- Line 70: Documentation shows CreateAPIKey(..., idleTimeout=60min) but the
implementation calls CreateAPIKey with 2*time.Hour; update the doc to match the
implementation by changing idleTimeout=60min to idleTimeout=2h (or 2 hours) so
the spec reflects the actual behavior; reference the CreateAPIKey call in
scheduler.go that passes 2*time.Hour to confirm the intended TTL.
- Around line 128-132: Update the "Idle Timeout Configuration" section to
reflect the actual default idle timeout used in code (2 hours / 120 minutes)
instead of 60 minutes: change the text that currently says "The default idle
timeout is **60 minutes**" to state "The default idle timeout is **120 minutes
(2 hours)**", and keep the note that the configurable `idleTimeout` argument to
`provisionAPIKey` / `CreateAPIKey` can be changed and that the AC requires ≥ 30
min; ensure the wording aligns with the SO-80 document for consistency.
---
Nitpick comments:
In `@artifact-docs/architecture/SO-80-session-scoped-api-keys.md`:
- Around line 19-25: The fenced code block showing the API key lifecycle should
include a language specifier to satisfy markdown linting; update the block that
contains "spawnAgent() → provisionAPIKey(agentID, runID) → CreateAPIKey(...,
runID, ttl=2h)" to use a language tag such as "text" or "plaintext" (e.g.,
```text) so the diagram renders consistently and the linter passes; the content
inside (including references to spawnAgent, provisionAPIKey, CreateAPIKey and
the api_keys row fields run_id, expires_at, revoked_at) should remain unchanged.
In `@artifact-docs/infra/stale-run-cleanup.md`:
- Around line 40-45: The fenced code block containing the startup order diagram
should include a language specifier to satisfy markdown linting; replace the
opening triple backticks with a language-tagged fence (e.g., change "```" to
"```text" or "```plaintext") for the block that shows RecoverStuckIssues(),
StartHeartbeatLoop(), StartAPIKeyExpiryLoop(), and srv.ListenAndServe() so the
diagram is treated as plain text.
In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md`:
- Around line 33-35: Add explicit language identifiers to the three fenced code
blocks containing the literals "so_<32-bytes-hex> (e.g. so_a3f2...d9e1)",
"Run starts → provisionAPIKey(agentID, runID)", and "=== RUN
TestAPIKeyLifecycle" so they render correctly and satisfy markdown linting;
update each opening triple-backtick to include an appropriate language tag such
as text (e.g., ```text) for these blocks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 151a8e92-8c33-44bd-855c-00335afb96e3
📒 Files selected for processing (6)
artifact-docs/architecture/SO-80-session-scoped-api-keys.mdartifact-docs/infra/stale-run-cleanup.mdartifact-docs/tech-specs/SO-72-session-scoped-api-keys.mdinternal/db/db_test.gointernal/db/queries.gointernal/scheduler/scheduler.go
| Run starts → provisionAPIKey(agentID, runID) | ||
| ├── RevokeRunAPIKey(runID) ← revoke any prior key for *this run* only | ||
| ├── generate 32-byte random key | ||
| └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min) |
There was a problem hiding this comment.
Documentation inconsistent with implementation: TTL is 2 hours, not 60 minutes.
Line 70 shows idleTimeout=60min but scheduler.go line 586 passes 2*time.Hour to CreateAPIKey. This creates confusion between the documented and actual behavior.
📝 Suggested fix
- └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min)
+ └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=2h)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min) | |
| └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=2h) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` at line 70,
Documentation shows CreateAPIKey(..., idleTimeout=60min) but the implementation
calls CreateAPIKey with 2*time.Hour; update the doc to match the implementation
by changing idleTimeout=60min to idleTimeout=2h (or 2 hours) so the spec
reflects the actual behavior; reference the CreateAPIKey call in scheduler.go
that passes 2*time.Hour to confirm the intended TTL.
| ## Idle Timeout Configuration | ||
|
|
||
| The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the | ||
| `idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default. | ||
|
|
There was a problem hiding this comment.
Update default idle timeout documentation to match code.
Lines 130-131 state the default is 60 minutes, but the implementation uses 2 hours. The SO-80 document correctly states 2 hours, so this older SO-72 spec should be updated for consistency.
📝 Suggested fix
## Idle Timeout Configuration
-The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the
-`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default.
+The default idle timeout is **2 hours**. This is configurable in `provisionAPIKey` — change the
+`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 2 hours is the current default.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Idle Timeout Configuration | |
| The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the | |
| `idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default. | |
| ## Idle Timeout Configuration | |
| The default idle timeout is **2 hours**. This is configurable in `provisionAPIKey` — change the | |
| `idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 2 hours is the current default. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` around lines 128 -
132, Update the "Idle Timeout Configuration" section to reflect the actual
default idle timeout used in code (2 hours / 120 minutes) instead of 60 minutes:
change the text that currently says "The default idle timeout is **60 minutes**"
to state "The default idle timeout is **120 minutes (2 hours)**", and keep the
note that the configurable `idleTimeout` argument to `provisionAPIKey` /
`CreateAPIKey` can be changed and that the AC requires ≥ 30 min; ensure the
wording aligns with the SO-80 document for consistency.
There was a problem hiding this comment.
Code Review
This pull request introduces session-scoped API keys with a 2-hour expiry and enhances the stale run cleanup mechanism. The CleanupStaleRuns function now accepts a cutoff duration, preventing premature termination of recently started runs, and is integrated into the scheduler's recovery process with a 10-minute cutoff. New tests have been added to cover both features. A review comment points out that the current CleanupStaleRuns implementation is not atomic and could encounter SQLite parameter limits, suggesting an improved approach using the RETURNING clause for atomicity and robustness.
| func (d *DB) CleanupStaleRuns(cutoff time.Duration) ([]string, error) { | ||
| // First collect the IDs we are about to fail so callers can log them individually. | ||
| var ( | ||
| rows *sql.Rows | ||
| err error | ||
| ) | ||
| if cutoff > 0 { | ||
| threshold := time.Now().UTC().Add(-cutoff).Format("2006-01-02 15:04:05") | ||
| rows, err = d.Query( | ||
| `SELECT id FROM runs WHERE status='running' AND started_at < ?`, threshold) | ||
| } else { | ||
| rows, err = d.Query(`SELECT id FROM runs WHERE status='running'`) | ||
| } | ||
| if err != nil { | ||
| return 0, err | ||
| return nil, err | ||
| } | ||
| return res.RowsAffected() | ||
| defer rows.Close() | ||
|
|
||
| var ids []string | ||
| for rows.Next() { | ||
| var id string | ||
| if err := rows.Scan(&id); err != nil { | ||
| return nil, err | ||
| } | ||
| ids = append(ids, id) | ||
| } | ||
| if err := rows.Err(); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| if len(ids) == 0 { | ||
| return nil, nil | ||
| } | ||
|
|
||
| // Build a parameterised IN clause and update in one statement. | ||
| placeholders := strings.Repeat("?,", len(ids)) | ||
| placeholders = placeholders[:len(placeholders)-1] | ||
| args := make([]any, len(ids)) | ||
| for i, id := range ids { | ||
| args[i] = id | ||
| } | ||
| _, err = d.Exec( | ||
| `UPDATE runs SET status='failed', completed_at=datetime('now') WHERE id IN (`+placeholders+`)`, | ||
| args...) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| return ids, nil | ||
| } |
There was a problem hiding this comment.
The current two-step process (SELECT then UPDATE) is not atomic and is subject to SQLite's parameter limit (usually 999) in the IN clause. If there are many stale runs, this function will fail.
Since SQLite 3.35.0, you can use the RETURNING clause to perform the update and retrieve the affected IDs in a single atomic statement. This also avoids the need to manually build the IN clause and handles the time comparison more robustly within the database engine.
func (d *DB) CleanupStaleRuns(cutoff time.Duration) ([]string, error) {
var query string
var args []any
if cutoff > 0 {
query = `UPDATE runs SET status='failed', completed_at=datetime('now')
WHERE status='running' AND started_at < datetime('now', ?)
RETURNING id`
args = append(args, fmt.Sprintf("-%d seconds", int(cutoff.Seconds())))
} else {
query = `UPDATE runs SET status='failed', completed_at=datetime('now')
WHERE status='running'
RETURNING id`
}
rows, err := d.Query(query, args...)
if err != nil {
return nil, err
}
defer rows.Close()
var ids []string
for rows.Next() {
var id string
if err := rows.Scan(&id); err != nil {
return nil, err
}
ids = append(ids, id)
}
return ids, rows.Err()
}
Summary
Fixes SO-21: When the secondorder service is killed mid-run, rows remain stuck at
status=runningforever, clogging the activity feed and blocking the stuck-issue recovery loop.Changes
internal/db/queries.goCleanupStaleRuns(cutoff time.Duration) ([]string, error)status='running'ANDstarted_at < (now - cutoff)status='failed'withcompleted_at=datetime('now')in one statementcutoff=0matches all running runs regardless of ageMarkStaleRunsFailed()as backward-compat wrapper overCleanupStaleRuns(0)internal/scheduler/scheduler.gostaleCutoff = 10 * time.MinuteconstantRecoverStuckIssues()to callCleanupStaleRuns(staleCutoff)scheduler: cleaned up stale run <run_id>per affected run (AC2)StartHeartbeatLoopandListenAndServe(AC3)artifact-docs/infra/stale-run-cleanup.mdAcceptance Criteria
Tests
4 new tests in
internal/db/db_test.go:TestCleanupStaleRunsAllRunning— cutoff=0 fails all running runsTestCleanupStaleRunsCutoffFiltersRecent— 10-min cutoff spares recent run, fails old runTestCleanupStaleRunsNoneRunning— no-op when no running runsTestMarkStaleRunsFailedBackwardCompat— existing wrapper still works