Skip to content

fix(scheduler): auto-cleanup stale running runs on service startup (SO-21)#2

Open
castrojo wants to merge 3 commits intomainfrom
feat/SO-21-stale-run-cleanup-on-startup
Open

fix(scheduler): auto-cleanup stale running runs on service startup (SO-21)#2
castrojo wants to merge 3 commits intomainfrom
feat/SO-21-stale-run-cleanup-on-startup

Conversation

@castrojo
Copy link
Copy Markdown
Owner

@castrojo castrojo commented Apr 9, 2026

Summary

Fixes SO-21: When the secondorder service is killed mid-run, rows remain stuck at status=running forever, clogging the activity feed and blocking the stuck-issue recovery loop.

Changes

internal/db/queries.go

  • Add CleanupStaleRuns(cutoff time.Duration) ([]string, error)
    • Finds runs where status='running' AND started_at < (now - cutoff)
    • Returns slice of affected run IDs so callers can log each one individually
    • Updates matched rows to status='failed' with completed_at=datetime('now') in one statement
    • cutoff=0 matches all running runs regardless of age
  • Deprecate MarkStaleRunsFailed() as backward-compat wrapper over CleanupStaleRuns(0)

internal/scheduler/scheduler.go

  • Add staleCutoff = 10 * time.Minute constant
  • Update RecoverStuckIssues() to call CleanupStaleRuns(staleCutoff)
  • Emit scheduler: cleaned up stale run <run_id> per affected run (AC2)
  • Cleanup runs before StartHeartbeatLoop and ListenAndServe (AC3)

artifact-docs/infra/stale-run-cleanup.md

  • Documents the design, cutoff rationale, startup order, and test coverage

Acceptance Criteria

AC Status
AC1: Runs >10 min old at startup → failed ✅ CleanupStaleRuns(10*time.Minute)
AC2: Log line per stale run ✅ slog.Info cleaned up stale run, run_id
AC3: Cleanup before scheduler accepts new work ✅ Called before heartbeat and HTTP
AC4: Fail-safe mark as failed ✅ All orphaned runs transition to failed
AC5: go build and go test pass ✅ 0 failures

Tests

4 new tests in internal/db/db_test.go:

  • TestCleanupStaleRunsAllRunning — cutoff=0 fails all running runs
  • TestCleanupStaleRunsCutoffFiltersRecent — 10-min cutoff spares recent run, fails old run
  • TestCleanupStaleRunsNoneRunning — no-op when no running runs
  • TestMarkStaleRunsFailedBackwardCompat — existing wrapper still works

castrojo added 3 commits April 9, 2026 07:50
Implements session-scoped API keys so each run gets its own key,
independent of other runs for the same agent.

## Changes

### internal/db/migrations/018_session_scoped_api_keys.sql (in main)
- Added run_id and expires_at columns to api_keys table
- Added idx_api_keys_run index

### internal/db/queries.go (in main)
- CreateAPIKey(agentID, runID, keyHash, prefix string, ttl) — binds key to run, sets expires_at
- GetAgentByAPIKey — checks expires_at in addition to revoked_at
- RevokeRunAPIKey(runID) — scoped revocation (not agent-wide)
- ExpireStaleAPIKeys() — periodic cleanup for idle timeout enforcement

### internal/scheduler/scheduler.go
- provisionAPIKey uses 2h TTL (AC#2: key valid ≥2h)
- StartAPIKeyExpiryLoop — periodic sweep to expire stale keys

### internal/db/db_test.go
- TestSessionKeyNotRevokedOnNewRun — AC#1: second run doesn't revoke first
- TestSessionKeyExpiresAfterIdleTimeout — AC#3: expired key returns 403
- TestLegacyKeyBackwardCompat — AC#4: legacy keys without expires_at still work
- TestRevokeRunAPIKeyOnlyAffectsTargetRun — AC#1: run-scoped revocation
- TestExpireStaleAPIKeys — periodic cleanup works correctly

## Acceptance Criteria
- AC#1 ✅ Starting run B for agent A does NOT revoke run A key
- AC#2 ✅ Each run gets its own key, valid for 2h (≥2h)
- AC#3 ✅ Expired key returns ErrNoRows (→ 403 via Auth middleware)
- AC#4 ✅ Legacy ValidateAPIKey() path unchanged — backward compatible
- AC#5 ✅ go build ./... && go test ./... pass
- AC#6 → PR filed

Closes SO-80
…O-21)

- Add CleanupStaleRuns(cutoff time.Duration) to db layer
  - Queries running runs older than cutoff, updates to failed in one statement
  - Returns slice of affected run IDs for per-run logging
  - cutoff=0 matches all running runs (backward-compat for MarkStaleRunsFailed)
- Deprecate MarkStaleRunsFailed() as thin wrapper over CleanupStaleRuns(0)
- Update RecoverStuckIssues() to use CleanupStaleRuns(10 * time.Minute)
  - Logs 'cleaned up stale run <run_id>' for every affected run (AC2)
  - Cutoff of 10 min spares runs from a still-live concurrent process
- Cleanup runs before scheduler heartbeat loop and HTTP listener start (AC3)
- Add 4 tests: allRunning, cutoffFiltersRecent, noneRunning, backwardCompat
- Add artifact-docs/infra/stale-run-cleanup.md

go build ./... and go test ./... pass (0 failures)
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

This PR introduces session-scoped API keys and stale run cleanup features. It adds comprehensive documentation for API key management with run-based binding and TTL expiry, implements CleanupStaleRuns with cutoff-based filtering, increases API key TTL to 2 hours, and adds extensive test coverage for both features.

Changes

Cohort / File(s) Summary
Architecture & Specification Documentation
artifact-docs/architecture/SO-80-session-scoped-api-keys.md, artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md, artifact-docs/infra/stale-run-cleanup.md
Comprehensive documentation for session-scoped API keys (run-bound with TTL expiry, revocation, and backward compatibility) and stale run cleanup on startup with cutoff-based filtering. Defines schemas, database behaviors, and acceptance criteria.
Database Query Layer
internal/db/queries.go
Deprecated MarkStaleRunsFailed() to delegate to new CleanupStaleRuns(cutoff) method, which selects running runs optionally filtered by startup age, updates them to failed status with completion timestamp, and returns affected run IDs for logging.
Test Suite
internal/db/db_test.go
Added 278 lines of test coverage for API key lifecycle (provisioning, expiry, revocation, backward compatibility) and stale run cleanup behavior (cutoff filtering, no-op cases, cleanup confirmation).
Scheduler Integration
internal/scheduler/scheduler.go
Increased session-scoped API key TTL from 60 minutes to 2 hours; reworked startup recovery to call CleanupStaleRuns(10*time.Minute) instead of MarkStaleRunsFailed() with improved per-run and summary logging.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Poem

🐰 Session keys now bound to runs,
Stale cleanup dawns with cutoff sums,
Two hours of trust, then keys expire,
TTL fresh, what we require! 🔑⏰

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: auto-cleanup of stale running runs on service startup (SO-21), matching the primary objective.
Description check ✅ Passed The description is well-structured and directly related to the changeset, covering the problem statement, all code changes, acceptance criteria, and tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
artifact-docs/infra/stale-run-cleanup.md (1)

40-45: Add language specifier to fenced code block.

The startup order diagram lacks a language specifier. Adding text or plaintext satisfies markdown linting.

📝 Suggested fix
-```
+```text
 RecoverStuckIssues()      ← CleanupStaleRuns runs here (line 305)
 StartHeartbeatLoop()      ← scheduler begins accepting work (line 310)
 StartAPIKeyExpiryLoop()   ← (line 313)
 srv.ListenAndServe()      ← HTTP open (line 325)
-```
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@artifact-docs/infra/stale-run-cleanup.md` around lines 40 - 45, The fenced
code block containing the startup order diagram should include a language
specifier to satisfy markdown linting; replace the opening triple backticks with
a language-tagged fence (e.g., change "```" to "```text" or "```plaintext") for
the block that shows RecoverStuckIssues(), StartHeartbeatLoop(),
StartAPIKeyExpiryLoop(), and srv.ListenAndServe() so the diagram is treated as
plain text.
artifact-docs/architecture/SO-80-session-scoped-api-keys.md (1)

19-25: Add language specifier to fenced code block.

The lifecycle diagram is formatted as a fenced code block without a language specifier. While this is a text diagram, adding a language identifier (e.g., text or plaintext) satisfies markdown linting and improves rendering consistency.

📝 Suggested fix
-```
+```text
 spawnAgent() → provisionAPIKey(agentID, runID) → CreateAPIKey(..., runID, ttl=2h)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@artifact-docs/architecture/SO-80-session-scoped-api-keys.md` around lines 19
- 25, The fenced code block showing the API key lifecycle should include a
language specifier to satisfy markdown linting; update the block that contains
"spawnAgent() → provisionAPIKey(agentID, runID) → CreateAPIKey(..., runID,
ttl=2h)" to use a language tag such as "text" or "plaintext" (e.g., ```text) so
the diagram renders consistently and the linter passes; the content inside
(including references to spawnAgent, provisionAPIKey, CreateAPIKey and the
api_keys row fields run_id, expires_at, revoked_at) should remain unchanged.
artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md (1)

33-35: Add language specifiers to fenced code blocks.

Several code blocks (lines 33, 66, 156) lack language specifiers. Adding appropriate identifiers improves rendering and satisfies markdown linting.

📝 Suggested fixes

Line 33:

-```
+```text
 so_<32-bytes-hex>     (e.g. so_a3f2...d9e1)

Line 66:

-```
+```text
 Run starts → provisionAPIKey(agentID, runID)

Line 156:

-```
+```text
 === RUN   TestAPIKeyLifecycle
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` around lines 33 -
35, Add explicit language identifiers to the three fenced code blocks containing
the literals "so_<32-bytes-hex>     (e.g. so_a3f2...d9e1)", "Run starts →
provisionAPIKey(agentID, runID)", and "=== RUN   TestAPIKeyLifecycle" so they
render correctly and satisfy markdown linting; update each opening
triple-backtick to include an appropriate language tag such as text (e.g.,
```text) for these blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md`:
- Line 70: Documentation shows CreateAPIKey(..., idleTimeout=60min) but the
implementation calls CreateAPIKey with 2*time.Hour; update the doc to match the
implementation by changing idleTimeout=60min to idleTimeout=2h (or 2 hours) so
the spec reflects the actual behavior; reference the CreateAPIKey call in
scheduler.go that passes 2*time.Hour to confirm the intended TTL.
- Around line 128-132: Update the "Idle Timeout Configuration" section to
reflect the actual default idle timeout used in code (2 hours / 120 minutes)
instead of 60 minutes: change the text that currently says "The default idle
timeout is **60 minutes**" to state "The default idle timeout is **120 minutes
(2 hours)**", and keep the note that the configurable `idleTimeout` argument to
`provisionAPIKey` / `CreateAPIKey` can be changed and that the AC requires ≥ 30
min; ensure the wording aligns with the SO-80 document for consistency.

---

Nitpick comments:
In `@artifact-docs/architecture/SO-80-session-scoped-api-keys.md`:
- Around line 19-25: The fenced code block showing the API key lifecycle should
include a language specifier to satisfy markdown linting; update the block that
contains "spawnAgent() → provisionAPIKey(agentID, runID) → CreateAPIKey(...,
runID, ttl=2h)" to use a language tag such as "text" or "plaintext" (e.g.,
```text) so the diagram renders consistently and the linter passes; the content
inside (including references to spawnAgent, provisionAPIKey, CreateAPIKey and
the api_keys row fields run_id, expires_at, revoked_at) should remain unchanged.

In `@artifact-docs/infra/stale-run-cleanup.md`:
- Around line 40-45: The fenced code block containing the startup order diagram
should include a language specifier to satisfy markdown linting; replace the
opening triple backticks with a language-tagged fence (e.g., change "```" to
"```text" or "```plaintext") for the block that shows RecoverStuckIssues(),
StartHeartbeatLoop(), StartAPIKeyExpiryLoop(), and srv.ListenAndServe() so the
diagram is treated as plain text.

In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md`:
- Around line 33-35: Add explicit language identifiers to the three fenced code
blocks containing the literals "so_<32-bytes-hex>     (e.g. so_a3f2...d9e1)",
"Run starts → provisionAPIKey(agentID, runID)", and "=== RUN  
TestAPIKeyLifecycle" so they render correctly and satisfy markdown linting;
update each opening triple-backtick to include an appropriate language tag such
as text (e.g., ```text) for these blocks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 151a8e92-8c33-44bd-855c-00335afb96e3

📥 Commits

Reviewing files that changed from the base of the PR and between 89c022c and dbec72d.

📒 Files selected for processing (6)
  • artifact-docs/architecture/SO-80-session-scoped-api-keys.md
  • artifact-docs/infra/stale-run-cleanup.md
  • artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md
  • internal/db/db_test.go
  • internal/db/queries.go
  • internal/scheduler/scheduler.go

Run starts → provisionAPIKey(agentID, runID)
├── RevokeRunAPIKey(runID) ← revoke any prior key for *this run* only
├── generate 32-byte random key
└── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation inconsistent with implementation: TTL is 2 hours, not 60 minutes.

Line 70 shows idleTimeout=60min but scheduler.go line 586 passes 2*time.Hour to CreateAPIKey. This creates confusion between the documented and actual behavior.

📝 Suggested fix
-              └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min)
+              └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=2h)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
└── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min)
└── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=2h)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` at line 70,
Documentation shows CreateAPIKey(..., idleTimeout=60min) but the implementation
calls CreateAPIKey with 2*time.Hour; update the doc to match the implementation
by changing idleTimeout=60min to idleTimeout=2h (or 2 hours) so the spec
reflects the actual behavior; reference the CreateAPIKey call in scheduler.go
that passes 2*time.Hour to confirm the intended TTL.

Comment on lines +128 to +132
## Idle Timeout Configuration

The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the
`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update default idle timeout documentation to match code.

Lines 130-131 state the default is 60 minutes, but the implementation uses 2 hours. The SO-80 document correctly states 2 hours, so this older SO-72 spec should be updated for consistency.

📝 Suggested fix
 ## Idle Timeout Configuration

-The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the
-`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default.
+The default idle timeout is **2 hours**. This is configurable in `provisionAPIKey` — change the
+`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 2 hours is the current default.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Idle Timeout Configuration
The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the
`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default.
## Idle Timeout Configuration
The default idle timeout is **2 hours**. This is configurable in `provisionAPIKey` — change the
`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 2 hours is the current default.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` around lines 128 -
132, Update the "Idle Timeout Configuration" section to reflect the actual
default idle timeout used in code (2 hours / 120 minutes) instead of 60 minutes:
change the text that currently says "The default idle timeout is **60 minutes**"
to state "The default idle timeout is **120 minutes (2 hours)**", and keep the
note that the configurable `idleTimeout` argument to `provisionAPIKey` /
`CreateAPIKey` can be changed and that the AC requires ≥ 30 min; ensure the
wording aligns with the SO-80 document for consistency.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces session-scoped API keys with a 2-hour expiry and enhances the stale run cleanup mechanism. The CleanupStaleRuns function now accepts a cutoff duration, preventing premature termination of recently started runs, and is integrated into the scheduler's recovery process with a 10-minute cutoff. New tests have been added to cover both features. A review comment points out that the current CleanupStaleRuns implementation is not atomic and could encounter SQLite parameter limits, suggesting an improved approach using the RETURNING clause for atomicity and robustness.

Comment thread internal/db/queries.go
Comment on lines +519 to 567
func (d *DB) CleanupStaleRuns(cutoff time.Duration) ([]string, error) {
// First collect the IDs we are about to fail so callers can log them individually.
var (
rows *sql.Rows
err error
)
if cutoff > 0 {
threshold := time.Now().UTC().Add(-cutoff).Format("2006-01-02 15:04:05")
rows, err = d.Query(
`SELECT id FROM runs WHERE status='running' AND started_at < ?`, threshold)
} else {
rows, err = d.Query(`SELECT id FROM runs WHERE status='running'`)
}
if err != nil {
return 0, err
return nil, err
}
return res.RowsAffected()
defer rows.Close()

var ids []string
for rows.Next() {
var id string
if err := rows.Scan(&id); err != nil {
return nil, err
}
ids = append(ids, id)
}
if err := rows.Err(); err != nil {
return nil, err
}

if len(ids) == 0 {
return nil, nil
}

// Build a parameterised IN clause and update in one statement.
placeholders := strings.Repeat("?,", len(ids))
placeholders = placeholders[:len(placeholders)-1]
args := make([]any, len(ids))
for i, id := range ids {
args[i] = id
}
_, err = d.Exec(
`UPDATE runs SET status='failed', completed_at=datetime('now') WHERE id IN (`+placeholders+`)`,
args...)
if err != nil {
return nil, err
}
return ids, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current two-step process (SELECT then UPDATE) is not atomic and is subject to SQLite's parameter limit (usually 999) in the IN clause. If there are many stale runs, this function will fail.

Since SQLite 3.35.0, you can use the RETURNING clause to perform the update and retrieve the affected IDs in a single atomic statement. This also avoids the need to manually build the IN clause and handles the time comparison more robustly within the database engine.

func (d *DB) CleanupStaleRuns(cutoff time.Duration) ([]string, error) {
	var query string
	var args []any
	if cutoff > 0 {
		query = `UPDATE runs SET status='failed', completed_at=datetime('now') 
				 WHERE status='running' AND started_at < datetime('now', ?) 
				 RETURNING id`
		args = append(args, fmt.Sprintf("-%d seconds", int(cutoff.Seconds())))
	} else {
		query = `UPDATE runs SET status='failed', completed_at=datetime('now') 
				 WHERE status='running' 
				 RETURNING id`
	}

	rows, err := d.Query(query, args...)
	if err != nil {
		return nil, err
	}
	defer rows.Close()

	var ids []string
	for rows.Next() {
		var id string
		if err := rows.Scan(&id); err != nil {
			return nil, err
		}
		ids = append(ids, id)
	}
	return ids, rows.Err()
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant