Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions artifact-docs/architecture/SO-80-session-scoped-api-keys.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# SO-80: Session-scoped API Keys

**Status:** Implemented (PR #3)
**Priority:** HIGH (P2)
**Replaces:** SO-72 (cancelled)

## Problem

Prior to this change, the `api_keys` table had no run binding — `RevokeAPIKeys(agentID)` would invalidate **all** keys for an agent. This meant starting a second run for agent A would revoke the key used by any currently-running instance of agent A, causing:

- Mid-task auth failures (`401 Unauthorized`)
- Agents falling back to SQLite direct writes (fragile workaround)
- Incorrect status updates and orphaned run state

## Solution

Bind each API key to a specific `run_id` instead of just `agent_id`. Key lifecycle:

```
spawnAgent() → provisionAPIKey(agentID, runID) → CreateAPIKey(..., runID, ttl=2h)
api_keys row: run_id = runID
expires_at = now+2h
revoked_at = NULL
```

When a run completes (normally or via timeout): `RevokeRunAPIKey(runID)` only revokes that specific run's key — **other runs are unaffected**.

## Data Model

```sql
-- Migration 018 adds to existing api_keys table:
ALTER TABLE api_keys ADD COLUMN run_id TEXT REFERENCES runs(id);
ALTER TABLE api_keys ADD COLUMN expires_at DATETIME;
CREATE INDEX idx_api_keys_run ON api_keys(run_id);
```

Full schema:
```sql
CREATE TABLE api_keys (
id TEXT PRIMARY KEY,
agent_id TEXT NOT NULL REFERENCES agents(id),
run_id TEXT REFERENCES runs(id), -- NULL for legacy keys
key_hash TEXT NOT NULL UNIQUE,
prefix TEXT NOT NULL,
revoked_at DATETIME, -- NULL = active
expires_at DATETIME, -- NULL = no expiry (legacy)
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
);
```

## DB Functions

| Function | Signature | Notes |
|---|---|---|
| `CreateAPIKey` | `(agentID, runID, keyHash, prefix string, ttl time.Duration) error` | Stores hash, sets `expires_at = now + ttl` |
| `GetAgentByAPIKey` | `(keyHash string) (*Agent, error)` | Checks `revoked_at IS NULL AND (expires_at IS NULL OR expires_at > now)` |
| `RevokeRunAPIKey` | `(runID string) error` | Sets `revoked_at = now` WHERE `run_id = runID` only |
| `ExpireStaleAPIKeys` | `() (int64, error)` | Batch-revoke keys past `expires_at` |

## Auth Middleware

`internal/handlers/api.go — Auth()` calls `GetAgentByAPIKey(keyHash)` which already handles expiry:

- Key found, not revoked, not expired → authenticated as the key's agent ✅
- Key not found / revoked / expired → `ErrNoRows` → `401 Unauthorized` ✅
- Legacy key (no `expires_at`) → `expires_at IS NULL` treated as no expiry → still works ✅

## Scheduler Integration

`internal/scheduler/scheduler.go`:
- `provisionAPIKey(agentID, runID)` called at run start, injects `rawKey` into agent environment
- Key TTL: **2 hours** (satisfies AC#2: ≥2h)
- `StartAPIKeyExpiryLoop(interval)` runs a periodic sweep (recommended: 1 min) to expire stale keys

## Acceptance Criteria

| AC | Description | How Verified |
|---|---|---|
| AC#1 | Starting run B for agent A does NOT revoke run A key | `TestSessionKeyNotRevokedOnNewRun`, `TestRevokeRunAPIKeyOnlyAffectsTargetRun` |
| AC#2 | Each run gets its own key, valid ≥2h | `provisionAPIKey` uses `2*time.Hour`; `TestSessionKeyNotRevokedOnNewRun` |
| AC#3 | Expired key returns 401 | `TestSessionKeyExpiresAfterIdleTimeout` |
| AC#4 | `ValidateAPIKey()` path unchanged | `TestLegacyKeyBackwardCompat` |
| AC#5 | `go build ./...` and `go test ./...` pass | CI / all 8 packages green |
| AC#6 | PR on castrojo/secondorder | PR #3 |

## Files Changed

- `internal/db/migrations/018_session_scoped_api_keys.sql` — schema migration
- `internal/db/queries.go` — `CreateAPIKey`, `GetAgentByAPIKey`, `RevokeRunAPIKey`, `ExpireStaleAPIKeys`
- `internal/scheduler/scheduler.go` — `provisionAPIKey` (2h TTL), `StartAPIKeyExpiryLoop`
- `internal/db/db_test.go` — 5 new tests covering all ACs
58 changes: 58 additions & 0 deletions artifact-docs/infra/stale-run-cleanup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Stale Run Cleanup on Startup (SO-21)

## Problem

When the secondorder service is killed mid-run, rows in the `runs` table remain
stuck at `status=running` indefinitely. This causes:

- Activity feed clutter (running jobs that will never complete)
- The stuck-issue recovery loop attempting to re-wake agents for runs that are
still marked "running" but have no live process

## Solution

`internal/db/queries.go` — `CleanupStaleRuns(cutoff time.Duration)`

- Queries for all `runs` where `status='running'` AND `started_at < (now - cutoff)`
- Returns the slice of matching run IDs so callers can emit per-run log lines
- Updates those runs to `status='failed'` with `completed_at=datetime('now')` in one statement
- A cutoff of 0 matches all running runs regardless of age (used by the backward-compat wrapper)

`internal/scheduler/scheduler.go` — `RecoverStuckIssues()`

- Calls `CleanupStaleRuns(10 * time.Minute)` **before** heartbeat loops and HTTP listener start
- Logs `scheduler: cleaned up stale run <run_id>` for every affected run (AC2)
- Logs a summary `scheduler: stale run cleanup complete count=N` when N > 0

### Cutoff: 10 minutes

Chosen per CEO recommendation. Runs younger than 10 minutes may belong to a
legitimately running process (e.g. a recently restarted concurrent worker).
Runs older than 10 minutes from a crashed process are safe to fail.

## Backward compatibility

`MarkStaleRunsFailed()` is preserved and now delegates to `CleanupStaleRuns(0)`,
maintaining the zero-cutoff ("fail everything") behaviour used in existing tests.

## Startup order (main.go)

```
RecoverStuckIssues() ← CleanupStaleRuns runs here (line 305)
StartHeartbeatLoop() ← scheduler begins accepting work (line 310)
StartAPIKeyExpiryLoop() ← (line 313)
srv.ListenAndServe() ← HTTP open (line 325)
```

AC3 is satisfied: cleanup completes synchronously before any new work is accepted.

## Tests added (`internal/db/db_test.go`)

| Test | What it covers |
|------|---------------|
| `TestCleanupStaleRunsAllRunning` | cutoff=0 fails all running runs, returns all IDs |
| `TestCleanupStaleRunsCutoffFiltersRecent` | 10-min cutoff spares recent run, fails old one |
| `TestCleanupStaleRunsNoneRunning` | no running rows → empty slice, no error |
| `TestMarkStaleRunsFailedBackwardCompat` | MarkStaleRunsFailed wrapper still works |

All pass: `go test ./...` — 0 failures.
169 changes: 169 additions & 0 deletions artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# SO-72: Session-Scoped API Keys for Agent Runs

**Status:** Implemented
**Priority:** HIGH (P2)
**Parent:** SO-69

---

## Problem

The API key embedded in each agent's system prompt was revoked when **any** new run started for
the same agent. A single `RevokeAPIKeys(agentID)` call wiped all keys regardless of which run
they belonged to. This caused:

- Mid-run 401 failures when a heartbeat or reviewer run spawned concurrently
- Agents falling back to SQLite direct writes (fragile, undocumented)
- Incorrect status updates and orphaned state
- CEO-reported: "The API key embedded in the system prompt gets revoked each session"

---

## Solution: Run-Scoped Key Lifecycle

Each agent **run** receives its own unique API key at spawn time. Keys are:

1. **Bound to a run** (`api_keys.run_id`) — not to the agent globally
2. **Agent-scoped** — the key resolves only to the agent that owns it
3. **Time-limited** — idle timeout enforced via `api_keys.expires_at`
4. **Non-interfering** — revoking run-A's key does not affect run-B (same agent)

### Key Format

```
so_<32-bytes-hex> (e.g. so_a3f2...d9e1)
```

Keys are stored as SHA-256 hashes in the database; only the raw key is transmitted to the agent.

---

## Database Schema (Migration 018)

```sql
ALTER TABLE api_keys ADD COLUMN run_id TEXT REFERENCES runs(id);
ALTER TABLE api_keys ADD COLUMN expires_at DATETIME;
CREATE INDEX IF NOT EXISTS idx_api_keys_run ON api_keys(run_id);
```

### Full `api_keys` table structure

| Column | Type | Description |
|-------------|----------|------------------------------------------------------|
| id | TEXT PK | UUID |
| agent_id | TEXT FK | Agent that owns this key |
| run_id | TEXT FK | Run that this key was issued to (nullable: legacy) |
| key_hash | TEXT | SHA-256 of the raw key |
| prefix | TEXT | First 12 chars of raw key (display only) |
| created_at | DATETIME | |
| revoked_at | DATETIME | Set when key is explicitly revoked |
| expires_at | DATETIME | Set to `now + idle_timeout` at creation |

---

## Key Lifecycle

```
Run starts → provisionAPIKey(agentID, runID)
├── RevokeRunAPIKey(runID) ← revoke any prior key for *this run* only
├── generate 32-byte random key
└── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation inconsistent with implementation: TTL is 2 hours, not 60 minutes.

Line 70 shows idleTimeout=60min but scheduler.go line 586 passes 2*time.Hour to CreateAPIKey. This creates confusion between the documented and actual behavior.

📝 Suggested fix
-              └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min)
+              └── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=2h)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
└── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=60min)
└── CreateAPIKey(agentID, runID, keyHash, prefix, idleTimeout=2h)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` at line 70,
Documentation shows CreateAPIKey(..., idleTimeout=60min) but the implementation
calls CreateAPIKey with 2*time.Hour; update the doc to match the implementation
by changing idleTimeout=60min to idleTimeout=2h (or 2 hours) so the spec
reflects the actual behavior; reference the CreateAPIKey call in scheduler.go
that passes 2*time.Hour to confirm the intended TTL.

Key injected into SECONDORDER_API_KEY env var for the agent process

Run completes → RevokeRunAPIKey(runID) ← explicit cleanup

Background cron (every 60s) → ExpireStaleAPIKeys()
└── UPDATE api_keys SET revoked_at=now WHERE expires_at <= now AND revoked_at IS NULL
```

### Validity check (Auth middleware)

```sql
SELECT ...
FROM api_keys k JOIN agents a ON k.agent_id = a.id
WHERE k.key_hash = ?
AND k.revoked_at IS NULL
AND (k.expires_at IS NULL OR k.expires_at > datetime('now'))
```

The `expires_at IS NULL` arm preserves backward compatibility with legacy keys that have no expiry.

---

## Acceptance Criteria — Verification

| AC | Criterion | Status | Test |
|-----|-----------|--------|------|
| AC#1 | New run start for agent A does NOT revoke the API key of a prior still-running run by agent A | ✅ | `TestSessionKeyNotRevokedOnNewRun`, `TestRevokeRunAPIKeyOnlyAffectsTargetRun` |
| AC#2 | Keys are agent-scoped — a key for agent A cannot authenticate as agent B | ✅ | `TestSessionKeyNotRevokedOnNewRun` |
| AC#3 | Keys expire after run ends (idle timeout ≥ 30min) | ✅ | `TestSessionKeyExpiresAfterIdleTimeout`, `TestExpireStaleAPIKeys` (60min default) |
| AC#4 | Existing PATCH/POST/GET endpoints work with session-scoped keys (no API contract changes) | ✅ | All existing handler tests pass |
| AC#5 | Existing long-lived (legacy) keys still work | ✅ | `TestLegacyKeyBackwardCompat` |
| AC#6 | Unit tests for key lifecycle | ✅ | 5 new tests in `internal/db/db_test.go` |

---

## Files Changed

### New
- `internal/db/migrations/018_session_scoped_api_keys.sql` — schema migration

### Modified
- `internal/db/queries.go`
- `CreateAPIKey(agentID, runID, keyHash, prefix string, idleTimeout time.Duration)` — new signature with run binding and expiry
- `GetAgentByAPIKey(keyHash)` — adds `expires_at` check
- `RevokeRunAPIKey(runID)` — replaces `RevokeAPIKeys(agentID)` (run-scoped, not agent-wide)
- `ExpireStaleAPIKeys()` — periodic cleanup, returns rows affected
- `internal/scheduler/scheduler.go`
- `provisionAPIKey(agentID, runID)` — calls `RevokeRunAPIKey` (not agent-wide) + `CreateAPIKey`
- Background ticker: `ExpireStaleAPIKeys()` every 60s
- `internal/db/db_test.go` — 5 new session-scoped key lifecycle tests

### Documentation
- `artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` (this file)

---

## Idle Timeout Configuration

The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the
`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default.

Comment on lines +128 to +132
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update default idle timeout documentation to match code.

Lines 130-131 state the default is 60 minutes, but the implementation uses 2 hours. The SO-80 document correctly states 2 hours, so this older SO-72 spec should be updated for consistency.

📝 Suggested fix
 ## Idle Timeout Configuration

-The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the
-`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default.
+The default idle timeout is **2 hours**. This is configurable in `provisionAPIKey` — change the
+`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 2 hours is the current default.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Idle Timeout Configuration
The default idle timeout is **60 minutes**. This is configurable in `provisionAPIKey` — change the
`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 60 min is the current default.
## Idle Timeout Configuration
The default idle timeout is **2 hours**. This is configurable in `provisionAPIKey` — change the
`idleTimeout` argument to `CreateAPIKey`. The AC requires ≥ 30 min; 2 hours is the current default.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@artifact-docs/tech-specs/SO-72-session-scoped-api-keys.md` around lines 128 -
132, Update the "Idle Timeout Configuration" section to reflect the actual
default idle timeout used in code (2 hours / 120 minutes) instead of 60 minutes:
change the text that currently says "The default idle timeout is **60 minutes**"
to state "The default idle timeout is **120 minutes (2 hours)**", and keep the
note that the configurable `idleTimeout` argument to `provisionAPIKey` /
`CreateAPIKey` can be changed and that the AC requires ≥ 30 min; ensure the
wording aligns with the SO-80 document for consistency.

Future: expose via `settings` table so operators can adjust without recompile.

---

## Backward Compatibility

Legacy keys (inserted before migration 018, or without `run_id`/`expires_at`) continue to work.
The auth query includes `expires_at IS NULL OR expires_at > datetime('now')` — keys with no
expiry set are treated as valid indefinitely, preserving existing agent configurations.

---

## Testing

```bash
# Run all session-key lifecycle tests
go test ./internal/db/... -run "TestSession|TestLegacy|TestRevokeRun|TestExpireStale|TestAPIKey" -v

# Full suite
go test ./...
```

Expected output (all PASS):
```
=== RUN TestAPIKeyLifecycle
--- PASS: TestAPIKeyLifecycle
=== RUN TestSessionKeyNotRevokedOnNewRun
--- PASS: TestSessionKeyNotRevokedOnNewRun
=== RUN TestSessionKeyExpiresAfterIdleTimeout
--- PASS: TestSessionKeyExpiresAfterIdleTimeout
=== RUN TestLegacyKeyBackwardCompat
--- PASS: TestLegacyKeyBackwardCompat
=== RUN TestRevokeRunAPIKeyOnlyAffectsTargetRun
--- PASS: TestRevokeRunAPIKeyOnlyAffectsTargetRun
=== RUN TestExpireStaleAPIKeys
--- PASS: TestExpireStaleAPIKeys
```
Loading