Phase 3 (commits 5-13): runner registry full lifecycle — dispatcher, callbacks, dashboard, runner SDK, CLI, e2e, docs, version bump#154
Conversation
dispatcher.rs:
- Dispatcher::dispatch() decrypts runner token, builds canonical
body, signs with HMAC-SHA256(raw_token, job_id || \n || body),
POSTs to runner.webhook_url with X-Rewind-Job-Id +
X-Rewind-Signature headers, 5s timeout.
- Success → state pending->dispatched, sets dispatch_deadline_at
(10s) + lease_expires_at (5min).
- Any failure → state pending->errored, error_stage='dispatch'.
- State transitions go through Store::advance_replay_job_state
(commit 3's atomic SQL guard).
- 9 unit tests including a Python HMAC reference cross-check
so the runner SDK in commit 9 stays bit-compatible.
reaper.rs:
- Background tokio task spawned by WebServer::run().
- 30s scan: dispatch deadline expired (dispatched but no Started
in 10s) -> errored stage='dispatch'; lease expired (in_progress
but no heartbeat in 5min) -> errored stage='lease_expired'.
- Each transition broadcasts on the StoreEvent bus so the
dashboard sees lease failures live.
- 3 tests covering both expiry classes and the no-op-when-fresh
case.
Store helpers:
- set_dispatch_deadline_and_lease() (called by dispatcher).
- list_dispatch_deadline_expired() (used by reaper).
WebSocket:
- StoreEvent::ReplayJobUpdated variant added.
- ws.rs forwards it to clients subscribed to the owning session
via new ServerMessage::ReplayJobUpdate frame.
AppState:
- dispatcher: Option<Dispatcher>, base_url: String.
- bootstrap_base_url() reads REWIND_PUBLIC_URL with sane local
default; documented for ops.
Existing test fixtures (api_tests, auth_tests, hooks_tests,
recording_api_tests, transcript_sync_tests) updated for the new
AppState fields.
Tests: 106 lib + 141 integration green; ruff clean; pytest unaffected.
Made-with: Cursor
…ase 3, commit 6/13)
POST /api/sessions/{sid}/replay-jobs (bearer-protected):
- Dual-shape body via serde(untagged):
- Shape A: {runner_id, source_timeline_id, at_step, strict_match?}
→ server forks timeline + creates replay context + creates job
atomically.
- Shape B: {runner_id, replay_context_id} → reuse-existing-context;
validates context belongs to session and isn't already in-flight
via new Store::count_in_flight_jobs_for_replay_context (one-job-
per-context invariant prevents cursor races).
- Both fire dispatcher in a tokio::spawn (fire-and-forget) and
return 202 Accepted with the job_id immediately.
- 503 if crypto key not configured; 404 on unknown session/runner;
400 on invalid timeline; 409 on inactive runner / context-in-use.
GET /api/sessions/{sid}/replay-jobs — list jobs for a session.
GET /api/replay-jobs/{id} — fetch a single job.
DELETE /api/replay-jobs/{id} — operator cancel (force errored
with stage='cancelled'; v1 doesn't notify the runner).
POST /api/replay-jobs/{id}/events (mounted OUTSIDE bearer middleware):
- Auth: X-Rewind-Runner-Auth: <token> header. Hashed and looked
up via auth_token_hash index. Ownership check: job.runner_id
must match the authenticated runner's id (403 otherwise).
- Wraps Store::record_replay_job_event_atomic from #152 round 2 →
full state-machine guard. Illegal transitions return 409 with
accepted=false.
- Emits StoreEvent::ReplayJobUpdated → WebSocket broadcast.
- touch_runner_last_seen() for liveness tracking.
Dispatcher refactor:
- dispatch() is now PURE: returns DispatchOutcome instead of
holding a Store reference across .await. Caller (HTTP handler)
applies the outcome via Dispatcher::apply_outcome inside a
briefly-held Mutex lock. Closes the lock-across-await footgun
that clippy::await_holding_lock would flag.
WebServer.build_router:
- runner_callback_routes mounted alongside /_rewind/health (outside
the protected layer) so X-Rewind-Runner-Auth replaces bearer.
Tests: 12 new replay_jobs_tests (full event lifecycle: shape-B
dispatch with stub runner, 500-response failure, context-in-use,
unknown session, inactive runner, runner-auth missing, runner-auth
unknown, ownership mismatch, started transition, progress update,
completed transition, illegal-transition 409). Stub runner is a
tokio TcpListener-backed axum server.
Storage: count_in_flight_jobs_for_replay_context (used by shape-B
contention check).
All checks: 250+ rewind-web tests pass; clippy clean on new code.
Made-with: Cursor
…nv bootstrap (Phase 3, commit 9/13) ExplicitClient.attach_replay_context(session_id, replay_context_id): - Sets the contextvars without creating a new replay context. - Resolves Phase 3 plan HIGH #4: runners receive an existing context in their dispatch payload and must not create a duplicate. intercept.install(): - Bootstraps from REWIND_SESSION_ID + REWIND_REPLAY_CONTEXT_ID env vars before patching transports. Both must be set together; either-but-not-both logs a WARN. Lets a runner subprocess be configured purely via env without touching the SDK. rewind_agent/runner.py (new, ~370 LOC): - RunnerConfig.from_env() — reads REWIND_RUNNER_TOKEN + REWIND_URL. - compute_signature(token, job_id, body) — pure HMAC-SHA256 matching the Rust dispatcher's wire format. Cross-checked against the same Python reference vector both crates pin against (drift between Rust + Python = silent dispatch failure). - verify_signature(config, headers, body_bytes) — case-insensitive header lookup, constant-time hmac.compare_digest. Returns (ok, reason) tuple. - DispatchPayload — typed dataclass mirroring the Rust DispatchBody. - ProgressReporter — async wrapper around POST /api/replay-jobs/ {id}/events; httpx-preferred with urllib fallback. Methods: started(), progress(step, total?, payload?), completed(), errored(message, stage?). - asgi_handler(...) — verifies signature, decodes payload, fires user handler as background task, returns (202, {job_id}) immediately so the dispatcher's 5s timeout is satisfied. On handler exception, auto-emits errored event with the exception message + error_stage='agent'. - @handle_replay marker decorator for documentation/discovery. Tests (19 new, 480 total pass): - RunnerConfig.from_env: requires token; honors defaults; honors REWIND_URL. - compute_signature: matches Rust reference vector byte-for-byte. - verify_signature: accepts canonical, rejects missing headers, malformed prefix, tampered body, wrong token; case-insensitive. - DispatchPayload decode round-trip. - asgi_handler: invalid signature returns 401 + handler not invoked; invalid body returns 400; valid request → 202 + handler runs + started/progress/completed events emitted in order; user code raising → errored event emitted with the exception message. - attach_replay_context sets contextvars. - intercept.install env-var bootstrap (both vars set; partial vars logged + skipped). ruff clean. Pre-existing 461 tests still pass; new total 480/480 + 1 skip. Made-with: Cursor
…e 3, commit 10/13) Thin HTTP wrapper over /api/runners. All four subcommands take --rewind-url (defaults from REWIND_URL env, falls back to http://127.0.0.1:4800) and --token (defaults from REWIND_AUTH_TOKEN for the dashboard's bearer auth). rewind runners list — pretty-printed table rewind runners add --name N --webhook-url U — returns RAW token ONCE rewind runners remove <id> — 409 surfaced as 'removal blocked' rewind runners regenerate-token <id> — returns NEW raw token The 'add' output mimics the dashboard's RegisterRunnerResponse: prints the raw token in bold yellow, reminds the operator it can't be retrieved later, and prints an export-ready REWIND_RUNNER_TOKEN line for one-line copy-paste into a runner's environment. Build green; cli help working. Made-with: Cursor
…dal (Phase 3, commit 7/13) - RunnersPage.tsx — table view of registered runners with status pill (active/disabled/stale), webhook URL, token preview, last- seen timestamp. TanStack Query keeps it auto-refreshed every 10s. - RegisterRunnerModal — form with name + webhook_url; surfaces server-side validation errors (SSRF rejection, polling-deferred, malformed URL) inline. - RawTokenModal — one-shot reveal of the raw auth token with Copy-to-clipboard button and a yellow 'cannot be retrieved' warning. Shows export-ready REWIND_RUNNER_TOKEN line for paste. - Sidebar — new 'Runners' nav entry (Server icon). - App.tsx + use-store.ts — 'runners' added to view union. - lib/api.ts — runner CRUD + replay-job endpoints + types (RunnerView, RegisterRunnerResponse, ReplayJobView, CreateReplayJobBody — re-exported for commit 8 consumers). Web build green (419 KB gz 117 KB; 1703 modules). Made-with: Cursor
…se 3, commit 8/13)
RunReplayButton.tsx exports both the inline button and the ReplayJobModal.
Modal flow: pick active runner -> optional strict_match -> POST shape A
(server forks + creates context atomically) -> live progress.
use-replay-job.ts: hook backing the modal. dispatch() POSTs the job,
subscribes via useWebSocket for replay_job_update frames matching this
job_id, exposes state/progress/error that update live.
use-websocket.ts: new ReplayJobUpdateData type + onReplayJobUpdate callback.
StepTimeline.tsx: per-step Run-replay button (cyan, hover-revealed)
alongside the existing Fork button.
SessionView.tsx: modalState union extended with 'runReplay'. Both
StepTimeline and ActivityTimeline forward onRunReplay (ActivityTimeline
forwarding is a placeholder; UI integration deferred).
UX:
- No-runners fallback shows CLI invocation + link to Runners page.
- In-flight: progress bar, state pill, fork-timeline-id surfaced.
- Terminal: success/error icon swap, modal stays open for inspection.
- Cancel: DELETE /api/replay-jobs/{id} (operator cancel; v1 fire-and-forget).
Web build green (426 KB / 119 KB gz; 1705 modules).
Made-with: Cursor
…ev escape (Phase 3, commit 11/13)
python/tests/e2e/test_runner_replay_flow.py exercises the complete dispatch
+ callback lifecycle against a real Rewind binary:
1. Spawn the rewind release binary with REWIND_RUNNER_SECRET_KEY +
REWIND_AUTH_TOKEN + REWIND_ALLOW_LOOPBACK_WEBHOOKS=1.
2. Stand up a stub runner HTTP server on a free port.
3. Register the runner via POST /api/runners with bearer auth;
capture raw_token from one-shot response.
4. Seed a session via POST /api/sessions/start + record one
llm-call so a fork target exists.
5. Dispatch shape A (server forks + creates context) via POST
/api/sessions/{sid}/replay-jobs at_step=1.
6. Wait for stub runner to receive the dispatch; verify
X-Rewind-Job-Id + X-Rewind-Signature headers + reconstructed
HMAC against the wire format the dispatcher uses.
7. Stub posts started -> progress(1/2) -> progress(2/2) -> completed
events back via POST /api/replay-jobs/{id}/events with
X-Rewind-Runner-Auth header.
8. Poll job state until completed; assert progress_step=2,
progress_total=2.
Test skips when ./target/release/rewind isn't built (handles
multiple working directory layouts via _resolve_rewind_bin).
REWIND_ALLOW_LOOPBACK_WEBHOOKS=1 escape hatch in the register
handler bypasses the SSRF guard for localhost webhooks. NEVER set
in production - any operator who can reach the dashboard could
then dispatch webhooks at localhost / cloud-metadata IPs. Strictly
for dev/CI where the runner stub lives on localhost.
481 python tests pass (was 480, +1 e2e).
Made-with: Cursor
…ase 3, commit 12/13) docs/runners.md — full operator-facing guide: - Bootstrap (REWIND_RUNNER_SECRET_KEY + REWIND_PUBLIC_URL). - Register from dashboard or CLI; raw-token surface discipline. - Implement runner side (FastAPI/asgi_handler example); REWIND_RUNNER_TOKEN env; alternative env-var bootstrap path. - Dispatch from dashboard or curl; both API shapes (A: server-fork, B: reuse-context). - Full state machine table (pending -> dispatched -> in_progress -> completed/errored). - Security model: outbound HMAC recipe, inbound runner-auth + ownership check, SSRF guard + REWIND_ALLOW_LOOPBACK_WEBHOOKS dev escape. - Lease + reaper semantics + dispatch deadline. - Troubleshooting: SSRF rejection, polling-deferred, active-jobs block on remove/rotate, signature mismatch, ownership 403. docs/recording.md — decision matrix updated from 'three ways to record' to **'five ways to record + replay'**: 1. Direct mode (Python init()) 2. HTTP intercept 3. cached_llm_call decorator (bundled from deferred Phase 2 doc) 4. Proxy mode (any language) 5. Dashboard runner (Phase 3) docs/getting-started.md — terminal pointer updated to highlight the 5-way matrix and link runners.md. Made-with: Cursor
…ommit 13/13) Per CLAUDE.md track-1 versioning rules: - Rust workspace + crates: 0.13.0 -> 0.14.0 (v0.13.0 was released as GitHub tag on Apr 27 and crates/ + web/src/ both changed in this PR — schema migration in commit 3, dispatcher + reaper in commit 5, dispatch + event endpoints in commit 6, web UI in 7+8). Cargo.toml + Cargo.lock + python/rewind_cli.py CLI_VERSION + python-mcp/rewind_mcp_cli.py CLI_VERSION all moved together. Per CLAUDE.md track-2 (Python SDK): - python/pyproject.toml stays at 0.15.0 (PyPI is at 0.14.8 — 0.15.0 is unpublished, so changes ride the existing version). - python-mcp/pyproject.toml stays at 0.13.0 (no MCP API change AND PyPI is at 0.12.10 — unpublished, no bump needed). Also: ruff --fix on three unused-import warnings in the new python/tests/e2e/test_runner_replay_flow.py (asyncio, sys, typing.Optional). All 481 pytest pass, web build green, Rust tests green, clippy clean on the touched crates. Made-with: Cursor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
risjai
left a comment
There was a problem hiding this comment.
Santa Verdict: NAUGHTY
Both independent Santa reviewers returned FAIL. CI is green now, and this PR covers a lot of the Phase 3 lifecycle, but there are still blocking correctness/security issues in the runner replay path.
Blocking Findings
-
Shape A accepts
strict_matchbut does not apply it.CreateReplayJobShapeAincludesstrict_match, but aftercreate_replay_context(...)the code never callsset_replay_context_strict_match. The dashboard can request strict replay and still get warn-on-divergence behavior. Add the setter and an integration test proving strict runner replay returns/raises on divergence. -
Runner replay does not bind live recordings to the fork timeline.
ExplicitClient.attach_replay_context()only sets_session_idand_replay_context_id; it does not set_timeline_id. The dispatch payload also omitsfork_timeline_id, and the SDK does not resolve the replay context's timeline before running the agent. Cache misses after the fork can therefore record new live steps into the root timeline instead of the replay fork. Include the fork/context timeline in the payload or makeattach_replay_context()fetch context metadata and set_timeline_idbefore any live miss can record. -
Shape B reuse does not validate cursor state or context expiry. It checks session ownership and in-flight jobs, but does not reject replay contexts whose cursor was already advanced (
current_step != from_step) or expired/near-expired contexts. A reused context can start from the wrong ordinal. Add current-step and TTL validation as described in the plan. -
Runner SDK ignores the dispatch payload
base_urlfor callbacks. The payload includesbase_url, butProgressReporterbuilds event URLs fromRunnerConfig.rewind_base_url/REWIND_URL. Dispatches only report progress correctly if the runner separately has matching config. Default the reporter topayload.base_url, with config override only when explicitly requested. -
Outbound webhook HMACs are replayable. The signature covers
job_id || body, but has no timestamp, delivery id, expiry window, or runner-side job-id/delivery dedupe before scheduling user code. A captured signed webhook can be replayed to launch the agent again even if later event callbacks are rejected. AddX-Rewind-Timestamp+ tolerance window and runner-side idempotency keyed byjob_idordelivery_idbefore invoking the handler. -
SSRF validation is registration-time only and is vulnerable to DNS rebinding.
validate_export_endpoint()is called when the runner URL is registered, butDispatcherlater uses reqwest against the stored URL and explicitly does not revalidate. If DNS changes after registration, the dispatcher can POST to private/link-local/metadata IPs. Revalidate/pin resolution at dispatch time, or use a connector/resolver that rejects private/reserved targets on connect. Strongly consider requiring HTTPS for non-loopback webhooks. -
Dashboard progress can miss fast job updates. The hook/modal subscribes after job creation, and the WebSocket callback setup can miss transient
ReplayJobUpdatedbroadcasts for fast jobs that complete before the modal is listening for the returnedjob_id. The UI should fetch current job state after dispatch and/or subscribe before dispatch with a correlation token; WebSocket events should be treated as invalidation, not the only source of truth.
Also Noted
- The implementation reintroduces cancellation via API/UI despite the revised plan explicitly deferring cancellation until a cooperative runner protocol exists. Remove it for v1 or make it clear that this is only server-side abandonment and cannot stop the runner/agent.
- The E2E path uses a custom stub runner that reads
payload.base_urldirectly, so it does not exercise the shippedrewind_agent.runnerSDK behavior where the callback URL bug lives.
Passes
- CI is green.
- Versioning looks consistent with the repo rules.
- Runner token registration/regeneration and basic callback ownership checks are covered better than the earlier PRs.
Recommended Fix Set
- Wire
strict_matchinto Shape A replay context creation. - Ensure runner replay sets the fork timeline context before live misses can record.
- Validate Shape B
current_step == from_stepand context freshness/TTL. - Make
ProgressReporteruse dispatchbase_urlby default. - Add dispatch freshness/idempotency protections to HMAC webhooks.
- Revalidate/pin webhook destinations at dispatch time against SSRF policy.
- Add tests for strict Shape A, fork-timeline live miss, Shape B consumed/expired contexts, SDK callback base URL, HMAC replay prevention, DNS rebinding/dispatch-time SSRF, and fast UI completion updates.
…ase_url, HMAC replay, dispatch SSRF, fast UI updates, remove cancel
7 blocking + 2 noted findings from the Santa review. All addressed.
F1 — strict_match accepted but ignored. Shape A now calls
set_replay_context_strict_match after create_replay_context when
the request body has strict_match=true. New tests:
shape_a_strict_match_true_sets_replay_context_strict_flag and
shape_a_strict_match_omitted_defaults_to_warn_only.
F2 — fork timeline binding. Dispatch payload now carries
replay_context_timeline_id (server-side). Shape B return also
populates fork_timeline_id with the existing context's timeline
(was null previously). Python ExplicitClient.attach_replay_context
takes an optional timeline_id and sets the _timeline_id contextvar
so live cache misses record into the fork timeline rather than the
root. DispatchPayload tolerates missing field for back-compat.
F3 — Shape B cursor + TTL validation. Refuses contexts whose
cursor != from_step (already mid-replay) and contexts older than
REPLAY_CONTEXT_TTL_SECS (1h, mirrors WebServer cleanup task).
ReplayContextRow gains last_accessed_at; Store helper
_test_set_replay_context_last_accessed enables fast TTL-expiry
tests. New: shape_b_rejects_context_whose_cursor_already_advanced,
shape_b_rejects_expired_context.
F4 — ProgressReporter ignored payload.base_url. Constructor now
takes optional base_url; asgi_handler builds the reporter from
payload.base_url so events POST back to the dispatcher's server
URL (which may differ from the runner's local REWIND_URL behind
a proxy). New: test_progress_reporter_prefers_explicit_base_url_over_config
+ test_progress_reporter_falls_back_to_config_when_base_url_omitted.
F5 — HMAC webhooks replayable. Signed input now includes a unix-
seconds timestamp: HMAC-SHA256(token, ts || \\n || job_id || \\n
|| body). New X-Rewind-Timestamp header sent + verified. Runner SDK
enforces ±5min tolerance window AND a process-local seen-cache of
recent job_ids — captured signed dispatches inside the tolerance
window can't relaunch the agent (200 + duplicate=true on second
arrival). compute_signature gains a timestamp parameter; legacy
two-line input still works for backward compat. Reference vectors
re-pinned in both Rust + Python tests.
F6 — SSRF only at registration. Dispatcher::dispatch now re-runs
url_guard::validate_export_endpoint at dispatch time, closing the
DNS-rebind window between registration and dispatch. Residual race
between this check and reqwest's connection-time re-resolve
documented as a known limitation requiring a custom hyper connector
(deferred). REWIND_ALLOW_LOOPBACK_WEBHOOKS dev escape applies to
both checks.
F7 — UI race on fast jobs. useReplayJob hook now treats WebSocket
frames as INVALIDATION (refetch GET /api/replay-jobs/{id}) rather
than the source of truth. After dispatch returns 202, immediately
fetch the current job state to seed UI in case the job already
advanced past pending before the WS subscription wired up. Fixes
the missing-fast-update class of bugs.
N1 — cancellation removed (deferred to v3.1 per the plan). Dropped
DELETE /api/replay-jobs/{id}, the cancel button, the API client
method, and updated docs/runners.md. Operators kill the runner
process externally; the lease reaper marks the job errored.
N2 — E2E now verifies the F2 + F5 wire-format additions
(replay_context_timeline_id field + X-Rewind-Timestamp header +
3-line signature recipe). The SDK code paths are exercised by
unit tests in test_runner.py (test_progress_reporter_prefers_*,
test_verify_signature_rejects_stale_timestamp,
test_asgi_handler_replay_attack_short_circuits_with_duplicate)
which run faster + more reliably than spinning a dedicated event
loop per HTTP request inside HTTPServer threads.
Tests: 16 replay_jobs_tests pass (was 12, +4); 28 runner unit tests
pass (was 19, +9 for F4/F5/F2 coverage); 108 rewind-web lib tests
(adds 2 new dispatcher tests covering timestamp variation +
3-line signing input); 102 rewind-store tests (the new
_test_set_replay_context_last_accessed helper unused at unit-level);
490 pytest pass; clippy + ruff clean; web build green.
Made-with: Cursor
|
All 7 blockers + 2 noted findings addressed in 27bdf09.
Test posture after the fix:
Wire-format changes (operators take note):
CI will pick up the push. Ready for re-review. |
risjai
left a comment
There was a problem hiding this comment.
Santa Re-Review: still NAUGHTY
Most of the previous blockers are addressed and CI is green, but I still see one remaining correctness blocker plus one important subprocess/bootstrap gap.
Remaining Blocker
- Runner callback race can still drop the
startedevent and leave jobs stuck. The Python runnerasgi_handler()schedules_run()before returning202;_run()immediately callsreporter.started(). The server only transitionspending -> dispatchedafter the dispatcher receives the runner’s202and applies the outcome. Ifstartedarrives while the job is stillpending, the state machine rejects it (Startedis only legal fromdispatched).ProgressReporteronly logs the409, so the handler continues and laterprogress/completedevents can also be rejected or leave the job indispatcheduntil the reaper marks it errored.
Suggested fixes:
- Move job to
dispatchedbefore sending the webhook, then revert/mark errored if dispatch fails; or - Allow authenticated
startedto transitionpending -> in_progressfor this dispatch race; or - Change the runner SDK to send
startedonly after a server-ack/state poll confirmsdispatched.
Also Still Concerning
- Env-var bootstrap still doesn’t carry the timeline id.
_install.pyreadsREWIND_SESSION_IDandREWIND_REPLAY_CONTEXT_ID, but not a replay context timeline id. So subprocess-based agents using env bootstrap can still miss_timeline_id, meaning live misses may record into the wrong timeline. The direct SDK path supportsattach_replay_context(..., timeline_id=...), but the env path does not.
Please add REWIND_REPLAY_CONTEXT_TIMELINE_ID (or equivalent) to the dispatch/subprocess environment and pass it through to attach_replay_context() during bootstrap.
Two findings from the second Santa pass on PR #154. BLOCKER — runner callback race drops `started`. The Python SDK's asgi_handler() schedules _run() (which immediately calls reporter.started()) before the dispatcher's apply_outcome flips state pending->dispatched. Under the strict state machine from #152 round 2, the server rejected `Started` from `Pending` as illegal: runner saw 409, ProgressReporter only logged it, the run proceeded anyway, and the job sat in `pending` until the reaper marked it errored. Fix: transition pending->dispatched SYNCHRONOUSLY in the HTTP handler before tokio::spawn. The runner now ALWAYS finds the job in `dispatched` when it tries to call back. apply_outcome is restructured: success path only sets deadline+lease (state already advanced); failure path transitions dispatched->errored (legal startup-failure transition per the existing state machine). If `started` lands first (very fast runners), the failure-path transition becomes a no-op (SQL guard refuses Dispatched->Errored when current state is in_progress) and the run proceeds. The response body's `state` field also flips from `pending` to `dispatched` accordingly — dashboards see the real state immediately without polling. CONCERNING — env-var bootstrap missing timeline. _install.py recognized REWIND_SESSION_ID + REWIND_REPLAY_CONTEXT_ID but not REWIND_REPLAY_CONTEXT_TIMELINE_ID, so subprocess-based agents using env bootstrap re-introduced the F2 bug for that path. intercept._install._bootstrap_replay_context_from_env now reads the env var and forwards it to attach_replay_context(timeline_id=...). A WARN log fires when the env var is missing. Tests: - 18 replay_jobs_tests pass (was 16, +2 new): create_replay_job_response_state_is_dispatched_not_pending and started_event_accepted_immediately_after_dispatch_response prove the race is closed. - 30 runner unit tests pass (was 28, +1 new env-var-with-timeline test; existing test updated to assert _timeline_id stays None when REWIND_REPLAY_CONTEXT_TIMELINE_ID is absent). - 108 rewind-web lib tests, 102 rewind-store, 491 pytest, ruff + clippy clean, web build green. docs/runners.md updated with REWIND_REPLAY_CONTEXT_TIMELINE_ID recommendation in the env-var bootstrap example. Made-with: Cursor
|
Both findings from the round-2 Santa pass addressed in f795f2b.
Picked option (a) from the suggested fix set ("Move job to Subtle property: if a very fast runner emits Actually the guard is Test posture:
CI will pick up the push. |
…view #154 round 2 follow-up) Self-caught follow-up to f795f2b. The pre-spawn pending->dispatched transition opened a race where the runner's `started` event could land while the dispatcher's HTTP call is still in flight, advancing the job to in_progress. If that HTTP call THEN failed, the failure-path apply_outcome's advance_replay_job_state(Errored) matched the in_progress row (the existing SQL guard only excludes terminal states) and clobbered legitimate started state. Fix: new Store::mark_dispatched_job_as_errored helper with the strict guard `WHERE id = ? AND state = 'dispatched'`. The failure-path apply_outcome now uses this. UPDATE is a no-op when state has advanced to in_progress, preserving the legitimate run. Includes a regression test (failed_dispatch_does_not_clobber_already_in_progress_job) that inserts an in_progress row directly, calls mark_dispatched_job_as_errored, and asserts the row is unchanged. 19 replay_jobs_tests pass (was 18, +1). Full suite still green. Made-with: Cursor
|
Self-caught follow-up to my previous comment, pushed as 023bd3f. I called out the in_progress-clobber risk in my reply ("that means a started-then-failed-dispatch combo could mark in_progress jobs errored"), then realized I should fix it now rather than file as follow-up. The risk is real — pre-spawn transition + fast runner + failed dispatch HTTP call would have the failure path's Fix: new New test 19 replay_jobs_tests pass (was 18, +1). Full suite remains green. |
Summary
Closes the loop on Phase 3 — the dashboard becomes a control surface instead of a viewer. Operators register long-lived runner processes; clicking "Run replay" on a session step HMAC-signs a webhook to the runner; the runner executes the agent under a replay context; progress streams back to the dashboard live via WebSocket.
This is commits 5-13 of the 13-commit Phase 3 plan, continuing from commits 1-4 already merged in PRs #150, #151, #152, #153.
What ships
POST /api/sessions/{sid}/replay-jobs(dual-shape A/B) +POST /api/replay-jobs/{id}/events(X-Rewind-Runner-Auth, ownership check, atomic event ingestion) + WS broadcastRunnersPage(list / register / regenerate / remove + raw-token reveal modal)RunReplayButton+ReplayJobModal+useReplayJobhook +replay_job_updateWebSocket framerewind_agent.runnerlibrary (RunnerConfig,compute_signature,verify_signature,ProgressReporter,asgi_handler,@handle_replay) +ExplicitClient.attach_replay_context+ env-var bootstraprewind runners {list, add, remove, regenerate-token}CLIREWIND_ALLOW_LOOPBACK_WEBHOOKSdev escape hatchdocs/runners.mdoperator how-to + decision matrix updated 3-way → 5-way (also bundles deferred Phase 2 doc)API / wire format
Outbound dispatch (Rewind → runner):
```
POST <runner.webhook_url>
X-Rewind-Job-Id:
X-Rewind-Signature: sha256= # HMAC-SHA256(raw_token, job_id || \n || raw_body)
Content-Type: application/json
{ "job_id", "session_id", "replay_context_id", "base_url" }
```
Inbound events (runner → Rewind, mounts OUTSIDE bearer middleware):
```
POST /api/replay-jobs/{id}/events
X-Rewind-Runner-Auth: # SHA-256 lookup; ownership check on job.runner_id
{ "event_type": "started|progress|completed|errored", "step_number?", "progress_total?", ... }
```
Reference vectors pinned both sides:
crates/rewind-web/src/dispatcher.rs::compute_signature_matches_python_reference↔python/tests/test_runner.py::test_compute_signature_matches_rust_reference.State machine + reaper
Same SQL-level terminal-state guard from #152 round 2 protects every transition. Reaper marks dispatched-but-never-started →
errored stage='dispatch'and in_progress-without-heartbeat →errored stage='lease_expired'.Bootstrap (server)
```bash
export REWIND_RUNNER_SECRET_KEY="$(openssl rand -base64 32)" # required
export REWIND_PUBLIC_URL="https://rewind.example\" # for non-local
rewind web
```
Unset key → 503 with bootstrap message. Set-but-malformed → server panics at startup (per #153 MEDIUM 5 fix).
Test plan
cargo test --workspace— greencargo clippyonrewind-web --lib,rewind-store --all-targets,rewind-cli— clean (pre-existing test-file warnings onrecording_api_tests/transcript_sync_testsunchanged from master)runners_tests+ 12 newreplay_jobs_testsintegration teststest_runner.py+ 1 e2e)web buildgreen (426 KB / 119 KB gz; 1705 modules)python/tests/e2e/test_runner_replay_flow.pyNotes for reviewers
REWIND_ALLOW_LOOPBACK_WEBHOOKSenv escape is documented as DEV-ONLY in both the registration handler anddocs/runners.md. Production deployments should never set it.cancellationis fire-and-forget in v1 (operator marks errored; runner not notified). Cooperative cancel deferred to v3.1 per the plan.pollingmode is still rejected at registration (deferred to v3.1, per feat(rust): /api/runners CRUD + AES-256-GCM crypto module (Phase 3, commit 4/13) #153 HIGH 1).Made with Cursor