feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#140
feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#140
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 648c2394fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| async with httpx.AsyncClient() as client: | ||
| response = await client.get( | ||
| f"{resolved_url}/api/sessions/{orig_session_id}/events" |
There was a problem hiding this comment.
Call an existing session endpoint for replay fetch
With replay_events=True, TraceContext.restore fetches from .../api/sessions/{orig_session_id}/events, but this commit’s API routes do not define that path (checked api/session_routes.py, api/trace_routes.py, and api/replay_routes.py; available paths are /trace, /traces, and /replay). In real runs this GET will 404, the exception is swallowed, and ctx.replayed_events stays empty, so auto-replay silently never happens.
Useful? React with 👍 / 👎.
| hook = RESTORE_HOOK_REGISTRY.get(framework) | ||
| if hook is not None: | ||
| try: | ||
| await hook(restored_state, object()) |
There was a problem hiding this comment.
Pass a mutable target to restore hooks
The restore hook is invoked with object() as the target, but built-in hooks (for example _langchain_hook) assign attributes like target.messages and target.intermediate_steps. A plain object instance does not allow dynamic attributes, so hooks raise AttributeError, get logged, and no restoration is applied. This makes hook-based state reconstruction effectively non-functional for normal hook implementations.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Implements L3 replay primitives (restore hooks, auto-replay plumbing) and L4 drift-detection primitives, plus API schema extensions to expose replay/drift options and status.
Changes:
- Added restore hook infrastructure (
RestoreHook, registry,apply_restore_hook) and a basicAutoReplayManager. - Added drift detection primitives (
DriftSeverity,DriftEvent,DriftDetector) and extendedTraceContext.restore()with replay/drift options. - Extended restore request/response schemas and updated demo GIF recording script + README GIF sizing.
Reviewed changes
Copilot reviewed 15 out of 43 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
agent_debugger_sdk/checkpoints/hooks.py |
Introduces restore hook protocol/registry, apply_restore_hook, and AutoReplayManager. |
agent_debugger_sdk/checkpoints/__init__.py |
Re-exports new hook-related symbols. |
agent_debugger_sdk/drift.py |
Adds drift severity/event types and drift comparison logic. |
agent_debugger_sdk/core/context/trace_context.py |
Extends TraceContext.restore() to optionally replay events and attach drift detector; attempts to invoke restore hooks. |
agent_debugger_sdk/core/context/session_manager.py |
Adds checkpoint_sequence into restored session config. |
api/schemas.py |
Adds replay_events / track_drift to RestoreRequest; adds replay/drift result fields to RestoreResponse. |
tests/sdk/core/test_session_manager.py |
Updates expectation to include checkpoint_sequence in restored session config. |
scripts/record_demo_gifs.js |
Updates Playwright automation flow, selectors, timing, and GIF encoding defaults. |
README.md |
Adjusts embedded demo GIF widths to match new output sizing. |
docs/assets/gifs/screenshots/* |
Removes older ad-hoc capture scripts and adds/updates screenshot assets used for docs/demo generation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Apply restore hook for the checkpoint's framework | ||
| if restored_state is not None: | ||
| framework = getattr(restored_state, "framework", "custom") | ||
| from agent_debugger_sdk.checkpoints import RESTORE_HOOK_REGISTRY | ||
|
|
||
| hook = RESTORE_HOOK_REGISTRY.get(framework) | ||
| if hook is not None: | ||
| try: | ||
| await hook(restored_state, object()) | ||
| except Exception as exc: | ||
| ctx._hook_errors.append(exc) | ||
| logger.warning("Restore hook for %r failed: %s", framework, exc) | ||
|
|
There was a problem hiding this comment.
TraceContext.restore() invokes the framework restore hook with object() as the target. Built-in hooks (e.g. LangChain) assign attributes on target, and object() is not attribute-settable, so the hook will always raise and become a no-op. Also, if no hook is registered for a framework, the generic fallback described in the PR isn’t applied here. Consider calling apply_restore_hook(framework, restored_state, target) with a mutable target (or accept a target/agent parameter) so hooks can actually restore state and unknown frameworks use the generic hook.
| # Attach drift detector if requested | ||
| if track_drift: | ||
| from agent_debugger_sdk.drift import DriftDetector | ||
|
|
||
| ctx._drift_detector = DriftDetector([]) | ||
|
|
There was a problem hiding this comment.
When track_drift=True, the drift detector is initialized with an empty original_events list (DriftDetector([])), which makes compare() a no-op for any index. To make drift detection meaningful, initialize it with the original post-checkpoint events (or fetch them when track_drift is enabled) so restored execution can be compared against the recorded baseline.
| async with httpx.AsyncClient() as client: | ||
| response = await client.get( | ||
| f"{resolved_url}/api/sessions/{orig_session_id}/events" |
There was a problem hiding this comment.
Auto-replay event fetching uses httpx.AsyncClient() without an explicit timeout and builds the URL via string concatenation. This can hang indefinitely on network issues and can produce double slashes if server_url ends with /. Consider using a bounded timeout (consistent with agent_debugger_sdk/transport.py) and normalizing resolved_url (e.g. rstrip('/')) or using base_url on the client.
| async with httpx.AsyncClient() as client: | |
| response = await client.get( | |
| f"{resolved_url}/api/sessions/{orig_session_id}/events" | |
| normalized_url = resolved_url.rstrip("/") | |
| timeout = httpx.Timeout(10.0) | |
| async with httpx.AsyncClient(timeout=timeout) as client: | |
| response = await client.get( | |
| f"{normalized_url}/api/sessions/{orig_session_id}/events" |
| # Check tool_name drift | ||
| if "tool_name" in orig_data and "tool_name" in new_data: | ||
| if orig_data["tool_name"] != new_data["tool_name"]: | ||
| return DriftEvent( | ||
| severity=DriftSeverity.WARNING, | ||
| description=( | ||
| f"Tool call drift at index {index}: " | ||
| f"expected tool {orig_data['tool_name']!r}, " | ||
| f"got {new_data['tool_name']!r}" | ||
| ), | ||
| original_value=orig_data["tool_name"], | ||
| restored_value=new_data["tool_name"], | ||
| event_type=event_type, | ||
| index=index, | ||
| ) |
There was a problem hiding this comment.
Tool-call drift detection only checks data['tool_name'], but some event payloads use data['tool'] (including cases in the L3 replay tests). This means tool drift will be silently missed for those events. Consider checking both keys (e.g. tool_name and tool) and comparing whichever is present in both original and restored events.
| # Check confidence drift | ||
| if "confidence" in orig_data and "confidence" in new_data: | ||
| delta = abs(float(orig_data["confidence"]) - float(new_data["confidence"])) | ||
| if delta >= _CONFIDENCE_DRIFT_THRESHOLD: | ||
| severity = DriftSeverity.CRITICAL if delta >= 0.5 else DriftSeverity.WARNING |
There was a problem hiding this comment.
Confidence drift comparison unconditionally casts both values to float(). If either side is None or a non-numeric string (possible in loosely-typed JSON payloads), this will raise and break drift detection. Consider guarding the cast (TypeError/ValueError) and treating non-numeric confidence as “not comparable” (returning None).
|
|
||
|
|
||
| # Registry mapping framework name to restore hook callable | ||
| RESTORE_HOOK_REGISTRY: dict[str, Any] = { |
There was a problem hiding this comment.
RESTORE_HOOK_REGISTRY is typed as dict[str, Any], which loses the benefit of the RestoreHook protocol and makes misuse easier to miss. Consider typing it as dict[str, RestoreHook] (or MutableMapping[str, RestoreHook]) so static checkers can validate hook signatures.
| RESTORE_HOOK_REGISTRY: dict[str, Any] = { | |
| RESTORE_HOOK_REGISTRY: dict[str, RestoreHook] = { |
| class AutoReplayManager: | ||
| """Orchestrates automatic event replay after checkpoint restoration. | ||
|
|
||
| Manages the lifecycle of replaying recorded events from a session, | ||
| applying restore hooks and optionally tracking drift. | ||
|
|
||
| Args: | ||
| events: List of events to replay (already filtered by sequence/importance). | ||
| framework: Framework identifier for hook lookup. | ||
| on_event: Optional callback invoked per event; return False to stop. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| events: list[dict[str, Any]], | ||
| framework: str = "custom", | ||
| on_event: Any | None = None, | ||
| ) -> None: | ||
| self.events = events | ||
| self.framework = framework | ||
| self.on_event = on_event | ||
| self.replayed: list[dict[str, Any]] = [] | ||
|
|
||
| async def run(self) -> list[dict[str, Any]]: | ||
| """Execute the replay sequence. | ||
|
|
||
| Returns: | ||
| List of events that were successfully replayed. | ||
| """ | ||
| for event in self.events: | ||
| if self.on_event is not None: | ||
| result = self.on_event(event) | ||
| if result is False: | ||
| break | ||
| self.replayed.append(event) | ||
| return self.replayed |
There was a problem hiding this comment.
AutoReplayManager’s docstring says it “applies restore hooks and optionally tracks drift”, but run() currently only invokes on_event and appends events to self.replayed. Either implement hook/drift application here or adjust the docstring to match actual behavior to avoid misleading API consumers.
| const CHROME = | ||
| process.env.CHROME_PATH || | ||
| "/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome"; | ||
| const UI = "http://localhost:8000/ui/"; |
There was a problem hiding this comment.
The script defaults CHROME to a user-specific absolute path (/home/nistrator/...) and hardcodes the UI URL/port. This makes the script fail out-of-the-box for other developers/CI. Consider defaulting to Playwright’s managed browser (no executablePath) and making the UI base URL configurable via an env var (with a repo-appropriate default).
- drift.py: check both tool_name and tool keys; guard float() conversion - hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix AutoReplayManager docstring to match actual behavior - trace_context.py: use /traces endpoint; use SimpleNamespace for restore target; seed DriftDetector with baseline; add httpx timeout and pagination; store restored_target - record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL configurable via env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a4cb7ffef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| e for e in raw_events | ||
| if e.get("sequence", checkpoint_sequence + 1) > checkpoint_sequence |
There was a problem hiding this comment.
Filter replay events by real trace ordering fields
This filter reads a top-level sequence key that trace payloads from /api/sessions/{id}/traces do not provide (event ordering is carried in metadata), so the default checkpoint_sequence + 1 path is taken and nearly all fetched events are treated as post-checkpoint. In restores with replay_events=True, that replays pre-checkpoint history as well, which corrupts replay correctness and any downstream drift comparison.
Useful? React with 👍 / 👎.
| hook = RESTORE_HOOK_REGISTRY.get(framework) | ||
| if hook is not None: |
There was a problem hiding this comment.
Use generic restore hook when framework-specific hook is absent
Hook application is skipped entirely unless a framework key exists in RESTORE_HOOK_REGISTRY, so restores for custom/unknown frameworks never run the generic reconstruction path and leave ctx._restored_target unset. Since apply_restore_hook already defines a safe fallback, bypassing it here makes TraceContext.restore() silently drop restore-target reconstruction for non-registered frameworks.
Useful? React with 👍 / 👎.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Addressed all remaining review feedback in commit 4144606: P1 — Filter replay events by real trace ordering fields (chatgpt-codex-connector, @3037222732): P2 — Use generic restore hook when framework-specific hook is absent (chatgpt-codex-connector, @3037222734): All other comments (P1 mutable target, P1 session endpoint, httpx timeout, double-slash URL, tool_name/tool key, confidence guard, registry typing, AutoReplayManager docstring, demo script hardcoded path) were addressed in the previous commit Ruff: clean. Tests: 31 passed, 1 skipped (pre-existing). |
- drift.py: check both tool_name and tool keys; guard float() conversion - hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix AutoReplayManager docstring to match actual behavior - trace_context.py: use /traces endpoint; use SimpleNamespace for restore target; seed DriftDetector with baseline; add httpx timeout and pagination; store restored_target - record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL configurable via env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4144606 to
c8bcae0
Compare
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
- drift.py: check both tool_name and tool keys; guard float() conversion - hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix AutoReplayManager docstring to match actual behavior - trace_context.py: use /traces endpoint; use SimpleNamespace for restore target; seed DriftDetector with baseline; add httpx timeout and pagination; store restored_target - record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL configurable via env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c8bcae0 to
3912e50
Compare
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
…hema extensions - Add agent_debugger_sdk/checkpoints/hooks.py: RestoreHook protocol, RESTORE_HOOK_REGISTRY, apply_restore_hook(), and AutoReplayManager - Add agent_debugger_sdk/drift.py: DriftSeverity enum, DriftEvent dataclass, and DriftDetector for comparing restored vs original execution - Extend RestoreRequest with replay_events/track_drift fields and RestoreResponse with replayed_events_count/drift_detected fields - Extend TraceContext.restore() with replay_events, track_drift, importance_threshold, and on_replay_event parameters - Thread checkpoint_sequence through session config for post-checkpoint event filtering during auto-replay Closes #136 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- drift.py: check both tool_name and tool keys; guard float() conversion - hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix AutoReplayManager docstring to match actual behavior - trace_context.py: use /traces endpoint; use SimpleNamespace for restore target; seed DriftDetector with baseline; add httpx timeout and pagination; store restored_target - record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL configurable via env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tering - Use apply_restore_hook() in TraceContext.restore() so unknown frameworks fall back to _generic_hook instead of skipping restoration entirely (P2) - Store checkpoint_timestamp in session config during restore_from_checkpoint and filter replayed events by timestamp rather than a non-existent sequence field — TraceEventSchema has no top-level sequence key so the old filter was silently passing every event through (P1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sion config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3912e50 to
07e69b0
Compare
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Summary
Implements the three self-contained pieces described in #136:
agent_debugger_sdk/checkpoints/hooks.py—RestoreHookprotocol,RESTORE_HOOK_REGISTRYdict,apply_restore_hook()async function, andAutoReplayManagerclass. Built-in LangChain hook restoresmessagesandintermediate_steps; unknown frameworks fall back to a generic hook. Hook failures are caught and logged without crashing restore.agent_debugger_sdk/drift.py—DriftSeverityenum (WARNING/CRITICAL),DriftEventdataclass (withoriginal_value/restored_valueandexpected/actualaliases), andDriftDetectorthat detects action, tool-call, and confidence drift between original and restored execution.api/schemas.py—RestoreRequestgainsreplay_events: boolandtrack_drift: bool(both defaultFalse);RestoreResponsegainsreplayed_events_count: int | Noneanddrift_detected: bool | None.TraceContext.restore()extended withreplay_events,track_drift,original_session_id,importance_threshold, andon_replay_eventparameters. Auto-replay fetches post-checkpoint events, filters by sequence and importance, honours cancellation callbacks, and calls registered restore hooks.No existing behaviour changes; all additions are additive.
Test plan
tests/test_replay_depth_l3.py— 31 passed, 1 skipped (pre-existingevidencearg limitation inrecord_decision, not related to this PR)ruff check— cleanCloses #136
🤖 Generated with Claude Code