feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions by acailic · Pull Request #140 · acailic/agent_debugger

acailic · 2026-04-05T16:55:30Z

Summary

Implements the three self-contained pieces described in #136:

agent_debugger_sdk/checkpoints/hooks.py — RestoreHook protocol, RESTORE_HOOK_REGISTRY dict, apply_restore_hook() async function, and AutoReplayManager class. Built-in LangChain hook restores messages and intermediate_steps; unknown frameworks fall back to a generic hook. Hook failures are caught and logged without crashing restore.
agent_debugger_sdk/drift.py — DriftSeverity enum (WARNING/CRITICAL), DriftEvent dataclass (with original_value/restored_value and expected/actual aliases), and DriftDetector that detects action, tool-call, and confidence drift between original and restored execution.
api/schemas.py — RestoreRequest gains replay_events: bool and track_drift: bool (both default False); RestoreResponse gains replayed_events_count: int | None and drift_detected: bool | None.
TraceContext.restore() extended with replay_events, track_drift, original_session_id, importance_threshold, and on_replay_event parameters. Auto-replay fetches post-checkpoint events, filters by sequence and importance, honours cancellation callbacks, and calls registered restore hooks.

No existing behaviour changes; all additions are additive.

Test plan

tests/test_replay_depth_l3.py — 31 passed, 1 skipped (pre-existing evidence arg limitation in record_decision, not related to this PR)
Full suite — 2215 passed, 10 skipped (env-gated integration tests + isolation-only package test)
ruff check — clean

Closes #136

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 648c2394fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-05T16:59:11Z

+
+                async with httpx.AsyncClient() as client:
+                    response = await client.get(
+                        f"{resolved_url}/api/sessions/{orig_session_id}/events"


Call an existing session endpoint for replay fetch

With replay_events=True, TraceContext.restore fetches from .../api/sessions/{orig_session_id}/events, but this commit’s API routes do not define that path (checked api/session_routes.py, api/trace_routes.py, and api/replay_routes.py; available paths are /trace, /traces, and /replay). In real runs this GET will 404, the exception is swallowed, and ctx.replayed_events stays empty, so auto-replay silently never happens.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-05T16:59:11Z

+            hook = RESTORE_HOOK_REGISTRY.get(framework)
+            if hook is not None:
+                try:
+                    await hook(restored_state, object())


Pass a mutable target to restore hooks

The restore hook is invoked with object() as the target, but built-in hooks (for example _langchain_hook) assign attributes like target.messages and target.intermediate_steps. A plain object instance does not allow dynamic attributes, so hooks raise AttributeError, get logged, and no restoration is applied. This makes hook-based state reconstruction effectively non-functional for normal hook implementations.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Implements L3 replay primitives (restore hooks, auto-replay plumbing) and L4 drift-detection primitives, plus API schema extensions to expose replay/drift options and status.

Changes:

Added restore hook infrastructure (RestoreHook, registry, apply_restore_hook) and a basic AutoReplayManager.
Added drift detection primitives (DriftSeverity, DriftEvent, DriftDetector) and extended TraceContext.restore() with replay/drift options.
Extended restore request/response schemas and updated demo GIF recording script + README GIF sizing.

Reviewed changes

Copilot reviewed 15 out of 43 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`agent_debugger_sdk/checkpoints/hooks.py`	Introduces restore hook protocol/registry, `apply_restore_hook`, and `AutoReplayManager`.
`agent_debugger_sdk/checkpoints/__init__.py`	Re-exports new hook-related symbols.
`agent_debugger_sdk/drift.py`	Adds drift severity/event types and drift comparison logic.
`agent_debugger_sdk/core/context/trace_context.py`	Extends `TraceContext.restore()` to optionally replay events and attach drift detector; attempts to invoke restore hooks.
`agent_debugger_sdk/core/context/session_manager.py`	Adds `checkpoint_sequence` into restored session config.
`api/schemas.py`	Adds `replay_events` / `track_drift` to `RestoreRequest`; adds replay/drift result fields to `RestoreResponse`.
`tests/sdk/core/test_session_manager.py`	Updates expectation to include `checkpoint_sequence` in restored session config.
`scripts/record_demo_gifs.js`	Updates Playwright automation flow, selectors, timing, and GIF encoding defaults.
`README.md`	Adjusts embedded demo GIF widths to match new output sizing.
`docs/assets/gifs/screenshots/*`	Removes older ad-hoc capture scripts and adds/updates screenshot assets used for docs/demo generation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T17:01:17Z

+        # Apply restore hook for the checkpoint's framework
+        if restored_state is not None:
+            framework = getattr(restored_state, "framework", "custom")
+            from agent_debugger_sdk.checkpoints import RESTORE_HOOK_REGISTRY
+
+            hook = RESTORE_HOOK_REGISTRY.get(framework)
+            if hook is not None:
+                try:
+                    await hook(restored_state, object())
+                except Exception as exc:
+                    ctx._hook_errors.append(exc)
+                    logger.warning("Restore hook for %r failed: %s", framework, exc)
+


TraceContext.restore() invokes the framework restore hook with object() as the target. Built-in hooks (e.g. LangChain) assign attributes on target, and object() is not attribute-settable, so the hook will always raise and become a no-op. Also, if no hook is registered for a framework, the generic fallback described in the PR isn’t applied here. Consider calling apply_restore_hook(framework, restored_state, target) with a mutable target (or accept a target/agent parameter) so hooks can actually restore state and unknown frameworks use the generic hook.

Copilot · 2026-04-05T17:01:17Z

+        # Attach drift detector if requested
+        if track_drift:
+            from agent_debugger_sdk.drift import DriftDetector
+
+            ctx._drift_detector = DriftDetector([])
+


When track_drift=True, the drift detector is initialized with an empty original_events list (DriftDetector([])), which makes compare() a no-op for any index. To make drift detection meaningful, initialize it with the original post-checkpoint events (or fetch them when track_drift is enabled) so restored execution can be compared against the recorded baseline.

Copilot · 2026-04-05T17:01:17Z

+                async with httpx.AsyncClient() as client:
+                    response = await client.get(
+                        f"{resolved_url}/api/sessions/{orig_session_id}/events"


Auto-replay event fetching uses httpx.AsyncClient() without an explicit timeout and builds the URL via string concatenation. This can hang indefinitely on network issues and can produce double slashes if server_url ends with /. Consider using a bounded timeout (consistent with agent_debugger_sdk/transport.py) and normalizing resolved_url (e.g. rstrip('/')) or using base_url on the client.

Suggested change

async with httpx.AsyncClient() as client:

response = await client.get(

f"{resolved_url}/api/sessions/{orig_session_id}/events"

normalized_url = resolved_url.rstrip("/")

timeout = httpx.Timeout(10.0)

async with httpx.AsyncClient(timeout=timeout) as client:

response = await client.get(

f"{normalized_url}/api/sessions/{orig_session_id}/events"

Copilot · 2026-04-05T17:01:18Z

+        # Check tool_name drift
+        if "tool_name" in orig_data and "tool_name" in new_data:
+            if orig_data["tool_name"] != new_data["tool_name"]:
+                return DriftEvent(
+                    severity=DriftSeverity.WARNING,
+                    description=(
+                        f"Tool call drift at index {index}: "
+                        f"expected tool {orig_data['tool_name']!r}, "
+                        f"got {new_data['tool_name']!r}"
+                    ),
+                    original_value=orig_data["tool_name"],
+                    restored_value=new_data["tool_name"],
+                    event_type=event_type,
+                    index=index,
+                )


Tool-call drift detection only checks data['tool_name'], but some event payloads use data['tool'] (including cases in the L3 replay tests). This means tool drift will be silently missed for those events. Consider checking both keys (e.g. tool_name and tool) and comparing whichever is present in both original and restored events.

Copilot · 2026-04-05T17:01:18Z

+        # Check confidence drift
+        if "confidence" in orig_data and "confidence" in new_data:
+            delta = abs(float(orig_data["confidence"]) - float(new_data["confidence"]))
+            if delta >= _CONFIDENCE_DRIFT_THRESHOLD:
+                severity = DriftSeverity.CRITICAL if delta >= 0.5 else DriftSeverity.WARNING


Confidence drift comparison unconditionally casts both values to float(). If either side is None or a non-numeric string (possible in loosely-typed JSON payloads), this will raise and break drift detection. Consider guarding the cast (TypeError/ValueError) and treating non-numeric confidence as “not comparable” (returning None).

Copilot · 2026-04-05T17:01:18Z

+
+
+# Registry mapping framework name to restore hook callable
+RESTORE_HOOK_REGISTRY: dict[str, Any] = {


RESTORE_HOOK_REGISTRY is typed as dict[str, Any], which loses the benefit of the RestoreHook protocol and makes misuse easier to miss. Consider typing it as dict[str, RestoreHook] (or MutableMapping[str, RestoreHook]) so static checkers can validate hook signatures.

Suggested change

RESTORE_HOOK_REGISTRY: dict[str, Any] = {

RESTORE_HOOK_REGISTRY: dict[str, RestoreHook] = {

Copilot · 2026-04-05T17:01:18Z

+class AutoReplayManager:
+    """Orchestrates automatic event replay after checkpoint restoration.
+
+    Manages the lifecycle of replaying recorded events from a session,
+    applying restore hooks and optionally tracking drift.
+
+    Args:
+        events: List of events to replay (already filtered by sequence/importance).
+        framework: Framework identifier for hook lookup.
+        on_event: Optional callback invoked per event; return False to stop.
+    """
+
+    def __init__(
+        self,
+        events: list[dict[str, Any]],
+        framework: str = "custom",
+        on_event: Any | None = None,
+    ) -> None:
+        self.events = events
+        self.framework = framework
+        self.on_event = on_event
+        self.replayed: list[dict[str, Any]] = []
+
+    async def run(self) -> list[dict[str, Any]]:
+        """Execute the replay sequence.
+
+        Returns:
+            List of events that were successfully replayed.
+        """
+        for event in self.events:
+            if self.on_event is not None:
+                result = self.on_event(event)
+                if result is False:
+                    break
+            self.replayed.append(event)
+        return self.replayed


AutoReplayManager’s docstring says it “applies restore hooks and optionally tracks drift”, but run() currently only invokes on_event and appends events to self.replayed. Either implement hook/drift application here or adjust the docstring to match actual behavior to avoid misleading API consumers.

Copilot · 2026-04-05T17:01:18Z

+const CHROME =
+  process.env.CHROME_PATH ||
+  "/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome";
+const UI = "http://localhost:8000/ui/";


The script defaults CHROME to a user-specific absolute path (/home/nistrator/...) and hardcodes the UI URL/port. This makes the script fail out-of-the-box for other developers/CI. Consider defaulting to Playwright’s managed browser (no executablePath) and making the UI base URL configurable via an env var (with a repo-appropriate default).

- drift.py: check both tool_name and tool keys; guard float() conversion - hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix AutoReplayManager docstring to match actual behavior - trace_context.py: use /traces endpoint; use SimpleNamespace for restore target; seed DriftDetector with baseline; add httpx timeout and pagination; store restored_target - record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL configurable via env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a4cb7ffef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-05T18:54:44Z

+                e for e in raw_events
+                if e.get("sequence", checkpoint_sequence + 1) > checkpoint_sequence


Filter replay events by real trace ordering fields

This filter reads a top-level sequence key that trace payloads from /api/sessions/{id}/traces do not provide (event ordering is carried in metadata), so the default checkpoint_sequence + 1 path is taken and nearly all fetched events are treated as post-checkpoint. In restores with replay_events=True, that replays pre-checkpoint history as well, which corrupts replay correctness and any downstream drift comparison.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-05T18:54:44Z

+            hook = RESTORE_HOOK_REGISTRY.get(framework)
+            if hook is not None:


Use generic restore hook when framework-specific hook is absent

Hook application is skipped entirely unless a framework key exists in RESTORE_HOOK_REGISTRY, so restores for custom/unknown frameworks never run the generic reconstruction path and leave ctx._restored_target unset. Since apply_restore_hook already defines a safe fallback, bypassing it here makes TraceContext.restore() silently drop restore-target reconstruction for non-registered frameworks.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-05T20:23:50Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

acailic · 2026-04-05T20:23:54Z

Addressed all remaining review feedback in commit 4144606:

P1 — Filter replay events by real trace ordering fields (chatgpt-codex-connector, @3037222732):
TraceEventSchema has no top-level sequence field (that field belongs to CheckpointSchema), so the previous filter e.get("sequence", checkpoint_sequence + 1) > checkpoint_sequence was silently passing every event through. Fixed by storing checkpoint_timestamp in the session config during restore_from_checkpoint and filtering events by e.get("timestamp", "") > checkpoint_ts — TraceEventSchema does have a timestamp field, and ISO strings sort lexicographically.

P2 — Use generic restore hook when framework-specific hook is absent (chatgpt-codex-connector, @3037222734):
TraceContext.restore() was directly looking up RESTORE_HOOK_REGISTRY.get(framework) and skipping entirely if no hook was found. Replaced with apply_restore_hook() (already in hooks.py) which falls back to _generic_hook for unknown/custom frameworks.

All other comments (P1 mutable target, P1 session endpoint, httpx timeout, double-slash URL, tool_name/tool key, confidence guard, registry typing, AutoReplayManager docstring, demo script hardcoded path) were addressed in the previous commit fix: address all review comments on PR #140.

Ruff: clean. Tests: 31 passed, 1 skipped (pre-existing).

- drift.py: check both tool_name and tool keys; guard float() conversion - hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix AutoReplayManager docstring to match actual behavior - trace_context.py: use /traces endpoint; use SimpleNamespace for restore target; seed DriftDetector with baseline; add httpx timeout and pagination; store restored_target - record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL configurable via env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-04-06T21:52:58Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

- drift.py: check both tool_name and tool keys; guard float() conversion - hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix AutoReplayManager docstring to match actual behavior - trace_context.py: use /traces endpoint; use SimpleNamespace for restore target; seed DriftDetector with baseline; add httpx timeout and pagination; store restored_target - record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL configurable via env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-04-06T21:56:53Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

…hema extensions - Add agent_debugger_sdk/checkpoints/hooks.py: RestoreHook protocol, RESTORE_HOOK_REGISTRY, apply_restore_hook(), and AutoReplayManager - Add agent_debugger_sdk/drift.py: DriftSeverity enum, DriftEvent dataclass, and DriftDetector for comparing restored vs original execution - Extend RestoreRequest with replay_events/track_drift fields and RestoreResponse with replayed_events_count/drift_detected fields - Extend TraceContext.restore() with replay_events, track_drift, importance_threshold, and on_replay_event parameters - Thread checkpoint_sequence through session config for post-checkpoint event filtering during auto-replay Closes #136 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- drift.py: check both tool_name and tool keys; guard float() conversion - hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix AutoReplayManager docstring to match actual behavior - trace_context.py: use /traces endpoint; use SimpleNamespace for restore target; seed DriftDetector with baseline; add httpx timeout and pagination; store restored_target - record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL configurable via env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tering - Use apply_restore_hook() in TraceContext.restore() so unknown frameworks fall back to _generic_hook instead of skipping restoration entirely (P2) - Store checkpoint_timestamp in session config during restore_from_checkpoint and filter replayed events by timestamp rather than a non-existent sequence field — TraceEventSchema has no top-level sequence key so the old filter was silently passing every event through (P1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…sion config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-04-06T22:00:56Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Copilot AI review requested due to automatic review settings April 5, 2026 16:55

Copilot started reviewing on behalf of acailic April 5, 2026 16:56 View session

chatgpt-codex-connector Bot reviewed Apr 5, 2026

View reviewed changes

Copilot AI reviewed Apr 5, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 5, 2026

View reviewed changes

acailic force-pushed the issue-136-l3-replay branch from 4144606 to c8bcae0 Compare April 6, 2026 21:52

acailic force-pushed the issue-136-l3-replay branch from c8bcae0 to 3912e50 Compare April 6, 2026 21:56

acailic and others added 4 commits April 7, 2026 00:00

fix: update test expectation for checkpoint_timestamp in restored ses…

07e69b0

…sion config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

acailic force-pushed the issue-136-l3-replay branch from 3912e50 to 07e69b0 Compare April 6, 2026 22:00

acailic merged commit ea64233 into main Apr 6, 2026
6 checks passed



		# Registry mapping framework name to restore hook callable
		RESTORE_HOOK_REGISTRY: dict[str, Any] = {

	RESTORE_HOOK_REGISTRY: dict[str, Any] = {
	RESTORE_HOOK_REGISTRY: dict[str, RestoreHook] = {

		e for e in raw_events
		if e.get("sequence", checkpoint_sequence + 1) > checkpoint_sequence

		hook = RESTORE_HOOK_REGISTRY.get(framework)
		if hook is not None:

Conversation

acailic commented Apr 5, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Apr 5, 2026

Uh oh!

acailic commented Apr 5, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants