Skip to content

feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#140

Merged
acailic merged 4 commits intomainfrom
issue-136-l3-replay
Apr 6, 2026
Merged

feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions#140
acailic merged 4 commits intomainfrom
issue-136-l3-replay

Conversation

@acailic
Copy link
Copy Markdown
Owner

@acailic acailic commented Apr 5, 2026

Summary

Implements the three self-contained pieces described in #136:

  • agent_debugger_sdk/checkpoints/hooks.pyRestoreHook protocol, RESTORE_HOOK_REGISTRY dict, apply_restore_hook() async function, and AutoReplayManager class. Built-in LangChain hook restores messages and intermediate_steps; unknown frameworks fall back to a generic hook. Hook failures are caught and logged without crashing restore.
  • agent_debugger_sdk/drift.pyDriftSeverity enum (WARNING/CRITICAL), DriftEvent dataclass (with original_value/restored_value and expected/actual aliases), and DriftDetector that detects action, tool-call, and confidence drift between original and restored execution.
  • api/schemas.pyRestoreRequest gains replay_events: bool and track_drift: bool (both default False); RestoreResponse gains replayed_events_count: int | None and drift_detected: bool | None.
  • TraceContext.restore() extended with replay_events, track_drift, original_session_id, importance_threshold, and on_replay_event parameters. Auto-replay fetches post-checkpoint events, filters by sequence and importance, honours cancellation callbacks, and calls registered restore hooks.

No existing behaviour changes; all additions are additive.

Test plan

  • tests/test_replay_depth_l3.py — 31 passed, 1 skipped (pre-existing evidence arg limitation in record_decision, not related to this PR)
  • Full suite — 2215 passed, 10 skipped (env-gated integration tests + isolation-only package test)
  • ruff check — clean

Closes #136

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 5, 2026 16:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 648c2394fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


async with httpx.AsyncClient() as client:
response = await client.get(
f"{resolved_url}/api/sessions/{orig_session_id}/events"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Call an existing session endpoint for replay fetch

With replay_events=True, TraceContext.restore fetches from .../api/sessions/{orig_session_id}/events, but this commit’s API routes do not define that path (checked api/session_routes.py, api/trace_routes.py, and api/replay_routes.py; available paths are /trace, /traces, and /replay). In real runs this GET will 404, the exception is swallowed, and ctx.replayed_events stays empty, so auto-replay silently never happens.

Useful? React with 👍 / 👎.

hook = RESTORE_HOOK_REGISTRY.get(framework)
if hook is not None:
try:
await hook(restored_state, object())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass a mutable target to restore hooks

The restore hook is invoked with object() as the target, but built-in hooks (for example _langchain_hook) assign attributes like target.messages and target.intermediate_steps. A plain object instance does not allow dynamic attributes, so hooks raise AttributeError, get logged, and no restoration is applied. This makes hook-based state reconstruction effectively non-functional for normal hook implementations.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements L3 replay primitives (restore hooks, auto-replay plumbing) and L4 drift-detection primitives, plus API schema extensions to expose replay/drift options and status.

Changes:

  • Added restore hook infrastructure (RestoreHook, registry, apply_restore_hook) and a basic AutoReplayManager.
  • Added drift detection primitives (DriftSeverity, DriftEvent, DriftDetector) and extended TraceContext.restore() with replay/drift options.
  • Extended restore request/response schemas and updated demo GIF recording script + README GIF sizing.

Reviewed changes

Copilot reviewed 15 out of 43 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
agent_debugger_sdk/checkpoints/hooks.py Introduces restore hook protocol/registry, apply_restore_hook, and AutoReplayManager.
agent_debugger_sdk/checkpoints/__init__.py Re-exports new hook-related symbols.
agent_debugger_sdk/drift.py Adds drift severity/event types and drift comparison logic.
agent_debugger_sdk/core/context/trace_context.py Extends TraceContext.restore() to optionally replay events and attach drift detector; attempts to invoke restore hooks.
agent_debugger_sdk/core/context/session_manager.py Adds checkpoint_sequence into restored session config.
api/schemas.py Adds replay_events / track_drift to RestoreRequest; adds replay/drift result fields to RestoreResponse.
tests/sdk/core/test_session_manager.py Updates expectation to include checkpoint_sequence in restored session config.
scripts/record_demo_gifs.js Updates Playwright automation flow, selectors, timing, and GIF encoding defaults.
README.md Adjusts embedded demo GIF widths to match new output sizing.
docs/assets/gifs/screenshots/* Removes older ad-hoc capture scripts and adds/updates screenshot assets used for docs/demo generation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +193 to +205
# Apply restore hook for the checkpoint's framework
if restored_state is not None:
framework = getattr(restored_state, "framework", "custom")
from agent_debugger_sdk.checkpoints import RESTORE_HOOK_REGISTRY

hook = RESTORE_HOOK_REGISTRY.get(framework)
if hook is not None:
try:
await hook(restored_state, object())
except Exception as exc:
ctx._hook_errors.append(exc)
logger.warning("Restore hook for %r failed: %s", framework, exc)

Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceContext.restore() invokes the framework restore hook with object() as the target. Built-in hooks (e.g. LangChain) assign attributes on target, and object() is not attribute-settable, so the hook will always raise and become a no-op. Also, if no hook is registered for a framework, the generic fallback described in the PR isn’t applied here. Consider calling apply_restore_hook(framework, restored_state, target) with a mutable target (or accept a target/agent parameter) so hooks can actually restore state and unknown frameworks use the generic hook.

Copilot uses AI. Check for mistakes.
Comment on lines +206 to +211
# Attach drift detector if requested
if track_drift:
from agent_debugger_sdk.drift import DriftDetector

ctx._drift_detector = DriftDetector([])

Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When track_drift=True, the drift detector is initialized with an empty original_events list (DriftDetector([])), which makes compare() a no-op for any index. To make drift detection meaningful, initialize it with the original post-checkpoint events (or fetch them when track_drift is enabled) so restored execution can be compared against the recorded baseline.

Copilot uses AI. Check for mistakes.
Comment on lines +240 to +242
async with httpx.AsyncClient() as client:
response = await client.get(
f"{resolved_url}/api/sessions/{orig_session_id}/events"
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-replay event fetching uses httpx.AsyncClient() without an explicit timeout and builds the URL via string concatenation. This can hang indefinitely on network issues and can produce double slashes if server_url ends with /. Consider using a bounded timeout (consistent with agent_debugger_sdk/transport.py) and normalizing resolved_url (e.g. rstrip('/')) or using base_url on the client.

Suggested change
async with httpx.AsyncClient() as client:
response = await client.get(
f"{resolved_url}/api/sessions/{orig_session_id}/events"
normalized_url = resolved_url.rstrip("/")
timeout = httpx.Timeout(10.0)
async with httpx.AsyncClient(timeout=timeout) as client:
response = await client.get(
f"{normalized_url}/api/sessions/{orig_session_id}/events"

Copilot uses AI. Check for mistakes.
Comment thread agent_debugger_sdk/drift.py Outdated
Comment on lines +106 to +120
# Check tool_name drift
if "tool_name" in orig_data and "tool_name" in new_data:
if orig_data["tool_name"] != new_data["tool_name"]:
return DriftEvent(
severity=DriftSeverity.WARNING,
description=(
f"Tool call drift at index {index}: "
f"expected tool {orig_data['tool_name']!r}, "
f"got {new_data['tool_name']!r}"
),
original_value=orig_data["tool_name"],
restored_value=new_data["tool_name"],
event_type=event_type,
index=index,
)
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tool-call drift detection only checks data['tool_name'], but some event payloads use data['tool'] (including cases in the L3 replay tests). This means tool drift will be silently missed for those events. Consider checking both keys (e.g. tool_name and tool) and comparing whichever is present in both original and restored events.

Copilot uses AI. Check for mistakes.
Comment thread agent_debugger_sdk/drift.py Outdated
Comment on lines +122 to +126
# Check confidence drift
if "confidence" in orig_data and "confidence" in new_data:
delta = abs(float(orig_data["confidence"]) - float(new_data["confidence"]))
if delta >= _CONFIDENCE_DRIFT_THRESHOLD:
severity = DriftSeverity.CRITICAL if delta >= 0.5 else DriftSeverity.WARNING
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confidence drift comparison unconditionally casts both values to float(). If either side is None or a non-numeric string (possible in loosely-typed JSON payloads), this will raise and break drift detection. Consider guarding the cast (TypeError/ValueError) and treating non-numeric confidence as “not comparable” (returning None).

Copilot uses AI. Check for mistakes.
Comment thread agent_debugger_sdk/checkpoints/hooks.py Outdated


# Registry mapping framework name to restore hook callable
RESTORE_HOOK_REGISTRY: dict[str, Any] = {
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESTORE_HOOK_REGISTRY is typed as dict[str, Any], which loses the benefit of the RestoreHook protocol and makes misuse easier to miss. Consider typing it as dict[str, RestoreHook] (or MutableMapping[str, RestoreHook]) so static checkers can validate hook signatures.

Suggested change
RESTORE_HOOK_REGISTRY: dict[str, Any] = {
RESTORE_HOOK_REGISTRY: dict[str, RestoreHook] = {

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +130
class AutoReplayManager:
"""Orchestrates automatic event replay after checkpoint restoration.

Manages the lifecycle of replaying recorded events from a session,
applying restore hooks and optionally tracking drift.

Args:
events: List of events to replay (already filtered by sequence/importance).
framework: Framework identifier for hook lookup.
on_event: Optional callback invoked per event; return False to stop.
"""

def __init__(
self,
events: list[dict[str, Any]],
framework: str = "custom",
on_event: Any | None = None,
) -> None:
self.events = events
self.framework = framework
self.on_event = on_event
self.replayed: list[dict[str, Any]] = []

async def run(self) -> list[dict[str, Any]]:
"""Execute the replay sequence.

Returns:
List of events that were successfully replayed.
"""
for event in self.events:
if self.on_event is not None:
result = self.on_event(event)
if result is False:
break
self.replayed.append(event)
return self.replayed
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoReplayManager’s docstring says it “applies restore hooks and optionally tracks drift”, but run() currently only invokes on_event and appends events to self.replayed. Either implement hook/drift application here or adjust the docstring to match actual behavior to avoid misleading API consumers.

Copilot uses AI. Check for mistakes.
Comment thread scripts/record_demo_gifs.js Outdated
Comment on lines +25 to +28
const CHROME =
process.env.CHROME_PATH ||
"/home/nistrator/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome";
const UI = "http://localhost:8000/ui/";
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script defaults CHROME to a user-specific absolute path (/home/nistrator/...) and hardcodes the UI URL/port. This makes the script fail out-of-the-box for other developers/CI. Consider defaulting to Playwright’s managed browser (no executablePath) and making the UI base URL configurable via an env var (with a repo-appropriate default).

Copilot uses AI. Check for mistakes.
acailic added a commit that referenced this pull request Apr 5, 2026
- drift.py: check both tool_name and tool keys; guard float() conversion
- hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix
  AutoReplayManager docstring to match actual behavior
- trace_context.py: use /traces endpoint; use SimpleNamespace for
  restore target; seed DriftDetector with baseline; add httpx timeout
  and pagination; store restored_target
- record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL
  configurable via env var

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a4cb7ffef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +267 to +268
e for e in raw_events
if e.get("sequence", checkpoint_sequence + 1) > checkpoint_sequence
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Filter replay events by real trace ordering fields

This filter reads a top-level sequence key that trace payloads from /api/sessions/{id}/traces do not provide (event ordering is carried in metadata), so the default checkpoint_sequence + 1 path is taken and nearly all fetched events are treated as post-checkpoint. In restores with replay_events=True, that replays pre-checkpoint history as well, which corrupts replay correctness and any downstream drift comparison.

Useful? React with 👍 / 👎.

Comment on lines +199 to +200
hook = RESTORE_HOOK_REGISTRY.get(framework)
if hook is not None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use generic restore hook when framework-specific hook is absent

Hook application is skipped entirely unless a framework key exists in RESTORE_HOOK_REGISTRY, so restores for custom/unknown frameworks never run the generic reconstruction path and leave ctx._restored_target unset. Since apply_restore_hook already defines a safe fallback, bypassing it here makes TraceContext.restore() silently drop restore-target reconstruction for non-registered frameworks.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@acailic
Copy link
Copy Markdown
Owner Author

acailic commented Apr 5, 2026

Addressed all remaining review feedback in commit 4144606:

P1 — Filter replay events by real trace ordering fields (chatgpt-codex-connector, @3037222732):
TraceEventSchema has no top-level sequence field (that field belongs to CheckpointSchema), so the previous filter e.get("sequence", checkpoint_sequence + 1) > checkpoint_sequence was silently passing every event through. Fixed by storing checkpoint_timestamp in the session config during restore_from_checkpoint and filtering events by e.get("timestamp", "") > checkpoint_tsTraceEventSchema does have a timestamp field, and ISO strings sort lexicographically.

P2 — Use generic restore hook when framework-specific hook is absent (chatgpt-codex-connector, @3037222734):
TraceContext.restore() was directly looking up RESTORE_HOOK_REGISTRY.get(framework) and skipping entirely if no hook was found. Replaced with apply_restore_hook() (already in hooks.py) which falls back to _generic_hook for unknown/custom frameworks.

All other comments (P1 mutable target, P1 session endpoint, httpx timeout, double-slash URL, tool_name/tool key, confidence guard, registry typing, AutoReplayManager docstring, demo script hardcoded path) were addressed in the previous commit fix: address all review comments on PR #140.

Ruff: clean. Tests: 31 passed, 1 skipped (pre-existing).

acailic added a commit that referenced this pull request Apr 6, 2026
- drift.py: check both tool_name and tool keys; guard float() conversion
- hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix
  AutoReplayManager docstring to match actual behavior
- trace_context.py: use /traces endpoint; use SimpleNamespace for
  restore target; seed DriftDetector with baseline; add httpx timeout
  and pagination; store restored_target
- record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL
  configurable via env var

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@acailic acailic force-pushed the issue-136-l3-replay branch from 4144606 to c8bcae0 Compare April 6, 2026 21:52
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

acailic added a commit that referenced this pull request Apr 6, 2026
- drift.py: check both tool_name and tool keys; guard float() conversion
- hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix
  AutoReplayManager docstring to match actual behavior
- trace_context.py: use /traces endpoint; use SimpleNamespace for
  restore target; seed DriftDetector with baseline; add httpx timeout
  and pagination; store restored_target
- record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL
  configurable via env var

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@acailic acailic force-pushed the issue-136-l3-replay branch from c8bcae0 to 3912e50 Compare April 6, 2026 21:56
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

acailic and others added 4 commits April 7, 2026 00:00
…hema extensions

- Add agent_debugger_sdk/checkpoints/hooks.py: RestoreHook protocol,
  RESTORE_HOOK_REGISTRY, apply_restore_hook(), and AutoReplayManager
- Add agent_debugger_sdk/drift.py: DriftSeverity enum, DriftEvent dataclass,
  and DriftDetector for comparing restored vs original execution
- Extend RestoreRequest with replay_events/track_drift fields and
  RestoreResponse with replayed_events_count/drift_detected fields
- Extend TraceContext.restore() with replay_events, track_drift,
  importance_threshold, and on_replay_event parameters
- Thread checkpoint_sequence through session config for post-checkpoint
  event filtering during auto-replay

Closes #136

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- drift.py: check both tool_name and tool keys; guard float() conversion
- hooks.py: type RESTORE_HOOK_REGISTRY as dict[str, RestoreHook]; fix
  AutoReplayManager docstring to match actual behavior
- trace_context.py: use /traces endpoint; use SimpleNamespace for
  restore target; seed DriftDetector with baseline; add httpx timeout
  and pagination; store restored_target
- record_demo_gifs.js: remove hardcoded CHROME_PATH; make UI URL
  configurable via env var

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tering

- Use apply_restore_hook() in TraceContext.restore() so unknown frameworks
  fall back to _generic_hook instead of skipping restoration entirely (P2)
- Store checkpoint_timestamp in session config during restore_from_checkpoint
  and filter replayed events by timestamp rather than a non-existent sequence
  field — TraceEventSchema has no top-level sequence key so the old filter
  was silently passing every event through (P1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sion config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@acailic acailic force-pushed the issue-136-l3-replay branch from 3912e50 to 07e69b0 Compare April 6, 2026 22:00
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@acailic acailic merged commit ea64233 into main Apr 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: implement L3 replay primitives — RestoreHook, DriftDetector, schema extensions

2 participants