Developed at MATS Exploration Phase under Neel Nanda, for a research project with Greg Kocher.
A harness for running multi-session agent trajectories using the Claude Agent SDK, capturing them in ATIF (Agent Trajectory Interchange Format), and tracking file state changes across sessions.
Built for AI alignment and interpretability research — studying how LLM agents behave across multi-turn, multi-session, multi-agent interactions.
Note: AgentLens currently supports Claude Code via the Claude Agent SDK. Support for additional agents and frameworks is planned — see Roadmap. Some features (especially turn-level replay) are experimental. We welcome PRs and contributions — open an issue if you run into bugs.
The harness takes a YAML config describing a sequence of sessions (prompts to an agent), runs each session against a working directory via the Claude Agent SDK, and produces structured outputs:
- ATIF trajectories — standardized JSON capturing every agent step, tool call, observation, and thinking block
- Shadow git change tracking — automatic tracking of all file changes via an invisible git repo, with per-step write attribution and full unified diffs
- Session chaining — three modes for controlling how sessions relate to each other (isolated, chained, forked)
- Resampling & replay — study behavioral variance at multiple levels: stateless API resampling, intervention testing (edit assistant text, tool results, or system prompts and resample), session-level resampling, and turn-level replay with full tool execution from any branch point
- Subagent capture — separate ATIF trajectories for each subagent invocation, linked to the parent via
SubagentTrajectoryRef
Requires Python >= 3.12 and uv.
git clone <this-repo>
cd agentlens
uv syncIf you have a Claude Code subscription (Pro/Max), no API key is needed — the SDK uses your subscription credentials automatically. Otherwise, set an API key:
export ANTHROPIC_API_KEY=sk-ant-... # Anthropic API key
# or
export OPENROUTER_API_KEY=sk-or-... # OpenRouter (set provider: openrouter in config)Run the smoke test:
harness run tests/smoke.yamlInspect results:
harness inspect runs/<run-name>Browse in the web UI:
cd ui && npm install && npm run dev
# Open http://localhost:5173The harness uses the Claude Agent SDK to run Claude Code sessions programmatically. Only Claude models are supported — the SDK speaks the Anthropic Messages API protocol and cannot run non-Claude models. Set the provider field in your config to choose how to route API calls.
| Provider | Config value | Env var | Notes |
|---|---|---|---|
| Anthropic | anthropic (default) |
ANTHROPIC_API_KEY |
Direct Anthropic API. If no key is set, falls back to Claude Code subscription credentials. |
| OpenRouter | openrouter |
OPENROUTER_API_KEY |
Routes through OpenRouter. The harness sets ANTHROPIC_BASE_URL automatically. |
| AWS Bedrock | bedrock |
Standard AWS credentials (AWS_ACCESS_KEY_ID, etc.) |
Sets CLAUDE_CODE_USE_BEDROCK=1. |
| GCP Vertex AI | vertex |
Standard GCP credentials (GOOGLE_APPLICATION_CREDENTIALS, etc.) |
Sets CLAUDE_CODE_USE_VERTEX=1. |
You can also set base_url in your config to point at a custom Anthropic-compatible endpoint.
With provider: anthropic (the default), if no ANTHROPIC_API_KEY is set, the SDK falls back to your Claude Code subscription credentials from ~/.claude/credentials.json (requires Claude Pro/Max). Usage is covered by your subscription with rate limits rather than per-token billing. If ANTHROPIC_API_KEY is set in your environment, it takes precedence over subscription credentials.
Cost reporting caveat: Cost figures in
run_meta.jsonand the web UI come from the SDK and are based on Anthropic's list pricing regardless of provider. They may not match your actual bill (especially on OpenRouter, Bedrock, or Vertex) and are purely informational when using a Claude Code subscription.
Example configs:
# Anthropic (default) — uses API key or Claude Code subscription
model: "claude-sonnet-4-20250514"
provider: anthropic
# OpenRouter
model: "claude-sonnet-4-20250514"
provider: openrouterExperiments are defined as YAML config files. Here's a full example:
model: "claude-sonnet-4-20250514"
provider: anthropic # anthropic | openrouter | bedrock | vertex
hypothesis: "The agent preserves hedging across sessions" # what this experiment tests
work_dir: "./repos/my_project" # working directory the agent operates in
session_mode: chained # isolated | chained | forked
tags: ["experiment-1"]
system_prompt: |
You are exploring a Python codebase. Use MEMORY.md to keep notes.
allowed_tools: # Claude Code tools the agent can use
- Read
- Grep
- Glob
- Bash
- Write
- Edit
max_turns: 30 # max agent turns per session
permission_mode: bypassPermissions # acceptEdits | bypassPermissions
max_budget_usd: 1.00 # optional spend cap per session
load_project_settings: false # whether to load the repo's CLAUDE.md
memory_file: "MEMORY.md" # auto-seeded file in working dir (default: MEMORY.md)
memory_seed: "# Project Notes\n" # initial content if file doesn't exist
revert_work_dir: true # reset working dir after run (default: false)
sessions:
- session_index: 1
prompt: "Explore the project structure. Take notes in MEMORY.md."
- session_index: 2
prompt: "Read the main module in detail. Update your notes."
- session_index: 3
prompt: "Summarize what you know about this project."
max_turns: 10 # per-session overrideAll file changes in the working directory are tracked automatically via a shadow git — a bare git repo stored in the run output directory (.shadow_git/). The agent never sees this repo; it uses GIT_DIR/GIT_WORK_TREE env vars to stay invisible.
This enables:
- Full diffs — every file change is captured automatically, no need to declare files upfront
- Turn-level replay — git worktrees provide isolated filesystem copies at any turn's state for parallel replay execution
- Per-step attribution — file writes are detected after each tool-using step and logged to
state_changelog.jsonl - Session diffs — unified patches showing what each session changed, saved as
session_diff.patch
The working directory does not need to be a git repo. The shadow git works with any directory.
- Memory file is auto-seeded. The harness creates
MEMORY.md(or whatevermemory_fileis set to) with thememory_seedcontent if it doesn't already exist. - Working directory path is injected into the system prompt. The harness appends the absolute path and memory file location to the system prompt so the agent knows where to read/write.
- The agent's cwd is the working directory. Set to the resolved
work_dir.
| Mode | Behavior | Shadow git action |
|---|---|---|
isolated |
Each session starts with a fresh conversation. File changes persist. | No reset |
chained |
Each session resumes from the previous session's conversation. Full context preserved. | Changes accumulate (no reset) |
forked |
Sessions 2+ fork from session 1. Each sees session 1's context but not each other's. | Reset to session 1's end state |
For more control than session_mode: forked provides, use fork_from on individual sessions to fork from any prior session — not just session 1:
session_mode: isolated # fork_from overrides session_mode per-session
sessions:
- session_index: 1
prompt: "Explore the codebase and take notes in MEMORY.md"
- session_index: 2
prompt: "Write a security analysis based on your notes"
fork_from: 1 # forks from session 1's conversation
- session_index: 3
prompt: "Write a performance analysis based on your notes"
fork_from: 1 # also forks from session 1 (independent of session 2)fork_from must reference a session with a lower index. It works with any session_mode — when set, it overrides the mode for that session.
To study behavioral variance, run the same forked session multiple times:
sessions:
- session_index: 1
prompt: "Explore the codebase and take notes"
- session_index: 2
prompt: "Write a security analysis based on your notes"
fork_from: 1
count: 5 # run 5 replicates of this sessionReplicates use a _rNN suffix on the session directory:
session_01/ # session 1 (count=1, no suffix)
session_02_r01/ # session 2, replicate 1 of 5
session_02_r02/ # session 2, replicate 2 of 5
...
session_02_r05/ # session 2, replicate 5 of 5
Sessions with count: 1 (the default) use the normal session_NN/ directory name. You can also add replicates to an existing run after the fact using harness resample-session.
The harness can define subagents that the main agent delegates work to via the Agent tool. When capture_subagent_trajectories is enabled (the default), each subagent invocation produces a separate ATIF trajectory file linked to the parent via SubagentTrajectoryRef.
agents:
- name: "code-explorer"
description: "Explores code structure, reads files, and reports findings."
prompt: "You are a code exploration specialist. Read files and report structure."
tools: ["Read", "Glob", "Grep"] # tool restrictions (null = inherit all)
model: "sonnet" # sonnet | opus | haiku | inheritEach agent in agents has:
| Field | Required | Default | Description |
|---|---|---|---|
name |
yes | — | Agent name (used as key in SDK's agents dict) |
description |
yes | — | When to use this agent (shown to the parent) |
prompt |
yes | — | System prompt for the subagent |
tools |
no | inherit all | Tool restrictions for the subagent |
model |
no | inherit | Model override: sonnet, opus, haiku, or inherit |
The Agent tool is automatically added to allowed_tools when agents is non-empty.
Subagent messages are filtered from the parent trajectory to keep it clean. The parent's observation result for the Agent tool call includes a subagent_trajectory_ref pointing to the separate subagent trajectory file.
| Field | Required | Default | Description |
|---|---|---|---|
model |
yes | — | Claude model identifier (e.g. claude-sonnet-4-20250514). Use Anthropic model names, not OpenRouter-format names. |
provider |
no | anthropic |
API provider: anthropic, openrouter, bedrock, vertex |
base_url |
no | — | Custom API base URL (overrides provider default) |
hypothesis |
no | — | One-sentence hypothesis this experiment tests. Shown in the web UI and saved to run_meta.json. |
work_dir |
yes | — | Working directory the agent operates in (any directory, not just repos) |
repo_name |
no | — | Human-readable name for the working directory |
sessions |
yes | — | List of SessionConfig objects |
session_mode |
no | isolated |
isolated, chained, or forked |
system_prompt |
no | — | System prompt for all sessions |
allowed_tools |
no | Read, Grep, Glob, Bash, Write, Edit | Tools the agent can use |
max_turns |
no | 50 |
Max agent turns per session |
permission_mode |
no | bypassPermissions |
acceptEdits or bypassPermissions |
memory_file |
no | MEMORY.md |
File to auto-seed in working directory |
memory_seed |
no | # Notes\n |
Initial content for the memory file |
max_budget_usd |
no | — | Per-session spend cap |
revert_work_dir |
no | false |
Reset working directory to pre-run state after the run completes |
load_project_settings |
no | false |
Load repo's CLAUDE.md and .claude/settings.json |
agents |
no | [] |
Subagent definitions (see Subagents) |
capture_subagent_trajectories |
no | true |
Save separate ATIF trajectories for each subagent invocation |
capture_api_requests |
no | true |
Capture raw API requests via proxy (enables resampling and intervention testing) |
run_name |
no | auto-generated | Custom name for the run directory |
tags |
no | [] |
Metadata tags |
Each session in sessions has:
| Field | Required | Default | Description |
|---|---|---|---|
session_index |
yes | — | Sequential index starting at 1 |
prompt |
yes | — | The user prompt for this session |
system_prompt |
no | — | Per-session system prompt override |
max_turns |
no | — | Per-session max turns override |
fork_from |
no | — | Session index to fork from (must be lower). Overrides session_mode for this session. |
count |
no | 1 |
Run this session N times as independent replicates. Directories get _rNN suffix. |
harness run <config.yaml> Run an experiment
harness list [--json] List completed runs
harness inspect <run_dir> [--json] Show run details
harness resample <run_dir> --session N --request N --count N Resample an API turn
harness resample-edit <run_dir> --session N --request N --dump/--input Edit & resample
harness resample-session <run_dir> --session N --count N Re-run a session N times
harness replay <run_dir> --session N --turn N --count N Replay from a turn
harness run examples/isolated.yaml \
--model anthropic/claude-sonnet-4 \
--tag baseline \
--session-mode chained \
--run-name my-run-01 \
--runs-dir ./output \
--no-capture # disable API capture (disables resampling)$ harness inspect runs/smoke-test-01
Run: smoke-test-01
Model: anthropic/claude-sonnet-4 (openrouter)
Mode: isolated
Tags: smoke-test
Total: 15 steps, 5 tool calls
Cost: $0.0596
File writes: 1
Session 1: 15 steps, 5 tool calls $0.0596
File changes:
session 1, step 15: MEMORY.md (+9/-0)
Replay a specific API turn N times to study output variance:
# Discover available requests
harness resample runs/my-run --session 1 --list-requests
# Resample request 5 ten times
harness resample runs/my-run --session 1 --request 5 --count 10
# Resample from a replicate session
harness resample runs/my-run --session 2 --replicate 3 --request 5 --count 5Resample results are saved to session_NN/resamples/request_NNN/ and can be viewed in the web UI.
Edit a captured API request and resample with the modified version — the CLI equivalent of the web UI's "Edit & Resample". Designed for scriptable intervention testing.
# Step 1: Dump the request for editing
harness resample-edit runs/my-run --session 1 --request 5 --dump > edit.json
# Step 2: Edit the JSON (assistant text, tool results, system prompt...)
# Step 3: Resample with the modified request
harness resample-edit runs/my-run --session 1 --request 5 \
--input edit.json --label "removed hedging" --count 5Pipe through jq for programmatic edits:
harness resample-edit runs/my-run --session 1 --request 5 --dump \
| jq '.system = "You are a cautious engineer. Double-check everything."' \
| harness resample-edit runs/my-run --session 1 --request 5 \
--input - --label "cautious prompt" --count 10Note: Thinking blocks cannot be edited — they carry cryptographic signatures validated by the API. See Thinking blocks for details.
Variants are saved alongside vanilla resamples and appear in the web UI.
Re-run a forked session N times to study behavioral variance across full trajectories:
harness resample-session runs/my-run --session 2 --count 5This finds session 2's fork_from target, resolves the session ID to fork from, and runs 5 new replicates. New session directories are appended (auto-incrementing from existing replicates), and run_meta.json is updated.
Experimental. Turn-level replay with git worktree filesystem reset is new and likely has bugs. If you run into issues, please open an issue.
Limitation: Replay resets the filesystem to the target turn's state, but cannot undo side effects outside the working directory (e.g. network requests, shell commands, environment changes). It works best with file-focused workflows.
Replay a session from any API turn with full tool execution. Each replicate runs in an isolated git worktree, so multiple replicates execute in parallel. Each replay becomes a new independent run with full provenance back to the source.
# List available turns
harness replay runs/my-run --session 1 --list-turns
# Replay from turn 5, three times (only session 1 runs)
harness replay runs/my-run --session 1 --turn 5 --count 3
# Replay session 1 turn 5, then continue with sessions 2, 3, etc.
harness replay runs/my-run --session 1 --turn 5 --continue-sessions
# Replay with an additional prompt after tool results
harness replay runs/my-run --session 1 --turn 5 --prompt "Try a different approach"By default, replay only runs the targeted session. Use --continue-sessions to also run subsequent sessions from the original config.
Replay creates new run directories (e.g. replay_my-run_s1_t5_r01_<timestamp>/) with full artifacts. Each includes a replay_meta.json with provenance linking back to the source run, session, and turn. The source working directory is never modified.
A SvelteKit web UI for browsing runs, trajectories, memory diffs, and resamples:
cd ui
npm install
npm run devOpen http://localhost:5173. The UI reads from the runs/ directory and provides:
- Run list — searchable/filterable list of all runs with model, cost, session count
- Run overview — metrics, session list with fork relationships, hypothesis display
- Trajectory viewer — full chat view with thinking blocks, tool calls, and observations
- Memory diff — before/after diffs of the memory file per session
- API captures — request/response viewer with token usage, system prompts, tool definitions, compaction events
- Subagent viewer — separate trajectory view for each subagent, with task prompt and return value
- Resamples — compare N resample outputs for a given API turn
- Edit & Resample — interactive message editor for intervention testing: edit assistant text, tool results, or system prompts in the conversation, then resample with the modified input to study how changes affect behavior (thinking blocks are shown read-only — see why)
- Changelog — per-step file write log across all sessions with expandable diffs
- Config viewer — frozen YAML config from the run
- Analysis — rendered markdown from
analysis.md
- Dark mode — toggle between light and dark themes
The UI expects RUNS_DIR=../runs (configured in ui/.env).
Each run produces a directory under runs/:
runs/<run_name>/
├── config.yaml # frozen copy of the run config
├── run_meta.json # run-level metadata and aggregates
├── full_diff.patch # unified diff of all changes (baseline → final)
├── state_changelog.jsonl # per-step write log across all sessions
├── analysis.md # experiment analysis (if created)
├── .shadow_git/ # shadow git repo (invisible change tracker)
│
├── session_01/
│ ├── trajectory.json # ATIF v1.6 trajectory (parent)
│ ├── transcript.jsonl # Claude Code transcript (for replay)
│ ├── uuid_map.json # turn correlation map (transcript ↔ ATIF ↔ raw dumps)
│ ├── session_diff.patch # unified diff of this session's changes
│ ├── subagent_<name>_<id>.json # subagent ATIF trajectory (if any)
│ ├── api_captures.jsonl # API request/response metadata (if capture enabled)
│ ├── raw_dumps/ # full API request/response JSON (if capture enabled)
│ │ ├── request_NNN.json
│ │ ├── request_NNN_headers.json
│ │ ├── response_NNN.txt
│ │ └── response_NNN_headers.json
│ └── resamples/ # resample outputs (created by UI or CLI)
│ ├── request_005/ # vanilla resamples for request 5
│ │ ├── sample_01.json
│ │ └── sample_02.json
│ └── request_005_v01/ # intervention variant
│ ├── variant.json # edit metadata (label, find/replace pairs)
│ ├── request.json # modified request body
│ └── sample_01.json
│
├── session_02/ # session 2 (count=1)
│ └── ...
├── session_03_r01/ # session 3, replicate 1 (count=3)
├── session_03_r02/ # session 3, replicate 2
└── session_03_r03/ # session 3, replicate 3
Each session produces a trajectory.json in ATIF v1.6 format. Key fields:
steps[].source—"agent","user", or"system"steps[].message— the text content of the stepsteps[].reasoning_content— extended thinking / chain-of-thought (when available)steps[].tool_calls[]— tool invocations with function name and argumentssteps[].observation— tool results, linked back to their tool call bysource_call_idfinal_metrics— token counts, cost, step count
state_changelog.jsonl records every detected file write with step-level attribution:
{
"session_index": 1,
"step_id": 15,
"file_path": "MEMORY.md",
"diff": "--- MEMORY.md\n+++ MEMORY.md\n@@ ...",
"diff_stats": {"added": 9, "removed": 0}
}When capture_api_requests: true is set (or --no-capture is not passed), the harness runs a local reverse proxy between the SDK and the API. This captures data not available in the SDK message stream:
- System prompt — the SDK's system prompt (a minimal agent prompt plus your
system_promptconfig) - Tool definitions — JSON schemas for each tool (Read, Write, Bash, etc.)
- Context management —
applied_editsfrom the API response when compaction occurs - Per-request token usage — input/output tokens, cache creation/read breakdown
- Compaction detection — when message count drops between requests, captures the post-compaction messages
- Sampling parameters — model, temperature, max_tokens
- Agent context — classifies each request as
main,subagent, orsdk_internal
The proxy logs to api_captures.jsonl in each session directory. System prompt and tools are logged in full on the first request and on change; otherwise only a hash is recorded to keep file sizes small.
Raw request/response bodies are saved to raw_dumps/ for resampling and intervention testing.
src/harness/
├── config.py # Pydantic config models, YAML loading
├── shadow_git.py # Shadow git: invisible change tracking via GIT_DIR/GIT_WORK_TREE
├── state.py # Per-step write detection via shadow git index
├── atif_adapter.py # Claude SDK Message -> ATIF Step mapping
├── runner.py # Single session execution
├── experiment.py # Multi-session orchestration (fork_from, replicates, shadow git lifecycle)
├── proxy.py # Reverse proxy for raw API request/response capture
├── resample.py # Single-turn API resampling
├── resample_session.py # Full session resampling (resample-session CLI)
├── transcript.py # Transcript parser and truncation for turn-level replay
├── uuid_map.py # UUID map builder — correlates transcript, ATIF, and raw API dumps
├── replay.py # Turn-level replay orchestrator
└── cli.py # Typer CLI
The core complexity lives in atif_adapter.py: the Claude Agent SDK streams messages (AssistantMessage, UserMessage, SystemMessage, ResultMessage) and the adapter maps them into ATIF steps with correct tool call / observation pairing, thinking block capture, and sequential step IDs.
- Multi-agent support — extend beyond Claude Code to support other agent frameworks and LLM providers (Codex, Devin, custom agents, etc.)
- Comparative analysis — side-by-side trajectory comparison across agents, models, and prompt variants
- Richer intervention toolkit — programmatic intervention pipelines for systematic counterfactual testing
- Scoring & evaluation — built-in trajectory scoring and automated evaluation metrics
We welcome PRs and contributions! Whether it's bug fixes, new features, documentation improvements, or support for additional agent frameworks — all contributions are appreciated.
- claude-agent-sdk — runs Claude Code sessions programmatically
- harbor — ATIF Pydantic models for trajectory validation
- typer — CLI framework
- pyyaml — config file loading
- pydantic — config validation



