feat: improve capture pipeline — opencode SQLite, codex prompt history, cursor model, MCP attribution fix#5
Conversation
…emove duplicate codex capture
capture-codex.py: `git diff` misses brand-new untracked files. Add
`git ls-files --others --exclude-standard` pass so Codex attribution
works when it creates a file from scratch. Also updated `get_dirty_file_names`
to include untracked files in the pre-task snapshot for correct exclusion.
capture-claude.py: `get_model_and_prompt` was guessing the session file
path from the session ID, but Claude Code organises sessions by repo path
slug, not session ID. Switch to a recursive glob search across
~/.claude/projects/**/{session_id}.jsonl so the model name is always found.
codex.rs: `agentdiff configure` was writing both `notify` in config.toml
AND `UserPromptSubmit`/`Stop` in hooks.json. When codex_hooks=true, Codex
fires both for the same task — doubling every session.jsonl entry. Remove
the `notify` key when enabling codex_hooks so only hooks.json fires.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nds, update CI section - Fix incorrect claim that configure tracks all repos globally (init is required per repo) - Remove commands that no longer exist: stats, log, remote-status, migrate, export - Add install-ci to commands table - Fix example flags: --out-md/--out-annotations → --out, agentdiff stats → agentdiff report - Replace manual CI YAML with agentdiff install-ci workflow + correct manual example - Fix install.sh URL: master → main - Remove stale config.toml keys (data_dir, auto_amend_ledger) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y, no session-evidence files - prepare-ledger: preserve agent="human" as semantic token; add git_author field separately so finalize-ledger can display the real git username without losing the human/AI distinction for type checks - prepare-ledger: explicitly attribute files with no session.jsonl evidence to human rather than inheriting the dominant AI agent — fixes cases where AI and human edits are committed together and untracked files were incorrectly claimed by the AI - finalize-ledger: read git_author from payload; use it for tool.name when agent=="human" so contributor.type=="human" traces show the committer name - store: remove session.jsonl load from load_entries() — only AgentTrace records belong in the committed view; add load_uncommitted_entries() for the --uncommitted path to avoid double-counting and copilot leakage - list: use load_uncommitted_entries() for the uncommitted view Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to WSL+Windows paths - Check ~/.cursor/ directory existence instead of hooks.json existence so the file is created when Cursor is installed but hooks.json is absent - Extract configure_cursor_hooks_file() helper to apply the same hooks to multiple paths without duplication - On WSL2, Cursor is a Windows app — scan /mnt/c/Users/*/\.cursor and write hooks.json there alongside the WSL ~/.cursor/hooks.json so whichever path cursor-server resolves picks up the config - Summary in print_configure_summary now checks presence_path (dir) separately from config_path (file) for all home-based tools, giving accurate output when the tool is installed but not yet configured Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… gotchas Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…QLite DB - Added functions to retrieve the model ID and initial user prompt from the OpenCode SQLite database. - Implemented fallback mechanisms to read from model.json if the database lookup fails. - Updated the main capture logic to utilize the new retrieval functions for model and prompt. - Introduced a comprehensive test script for the agentdiff pipeline, validating the entire capture, prepare, and finalize process with real and simulated agents. - Improved cursor configuration in Rust to ensure versioning in hooks configuration.
… attribution fallback - Drop always_log from capture-cursor.py and capture-codex.py — was writing to log files on every agent event regardless of AGENTDIFF_DEBUG, silently filling ~/.agentdiff/logs/. All call sites replaced with debug_log (conditional on AGENTDIFF_DEBUG env var). - Fix prepare-ledger.py: files with no session event but present in MCP files_read now correctly inherit the MCP agent/model instead of falling back to "human". Fixes CI mcp-smoke test failure: RuntimeError: expected model_id=mcp-smoke-model in trace entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR improves capture fidelity and attribution correctness across four agent integrations (OpenCode, Codex, Cursor, Claude Code) and fixes a misattribution bug in Key changes:
One notable concern: the new MCP attribution logic builds Confidence Score: 4/5Safe to merge with one targeted fix — the basename-collision misattribution in prepare-ledger.py should be addressed before this hits repos with common filenames. The majority of the PR is a clear improvement: logging cleanup, store separation, cursor configure fix, and git_author/agent separation are all correct and well-reasoned. The MCP attribution fix resolves the failing smoke test. The one concrete bug is in the new files_read_set lookup: because files_read contains absolute paths and files_touched contains repo-relative paths, all matching reduces to basename comparison, which risks misattributing unrelated files to the MCP agent whenever they share a filename. This is a real misattribution vector but limited to MCP-originated commits, so it does not break the primary single-agent path. scripts/prepare-ledger.py — the files_read_set basename-union logic at lines 334–340 needs a path-normalisation fix to avoid false-positive MCP attribution. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Agent Hook Fires] --> B[Dispatch by agent type]
B --> C[capture-claude.py]
B --> D[capture-cursor.py]
B --> E[capture-codex.py]
B --> F[capture-opencode.py]
C --> C1[glob ~/.claude/projects for session JSONL - skip synthetic model values]
C --> C2[~/.claude/history.jsonl for prompt]
D --> D1[model from payload keys - fallback cursor-unknown]
D --> D2[cached prompt file or agent-transcript JSONL]
E --> E1[git diff + ls-files --others including untracked files]
E --> E2[~/.codex/history.jsonl for prompt]
F --> F1[SQLite opencode.db - modelID from latest assistant msg]
F --> F2[SQLite opencode.db - first user message text]
C1 & C2 & D1 & D2 & E1 & E2 & F1 & F2 --> G[Write entry to session.jsonl]
G --> H[pre-commit: prepare-ledger.py]
H --> H1[File has session event]
H --> H2[File has no session event]
H1 --> H3[Use session agent and model]
H2 --> H4[agent != human AND file in files_read]
H4 -->|Yes| H5[Use MCP agent and model]
H4 -->|No| H6[Attribute to human]
H3 & H5 & H6 --> I[pending_ledger.json with per-file attribution and git_author]
I --> J[post-commit: finalize-ledger.py]
J --> K[tool.name = git_author for human or agent name for AI]
K --> L[Signed AgentTrace appended to traces/branch.jsonl]
|
| files_read_set = {os.path.basename(f) for f in files_read} | set(files_read) | ||
| for fp in files_touched: | ||
| if fp not in events_by_file: | ||
| if agent != "human" and (fp in files_read_set or os.path.basename(fp) in files_read_set): | ||
| attribution[fp] = {"agent": agent, "model": model} | ||
| else: | ||
| attribution[fp] = {"agent": "human", "model": "human"} |
There was a problem hiding this comment.
Basename collision can cause false-positive MCP attribution
files_read_set is built by unioning full paths from files_read with their basenames. The check on line 337 then tests whether the committed file appears in that set by either its repo-relative path or its basename:
files_read_set = {os.path.basename(f) for f in files_read} | set(files_read)
...
if agent != "human" and (fp in files_read_set or os.path.basename(fp) in files_read_set):Because files_read entries are absolute paths (e.g. /home/user/project/src/utils.py) while files_touched entries are repo-relative (e.g. src/utils.py), the full-path test fp in files_read_set will almost never match. In practice the check always falls through to the basename comparison: os.path.basename(fp) in files_read_set.
This means any committed file whose bare filename (e.g. utils.py, README.md, config.py) matches the basename of any file the MCP agent happened to read will be attributed to that MCP agent instead of to human, even if the touched file is completely unrelated. Common basenames make this a realistic misattribution risk.
Consider normalising files_read to repo-relative paths before comparing:
files_read_rel = {
os.path.relpath(f, repo_root) if f.startswith(repo_root) else f
for f in files_read
}
for fp in files_touched:
if fp not in events_by_file:
if agent != "human" and fp in files_read_rel:
attribution[fp] = {"agent": agent, "model": model}
else:
attribution[fp] = {"agent": "human", "model": "human"}| try: | ||
| conn = sqlite3.connect(f"file:{_OPENCODE_DB}?mode=ro", uri=True, timeout=2) | ||
| # Get first user message for this session | ||
| row = conn.execute( | ||
| "SELECT id FROM message WHERE session_id=? " | ||
| "AND json_extract(data,'$.role')='user' " | ||
| "ORDER BY time_created ASC LIMIT 1", | ||
| (session_id,), | ||
| ).fetchone() | ||
| if not row: | ||
| conn.close() | ||
| return "unknown" | ||
| msg_id = row[0] | ||
| # Get text parts for this message | ||
| parts = conn.execute( | ||
| "SELECT data FROM part WHERE message_id=? ORDER BY time_created ASC", | ||
| (msg_id,), | ||
| ).fetchall() | ||
| conn.close() | ||
| for part_row in parts: | ||
| try: | ||
| part = json.loads(part_row[0]) | ||
| if part.get("type") == "text" and part.get("text"): | ||
| text = str(part["text"]).strip() | ||
| debug_log(f"opencode prompt from DB: {text[:80]!r}") | ||
| return text[:500] | ||
| except Exception: | ||
| continue | ||
| except Exception as exc: | ||
| debug_log(f"opencode prompt DB lookup failed: {exc}") | ||
| return "unknown" |
There was a problem hiding this comment.
SQLite connection leaked on exception before explicit
conn.close()
In both get_opencode_model and get_opencode_prompt, the connection is closed with an explicit conn.close() only in the happy path. If conn.execute().fetchone() (or .fetchall()) raises an exception (e.g. schema mismatch, lock timeout, corrupt page), the except block logs and exits — but conn is never closed. Python's GC will eventually reclaim it, but the file lock can be held for up to timeout=2 seconds per invocation.
The standard fix is a context manager:
import contextlib
with contextlib.closing(sqlite3.connect(f"file:{_OPENCODE_DB}?mode=ro", uri=True, timeout=2)) as conn:
row = conn.execute(...).fetchone()
...This applies to both get_opencode_model (lines 60–68) and get_opencode_prompt (lines 101–118).
| _OPENCODE_DB = os.path.expanduser("~/.local/share/opencode/opencode.db") | ||
| _OPENCODE_MODEL_JSON = os.path.expanduser("~/.local/state/opencode/model.json") |
There was a problem hiding this comment.
DB path in code differs from PR description
The PR description states the SQLite database is read from ~/.opencode/opencode.db, but the implementation uses ~/.local/share/opencode/opencode.db. If OpenCode stores its database at ~/.opencode/opencode.db on some platforms or installation methods, the DB lookup will silently fail and fall back to model.json / default "opencode".
Worth verifying the canonical DB path across all supported OpenCode installation methods (binary, npm, homebrew) and potentially probing both locations. Is ~/.local/share/opencode/opencode.db the correct path for all OpenCode installation methods, or does it sometimes reside at ~/.opencode/opencode.db as stated in the PR description?
…cate debug logs - prepare-ledger: replace basename-union files_read_set with repo-relative path normalisation; full-path match now fires correctly, eliminating false-positive MCP attribution on common filenames (e.g. utils.py) - capture-opencode: guard both SQLite connections with contextlib.closing so the file lock is released on exception; probe both DB path candidates (~/.local/share/opencode and ~/.opencode) to cover all install methods - capture-codex: remove 5 duplicate debug_log lines that were strict subsets of the preceding log call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AgentDiff ReportSummary
Files Modified
|
Summary
~/.opencode/opencode.db) instead of relying on env vars that weren't being set reliably. Falls back to JSON log lookup, then env, then defaults.~/.codex/history/) by session ID, enabling prompt attribution on completions where the prompt wasn't passed directly.model/model_name/modelNamefrom the hook payload; falls back tocursor-unknownrather than omitting the field.PostToolUseevents before the first tool call in a session.src/configure/cursor.rs): Check directory existence, not file — createhooks.jsonif absent rather than erroring when config file doesn't exist yet.Bug fix
prepare-ledger.py: Files with no matching session.jsonl event but present in MCPfiles_readwere unconditionally attributed to"human". When an MCP server provides agent/model context viapending.json, those files should inherit the MCP agent — not fall back to human. This was causing the CImcp-smoketest to fail:Fix: before falling back to human, check if the file appears in
files_readfrom the pending MCP context and the top-level agent is non-human. If so, use the MCP agent/model.Logging cleanup
always_logincapture-cursor.pyandcapture-codex.pywas writing to log files on every agent event unconditionally (regardless ofAGENTDIFF_DEBUG), silently filling~/.agentdiff/logs/. Removed the function; all call sites replaced withdebug_logwhich is gated on the env var.Test plan
agentdiff verifyon a repo with opencode traces shows correct model (notunknown)agentdiff liston a codex-traced repo shows prompt populated from historyscripts/tests/mcp-smoke.txttrace hasmodel_id = mcp-smoke-model~/.agentdiff/logs/does not accumulatecapture-cursor.log/capture-codex.logentries during normal (non-debug) useagentdiff configure cursoron a machine without an existinghooks.jsonsucceeds🤖 Generated with Claude Code