Skip to content

Add --keep-sessions flag and rejudge command to skill-validator#149

Draft
lewing wants to merge 6 commits intomainfrom
incremental-results
Draft

Add --keep-sessions flag and rejudge command to skill-validator#149
lewing wants to merge 6 commits intomainfrom
incremental-results

Conversation

@lewing
Copy link
Member

@lewing lewing commented Feb 27, 2026

Summary

Adds two opt-in features to the skill-validator eval framework:

  1. --keep-sessions flag — Preserves agent session data (SDK events, config dirs) in a SQLite database (sessions.db) under the timestamped results directory instead of deleting them after evaluation.

  2. rejudge command — Re-runs judges on previously saved sessions without re-running the expensive agent LLM calls. Useful for iterating on judge prompts or trying different judge models.

What's in the sessions database

Each eval run pair (baseline + with-skill) is tracked with:

  • Session ID, skill name/path, scenario name, run index, role, model
  • Config and work directory paths
  • Prompt — the eval scenario prompt text (for browsing/replay)
  • Skill SHA — a truncated SHA-256 hash of the skill directory contents for change detection
  • Run metrics, judge results, and pairwise comparison results
  • Status tracking (running → completed/timed_out)

Design decisions

  • Fully opt-in — without --keep-sessions, behavior is unchanged. sessionDb stays null and all sessionDb?. calls are no-ops.
  • Thread-safe SQLite — WAL mode + SemaphoreSlim write lock for concurrent scenario execution. Each eval process gets its own timestamped results dir → separate sessions.db → no cross-process conflicts.
  • Work dirs always cleaned — only config dirs (containing events.jsonl) are preserved; temp work dirs are always deleted.
  • Skill SHA over full content — stores a 12-char hex hash of the skill directory rather than the full SKILL.md text, giving change detection without bloating the DB.

Files changed

File Change
SessionDatabase.cs New — thread-safe SQLite service with schema, CRUD, and ComputeDirectorySha
RejudgeCommand.cs New — subcommand to re-run judges on saved sessions
ValidateCommand.cs Wire --keep-sessions flag, session registration, prompt + skill SHA
AgentRunner.cs Separate config/work dir lifecycle, extended RunOptions
Models.cs Added KeepSessions to ValidatorConfig
Program.cs Register RejudgeCommand
SkillValidator.csproj Added Microsoft.Data.Sqlite dependency
SessionDatabaseTests.cs New — 11 tests covering CRUD, concurrency, isolation, SHA determinism

Testing

All 192 tests pass (pre-existing + 11 new session DB tests).

Supersedes #135 (reopened from non-forked branch per review feedback).

lewing and others added 3 commits February 27, 2026 11:19
- Add --keep-sessions flag to preserve agent session data in SQLite DB
- Create SessionDatabase service with WAL mode and thread-safe writes
- Update AgentRunner to separate config/work dir lifecycle
- Track session metadata, RunMetrics, judge and pairwise results
- Add rejudge command to re-run judges on saved sessions
- Add 8 tests for SessionDatabase (CRUD, concurrency, isolation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Store the eval scenario prompt and SKILL.md content in the sessions
table so they are available for browsing and replay without needing
the original skill directory.

Both columns are optional (nullable) for backward compatibility.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compute a SHA-256 hash (truncated to 12 hex chars) over all files in
the skill directory, sorted by relative path for determinism. This
gives lightweight change detection without storing potentially large
skill content in the database.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

Skill Validation Results

Skill Scenario Baseline With Skill Δ Skills Loaded Overfit Verdict
microbenchmarking Investigate runtime upgrade performance impact 2.5/5 ⏰ timeout 4.0/5 +1.5 ✅ microbenchmarking; tools: skill, stop_bash, glob, read_bash ✅ 0.11
csharp-scripts Test a C# language feature with a script 2.5/5 4.5/5 +2.0 ✅ csharp-scripts; tools: skill, create, edit 🟡 0.32
thread-abort-migration Worker thread with abort-based cancellation 5.0/5 5.0/5 0.0 ✅ thread-abort-migration; tools: skill ✅ 0.13
thread-abort-migration Timeout enforcement via Thread.Abort 4.0/5 5.0/5 +1.0 ✅ thread-abort-migration; tools: report_intent, skill ✅ 0.13
thread-abort-migration Blocking WaitHandle with Thread.Interrupt 3.5/5 4.0/5 +0.5 ✅ thread-abort-migration; tools: skill ✅ 0.13
thread-abort-migration ASP.NET Response.End and Response.Redirect with Thread.Abort 4.0/5 4.5/5 +0.5 ✅ thread-abort-migration; tools: skill, report_intent, bash ✅ 0.13
thread-abort-migration Thread.Join and Thread.Sleep only — should not migrate 4.0/5 5.0/5 +1.0 ✅ thread-abort-migration; tools: skill ✅ 0.13
analyzing-dotnet-performance Detects compiled regex startup budget and regex chain allocations 1.0/5 4.0/5 ⏰ timeout +3.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Detects CurrentCulture comparer and compiled regex budget in inflection rules 1.0/5 5.0/5 ⏰ timeout +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Finds per-call Dictionary allocation not hoisted to static 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Catches compound allocations in recursive number converter with ToLower 1.0/5 4.5/5 +3.5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Finds StringComparison.Ordinal missing and FrozenDictionary opportunities 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Detects Aggregate+Replace chain and struct missing IEquatable 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Finds branched Replace chain in format string manipulation 1.0/5 3.5/5 +2.5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Catches LINQ on hot-path string processing and All(char.IsUpper) 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Detects LINQ pipeline in TimeSpan formatting and collection processing 1.5/5 5.0/5 +3.5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Flags Span inconsistencies and compound method chains in truncation library 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Identifies unsealed leaf classes and locale hierarchy patterns 1.0/5 4.5/5 ⏰ timeout +3.5 ✅ analyzing-dotnet-performance; tools: skill, write_bash, stop_bash ✅ 0.14
optimizing-ef-core-queries Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete 4.5/5 5.0/5 +0.5 ✅ optimizing-ef-core-queries; tools: skill 🟡 0.35
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET 8+) 3.5/5 5.0/5 +1.5 ✅ dotnet-pinvoke; tools: skill ✅ 0.11
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET Framework) 4.5/5 5.0/5 +0.5 ✅ dotnet-pinvoke; tools: skill, view ✅ 0.11
dump-collect Configure automatic crash dumps for CoreCLR app on Linux 5.0/5 5.0/5 0.0 ✅ dump-collect; tools: skill, report_intent, view, glob, bash ✅ 0.09
dump-collect Set up NativeAOT crash dumps with createdump in Kubernetes 2.5/5 5.0/5 +2.5 ✅ dump-collect; tools: skill ✅ 0.09
dump-collect Recover crash dump from macOS NativeAOT without createdump 3.5/5 4.0/5 +0.5 ✅ dump-collect; tools: skill, report_intent, view, glob, bash ✅ 0.09
dump-collect Configure CoreCLR dump collection in Alpine Docker as non-root 3.5/5 4.0/5 +0.5 ✅ dump-collect; tools: skill, bash ✅ 0.09
dump-collect Advisory: macOS NativeAOT crash dump recovery steps 3.5/5 5.0/5 +1.5 ✅ dump-collect; tools: skill ✅ 0.09
dump-collect Advisory: CoreCLR Alpine Docker non-root configuration 3.0/5 5.0/5 +2.0 ✅ dump-collect; tools: skill ✅ 0.09
dump-collect Advisory: NativeAOT Kubernetes dump collection setup 2.5/5 3.0/5 +0.5 ✅ dump-collect; tools: skill, bash ✅ 0.09
dump-collect Detect runtime and configure crash dumps for unknown .NET app on Linux 4.0/5 4.5/5 +0.5 ✅ dump-collect; tools: skill, glob ✅ 0.09
dump-collect Decline dump analysis request 2.5/5 4.5/5 +2.0 ℹ️ not activated (expected) ✅ 0.09
build-parallelism Analyze build parallelism bottlenecks 1.0/5 ⏰ timeout 3.0/5 ⏰ timeout +2.0 ✅ build-parallelism; binlog-generation; binlog-failure-analysis; tools: skill, task, glob ✅ 0.16
including-generated-files Diagnose generated file inclusion failure 3.0/5 5.0/5 +2.0 ✅ including-generated-files; tools: skill ✅ 0.17
msbuild-antipatterns Review MSBuild files for anti-patterns and style issues 5.0/5 5.0/5 0.0 ✅ msbuild-antipatterns; tools: skill, glob, edit ✅ 0.08
build-perf-baseline Establish build performance baseline and recommend optimizations 4.0/5 5.0/5 +1.0 ✅ build-perf-baseline; tools: skill 🟡 0.36
msbuild-modernization Modernize legacy project to SDK-style 5.0/5 5.0/5 0.0 ✅ msbuild-modernization; tools: skill ✅ 0.06
directory-build-organization Organize build infrastructure for a multi-project repo 3.5/5 5.0/5 +1.5 ✅ directory-build-organization; msbuild-antipatterns; tools: skill ✅ 0.20
check-bin-obj-clash Diagnose bin/obj output path clashes 3.5/5 5.0/5 +1.5 ✅ check-bin-obj-clash; tools: skill, glob, edit ✅ 0.14
incremental-build Analyze incremental build issues 3.0/5 4.0/5 +1.0 ✅ incremental-build; tools: skill ✅ 0.14
eval-performance Analyze MSBuild evaluation performance issues 4.0/5 4.5/5 +0.5 ✅ eval-performance; tools: skill 🟡 0.21
build-perf-diagnostics Analyze analyzer performance impact on builds 5.0/5 5.0/5 0.0 ✅ build-perf-diagnostics; binlog-generation; binlog-failure-analysis; tools: skill 🟡 0.23
binlog-generation Build project with /bl flag 1.0/5 5.0/5 +4.0 ✅ binlog-generation; tools: skill ✅ 0.00
binlog-generation Build with /bl in PowerShell 2.5/5 5.0/5 +2.5 ✅ binlog-generation; tools: skill ✅ 0.00
binlog-generation Build multiple configurations with unique binlogs 2.0/5 5.0/5 +3.0 ✅ binlog-generation; tools: skill ✅ 0.00
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 4.0/5 4.0/5 0.0 ✅ binlog-failure-analysis; tools: skill, load_binlog, edit, view ✅ 0.11

timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

@steveisok
Copy link
Member

@ViktorHofer @JanKrivanek I think this will be immensly helpful for validation. We should prioritize getting this in.

I'm trying to help tweak #122 and validation cycles are loooong. I inquired about how this PR could help and this is what it said:

Yes, this would help enormously. Here's why:

The rejudge command directly addresses our biggest pain point. Right now, each eval run costs ~35 min (parallel) to ~105 min (sequential), and the vast majority of that time is the agent LLM calls — not the judging. When we tweak rubric wording (like adding
"suppressed warnings don't count as resolved"), we have to re-run the entire pipeline to see the effect. With rejudge, we could:

 - Re-score existing sessions with updated rubric text in seconds
 - Try different judge models (e.g., judge with Sonnet instead of Opus) without re-running agents
 - Iterate on rubric fairness without burning hours of Opus compute

Concrete examples from our experience where this would have saved hours:

 1. When we updated the rubric to penalize pragma suppressions — that was a rubric change, not a skill change. We had to run the full pipeline again to see the impact.
 2. The +8.2% run where baseline "won" on final build by cheating with pragmas — with rejudge we could have immediately tested whether clarified rubric language fixed the scoring without another 35-minute run.
 3. Comparing our two 3-run results (+46.9% vs +25.6%) — if we could rejudge both with identical rubric text, we'd know how much of the variance is judge calibration vs actual agent behavior.

The --keep-sessions flag is also valuable for our debugging workflow. We spent significant time figuring out why the agent timed out (analysis paralysis, sed breakage, ResponseError wall). Having events.jsonl preserved would make that forensics much faster.

In short: it turns our iteration loop from "change → wait 35+ min → analyze" into "change rubric → rejudge in seconds → analyze." That's a 100× speedup on the judging iteration cycle.

/cc @agocke @jeffschwMSFT

@lewing
Copy link
Member Author

lewing commented Feb 28, 2026

I've also had several runs die thanks to copilot exiting while they were 20 minutes in with multiple runs completed. My other reason for working on this it to make it easy to resume one of the eval sessions and ask the agent for feedback on why it made certain choices.

Copy link
Member Author

@lewing lewing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some review from a different agent:

Nice work — the schema design is clean and the thread-safety approach is solid. A few suggestions from the dotnet-replay integration perspective:

1. Store config_dir as relative to the DB directory

Currently config_dir is stored as an absolute path (e.g. /home/user/.../results-2026-02-28T.../sessions/<guid>). Since the results directory is self-contained (sessions.db + sessions/<guid>/events.jsonl are siblings), storing it relative to the DB directory (e.g. sessions/<guid>) would make the results dir portable — you can zip it, move it to another machine, and tools like dotnet-replay --db sessions.db just work without broken paths.

The validator only reads config_dir back in RejudgeCommand, which already knows resultsDir, so resolving via Path.Combine(resultsDir, relativeConfigDir) is straightforward. The change would be in RegisterSession — compute the relative path before storing:

var relativeConfigDir = Path.GetRelativePath(Path.GetDirectoryName(dbPath)!, configDir);

2. Add a schema version marker

The Copilot CLI session-store.db has a schema_version table that makes DB type detection trivial for consumers. Without one here, tools must probe column names via PRAGMA table_info(sessions). A small metadata table also future-proofs schema migrations:

CREATE TABLE IF NOT EXISTS schema_info (key TEXT PRIMARY KEY, value TEXT NOT NULL);
INSERT OR IGNORE INTO schema_info (key, value) VALUES ('type', 'skill-validator');
INSERT OR IGNORE INTO schema_info (key, value) VALUES ('version', '1');

3. Consider a display_name column (minor / follow-up)

When browsing sessions in a TUI, Copilot CLI sessions have a summary field for display. For skill-validator sessions, a consumer has to synthesize something like "scenario_name (role) — skill_name". Pre-computing a display_name in the DB makes it self-describing. This could be a follow-up though.

- Use session ID as config dir folder name (sessions/<id>) so the
  DB record links directly to its files on disk
- Store config_dir as relative path for portable results directories
- Add schema_info table with type='skill-validator' and version='1'
  for easy DB detection by external tools like dotnet-replay
- Add GetSchemaInfo() method for consumers
- Extract GetSessions(whereClause) for reuse
- Add tests for schema_info and relative config_dir

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
lewing added a commit to lewing/dotnet-replay that referenced this pull request Mar 1, 2026
Detect and load skill-validator databases (from dotnet/skills#149) alongside
existing Copilot CLI session-store.db files. The --db flag now accepts either
database type, auto-detected via schema_info vs schema_version tables.

Changes:
- Add SessionDbType enum and BrowserSession record to replace tuple
- Add DetectDbType() for schema-based DB type detection
- Add LoadSkillValidatorSessions() with config_dir relative path resolution
- Show eval metadata (metrics, judge, pairwise) in preview panel
- Adapt browser icons and display for skill-validator sessions
- Handle sessions without events.jsonl gracefully

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --keep-sessions is active, pass the already-created timestamped
directory to Reporter.ReportResults so sessions.db, session files, and
report outputs (results.json, summary.md) all live in the same folder
instead of creating two separate timestamped directories.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
lewing added a commit to lewing/dotnet-replay that referenced this pull request Mar 2, 2026
* Fix test failures from .NET 10 terminal logger stdout noise

- Guard SafeGetString against undefined JsonElement (ValueKind check)
- Add -v q to dotnet run in test helpers to suppress build output
  leaking into stdout, which corrupted JSON parsing in tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add skill-validator sessions.db support

Detect and load skill-validator databases (from dotnet/skills#149) alongside
existing Copilot CLI session-store.db files. The --db flag now accepts either
database type, auto-detected via schema_info vs schema_version tables.

Changes:
- Add SessionDbType enum and BrowserSession record to replace tuple
- Add DetectDbType() for schema-based DB type detection
- Add LoadSkillValidatorSessions() with config_dir relative path resolution
- Show eval metadata (metrics, judge, pairwise) in preview panel
- Adapt browser icons and display for skill-validator sessions
- Handle sessions without events.jsonl gracefully

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix cross-platform events.jsonl path resolution

Normalize Windows backslashes in config_dir for Linux/WSL compatibility.
Search nested session-state/<guid>/events.jsonl when direct path not found.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Detect copilot CLI install method for resume

Try standalone 'copilot' first, fall back to 'gh copilot' for users
who installed via the GitHub CLI extension.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants