Copilot reviewer - drop overfitted learnings, turn into generalized rules by T-Gro · Pull Request #19451 · dotnet/fsharp

T-Gro · 2026-03-18T15:17:00Z

(previous version was overfitted due to smaller training dataset)

github-actions · 2026-03-18T15:17:48Z

✅ No release notes required

…y extraction Generated from analysis of 695 review comments (2,102 effective weighted) across 4 expert reviewers spanning 10 years of dotnet/fsharp and fsharp/fslang-design review history. Artifacts: - .github/agents/expert-reviewer.md: 15-dimension review agent with 5-wave workflow and folder hotspot mapping - .github/skills/expert-review/SKILL.md: Slim invocation pointer with dimension selection guide and self-review checklist - .github/instructions/ExpertReview.instructions.md: Top 10 compiler review rules scoped to src/Compiler/** - .github/agents/extraction-pipeline.md: Reusable pipeline agent for future reviewer expertise extraction Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Skill name: gerund form (reviewing-compiler-prs) - Description: third person, keyword-rich, specific triggers - Removed redundant prose (Claude already knows) - Added cross-references to complementary skills (review-council, hypothesis-driven-debugging, ilverify-failure, vsintegration-ide-debugging) - Slim pointer pattern: skill is overview, agent is source of truth Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The review-council skill (generic multi-model review) and expert-review skill (F#-specific dimensions) had 0.55 trigger overlap. Merged into a single reviewing-compiler-prs skill that combines: - F#-specific 15 dimensions with file-based selection - Multi-model 3-phase methodology (briefing, parallel assessment, consolidation with dedup/false-positive filtering) - Self-review checklist Deleted .github/skills/review-council/ — its methodology is now embedded in the merged skill's Multi-Model Review Methodology section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Description: natural third-person prose, no keyword spam list - Removed redundant opening paragraph (restated description) - Trimmed Multi-Model section: removed briefing pack steps Claude already knows; kept only unique content (assessment gates, consolidation rules) - Removed redundant 'Before requesting...' preamble - 75 -> 54 lines (28% reduction, no content loss) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Agent (expert-reviewer.md): - Dim 1: test layer selection guide (Typecheck/SyntaxTree/EmittedIL/ compileAndRun/Service/Core), happy/negative/interaction paths, assertion quality - Dim 12: pipelines over nesting, active patterns for complex conditions, nesting depth limit (2 levels), wide over deep - Dim 13: ad-hoc if/then band-aid detection, higher-order pattern extraction, search for existing utilities before reimplementing, respect intentional deviations in isolated projects Skill (SKILL.md): - Claims coverage check (orphan claims/changes/partial impls) - 'Correct convention' false positive filter - 'Unexplained != wrong' — doc gap not defect - 'Agents love nitpicks' — deprioritize - Minimum viable council = 2 models Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… from instructions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…s 1,155 PRs Expert reviewer agent expanded from 15 to 20 dimensions based on deep analysis of 8,097 review comments (17x previous run's 469). New dimensions: - FSharp.Core Stability (84 blocking entries) - Optimization Correctness (tail-call, inlining philosophy) - Struct Type Awareness (value semantics, byref handling) - Overload Resolution Correctness (SRTP, method hiding) - Incremental Checking Correctness (stale results, invalidation) Added 14 explicit anti-patterns (reformatting+logic, catch-all handlers, raw TType matching, Unchecked.defaultof abuse, etc.) Skill updated with 13-row dimension selection table and 15-item self-review checklist ordered by blocking frequency. Instructions expanded to 7 sections covering backward compatibility, concurrency, and performance rules. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ferences, test running - Remove Dimension 19 (Parallel Compilation) - single-feature overfitting Determinism rules kept in overarching principles where they belong - Remove all 'reactor thread' references (0 matches in src/, obsolete concept) - Change 'Run tests in Release' to 'Ensure tests cover Debug/Release differences' CI runs tests, not developers - focus on having tests, not running them - 20 dimensions -> 19 dimensions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Audit all 19 dimensions for rules that only make sense for the specific historical PR they were derived from. Rules that reference specific functions, specific files, or specific implementation patterns that a future reviewer wouldn't be able to apply were either: - Generalized (e.g., 'Use AppTy active pattern' kept, but 'Cross-reference GenFieldInit in IlxGen.fs when modifying infos.fs' removed) - Removed when too narrow (e.g., 'Cache colorization data by document ID') 581 -> 508 lines. CHECK items reduced from verbose per-dimension to focused generalizable rules. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Agent: Replace 15-row folder hotspot table with pointer to skill (skill already has the dimension selection routing table) - Skill: Trim self-review checklist from 15 to 7 items + pointer to agent Remove 'ordered by blocking frequency' provenance leak - Instructions: Remove Test Discipline section (copilot-instructions.md already covers testing), remove Code Generation section (overfitted BindUnitVars/Expr.Lambda rules from single PRs) 45 -> 32 lines Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e rules Confusion audit found 68% A-grade (clear), 29% B (needs context), 3% C (overfitted). Applied fixes: - Add 'why' rationale to 15 items (stripTyEqns, InlineIfLambda, FCS dependency minimization, queue stress, method hiding) - Resolve Dim 16 contradiction: catch-all OK at API boundaries (log, don't silence), forbidden inside compiler - Generalize 3 C-grade items: Expr.Lambda→expression nodes, tuple-match →expression restructuring, anon records→any metadata emission - Rewrite 3 factoid items as actionable rules (CE semantics, performance comparison, type provider framework compat) - Expand jargon: FCS, stream B tag-byte, closure capturing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Key improvements to the pipeline methodology: Data collection: - MUST collect ALL activity, not sample (sampling produced 17x less data) - Include closed/rejected PRs (strongest review opinions) - Include own PRs (PR descriptions = design intent) - Search both commenter: AND author: queries - Batch size 15 items per agent (not 200 — agents fail on large batches) Anti-overfitting: - Normalize by PR count not comment count (1 PR = 1 vote) - Classify into generalizable principles, not PR-specific advice - Cross-check rules against CI config — don't reproduce what CI enforces - Validate technical terms against current codebase (grep for 0 matches) Quality gates: - Add Phase 5.6: Confusion audit (grade A/B/C/D on every CHECK item) - Target >=80% grade A, 0% grade C/D - Re-dispatch failed multi-batch agents (validate file sizes >500 bytes) 7 new lessons learned (#8-14) from this session's experience. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

A fresh-context agent won't read an appendix. All 14 lessons are now direct instructions in the phase where they apply: - Scale Warning: batch size 15, validate output >500 bytes, re-dispatch - Phase 1.1: four searches (commenter+author), all PR states, paginate - Phase 1.2: PR descriptions as first-class data, 15 items per agent - Phase 1.4: grep technical terms against codebase, drop obsolete - Phase 2.2: normalize by PR not comment, generalize don't transcribe, distinguish CI enforcement from review rules - Wave 1: don't explain away issues, no hypotheticals, verify against PR branch not main - Phase 3.1: agent is single source of truth, inline critical content - Phase 3.4: skill points to agent, description is discovery key - Phase 5.6: confusion audit catches overfitted CHECK items 431 -> 404 lines. Appendix removed entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ample - Test Coverage: 9 -> 5 items (merged test layer list, removed CI-like cross-TFM and Debug/Release rules) - FSharp.Core: 7 -> 6 items (merged XML doc + RFC into one bullet) - Extraction pipeline: remove 'reactor thread' from obsolete-term example - All dimensions now 3-6 CHECK items (sub-agent sized) Checklist verification: [x] reactor: zero mentions across all files [x] all technical terms verified against src/ (zero-match = removed) [x] no single-PR overfitting in CHECK items [x] parallel compilation generalized into Determinism principle [x] overfitting prevention inlined into extraction pipeline [x] all dimensions 3-6 items (sub-agent appropriate) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

New dedicated step where a fresh-context sub-agent reads each dimension's CHECK items against the actual codebase and CI config to: - grep every technical term (zero matches = obsolete, remove) - cross-check rules against CI (CI already does it = not a review rule) - verify generalizability (applies to future unknown PRs?) - verify rationale clarity (why does violating this rule break things?) Grades each item A/B/C/D with targets >=80% A, 0% C/D. Previous overfitting check moved to 5.7 as a final pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Critical fixes: 1. PR-normalization now mechanically enforced: new §2.2b creates pr_rule_votes table (DISTINCT activity_id per rule) before synthesis. Synthesis agent explicitly blocked from raw comment data. 2. Anti-overfitting rules now apply to artifact generation (§3.1), not just classification. Routing table placement clarified. 3. Sub-agent error handling: validate completeness (not just file size), retry up to 3 times, log and continue on persistent failure. 4. Feedback loop: §5.6 findings >20% C/D trigger re-run of Phase 2-3, not just artifact patching. Medium fixes: - CI analysis moved to §2.1 (input for all classifiers) - Bootstrap sample now stratified (by year, area, comment type) - Phase 4 renamed to clarify it's a spec, not a pipeline step - '~15 items' disambiguated to '~15 PRs' everywhere - Per-PR pagination explicitly required in §1.2 - Model specification made flexible (best available, not hardcoded) - Feature area table schema defined Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Restructure pipeline for clarity without changing behavior: - Add ASCII flow diagram showing Phase 1→2→3→5 data flow with named intermediate artifacts at each stage - Add I/O block to every phase/step (45 annotations total): Sub-agents, Input, Output, Resume, Validation, Context constraints - Each step now explicitly states what tables/files it reads and writes - Resume guidance: skip phases whose output tables already exist - Context management: synthesis agent explicitly blocked from raw data - Phase 4 clearly marked as embedded spec, not pipeline step 439 -> 518 lines (80 lines of I/O metadata, 0 behavioral changes) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Remove all F#/compiler-specific examples (stripTyEqns, LanguageFeature, BindUnitVars, dotnet/fsharp, azure-pipelines) - Replace with generic equivalents that work for any repo/language - Consolidate Scale section: remove sub-agent list that duplicated per-phase batch sizes, organize into Context/Model/Reliability - CI config references now generic (GitHub Actions, Azure Pipelines, Jenkins, etc.) 518 -> 512 lines. Zero tech-specific references remaining. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

T-Gro requested a review from a team as a code owner March 18, 2026 15:17

github-project-automation bot added this to F# Compiler and Tooling Mar 18, 2026

github-project-automation bot moved this to New in F# Compiler and Tooling Mar 18, 2026

T-Gro and others added 18 commits March 18, 2026 22:53

Remove overfitted parallel compilation section and Debug/Release rule…

bd6a062

… from instructions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

T-Gro force-pushed the feat/expert-review branch from c026072 to c25cdd3 Compare March 18, 2026 21:55

T-Gro merged commit ff81858 into main Mar 18, 2026
4 checks passed

github-project-automation bot moved this from New to Done in F# Compiler and Tooling Mar 18, 2026

T-Gro deleted the feat/expert-review branch March 18, 2026 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copilot reviewer - drop overfitted learnings, turn into generalized rules#19451

Copilot reviewer - drop overfitted learnings, turn into generalized rules#19451
T-Gro merged 18 commits intomainfrom
feat/expert-review

T-Gro commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

T-Gro commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

✅ No release notes required

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant