Skip to content

Copilot reviewer - drop overfitted learnings, turn into generalized rules#19451

Merged
T-Gro merged 18 commits intomainfrom
feat/expert-review
Mar 18, 2026
Merged

Copilot reviewer - drop overfitted learnings, turn into generalized rules#19451
T-Gro merged 18 commits intomainfrom
feat/expert-review

Conversation

@T-Gro
Copy link
Member

@T-Gro T-Gro commented Mar 18, 2026

(previous version was overfitted due to smaller training dataset)

@github-actions
Copy link
Contributor

✅ No release notes required

T-Gro and others added 18 commits March 18, 2026 22:53
…y extraction

Generated from analysis of 695 review comments (2,102 effective weighted)
across 4 expert reviewers spanning 10 years of dotnet/fsharp and
fsharp/fslang-design review history.

Artifacts:
- .github/agents/expert-reviewer.md: 15-dimension review agent with
  5-wave workflow and folder hotspot mapping
- .github/skills/expert-review/SKILL.md: Slim invocation pointer with
  dimension selection guide and self-review checklist
- .github/instructions/ExpertReview.instructions.md: Top 10 compiler
  review rules scoped to src/Compiler/**
- .github/agents/extraction-pipeline.md: Reusable pipeline agent for
  future reviewer expertise extraction

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Skill name: gerund form (reviewing-compiler-prs)
- Description: third person, keyword-rich, specific triggers
- Removed redundant prose (Claude already knows)
- Added cross-references to complementary skills (review-council,
  hypothesis-driven-debugging, ilverify-failure, vsintegration-ide-debugging)
- Slim pointer pattern: skill is overview, agent is source of truth

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The review-council skill (generic multi-model review) and expert-review
skill (F#-specific dimensions) had 0.55 trigger overlap. Merged into
a single reviewing-compiler-prs skill that combines:
- F#-specific 15 dimensions with file-based selection
- Multi-model 3-phase methodology (briefing, parallel assessment,
  consolidation with dedup/false-positive filtering)
- Self-review checklist

Deleted .github/skills/review-council/ — its methodology is now
embedded in the merged skill's Multi-Model Review Methodology section.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Description: natural third-person prose, no keyword spam list
- Removed redundant opening paragraph (restated description)
- Trimmed Multi-Model section: removed briefing pack steps Claude
  already knows; kept only unique content (assessment gates,
  consolidation rules)
- Removed redundant 'Before requesting...' preamble
- 75 -> 54 lines (28% reduction, no content loss)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Agent (expert-reviewer.md):
- Dim 1: test layer selection guide (Typecheck/SyntaxTree/EmittedIL/
  compileAndRun/Service/Core), happy/negative/interaction paths,
  assertion quality
- Dim 12: pipelines over nesting, active patterns for complex
  conditions, nesting depth limit (2 levels), wide over deep
- Dim 13: ad-hoc if/then band-aid detection, higher-order pattern
  extraction, search for existing utilities before reimplementing,
  respect intentional deviations in isolated projects

Skill (SKILL.md):
- Claims coverage check (orphan claims/changes/partial impls)
- 'Correct convention' false positive filter
- 'Unexplained != wrong' — doc gap not defect
- 'Agents love nitpicks' — deprioritize
- Minimum viable council = 2 models

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… from instructions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s 1,155 PRs

Expert reviewer agent expanded from 15 to 20 dimensions based on deep
analysis of 8,097 review comments (17x previous run's 469). New dimensions:
- FSharp.Core Stability (84 blocking entries)
- Optimization Correctness (tail-call, inlining philosophy)
- Struct Type Awareness (value semantics, byref handling)
- Overload Resolution Correctness (SRTP, method hiding)
- Incremental Checking Correctness (stale results, invalidation)

Added 14 explicit anti-patterns (reformatting+logic, catch-all handlers,
raw TType matching, Unchecked.defaultof abuse, etc.)

Skill updated with 13-row dimension selection table and 15-item
self-review checklist ordered by blocking frequency.

Instructions expanded to 7 sections covering backward compatibility,
concurrency, and performance rules.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ferences, test running

- Remove Dimension 19 (Parallel Compilation) - single-feature overfitting
  Determinism rules kept in overarching principles where they belong
- Remove all 'reactor thread' references (0 matches in src/, obsolete concept)
- Change 'Run tests in Release' to 'Ensure tests cover Debug/Release differences'
  CI runs tests, not developers - focus on having tests, not running them
- 20 dimensions -> 19 dimensions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Audit all 19 dimensions for rules that only make sense for the specific
historical PR they were derived from. Rules that reference specific
functions, specific files, or specific implementation patterns that
a future reviewer wouldn't be able to apply were either:
- Generalized (e.g., 'Use AppTy active pattern' kept, but
  'Cross-reference GenFieldInit in IlxGen.fs when modifying infos.fs' removed)
- Removed when too narrow (e.g., 'Cache colorization data by document ID')

581 -> 508 lines. CHECK items reduced from verbose per-dimension to
focused generalizable rules.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Agent: Replace 15-row folder hotspot table with pointer to skill
  (skill already has the dimension selection routing table)
- Skill: Trim self-review checklist from 15 to 7 items + pointer to agent
  Remove 'ordered by blocking frequency' provenance leak
- Instructions: Remove Test Discipline section (copilot-instructions.md
  already covers testing), remove Code Generation section (overfitted
  BindUnitVars/Expr.Lambda rules from single PRs)
  45 -> 32 lines

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e rules

Confusion audit found 68% A-grade (clear), 29% B (needs context), 3% C
(overfitted). Applied fixes:

- Add 'why' rationale to 15 items (stripTyEqns, InlineIfLambda, FCS
  dependency minimization, queue stress, method hiding)
- Resolve Dim 16 contradiction: catch-all OK at API boundaries (log,
  don't silence), forbidden inside compiler
- Generalize 3 C-grade items: Expr.Lambda→expression nodes, tuple-match
  →expression restructuring, anon records→any metadata emission
- Rewrite 3 factoid items as actionable rules (CE semantics, performance
  comparison, type provider framework compat)
- Expand jargon: FCS, stream B tag-byte, closure capturing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Key improvements to the pipeline methodology:

Data collection:
- MUST collect ALL activity, not sample (sampling produced 17x less data)
- Include closed/rejected PRs (strongest review opinions)
- Include own PRs (PR descriptions = design intent)
- Search both commenter: AND author: queries
- Batch size 15 items per agent (not 200 — agents fail on large batches)

Anti-overfitting:
- Normalize by PR count not comment count (1 PR = 1 vote)
- Classify into generalizable principles, not PR-specific advice
- Cross-check rules against CI config — don't reproduce what CI enforces
- Validate technical terms against current codebase (grep for 0 matches)

Quality gates:
- Add Phase 5.6: Confusion audit (grade A/B/C/D on every CHECK item)
- Target >=80% grade A, 0% grade C/D
- Re-dispatch failed multi-batch agents (validate file sizes >500 bytes)

7 new lessons learned (#8-14) from this session's experience.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A fresh-context agent won't read an appendix. All 14 lessons are now
direct instructions in the phase where they apply:

- Scale Warning: batch size 15, validate output >500 bytes, re-dispatch
- Phase 1.1: four searches (commenter+author), all PR states, paginate
- Phase 1.2: PR descriptions as first-class data, 15 items per agent
- Phase 1.4: grep technical terms against codebase, drop obsolete
- Phase 2.2: normalize by PR not comment, generalize don't transcribe,
  distinguish CI enforcement from review rules
- Wave 1: don't explain away issues, no hypotheticals, verify against
  PR branch not main
- Phase 3.1: agent is single source of truth, inline critical content
- Phase 3.4: skill points to agent, description is discovery key
- Phase 5.6: confusion audit catches overfitted CHECK items

431 -> 404 lines. Appendix removed entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ample

- Test Coverage: 9 -> 5 items (merged test layer list, removed CI-like
  cross-TFM and Debug/Release rules)
- FSharp.Core: 7 -> 6 items (merged XML doc + RFC into one bullet)
- Extraction pipeline: remove 'reactor thread' from obsolete-term example
- All dimensions now 3-6 CHECK items (sub-agent sized)

Checklist verification:
[x] reactor: zero mentions across all files
[x] all technical terms verified against src/ (zero-match = removed)
[x] no single-PR overfitting in CHECK items
[x] parallel compilation generalized into Determinism principle
[x] overfitting prevention inlined into extraction pipeline
[x] all dimensions 3-6 items (sub-agent appropriate)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New dedicated step where a fresh-context sub-agent reads each dimension's
CHECK items against the actual codebase and CI config to:
- grep every technical term (zero matches = obsolete, remove)
- cross-check rules against CI (CI already does it = not a review rule)
- verify generalizability (applies to future unknown PRs?)
- verify rationale clarity (why does violating this rule break things?)

Grades each item A/B/C/D with targets >=80% A, 0% C/D.
Previous overfitting check moved to 5.7 as a final pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes:
1. PR-normalization now mechanically enforced: new §2.2b creates
   pr_rule_votes table (DISTINCT activity_id per rule) before synthesis.
   Synthesis agent explicitly blocked from raw comment data.
2. Anti-overfitting rules now apply to artifact generation (§3.1),
   not just classification. Routing table placement clarified.
3. Sub-agent error handling: validate completeness (not just file size),
   retry up to 3 times, log and continue on persistent failure.
4. Feedback loop: §5.6 findings >20% C/D trigger re-run of Phase 2-3,
   not just artifact patching.

Medium fixes:
- CI analysis moved to §2.1 (input for all classifiers)
- Bootstrap sample now stratified (by year, area, comment type)
- Phase 4 renamed to clarify it's a spec, not a pipeline step
- '~15 items' disambiguated to '~15 PRs' everywhere
- Per-PR pagination explicitly required in §1.2
- Model specification made flexible (best available, not hardcoded)
- Feature area table schema defined

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restructure pipeline for clarity without changing behavior:

- Add ASCII flow diagram showing Phase 1→2→3→5 data flow with
  named intermediate artifacts at each stage
- Add I/O block to every phase/step (45 annotations total):
  Sub-agents, Input, Output, Resume, Validation, Context constraints
- Each step now explicitly states what tables/files it reads and writes
- Resume guidance: skip phases whose output tables already exist
- Context management: synthesis agent explicitly blocked from raw data
- Phase 4 clearly marked as embedded spec, not pipeline step

439 -> 518 lines (80 lines of I/O metadata, 0 behavioral changes)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove all F#/compiler-specific examples (stripTyEqns, LanguageFeature,
  BindUnitVars, dotnet/fsharp, azure-pipelines)
- Replace with generic equivalents that work for any repo/language
- Consolidate Scale section: remove sub-agent list that duplicated
  per-phase batch sizes, organize into Context/Model/Reliability
- CI config references now generic (GitHub Actions, Azure Pipelines,
  Jenkins, etc.)

518 -> 512 lines. Zero tech-specific references remaining.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@T-Gro T-Gro force-pushed the feat/expert-review branch from c026072 to c25cdd3 Compare March 18, 2026 21:55
@T-Gro T-Gro merged commit ff81858 into main Mar 18, 2026
4 checks passed
@T-Gro T-Gro deleted the feat/expert-review branch March 18, 2026 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

1 participant