chore(scripts): add test262_failing_diff.js for per-mode regression diffing#446
Conversation
…iffing Diffs two Test262 runner artifacts by per-mode (path, mode) failing SET, emitting regressions (baseline-pass -> candidate-fail) and fixed (baseline-fail -> candidate-pass). Set difference, not counts, so a run that trades one failure for another nets zero on counts but is surfaced. Validates each input against its own summary (summary.failed === count(status === "fail")) and aborts on mismatch rather than reporting a false "0 regressions" off a broken extractor. Matches the existing node-script convention (analyze_results.js / check_results.js) and the scripts/test262_result_contract.schema.json field names (path, status, mode). Permanently replaces ad-hoc per-session python. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 14 minutes and 47 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds ChangesTest262 Artifact Diff Script
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0691971f71
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…diff Address two review findings against the contract (docs/decisions/tooling-migration-contracts.md): - Non-pass outcomes: a result is non-passing when status is fail, timeout, OR error (not just fail). Classify per (path,mode) by STATUS, not a fail-set difference: REGRESSION = baseline pass -> candidate non-pass; FIXED = baseline non-pass -> candidate pass. A pass that now times out / errors is now caught; a baseline failure that becomes skip/timeout/error is no longer mislabeled FIXED (it did not newly pass). Validation now checks failed AND timeout AND error counts against the summary. - Path normalization: the stable key is (normalized path, mode). `path` may be repo-relative (test262/test/x.js) or Test262-root-relative (x.js); normalize both spellings before keying so the same test across two runs no longer produces false regressions/fixes. Self-tested with positive controls: pass->timeout flagged as regression, timeout->skip NOT flagged as fixed, repo- vs root-relative spellings collapse to one key, and a corrupted summary.timeout aborts the diff. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ee3203c2f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (const [key, candStatus] of candidate) { | ||
| const baseStatus = baseline.get(key); | ||
| if (baseStatus === undefined) continue; | ||
| if (baseStatus === "pass" && NONPASS.has(candStatus)) { |
There was a problem hiding this comment.
When a candidate over-broadens skip metadata, a test that was observed and passing in the baseline appears as pass -> skip, but this condition only treats fail/timeout/error as regressions, so the script exits 0 and reports a clean diff. Since skipped tests still reduce Passed/Discovered and can hide lost coverage when both artifacts contain the key, pass-to-skip transitions should be reported as regressions or a separate warning rather than ignored.
Useful? React with 👍 / 👎.
A candidate that over-broadens skip metadata turns a baseline-passing test into a skip. That is not a correctness regression (the engine did not start failing), but it is not clean either: the test stops executing, lowering Passed/Discovered and hiding lost coverage. Previously only fail/timeout/error counted, so pass->skip exited 0 with a clean diff. Report pass->skip transitions in a dedicated LOST COVERAGE section (kept separate from REGRESSIONS so correctness vs coverage signals stay distinct) and include them in the non-zero exit so they cannot pass silently. The skip count is now also validated against summary.skipped. Self-tested: pass->skip flagged as lost coverage (exit 1); timeout->skip is NOT flagged (baseline was not passing); real no-change pair stays exit 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eaa4e619ef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/test262_failing_diff.js`:
- Line 44: The artifact loading in test262_failing_diff is using dynamic require
on a CLI path, which can execute JavaScript instead of only reading data. Update
the logic around the current require(require("path").resolve(path)) usage to
validate that the provided artifact path ends in .json, then read it with
fs.readFileSync and parse it with JSON.parse. Keep the change localized to the
artifact-loading code in scripts/test262_failing_diff.js so the CLI only accepts
JSON inputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7b5f34e3-d68a-456f-bcb7-dbd0e54d4973
📒 Files selected for processing (1)
scripts/test262_failing_diff.js
|
|
||
| // Build (normalized path \t mode) -> status, validating non-pass counts. | ||
| function loadStatuses(path) { | ||
| const data = require(require("path").resolve(path)); |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify how artifact paths are sourced when this script is invoked.
rg -n -C2 --iglob '*.{yml,yaml,sh,js,mjs,cjs,ts,json,md}' 'test262_failing_diff\.js|node\s+scripts/test262_failing_diff\.js' .
# Confirm current dynamic require usage in this script.
rg -n -C2 'require\(require\("path"\)\.resolve\(path\)\)' scripts/test262_failing_diff.jsRepository: dowdiness/js_engine
Length of output: 1198
🏁 Script executed:
sed -n '1,140p' scripts/test262_failing_diff.jsRepository: dowdiness/js_engine
Length of output: 5147
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,140p' scripts/test262_failing_diff.jsRepository: dowdiness/js_engine
Length of output: 5147
🏁 Script executed:
sed -n '1,140p' scripts/test262_failing_diff.jsRepository: dowdiness/js_engine
Length of output: 5147
🏁 Script executed:
sed -n '1,140p' scripts/test262_failing_diff.jsRepository: dowdiness/js_engine
Length of output: 5147
Avoid dynamic require for artifact input.
scripts/test262_failing_diff.js:44 loads a CLI-supplied path via require(require("path").resolve(path)), which can execute a .js module instead of strictly parsing JSON. Use fs.readFileSync() + JSON.parse() and enforce a .json artifact path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/test262_failing_diff.js` at line 44, The artifact loading in
test262_failing_diff is using dynamic require on a CLI path, which can execute
JavaScript instead of only reading data. Update the logic around the current
require(require("path").resolve(path)) usage to validate that the provided
artifact path ends in .json, then read it with fs.readFileSync and parse it with
JSON.parse. Keep the change localized to the artifact-loading code in
scripts/test262_failing_diff.js so the CLI only accepts JSON inputs.
Source: Linters/SAST tools
The regression check relies on baseline pass rows as the only evidence that a candidate non-pass is a real pass->fail change. The validator reconciled fail/timeout/error/skip but not pass, so an artifact that dropped pass rows while keeping summary.passed would validate clean, leave those keys absent from the baseline map, and silently skip pass->fail regressions (exit 0). Add two guards before trusting the maps: - summary.total === results.length (global: catches dropped/duplicated rows) - count(pass) === summary.passed (per-status: catches a mislabel that keeps total) Verified the stricter guards do NOT false-abort on real runner artifacts (total/passed/skipped all reconcile). Positive controls: dropping a pass row aborts on the total guard; dropping it with total decremented aborts on the pass-count guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6781cee48f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ss->skip Lost coverage previously fired only on pass->skip, so a candidate that broadened skip metadata over a previously-executed failing test (fail/timeout/ error -> skip) fell through every case: skip is not in NONPASS so it was not a regression, the baseline was not pass so it was not lost coverage, and the candidate was not pass so it was not a fix. The failing test simply vanished from the diff and the run exited 0 — making hidden failures look clean. Broaden LOST COVERAGE to any executed (pass/fail/timeout/error) -> skip transition. The reported `from` status distinguishes a lost pass (pass->skip) from a hidden failure (fail->skip). Stable skips (skip->skip) are unaffected because the baseline must be an executed status. Self-tested: fail->skip and timeout->skip now flagged as lost coverage (exit 1); pass->skip and pass->timeout still behave as before; real artifact pairs whose skips are skip->skip in both runs stay exit 0 (no false positives). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f299a40b6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The total === results.length guard only catches a net row-count change. A duplicate normalized (path,mode) paired with a dropped row keeps results.length and every per-status count intact, so validation passed while Map.set silently overwrote the duplicate — the dropped key then went missing from the baseline map and was skipped as unclassified, letting a real pass->fail exit 0. The "catches duplicated rows" comment was overclaiming. Reject duplicate normalized keys while building the map, making byKey.size === results.length an invariant. Combined with the total check, no row can be silently dropped or overwritten. Self-tested: a plain duplicate aborts; the duplicate+dropped-row scenario whose counts fully reconcile now aborts too (was a false exit 0); real artifact pairs have no duplicates and stay exit 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds
scripts/test262_failing_diff.js— a small node script that diffs two Test262 runner artifacts by per-mode(path, mode)failing SET, emitting:Set difference, not counts: a run that trades one failure for another nets zero on counts but is surfaced here. Each input is validated against its own
summary.failed === count(status === "fail")and the script aborts on mismatch rather than reporting a false "0 regressions" off a broken extractor — the exact failure mode that has bitten ad-hoc per-session python.Matches the existing node-script convention (
analyze_results.js,check_results.js) and thetest262_result_contract.schema.jsonfield names (path,status,mode).Self-tested with a positive control: corrupting
summary.failedtriggers the abort; a real artifact pair runs clean and reports the correct diff.Usage:
🤖 Generated with Claude Code
Summary by CodeRabbit