chore(scripts): add test262_failing_diff.js for per-mode regression diffing by dowdiness · Pull Request #446 · dowdiness/js_engine

dowdiness · 2026-06-24T14:32:34Z

Adds scripts/test262_failing_diff.js — a small node script that diffs two Test262 runner artifacts by per-mode (path, mode) failing SET, emitting:

REGRESSIONS — baseline-pass → candidate-fail (exit 1 if any)
FIXED — baseline-fail → candidate-pass

Set difference, not counts: a run that trades one failure for another nets zero on counts but is surfaced here. Each input is validated against its own summary.failed === count(status === "fail") and the script aborts on mismatch rather than reporting a false "0 regressions" off a broken extractor — the exact failure mode that has bitten ad-hoc per-session python.

Matches the existing node-script convention (analyze_results.js, check_results.js) and the test262_result_contract.schema.json field names (path, status, mode).

Self-tested with a positive control: corrupting summary.failed triggers the abort; a real artifact pair runs clean and reports the correct diff.

Usage:

node scripts/test262_failing_diff.js <baseline.json> <candidate.json>

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a command-line report that compares two Test262 runs and summarizes test outcome changes.
- Highlights regressions, lost coverage, and fixed tests with clear status transitions.
- Validates run summaries before reporting and exits with a failing status when regressions or lost coverage are found.

…iffing Diffs two Test262 runner artifacts by per-mode (path, mode) failing SET, emitting regressions (baseline-pass -> candidate-fail) and fixed (baseline-fail -> candidate-pass). Set difference, not counts, so a run that trades one failure for another nets zero on counts but is surfaced. Validates each input against its own summary (summary.failed === count(status === "fail")) and aborts on mismatch rather than reporting a false "0 regressions" off a broken extractor. Matches the existing node-script convention (analyze_results.js / check_results.js) and the scripts/test262_result_contract.schema.json field names (path, status, mode). Permanently replaces ad-hoc per-session python. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-24T14:32:45Z

Warning

Review limit reached

@dowdiness, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 14 minutes and 47 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b4fe7ae-80db-422b-87e4-f64ae74562fb

📥 Commits

Reviewing files that changed from the base of the PR and between eaa4e61 and 5a1f175.

📒 Files selected for processing (1)

scripts/test262_failing_diff.js

📝 Walkthrough

Walkthrough

Adds scripts/test262_failing_diff.js, a new CLI Node.js script that accepts two Test262 runner artifact paths, normalizes test path keys, validates artifact summary counts, classifies intersecting keys into regressions, lost coverage, and fixed transitions, prints each category, and exits with code 1 if regressions or lost coverage are detected.

Changes

Test262 Artifact Diff Script

Layer / File(s)	Summary
Docs, NONPASS set, and path normalization `scripts/test262_failing_diff.js`	File-level documentation describes the artifact contract and classification semantics. `NONPASS` status set and `normalizePath()` define the canonical `(path, mode)` key used for all comparisons.
`loadStatuses`, `report`, and `main` entrypoint `scripts/test262_failing_diff.js`	`loadStatuses()` builds a normalized status map and validates per-status counts against artifact summary fields, aborting on mismatch. `report()` formats labeled transition rows. `main()` validates CLI args, loads both artifacts, classifies only keys present in both into regressions/lost-coverage/fixed lists, prints all three reports, and exits `0` or `1` accordingly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 Hop along the test result trail,
Comparing base and candidate — pass or fail?
Lost coverage spotted, regressions caught,
Each outcome transition carefully sought.
Exit zero when all is well and right,
Exit one when regressions come to light! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the new Test262 diffing script and its per-mode regression focus.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/test262-failing-diff-script

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0691971f71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…diff Address two review findings against the contract (docs/decisions/tooling-migration-contracts.md): - Non-pass outcomes: a result is non-passing when status is fail, timeout, OR error (not just fail). Classify per (path,mode) by STATUS, not a fail-set difference: REGRESSION = baseline pass -> candidate non-pass; FIXED = baseline non-pass -> candidate pass. A pass that now times out / errors is now caught; a baseline failure that becomes skip/timeout/error is no longer mislabeled FIXED (it did not newly pass). Validation now checks failed AND timeout AND error counts against the summary. - Path normalization: the stable key is (normalized path, mode). `path` may be repo-relative (test262/test/x.js) or Test262-root-relative (x.js); normalize both spellings before keying so the same test across two runs no longer produces false regressions/fixes. Self-tested with positive controls: pass->timeout flagged as regression, timeout->skip NOT flagged as fixed, repo- vs root-relative spellings collapse to one key, and a corrupted summary.timeout aborts the diff. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee3203c2f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-24T14:46:59Z

+  for (const [key, candStatus] of candidate) {
+    const baseStatus = baseline.get(key);
+    if (baseStatus === undefined) continue;
+    if (baseStatus === "pass" && NONPASS.has(candStatus)) {


Flag pass-to-skip transitions

When a candidate over-broadens skip metadata, a test that was observed and passing in the baseline appears as pass -> skip, but this condition only treats fail/timeout/error as regressions, so the script exits 0 and reports a clean diff. Since skipped tests still reduce Passed/Discovered and can hide lost coverage when both artifacts contain the key, pass-to-skip transitions should be reported as regressions or a separate warning rather than ignored.

Useful? React with 👍 / 👎.

A candidate that over-broadens skip metadata turns a baseline-passing test into a skip. That is not a correctness regression (the engine did not start failing), but it is not clean either: the test stops executing, lowering Passed/Discovered and hiding lost coverage. Previously only fail/timeout/error counted, so pass->skip exited 0 with a clean diff. Report pass->skip transitions in a dedicated LOST COVERAGE section (kept separate from REGRESSIONS so correctness vs coverage signals stay distinct) and include them in the non-zero exit so they cannot pass silently. The skip count is now also validated against summary.skipped. Self-tested: pass->skip flagged as lost coverage (exit 1); timeout->skip is NOT flagged (baseline was not passing); real no-change pair stays exit 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eaa4e619ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/test262_failing_diff.js`:
- Line 44: The artifact loading in test262_failing_diff is using dynamic require
on a CLI path, which can execute JavaScript instead of only reading data. Update
the logic around the current require(require("path").resolve(path)) usage to
validate that the provided artifact path ends in .json, then read it with
fs.readFileSync and parse it with JSON.parse. Keep the change localized to the
artifact-loading code in scripts/test262_failing_diff.js so the CLI only accepts
JSON inputs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b5f34e3-d68a-456f-bcb7-dbd0e54d4973

📥 Commits

Reviewing files that changed from the base of the PR and between 6f65fd3 and eaa4e61.

📒 Files selected for processing (1)

scripts/test262_failing_diff.js

coderabbitai · 2026-06-24T14:59:22Z

+
+// Build (normalized path \t mode) -> status, validating non-pass counts.
+function loadStatuses(path) {
+  const data = require(require("path").resolve(path));


🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify how artifact paths are sourced when this script is invoked. rg -n -C2 --iglob '*.{yml,yaml,sh,js,mjs,cjs,ts,json,md}' 'test262_failing_diff\.js|node\s+scripts/test262_failing_diff\.js' . # Confirm current dynamic require usage in this script. rg -n -C2 'require$require\("path"$\.resolve$path$\)' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 1198

🏁 Script executed:

sed -n '1,140p' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 5147

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,140p' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 5147

🏁 Script executed:

sed -n '1,140p' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 5147

🏁 Script executed:

sed -n '1,140p' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 5147

Avoid dynamic require for artifact input.

scripts/test262_failing_diff.js:44 loads a CLI-supplied path via require(require("path").resolve(path)), which can execute a .js module instead of strictly parsing JSON. Use fs.readFileSync() + JSON.parse() and enforce a .json artifact path.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test262_failing_diff.js` at line 44, The artifact loading in test262_failing_diff is using dynamic require on a CLI path, which can execute JavaScript instead of only reading data. Update the logic around the current require(require("path").resolve(path)) usage to validate that the provided artifact path ends in .json, then read it with fs.readFileSync and parse it with JSON.parse. Keep the change localized to the artifact-loading code in scripts/test262_failing_diff.js so the CLI only accepts JSON inputs.

Source: Linters/SAST tools

The regression check relies on baseline pass rows as the only evidence that a candidate non-pass is a real pass->fail change. The validator reconciled fail/timeout/error/skip but not pass, so an artifact that dropped pass rows while keeping summary.passed would validate clean, leave those keys absent from the baseline map, and silently skip pass->fail regressions (exit 0). Add two guards before trusting the maps: - summary.total === results.length (global: catches dropped/duplicated rows) - count(pass) === summary.passed (per-status: catches a mislabel that keeps total) Verified the stricter guards do NOT false-abort on real runner artifacts (total/passed/skipped all reconcile). Positive controls: dropping a pass row aborts on the total guard; dropping it with total decremented aborts on the pass-count guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6781cee48f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ss->skip Lost coverage previously fired only on pass->skip, so a candidate that broadened skip metadata over a previously-executed failing test (fail/timeout/ error -> skip) fell through every case: skip is not in NONPASS so it was not a regression, the baseline was not pass so it was not lost coverage, and the candidate was not pass so it was not a fix. The failing test simply vanished from the diff and the run exited 0 — making hidden failures look clean. Broaden LOST COVERAGE to any executed (pass/fail/timeout/error) -> skip transition. The reported `from` status distinguishes a lost pass (pass->skip) from a hidden failure (fail->skip). Stable skips (skip->skip) are unaffected because the baseline must be an executed status. Self-tested: fail->skip and timeout->skip now flagged as lost coverage (exit 1); pass->skip and pass->timeout still behave as before; real artifact pairs whose skips are skip->skip in both runs stay exit 0 (no false positives). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f299a40b6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The total === results.length guard only catches a net row-count change. A duplicate normalized (path,mode) paired with a dropped row keeps results.length and every per-status count intact, so validation passed while Map.set silently overwrote the duplicate — the dropped key then went missing from the baseline map and was skipped as unclassified, letting a real pass->fail exit 0. The "catches duplicated rows" comment was overclaiming. Reject duplicate normalized keys while building the map, making byKey.size === results.length an invariant. Combined with the total check, no row can be silently dropped or overwritten. Self-tested: a plain duplicate aborts; the duplicate+dropped-row scenario whose counts fully reconcile now aborts too (was a false exit 0); real artifact pairs have no duplicates and stay exit 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread scripts/test262_failing_diff.js Outdated

Comment thread scripts/test262_failing_diff.js Outdated

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread scripts/test262_failing_diff.js Outdated

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread scripts/test262_failing_diff.js Outdated

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread scripts/test262_failing_diff.js Outdated

dowdiness merged commit f82e42b into main Jun 24, 2026
13 checks passed

dowdiness deleted the chore/test262-failing-diff-script branch June 24, 2026 16:30

Conversation

dowdiness commented Jun 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dowdiness commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading