Skip to content

chore(scripts): add test262_failing_diff.js for per-mode regression diffing#446

Merged
dowdiness merged 6 commits into
mainfrom
chore/test262-failing-diff-script
Jun 24, 2026
Merged

chore(scripts): add test262_failing_diff.js for per-mode regression diffing#446
dowdiness merged 6 commits into
mainfrom
chore/test262-failing-diff-script

Conversation

@dowdiness

@dowdiness dowdiness commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Adds scripts/test262_failing_diff.js — a small node script that diffs two Test262 runner artifacts by per-mode (path, mode) failing SET, emitting:

  • REGRESSIONS — baseline-pass → candidate-fail (exit 1 if any)
  • FIXED — baseline-fail → candidate-pass

Set difference, not counts: a run that trades one failure for another nets zero on counts but is surfaced here. Each input is validated against its own summary.failed === count(status === "fail") and the script aborts on mismatch rather than reporting a false "0 regressions" off a broken extractor — the exact failure mode that has bitten ad-hoc per-session python.

Matches the existing node-script convention (analyze_results.js, check_results.js) and the test262_result_contract.schema.json field names (path, status, mode).

Self-tested with a positive control: corrupting summary.failed triggers the abort; a real artifact pair runs clean and reports the correct diff.

Usage:

node scripts/test262_failing_diff.js <baseline.json> <candidate.json>

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added a command-line report that compares two Test262 runs and summarizes test outcome changes.
    • Highlights regressions, lost coverage, and fixed tests with clear status transitions.
    • Validates run summaries before reporting and exits with a failing status when regressions or lost coverage are found.

…iffing

Diffs two Test262 runner artifacts by per-mode (path, mode) failing SET,
emitting regressions (baseline-pass -> candidate-fail) and fixed
(baseline-fail -> candidate-pass). Set difference, not counts, so a run
that trades one failure for another nets zero on counts but is surfaced.

Validates each input against its own summary (summary.failed ===
count(status === "fail")) and aborts on mismatch rather than reporting a
false "0 regressions" off a broken extractor. Matches the existing
node-script convention (analyze_results.js / check_results.js) and the
scripts/test262_result_contract.schema.json field names (path, status,
mode). Permanently replaces ad-hoc per-session python.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@dowdiness, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 14 minutes and 47 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b4fe7ae-80db-422b-87e4-f64ae74562fb

📥 Commits

Reviewing files that changed from the base of the PR and between eaa4e61 and 5a1f175.

📒 Files selected for processing (1)
  • scripts/test262_failing_diff.js
📝 Walkthrough

Walkthrough

Adds scripts/test262_failing_diff.js, a new CLI Node.js script that accepts two Test262 runner artifact paths, normalizes test path keys, validates artifact summary counts, classifies intersecting keys into regressions, lost coverage, and fixed transitions, prints each category, and exits with code 1 if regressions or lost coverage are detected.

Changes

Test262 Artifact Diff Script

Layer / File(s) Summary
Docs, NONPASS set, and path normalization
scripts/test262_failing_diff.js
File-level documentation describes the artifact contract and classification semantics. NONPASS status set and normalizePath() define the canonical (path, mode) key used for all comparisons.
loadStatuses, report, and main entrypoint
scripts/test262_failing_diff.js
loadStatuses() builds a normalized status map and validates per-status counts against artifact summary fields, aborting on mismatch. report() formats labeled transition rows. main() validates CLI args, loads both artifacts, classifies only keys present in both into regressions/lost-coverage/fixed lists, prints all three reports, and exits 0 or 1 accordingly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 Hop along the test result trail,
Comparing base and candidate — pass or fail?
Lost coverage spotted, regressions caught,
Each outcome transition carefully sought.
Exit zero when all is well and right,
Exit one when regressions come to light! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the new Test262 diffing script and its per-mode regression focus.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/test262-failing-diff-script

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0691971f71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/test262_failing_diff.js Outdated
Comment thread scripts/test262_failing_diff.js Outdated
…diff

Address two review findings against the contract
(docs/decisions/tooling-migration-contracts.md):

- Non-pass outcomes: a result is non-passing when status is fail, timeout,
  OR error (not just fail). Classify per (path,mode) by STATUS, not a
  fail-set difference: REGRESSION = baseline pass -> candidate non-pass;
  FIXED = baseline non-pass -> candidate pass. A pass that now times out /
  errors is now caught; a baseline failure that becomes skip/timeout/error
  is no longer mislabeled FIXED (it did not newly pass). Validation now
  checks failed AND timeout AND error counts against the summary.

- Path normalization: the stable key is (normalized path, mode). `path`
  may be repo-relative (test262/test/x.js) or Test262-root-relative
  (x.js); normalize both spellings before keying so the same test across
  two runs no longer produces false regressions/fixes.

Self-tested with positive controls: pass->timeout flagged as regression,
timeout->skip NOT flagged as fixed, repo- vs root-relative spellings
collapse to one key, and a corrupted summary.timeout aborts the diff.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee3203c2f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

for (const [key, candStatus] of candidate) {
const baseStatus = baseline.get(key);
if (baseStatus === undefined) continue;
if (baseStatus === "pass" && NONPASS.has(candStatus)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Flag pass-to-skip transitions

When a candidate over-broadens skip metadata, a test that was observed and passing in the baseline appears as pass -> skip, but this condition only treats fail/timeout/error as regressions, so the script exits 0 and reports a clean diff. Since skipped tests still reduce Passed/Discovered and can hide lost coverage when both artifacts contain the key, pass-to-skip transitions should be reported as regressions or a separate warning rather than ignored.

Useful? React with 👍 / 👎.

A candidate that over-broadens skip metadata turns a baseline-passing test
into a skip. That is not a correctness regression (the engine did not start
failing), but it is not clean either: the test stops executing, lowering
Passed/Discovered and hiding lost coverage. Previously only fail/timeout/error
counted, so pass->skip exited 0 with a clean diff.

Report pass->skip transitions in a dedicated LOST COVERAGE section (kept
separate from REGRESSIONS so correctness vs coverage signals stay distinct)
and include them in the non-zero exit so they cannot pass silently. The
skip count is now also validated against summary.skipped.

Self-tested: pass->skip flagged as lost coverage (exit 1); timeout->skip is
NOT flagged (baseline was not passing); real no-change pair stays exit 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eaa4e619ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/test262_failing_diff.js Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/test262_failing_diff.js`:
- Line 44: The artifact loading in test262_failing_diff is using dynamic require
on a CLI path, which can execute JavaScript instead of only reading data. Update
the logic around the current require(require("path").resolve(path)) usage to
validate that the provided artifact path ends in .json, then read it with
fs.readFileSync and parse it with JSON.parse. Keep the change localized to the
artifact-loading code in scripts/test262_failing_diff.js so the CLI only accepts
JSON inputs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b5f34e3-d68a-456f-bcb7-dbd0e54d4973

📥 Commits

Reviewing files that changed from the base of the PR and between 6f65fd3 and eaa4e61.

📒 Files selected for processing (1)
  • scripts/test262_failing_diff.js


// Build (normalized path \t mode) -> status, validating non-pass counts.
function loadStatuses(path) {
const data = require(require("path").resolve(path));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify how artifact paths are sourced when this script is invoked.
rg -n -C2 --iglob '*.{yml,yaml,sh,js,mjs,cjs,ts,json,md}' 'test262_failing_diff\.js|node\s+scripts/test262_failing_diff\.js' .

# Confirm current dynamic require usage in this script.
rg -n -C2 'require\(require\("path"\)\.resolve\(path\)\)' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 1198


🏁 Script executed:

sed -n '1,140p' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 5147


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1,140p' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 5147


🏁 Script executed:

sed -n '1,140p' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 5147


🏁 Script executed:

sed -n '1,140p' scripts/test262_failing_diff.js

Repository: dowdiness/js_engine

Length of output: 5147


Avoid dynamic require for artifact input.

scripts/test262_failing_diff.js:44 loads a CLI-supplied path via require(require("path").resolve(path)), which can execute a .js module instead of strictly parsing JSON. Use fs.readFileSync() + JSON.parse() and enforce a .json artifact path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test262_failing_diff.js` at line 44, The artifact loading in
test262_failing_diff is using dynamic require on a CLI path, which can execute
JavaScript instead of only reading data. Update the logic around the current
require(require("path").resolve(path)) usage to validate that the provided
artifact path ends in .json, then read it with fs.readFileSync and parse it with
JSON.parse. Keep the change localized to the artifact-loading code in
scripts/test262_failing_diff.js so the CLI only accepts JSON inputs.

Source: Linters/SAST tools

The regression check relies on baseline pass rows as the only evidence that a
candidate non-pass is a real pass->fail change. The validator reconciled
fail/timeout/error/skip but not pass, so an artifact that dropped pass rows
while keeping summary.passed would validate clean, leave those keys absent from
the baseline map, and silently skip pass->fail regressions (exit 0).

Add two guards before trusting the maps:
- summary.total === results.length (global: catches dropped/duplicated rows)
- count(pass) === summary.passed (per-status: catches a mislabel that keeps total)

Verified the stricter guards do NOT false-abort on real runner artifacts
(total/passed/skipped all reconcile). Positive controls: dropping a pass row
aborts on the total guard; dropping it with total decremented aborts on the
pass-count guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6781cee48f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/test262_failing_diff.js Outdated
…ss->skip

Lost coverage previously fired only on pass->skip, so a candidate that
broadened skip metadata over a previously-executed failing test (fail/timeout/
error -> skip) fell through every case: skip is not in NONPASS so it was not a
regression, the baseline was not pass so it was not lost coverage, and the
candidate was not pass so it was not a fix. The failing test simply vanished
from the diff and the run exited 0 — making hidden failures look clean.

Broaden LOST COVERAGE to any executed (pass/fail/timeout/error) -> skip
transition. The reported `from` status distinguishes a lost pass (pass->skip)
from a hidden failure (fail->skip). Stable skips (skip->skip) are unaffected
because the baseline must be an executed status.

Self-tested: fail->skip and timeout->skip now flagged as lost coverage (exit 1);
pass->skip and pass->timeout still behave as before; real artifact pairs whose
skips are skip->skip in both runs stay exit 0 (no false positives).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f299a40b6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/test262_failing_diff.js Outdated
The total === results.length guard only catches a net row-count change. A
duplicate normalized (path,mode) paired with a dropped row keeps results.length
and every per-status count intact, so validation passed while Map.set silently
overwrote the duplicate — the dropped key then went missing from the baseline
map and was skipped as unclassified, letting a real pass->fail exit 0. The
"catches duplicated rows" comment was overclaiming.

Reject duplicate normalized keys while building the map, making
byKey.size === results.length an invariant. Combined with the total check, no
row can be silently dropped or overwritten.

Self-tested: a plain duplicate aborts; the duplicate+dropped-row scenario whose
counts fully reconcile now aborts too (was a false exit 0); real artifact pairs
have no duplicates and stay exit 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dowdiness dowdiness merged commit f82e42b into main Jun 24, 2026
13 checks passed
@dowdiness dowdiness deleted the chore/test262-failing-diff-script branch June 24, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant