perf(resolver): pre-filter reduction pairs by `fam_requires` by doubleailes · Pull Request #68 · doubleailes/rer

doubleailes · 2026-05-15T14:24:42Z

Summary

Stacks on #67 (which stacks on #66). The phase loop builds a `pending_set` of `(scope_x, scope_y)` pairs and then calls `scopes[x].reduce_by(scopes[y].package_request)` for each. Profile data on rez's 188-case benchmark: 27 M reduce_by calls, but 98.5 % are immediate early-returns because scope[y]'s family doesn't appear in scope[x]'s `fam_requires`.

Doing the check at pair-generation time instead of pair-consumption time avoids tens of millions of function-call round trips, and — this turned out to be the bigger win in practice — keeps the pending set itself ~100× smaller. The downstream effect is across the board:

The `FxHashSet` doesn't rehash through 27 M inserts.
`pending.sort_unstable()` runs over ~400 K entries instead of 27 M.
The consume loop pops ~400 K entries instead of 27 M.
The `FxHashSet` growth allocations vanish.

What changed

New `PackageScope::fam_requires_contains(name)` that delegates to the slice (or returns `false` for conflict / ephemeral scopes, which carry no variants).
Each `pending_set.insert((x, y))` is replaced by a `maybe_push` helper that applies the filter at pair-generation time.
Same filter applied in the cascade inside the consume loop (when scope `x` narrows, we push `(j, x)` only for `j` that depend on `x`'s family).

The filter is sound: if scope[x]'s variants never mention some family, `reduce_by` with a request for that family would return `Unchanged` with no reductions. Skipping the call preserves both the result and the reduction list.

Benchmark (188 cases, release, same machine)

Stage	Total	Mean	vs rez
Baseline (main)	43.0 s	230 ms	8.8×
+ Shared cache (#66)	34.4 s	183 ms	11.1×
+ mimalloc (#67)	29.4 s	156 ms	13.0×
+ this filter, run 1	21.9 s	117 ms	17.5×
+ this filter, run 2	22.3 s	119 ms	17.1×

-25 % on top of #67, -49 % cumulative from main. Differential test (`cargo test`) also benefited — ~46 s → ~38 s.

The predicted gain was 1–3 % (just the per-call function overhead). The actual gain is 25 % because skipping at generation time also avoids growing/sorting/popping a 27 M-entry pending set. The wins are correlated but not what I expected — worth recording.

Correctness

`cargo build` — clean.
`cargo test` — passes.
`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored` — 188/188 still match rez 1:1.

Base

This PR targets `try-mimalloc` (#67). When #66 and #67 merge, GitHub will rebase this onto main; rebase should be clean.

🤖 Generated with Claude Code

qodo-code-review · 2026-05-15T14:24:46Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Callgrind on the bench binary showed roughly 33 % of cycles inside libc's malloc/free family (`_int_malloc`, `_int_free`, `malloc`, `free`, `malloc_consolidate`, `unlink_chunk`, …). The hot paths are short-lived small allocations: `SmallVec` segments inside `Ranges<RerVersion>` on every range clone, the per-call `FxHashMap<&Requirement, bool>` in `reduce_by`, hashbrown rehashes, `String::clone`s for package names. mimalloc's small-object path is measurably cheaper than glibc's on exactly this shape. Wire `mimalloc::MiMalloc` as `#[global_allocator]` in the bench binary only. Production wheels (`rer-python`) are untouched — switching the global allocator in a PyO3 extension is more nuanced and is left for a separate change. ## Benchmark (188 cases, release, same machine, two consecutive runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (main) | 43.0 s | 230 ms | 8.8× | | + Shared cache (PR #66) | 34.4 s | 183 ms | 11.1× | | + mimalloc (this branch), run 1 | 29.8 s | 158 ms | 12.8× | | + mimalloc (this branch), run 2 | 29.4 s | 156 ms | 13.0× | Cumulative -32 % from main. Stacks cleanly on top of #66. ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The phase loop builds a `pending_set: FxHashSet<(usize, usize)>` of scope-pair indices and then iterates it, calling `scopes[x].reduce_by(scopes[y].package_request)` for each. Profile data on rez's 188-case benchmark: 27 M `reduce_by` calls, but **98.5 %** are immediate early-returns because `scopes[y]`'s family doesn't appear in `scopes[x]`'s `fam_requires` set. Doing that check at pair-generation time instead of pair-consumption time avoids tens of millions of function-call round trips, *and* — this is the bigger win in practice — keeps the pending set itself ~100× smaller. The downstream cost reduction is across the board: - the `FxHashSet` doesn't rehash through 27 M inserts - `pending.sort_unstable()` runs over ~400 K entries instead of 27 M - the consume loop pops ~400 K entries instead of 27 M - fewer allocations from the inner `pending_set` growth Implementation: - Add `PackageScope::fam_requires_contains(name)` that delegates to the slice (or returns `false` for conflict / ephemeral scopes). - Replace each `pending_set.insert((x, y))` with a `maybe_push` helper that applies the filter. - Apply the same filter in the cascade path inside the consume loop (when scope `x` narrows, we now only push `(j, x)` for `j` that actually depend on `x`'s family). ## Benchmark (188 cases, release, same machine) | Stage | Total | Mean | vs rez | |----------------------------------|--------:|-------:|--------:| | Baseline (main) | 43.0 s | 230 ms | 8.8× | | + Shared cache (PR #66) | 34.4 s | 183 ms | 11.1× | | + mimalloc (PR #67) | 29.4 s | 156 ms | 13.0× | | **+ this filter, run 1** |**21.9 s**|**117 ms**|**17.5×**| | **+ this filter, run 2** |**22.3 s**|**119 ms**|**17.1×**| -25 % on top of #67 and **-49 % cumulative from main**. The differential test got the same lift — `cargo test … --release` over the 188 cases dropped from ~46 s to ~38 s. ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). The filter is sound: if `scopes[x].fam_requires_contains(name)` is `false`, then no variant in `scopes[x]` mentions that family, so `reduce_by` would have returned `Unchanged` with no reductions — skipping the call preserves both the result and the reduction list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Callgrind on the rez 188-case benchmark put `String::clone` at ~2.7 % of cycles and hashbrown ops at ~10 %, much of it for the package- family-name keys that flow through every `Requirement`, `PackageScope`, `PackageVariant`, `PackageVariantSlice` and the various `FxHashSet<String>` / `FxHashMap<String, V>` containers in the solver hot path. Replace `String` with a new `Name` type alias (`Rc<str>`) for those fields. The clones the solver makes constantly during a solve — `Requirement::clone()` for the per-call `extracted_request.clone()`, `req.clone()` and `package_request.clone()` flow, `PackageScope` clones on every narrowing, `extracted_fams` / `request_fams` / `fam_requires` / `common_fams` set ops — become refcount bumps instead of heap allocations. `Rc<str>` `Borrow<str>`s, so `HashMap<Name, V>::get(&str)` and `HashSet<Name>::contains(&str)` still work transparently. The repo input type stays as `HashMap<String, …>` (it is the JSON shape callers hand in); conversion to `Name` happens at the boundary in `PackageVariantList::new`, `PackageVariantCache`, and `PackageScope::new`. `PackageVariantCache` now uses `get_or_build` with a two-step lookup to avoid allocating a fresh `Name` on cache hits. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (main, post-#68) | 22.1 s | 117 ms | 17.4× | | + interned names, run 1 | 18.6 s | 99 ms | 20.6× | | + interned names, run 2 | 18.6 s | 99 ms | 20.5× | **-15.5 % on top of #68, -57 % cumulative from main** (43.0 s → 18.6 s, 8.8× rez → 20.5× rez). ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). The differential test itself runs faster — ~38 s → ~28 s — same effect on its workload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`PackageVariantSlice::extractable` was a `HashSet::is_subset` call that iterated `common_fams` and tested membership in `extracted_fams` on every probe. Fresh callgrind on the rez 188-case benchmark (after #66/#67/#68/#70) put `HashSet::is_subset` at **11.8 %** of inclusive cycles — nearly half of `PackageVariantSlice::extract`'s total 25.3 %. Every `extract()` call hits this guard; the vast majority return early with "nothing left to extract." The set operation is unnecessary. `extracted_fams` is only ever populated by inserting an element of `common_fams.difference( extracted_fams)` in `extract`, and `copy_with_entries` resets it to empty. So `extracted_fams ⊆ common_fams` always holds, and under that invariant the `is_subset` check is equivalent to a length compare: common_fams.is_subset(extracted_fams) ⟺ common_fams ⊆ extracted_fams ⟺ (since extracted ⊆ common) common_fams == extracted_fams ⟺ |common| == |extracted| Replace the body with `common_fams().len() > extracted_fams.len()`. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |--------------------------------------|--------:|-------:|-------:| | Baseline (main, post-#70) | 18.6 s | 99 ms | 20.5× | | + this change, run 1 | 13.1 s | 70 ms | 29.1× | | + this change, run 2 | 13.9 s | 74 ms | 27.4× | **-30 % on top of #70**, **-69 % cumulative from main** (43.0 s → 13.1 s, 8.8× rez → 29.1× rez). Differential test got the same lift: 188/188 still match rez 1:1, in 20.77 s (down from 27.67 s). Predicted gain was 5–10 %. Like #68, hidden downstream costs (the `is_subset` iterator setup/teardown, the hash lookups it performed, and the now-unnecessary `common_fams_cache` first-time computation on slices that never need extraction) made the actual gain larger. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Callgrind on rez's 188-case benchmark (post-#71/#72) showed `SmallVec::extend` + `Drop` at ~4 % of cycles, almost entirely from `VersionRange::clone`. Every `Requirement::clone()` (in `extracted_request.clone()`, the per-pair `package_request.clone()` in `reduce_by`, the `req.clone()` and `package_request.clone()` in `Reduction`, etc.) deep-copies the inner `Ranges`'s `SmallVec` of `(Bound, Bound)` segments. After the rest of the perf stack (#66/#67/#68/#70/#71/#72), this is the largest non-amortised allocation cost left. Switch the inner from `Ranges<RerVersion>` to `Rc<Ranges<RerVersion>>`. `Rc<T>::clone` is a refcount bump; `Rc<T>::Hash`/`Eq` defer to the inner `T`, so the derived semantics on `VersionRange` are unchanged. Methods that build a new range (`intersection`, `union`, `complement`, `from_versions`, `span`, `split`, ...) still produce a fresh `Ranges` internally and wrap it with `Rc::new` — the win is on the read / clone path, not the construction path. `as_ranges()` still returns `&Ranges` (via `Rc::deref`). `into_ranges` now uses `Rc::unwrap_or_clone` — falls back to a clone if the `Rc` is shared, but is the consume-the-`VersionRange` API and rare in practice. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (post-#71/#72), median | ~12.7 s | 68 ms | ~30× | | + this change, run 1 | 11.2 s | 60 ms | 34.1× | | + this change, run 2 | 11.3 s | 60 ms | 33.7× | **~11 % on top of #72**, **~74 % cumulative from main** (43.0 s → 11.2 s, 8.8× rez → 34.1× rez). Differential test (`cargo test … --ignored`): 17.73 s, **188/188 still match rez 1:1**. Predicted 3–5 %. The slightly bigger gain reflects that `VersionRange::clone` cascades into a lot more than just the `SmallVec::extend` it was attributed to in the callgrind exclusive view — it also drove allocator-side work and the matching `Drop`s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

doubleailes force-pushed the try-mimalloc branch from 224497b to 484162d Compare May 15, 2026 14:28

doubleailes force-pushed the try-mimalloc branch from 484162d to 7546273 Compare May 15, 2026 14:32

doubleailes force-pushed the filter-pending-pairs branch from 1943ee1 to f267f7f Compare May 15, 2026 14:33

Base automatically changed from try-mimalloc to main May 15, 2026 14:36

doubleailes merged commit 3afa266 into main May 15, 2026
24 checks passed

doubleailes deleted the filter-pending-pairs branch May 15, 2026 14:37

doubleailes mentioned this pull request May 15, 2026

perf(resolver): intern package family names as Rc<str> #70

Merged

3 tasks

doubleailes mentioned this pull request May 15, 2026

perf(resolver): O(1) extractable() via length compare #71

Merged

3 tasks

doubleailes mentioned this pull request May 15, 2026

perf(version): wrap Ranges in Rc inside VersionRange #73

Merged

3 tasks

This was referenced May 15, 2026

docs(readme): add local apples-to-apples rez 3.3.0 benchmark #74

Merged

bench(resolver): repurpose criterion at the rez_solver hot path #77

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(resolver): pre-filter reduction pairs by `fam_requires`#68

perf(resolver): pre-filter reduction pairs by `fam_requires`#68
doubleailes merged 2 commits into
mainfrom
filter-pending-pairs

doubleailes commented May 15, 2026

Uh oh!

qodo-code-review Bot commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

doubleailes commented May 15, 2026

Summary

What changed

Benchmark (188 cases, release, same machine)

Correctness

Base

Uh oh!

qodo-code-review Bot commented May 15, 2026

Qodo reviews are paused for this user.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant