perf(resolver): allow sharing the variant cache across solves by doubleailes · Pull Request #66 · doubleailes/rer

doubleailes · 2026-05-15T13:50:42Z

Summary

Profiling rer's solve on rez's 188-case benchmark showed that `PackageScope::new` accounts for roughly 22 % of total cycles — almost all of which is `PackageVariantList::new` and `get_intersection`, i.e. the one-time work of loading a family's variants and parsing every variant's requires.

That work was already cached within a `Solver`, but the cache was rebuilt for every `Solver::new`. Real-world rer usage — and rez's integration pattern — solves many requests against one already-loaded repository, so the cache should outlive the `Solver`.

What changed

`SolverContext.cache` is now `Rc<RefCell>` (was `RefCell`). Existing call sites are untouched — `RefCell::borrow_mut` works through the `Rc`.
New public type `SharedVariantCache = Rc<RefCell>` and a `make_shared_cache()` constructor.
New constructors `SolverContext::new_with_cache` and `Solver::new_with_cache` that take a pre-built cache. The existing `new` constructors are kept and just delegate (each call builds a fresh cache) — fully backward compatible.
The benchmark example `rez_benchmark_dataset` now builds one cache up-front and reuses it across all 188 cases. That mirrors the way rer is meant to be used from rez and exercises the new path.

The caller is responsible for pairing a shared cache with a single repository — the cache memoises both parsed `PackageVariantList`s and "family present?", both of which are wrong against a different repo. Documented in the doc comment on `SharedVariantCache`.

Verification

188-case benchmark (release, same machine, two consecutive runs)

	Total	Mean	vs rez
Baseline (main)	43.0 s	230 ms	8.8×
Shared cache, run 1	34.4 s	183 ms	11.1×
Shared cache, run 2	35.6 s	189 ms	10.7×

20 % reduction in total solve time, exactly in the range predicted by the callgrind profile.

Correctness

`cargo build` — clean.
`cargo test` — passes.
`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored` — 188/188 still match rez 1:1.

🤖 Generated with Claude Code

Profiling rer's solve on rez's 188-case benchmark showed that `PackageScope::new` accounts for roughly 22% of total cycles — almost all of which is `PackageVariantList::new` and `get_intersection`, i.e. the one-time work of loading a family's variants and parsing every variant's requires from the in-memory `PackageData`. That work was already cached *within* a `Solver`, but the cache was re-created for every `Solver::new`. Real-world rer usage — and rez's own integration pattern — solves many requests against one already- loaded repository, so the cache should outlive the `Solver`. This commit: - Wraps `SolverContext.cache` in `Rc<RefCell<…>>` so it can be shared. Existing call sites are unchanged — `RefCell::borrow_mut` works through the `Rc`. - Adds a `SharedVariantCache` type alias and a `make_shared_cache()` helper. - Adds `SolverContext::new_with_cache` and `Solver::new_with_cache` constructors that accept a pre-built cache. The existing `new` constructors are kept and delegate (creating a fresh cache each time, identical behaviour to before). - The caller is responsible for pairing the cache with the right repository — a cache built against repo A is invalid against repo B, since it memoises both parsed `PackageVariantList`s and "family present?". - Updates the `rez_benchmark_dataset` example to share one cache across all 188 cases — that mirrors the integration shape and exercises the new path. ## Benchmark (188 cases, release, same machine, two runs) | | Total | Mean | vs rez | |----------------|--------:|-------:|---------:| | Baseline | 43.0 s | 230 ms | 8.8× | | Shared cache 1 | 34.4 s | 183 ms | 11.1× | | Shared cache 2 | 35.6 s | 189 ms | 10.7× | 20 % reduction in total solve time. The rez 188-case differential test still passes 188/188 — correctness is preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

qodo-code-review · 2026-05-15T13:50:47Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Callgrind on the bench binary showed roughly 33 % of cycles inside libc's malloc/free family (`_int_malloc`, `_int_free`, `malloc`, `free`, `malloc_consolidate`, `unlink_chunk`, …). The hot paths are short-lived small allocations: `SmallVec` segments inside `Ranges<RerVersion>` on every range clone, the per-call `FxHashMap<&Requirement, bool>` in `reduce_by`, hashbrown rehashes, `String::clone`s for package names. mimalloc's small-object path is measurably cheaper than glibc's on exactly this shape. Wire `mimalloc::MiMalloc` as `#[global_allocator]` in the bench binary only. Production wheels (`rer-python`) are untouched — switching the global allocator in a PyO3 extension is more nuanced and is left for a separate change. ## Benchmark (188 cases, release, same machine, two consecutive runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (main) | 43.0 s | 230 ms | 8.8× | | + Shared cache (PR #66) | 34.4 s | 183 ms | 11.1× | | + mimalloc (this branch), run 1 | 29.8 s | 158 ms | 12.8× | | + mimalloc (this branch), run 2 | 29.4 s | 156 ms | 13.0× | Cumulative -32 % from main. Stacks cleanly on top of #66. ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The phase loop builds a `pending_set: FxHashSet<(usize, usize)>` of scope-pair indices and then iterates it, calling `scopes[x].reduce_by(scopes[y].package_request)` for each. Profile data on rez's 188-case benchmark: 27 M `reduce_by` calls, but **98.5 %** are immediate early-returns because `scopes[y]`'s family doesn't appear in `scopes[x]`'s `fam_requires` set. Doing that check at pair-generation time instead of pair-consumption time avoids tens of millions of function-call round trips, *and* — this is the bigger win in practice — keeps the pending set itself ~100× smaller. The downstream cost reduction is across the board: - the `FxHashSet` doesn't rehash through 27 M inserts - `pending.sort_unstable()` runs over ~400 K entries instead of 27 M - the consume loop pops ~400 K entries instead of 27 M - fewer allocations from the inner `pending_set` growth Implementation: - Add `PackageScope::fam_requires_contains(name)` that delegates to the slice (or returns `false` for conflict / ephemeral scopes). - Replace each `pending_set.insert((x, y))` with a `maybe_push` helper that applies the filter. - Apply the same filter in the cascade path inside the consume loop (when scope `x` narrows, we now only push `(j, x)` for `j` that actually depend on `x`'s family). ## Benchmark (188 cases, release, same machine) | Stage | Total | Mean | vs rez | |----------------------------------|--------:|-------:|--------:| | Baseline (main) | 43.0 s | 230 ms | 8.8× | | + Shared cache (PR #66) | 34.4 s | 183 ms | 11.1× | | + mimalloc (PR #67) | 29.4 s | 156 ms | 13.0× | | **+ this filter, run 1** |**21.9 s**|**117 ms**|**17.5×**| | **+ this filter, run 2** |**22.3 s**|**119 ms**|**17.1×**| -25 % on top of #67 and **-49 % cumulative from main**. The differential test got the same lift — `cargo test … --release` over the 188 cases dropped from ~46 s to ~38 s. ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). The filter is sound: if `scopes[x].fam_requires_contains(name)` is `false`, then no variant in `scopes[x]` mentions that family, so `reduce_by` would have returned `Unchanged` with no reductions — skipping the call preserves both the result and the reduction list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Callgrind on the bench binary showed roughly 33 % of cycles inside libc's malloc/free family (`_int_malloc`, `_int_free`, `malloc`, `free`, `malloc_consolidate`, `unlink_chunk`, …). The hot paths are short-lived small allocations: `SmallVec` segments inside `Ranges<RerVersion>` on every range clone, the per-call `FxHashMap<&Requirement, bool>` in `reduce_by`, hashbrown rehashes, `String::clone`s for package names. mimalloc's small-object path is measurably cheaper than glibc's on exactly this shape. Wire `mimalloc::MiMalloc` as `#[global_allocator]` in the bench binary only. Production wheels (`rer-python`) are untouched — switching the global allocator in a PyO3 extension is more nuanced and is left for a separate change. ## Benchmark (188 cases, release, same machine, two consecutive runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (main) | 43.0 s | 230 ms | 8.8× | | + Shared cache (PR #66) | 34.4 s | 183 ms | 11.1× | | + mimalloc (this branch), run 1 | 29.8 s | 158 ms | 12.8× | | + mimalloc (this branch), run 2 | 29.4 s | 156 ms | 13.0× | Cumulative -32 % from main. Stacks cleanly on top of #66. ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The phase loop builds a `pending_set: FxHashSet<(usize, usize)>` of scope-pair indices and then iterates it, calling `scopes[x].reduce_by(scopes[y].package_request)` for each. Profile data on rez's 188-case benchmark: 27 M `reduce_by` calls, but **98.5 %** are immediate early-returns because `scopes[y]`'s family doesn't appear in `scopes[x]`'s `fam_requires` set. Doing that check at pair-generation time instead of pair-consumption time avoids tens of millions of function-call round trips, *and* — this is the bigger win in practice — keeps the pending set itself ~100× smaller. The downstream cost reduction is across the board: - the `FxHashSet` doesn't rehash through 27 M inserts - `pending.sort_unstable()` runs over ~400 K entries instead of 27 M - the consume loop pops ~400 K entries instead of 27 M - fewer allocations from the inner `pending_set` growth Implementation: - Add `PackageScope::fam_requires_contains(name)` that delegates to the slice (or returns `false` for conflict / ephemeral scopes). - Replace each `pending_set.insert((x, y))` with a `maybe_push` helper that applies the filter. - Apply the same filter in the cascade path inside the consume loop (when scope `x` narrows, we now only push `(j, x)` for `j` that actually depend on `x`'s family). ## Benchmark (188 cases, release, same machine) | Stage | Total | Mean | vs rez | |----------------------------------|--------:|-------:|--------:| | Baseline (main) | 43.0 s | 230 ms | 8.8× | | + Shared cache (PR #66) | 34.4 s | 183 ms | 11.1× | | + mimalloc (PR #67) | 29.4 s | 156 ms | 13.0× | | **+ this filter, run 1** |**21.9 s**|**117 ms**|**17.5×**| | **+ this filter, run 2** |**22.3 s**|**119 ms**|**17.1×**| -25 % on top of #67 and **-49 % cumulative from main**. The differential test got the same lift — `cargo test … --release` over the 188 cases dropped from ~46 s to ~38 s. ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). The filter is sound: if `scopes[x].fam_requires_contains(name)` is `false`, then no variant in `scopes[x]` mentions that family, so `reduce_by` would have returned `Unchanged` with no reductions — skipping the call preserves both the result and the reduction list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`PackageVariantSlice::extractable` was a `HashSet::is_subset` call that iterated `common_fams` and tested membership in `extracted_fams` on every probe. Fresh callgrind on the rez 188-case benchmark (after #66/#67/#68/#70) put `HashSet::is_subset` at **11.8 %** of inclusive cycles — nearly half of `PackageVariantSlice::extract`'s total 25.3 %. Every `extract()` call hits this guard; the vast majority return early with "nothing left to extract." The set operation is unnecessary. `extracted_fams` is only ever populated by inserting an element of `common_fams.difference( extracted_fams)` in `extract`, and `copy_with_entries` resets it to empty. So `extracted_fams ⊆ common_fams` always holds, and under that invariant the `is_subset` check is equivalent to a length compare: common_fams.is_subset(extracted_fams) ⟺ common_fams ⊆ extracted_fams ⟺ (since extracted ⊆ common) common_fams == extracted_fams ⟺ |common| == |extracted| Replace the body with `common_fams().len() > extracted_fams.len()`. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |--------------------------------------|--------:|-------:|-------:| | Baseline (main, post-#70) | 18.6 s | 99 ms | 20.5× | | + this change, run 1 | 13.1 s | 70 ms | 29.1× | | + this change, run 2 | 13.9 s | 74 ms | 27.4× | **-30 % on top of #70**, **-69 % cumulative from main** (43.0 s → 13.1 s, 8.8× rez → 29.1× rez). Differential test got the same lift: 188/188 still match rez 1:1, in 20.77 s (down from 27.67 s). Predicted gain was 5–10 %. Like #68, hidden downstream costs (the `is_subset` iterator setup/teardown, the hash lookups it performed, and the now-unnecessary `common_fams_cache` first-time computation on slices that never need extraction) made the actual gain larger. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Callgrind on rez's 188-case benchmark (post-#71/#72) showed `SmallVec::extend` + `Drop` at ~4 % of cycles, almost entirely from `VersionRange::clone`. Every `Requirement::clone()` (in `extracted_request.clone()`, the per-pair `package_request.clone()` in `reduce_by`, the `req.clone()` and `package_request.clone()` in `Reduction`, etc.) deep-copies the inner `Ranges`'s `SmallVec` of `(Bound, Bound)` segments. After the rest of the perf stack (#66/#67/#68/#70/#71/#72), this is the largest non-amortised allocation cost left. Switch the inner from `Ranges<RerVersion>` to `Rc<Ranges<RerVersion>>`. `Rc<T>::clone` is a refcount bump; `Rc<T>::Hash`/`Eq` defer to the inner `T`, so the derived semantics on `VersionRange` are unchanged. Methods that build a new range (`intersection`, `union`, `complement`, `from_versions`, `span`, `split`, ...) still produce a fresh `Ranges` internally and wrap it with `Rc::new` — the win is on the read / clone path, not the construction path. `as_ranges()` still returns `&Ranges` (via `Rc::deref`). `into_ranges` now uses `Rc::unwrap_or_clone` — falls back to a clone if the `Rc` is shared, but is the consume-the-`VersionRange` API and rare in practice. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (post-#71/#72), median | ~12.7 s | 68 ms | ~30× | | + this change, run 1 | 11.2 s | 60 ms | 34.1× | | + this change, run 2 | 11.3 s | 60 ms | 33.7× | **~11 % on top of #72**, **~74 % cumulative from main** (43.0 s → 11.2 s, 8.8× rez → 34.1× rez). Differential test (`cargo test … --ignored`): 17.73 s, **188/188 still match rez 1:1**. Predicted 3–5 %. The slightly bigger gain reflects that `VersionRange::clone` cascades into a lot more than just the `SmallVec::extend` it was attributed to in the callgrind exclusive view — it also drove allocator-side work and the matching `Drop`s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This was referenced May 15, 2026

perf(examples): use mimalloc as global allocator in the bench binary #67

Merged

perf(resolver): pre-filter reduction pairs by fam_requires #68

Merged

doubleailes merged commit 241ee4f into main May 15, 2026
23 of 24 checks passed

doubleailes deleted the shared-variant-cache branch May 15, 2026 14:27

This was referenced May 15, 2026

perf(resolver): intern package family names as Rc<str> #70

Merged

perf(resolver): O(1) extractable() via length compare #71

Merged

doubleailes mentioned this pull request May 15, 2026

perf(version): wrap Ranges in Rc inside VersionRange #73

Merged

3 tasks

This was referenced May 15, 2026

docs(readme): add local apples-to-apples rez 3.3.0 benchmark #74

Merged

docs: add "Wiring pyrer into rez" integration guide #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(resolver): allow sharing the variant cache across solves#66

perf(resolver): allow sharing the variant cache across solves#66
doubleailes merged 1 commit into
mainfrom
shared-variant-cache

doubleailes commented May 15, 2026

Uh oh!

qodo-code-review Bot commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

doubleailes commented May 15, 2026

Summary

What changed

Verification

188-case benchmark (release, same machine, two consecutive runs)

Correctness

Uh oh!

qodo-code-review Bot commented May 15, 2026

Qodo reviews are paused for this user.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant