perf(resolver): intern package family names as Rc<str>#70
Merged
Conversation
Callgrind on the rez 188-case benchmark put `String::clone` at ~2.7 % of cycles and hashbrown ops at ~10 %, much of it for the package- family-name keys that flow through every `Requirement`, `PackageScope`, `PackageVariant`, `PackageVariantSlice` and the various `FxHashSet<String>` / `FxHashMap<String, V>` containers in the solver hot path. Replace `String` with a new `Name` type alias (`Rc<str>`) for those fields. The clones the solver makes constantly during a solve — `Requirement::clone()` for the per-call `extracted_request.clone()`, `req.clone()` and `package_request.clone()` flow, `PackageScope` clones on every narrowing, `extracted_fams` / `request_fams` / `fam_requires` / `common_fams` set ops — become refcount bumps instead of heap allocations. `Rc<str>` `Borrow<str>`s, so `HashMap<Name, V>::get(&str)` and `HashSet<Name>::contains(&str)` still work transparently. The repo input type stays as `HashMap<String, …>` (it is the JSON shape callers hand in); conversion to `Name` happens at the boundary in `PackageVariantList::new`, `PackageVariantCache`, and `PackageScope::new`. `PackageVariantCache` now uses `get_or_build` with a two-step lookup to avoid allocating a fresh `Name` on cache hits. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (main, post-#68) | 22.1 s | 117 ms | 17.4× | | + interned names, run 1 | 18.6 s | 99 ms | 20.6× | | + interned names, run 2 | 18.6 s | 99 ms | 20.5× | **-15.5 % on top of #68, -57 % cumulative from main** (43.0 s → 18.6 s, 8.8× rez → 20.5× rez). ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). The differential test itself runs faster — ~38 s → ~28 s — same effect on its workload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
3 tasks
doubleailes
added a commit
that referenced
this pull request
May 15, 2026
`PackageVariantSlice::extractable` was a `HashSet::is_subset` call that iterated `common_fams` and tested membership in `extracted_fams` on every probe. Fresh callgrind on the rez 188-case benchmark (after #66/#67/#68/#70) put `HashSet::is_subset` at **11.8 %** of inclusive cycles — nearly half of `PackageVariantSlice::extract`'s total 25.3 %. Every `extract()` call hits this guard; the vast majority return early with "nothing left to extract." The set operation is unnecessary. `extracted_fams` is only ever populated by inserting an element of `common_fams.difference( extracted_fams)` in `extract`, and `copy_with_entries` resets it to empty. So `extracted_fams ⊆ common_fams` always holds, and under that invariant the `is_subset` check is equivalent to a length compare: common_fams.is_subset(extracted_fams) ⟺ common_fams ⊆ extracted_fams ⟺ (since extracted ⊆ common) common_fams == extracted_fams ⟺ |common| == |extracted| Replace the body with `common_fams().len() > extracted_fams.len()`. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |--------------------------------------|--------:|-------:|-------:| | Baseline (main, post-#70) | 18.6 s | 99 ms | 20.5× | | + this change, run 1 | 13.1 s | 70 ms | 29.1× | | + this change, run 2 | 13.9 s | 74 ms | 27.4× | **-30 % on top of #70**, **-69 % cumulative from main** (43.0 s → 13.1 s, 8.8× rez → 29.1× rez). Differential test got the same lift: 188/188 still match rez 1:1, in 20.77 s (down from 27.67 s). Predicted gain was 5–10 %. Like #68, hidden downstream costs (the `is_subset` iterator setup/teardown, the hash lookups it performed, and the now-unnecessary `common_fams_cache` first-time computation on slices that never need extraction) made the actual gain larger. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
doubleailes
added a commit
that referenced
this pull request
May 15, 2026
Callgrind on rez's 188-case benchmark (post-#71/#72) showed `SmallVec::extend` + `Drop` at ~4 % of cycles, almost entirely from `VersionRange::clone`. Every `Requirement::clone()` (in `extracted_request.clone()`, the per-pair `package_request.clone()` in `reduce_by`, the `req.clone()` and `package_request.clone()` in `Reduction`, etc.) deep-copies the inner `Ranges`'s `SmallVec` of `(Bound, Bound)` segments. After the rest of the perf stack (#66/#67/#68/#70/#71/#72), this is the largest non-amortised allocation cost left. Switch the inner from `Ranges<RerVersion>` to `Rc<Ranges<RerVersion>>`. `Rc<T>::clone` is a refcount bump; `Rc<T>::Hash`/`Eq` defer to the inner `T`, so the derived semantics on `VersionRange` are unchanged. Methods that build a new range (`intersection`, `union`, `complement`, `from_versions`, `span`, `split`, ...) still produce a fresh `Ranges` internally and wrap it with `Rc::new` — the win is on the read / clone path, not the construction path. `as_ranges()` still returns `&Ranges` (via `Rc::deref`). `into_ranges` now uses `Rc::unwrap_or_clone` — falls back to a clone if the `Rc` is shared, but is the consume-the-`VersionRange` API and rare in practice. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (post-#71/#72), median | ~12.7 s | 68 ms | ~30× | | + this change, run 1 | 11.2 s | 60 ms | 34.1× | | + this change, run 2 | 11.3 s | 60 ms | 33.7× | **~11 % on top of #72**, **~74 % cumulative from main** (43.0 s → 11.2 s, 8.8× rez → 34.1× rez). Differential test (`cargo test … --ignored`): 17.73 s, **188/188 still match rez 1:1**. Predicted 3–5 %. The slightly bigger gain reflects that `VersionRange::clone` cascades into a lot more than just the `SmallVec::extend` it was attributed to in the callgrind exclusive view — it also drove allocator-side work and the matching `Drop`s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
doubleailes
added a commit
that referenced
this pull request
May 15, 2026
Callgrind on rez's 188-case benchmark (post-#71/#72) showed `SmallVec::extend` + `Drop` at ~4 % of cycles, almost entirely from `VersionRange::clone`. Every `Requirement::clone()` (in `extracted_request.clone()`, the per-pair `package_request.clone()` in `reduce_by`, the `req.clone()` and `package_request.clone()` in `Reduction`, etc.) deep-copies the inner `Ranges`'s `SmallVec` of `(Bound, Bound)` segments. After the rest of the perf stack (#66/#67/#68/#70/#71/#72), this is the largest non-amortised allocation cost left. Switch the inner from `Ranges<RerVersion>` to `Rc<Ranges<RerVersion>>`. `Rc<T>::clone` is a refcount bump; `Rc<T>::Hash`/`Eq` defer to the inner `T`, so the derived semantics on `VersionRange` are unchanged. Methods that build a new range (`intersection`, `union`, `complement`, `from_versions`, `span`, `split`, ...) still produce a fresh `Ranges` internally and wrap it with `Rc::new` — the win is on the read / clone path, not the construction path. `as_ranges()` still returns `&Ranges` (via `Rc::deref`). `into_ranges` now uses `Rc::unwrap_or_clone` — falls back to a clone if the `Rc` is shared, but is the consume-the-`VersionRange` API and rare in practice. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (post-#71/#72), median | ~12.7 s | 68 ms | ~30× | | + this change, run 1 | 11.2 s | 60 ms | 34.1× | | + this change, run 2 | 11.3 s | 60 ms | 33.7× | **~11 % on top of #72**, **~74 % cumulative from main** (43.0 s → 11.2 s, 8.8× rez → 34.1× rez). Differential test (`cargo test … --ignored`): 17.73 s, **188/188 still match rez 1:1**. Predicted 3–5 %. The slightly bigger gain reflects that `VersionRange::clone` cascades into a lot more than just the `SmallVec::extend` it was attributed to in the callgrind exclusive view — it also drove allocator-side work and the matching `Drop`s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Callgrind on the rez 188-case benchmark put `String::clone` at ~2.7 % of cycles and hashbrown ops at ~10 %, much of it for the package-family-name keys that flow through every `Requirement`, `PackageScope`, `PackageVariant`, `PackageVariantSlice` and the various `FxHashSet` / `FxHashMap<String, V>` containers on the solver's hot path.
This PR introduces a `Name` type alias (`Rc`) and uses it for those fields. The clones the solver makes constantly during a solve — `Requirement::clone()`, `PackageScope` clones on every narrowing, `extracted_fams` / `request_fams` / `fam_requires` / `common_fams` set ops — become refcount bumps instead of heap allocations.
What changed
`Rc` `Borrow`s, so `HashMap<Name, V>::get(&str)` and `HashSet::contains(&str)` still work transparently — no caller-side ergonomic regression.
Benchmark (188 cases, release, same machine)
-15.5 % additive, -57 % cumulative from main (43.0 s → 18.6 s, 8.8× → 20.5× rez).
Differential test got the same lift: ~38 s → ~28 s.
Correctness
🤖 Generated with Claude Code