perf(examples): use mimalloc as global allocator in the bench binary by doubleailes · Pull Request #67 · doubleailes/rer

doubleailes · 2026-05-15T14:15:51Z

Summary

Stacks on top of #66. Callgrind on the bench binary showed roughly 33 % of cycles inside libc's malloc/free family (`_int_malloc`, `_int_free`, `malloc`, `free`, `malloc_consolidate`, `unlink_chunk`, …). The hot allocations are short-lived small ones: `SmallVec` segments inside `Ranges` on every range clone, the per-call `FxHashMap<&Requirement, bool>` in `reduce_by`, hashbrown rehashes, `String::clone`s for package names. mimalloc's small-object path is measurably cheaper than glibc's on exactly that shape.

This wires `mimalloc::MiMalloc` as `#[global_allocator]` in the bench binary only. Production wheels (`rer-python`) are untouched — switching the allocator in a PyO3 extension has Python-allocator interactions worth treating as a separate change.

Benchmark (188 cases, release, same machine)

Stage	Total	Mean	vs rez
Baseline (main)	43.0 s	230 ms	8.8×
+ Shared cache (#66)	34.4 s	183 ms	11.1×
+ mimalloc, run 1	29.8 s	158 ms	12.8×
+ mimalloc, run 2	29.4 s	156 ms	13.0×

Additive -13 % on top of #66; -32 % cumulative from main.

Correctness

`cargo build` — clean.
`cargo test` — passes.
`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored` — 188/188 still match rez 1:1. (The differential test runs through its own test binary, which doesn't enable mimalloc, so its timing didn't change — but it confirms correctness wasn't broken.)

Base

This PR targets `shared-variant-cache` (`#66`). Once #66 merges, GitHub will retarget this to `main` automatically; rebase will be a no-op.

🤖 Generated with Claude Code

qodo-code-review · 2026-05-15T14:15:55Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Callgrind on the bench binary showed roughly 33 % of cycles inside libc's malloc/free family (`_int_malloc`, `_int_free`, `malloc`, `free`, `malloc_consolidate`, `unlink_chunk`, …). The hot paths are short-lived small allocations: `SmallVec` segments inside `Ranges<RerVersion>` on every range clone, the per-call `FxHashMap<&Requirement, bool>` in `reduce_by`, hashbrown rehashes, `String::clone`s for package names. mimalloc's small-object path is measurably cheaper than glibc's on exactly this shape. Wire `mimalloc::MiMalloc` as `#[global_allocator]` in the bench binary only. Production wheels (`rer-python`) are untouched — switching the global allocator in a PyO3 extension is more nuanced and is left for a separate change. ## Benchmark (188 cases, release, same machine, two consecutive runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (main) | 43.0 s | 230 ms | 8.8× | | + Shared cache (PR #66) | 34.4 s | 183 ms | 11.1× | | + mimalloc (this branch), run 1 | 29.8 s | 158 ms | 12.8× | | + mimalloc (this branch), run 2 | 29.4 s | 156 ms | 13.0× | Cumulative -32 % from main. Stacks cleanly on top of #66. ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The phase loop builds a `pending_set: FxHashSet<(usize, usize)>` of scope-pair indices and then iterates it, calling `scopes[x].reduce_by(scopes[y].package_request)` for each. Profile data on rez's 188-case benchmark: 27 M `reduce_by` calls, but **98.5 %** are immediate early-returns because `scopes[y]`'s family doesn't appear in `scopes[x]`'s `fam_requires` set. Doing that check at pair-generation time instead of pair-consumption time avoids tens of millions of function-call round trips, *and* — this is the bigger win in practice — keeps the pending set itself ~100× smaller. The downstream cost reduction is across the board: - the `FxHashSet` doesn't rehash through 27 M inserts - `pending.sort_unstable()` runs over ~400 K entries instead of 27 M - the consume loop pops ~400 K entries instead of 27 M - fewer allocations from the inner `pending_set` growth Implementation: - Add `PackageScope::fam_requires_contains(name)` that delegates to the slice (or returns `false` for conflict / ephemeral scopes). - Replace each `pending_set.insert((x, y))` with a `maybe_push` helper that applies the filter. - Apply the same filter in the cascade path inside the consume loop (when scope `x` narrows, we now only push `(j, x)` for `j` that actually depend on `x`'s family). ## Benchmark (188 cases, release, same machine) | Stage | Total | Mean | vs rez | |----------------------------------|--------:|-------:|--------:| | Baseline (main) | 43.0 s | 230 ms | 8.8× | | + Shared cache (PR #66) | 34.4 s | 183 ms | 11.1× | | + mimalloc (PR #67) | 29.4 s | 156 ms | 13.0× | | **+ this filter, run 1** |**21.9 s**|**117 ms**|**17.5×**| | **+ this filter, run 2** |**22.3 s**|**119 ms**|**17.1×**| -25 % on top of #67 and **-49 % cumulative from main**. The differential test got the same lift — `cargo test … --release` over the 188 cases dropped from ~46 s to ~38 s. ## Correctness 188/188 still match rez 1:1 (`cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored`). The filter is sound: if `scopes[x].fam_requires_contains(name)` is `false`, then no variant in `scopes[x]` mentions that family, so `reduce_by` would have returned `Unchanged` with no reductions — skipping the call preserves both the result and the reduction list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`PackageVariantSlice::extractable` was a `HashSet::is_subset` call that iterated `common_fams` and tested membership in `extracted_fams` on every probe. Fresh callgrind on the rez 188-case benchmark (after #66/#67/#68/#70) put `HashSet::is_subset` at **11.8 %** of inclusive cycles — nearly half of `PackageVariantSlice::extract`'s total 25.3 %. Every `extract()` call hits this guard; the vast majority return early with "nothing left to extract." The set operation is unnecessary. `extracted_fams` is only ever populated by inserting an element of `common_fams.difference( extracted_fams)` in `extract`, and `copy_with_entries` resets it to empty. So `extracted_fams ⊆ common_fams` always holds, and under that invariant the `is_subset` check is equivalent to a length compare: common_fams.is_subset(extracted_fams) ⟺ common_fams ⊆ extracted_fams ⟺ (since extracted ⊆ common) common_fams == extracted_fams ⟺ |common| == |extracted| Replace the body with `common_fams().len() > extracted_fams.len()`. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |--------------------------------------|--------:|-------:|-------:| | Baseline (main, post-#70) | 18.6 s | 99 ms | 20.5× | | + this change, run 1 | 13.1 s | 70 ms | 29.1× | | + this change, run 2 | 13.9 s | 74 ms | 27.4× | **-30 % on top of #70**, **-69 % cumulative from main** (43.0 s → 13.1 s, 8.8× rez → 29.1× rez). Differential test got the same lift: 188/188 still match rez 1:1, in 20.77 s (down from 27.67 s). Predicted gain was 5–10 %. Like #68, hidden downstream costs (the `is_subset` iterator setup/teardown, the hash lookups it performed, and the now-unnecessary `common_fams_cache` first-time computation on slices that never need extraction) made the actual gain larger. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Callgrind on rez's 188-case benchmark (post-#71/#72) showed `SmallVec::extend` + `Drop` at ~4 % of cycles, almost entirely from `VersionRange::clone`. Every `Requirement::clone()` (in `extracted_request.clone()`, the per-pair `package_request.clone()` in `reduce_by`, the `req.clone()` and `package_request.clone()` in `Reduction`, etc.) deep-copies the inner `Ranges`'s `SmallVec` of `(Bound, Bound)` segments. After the rest of the perf stack (#66/#67/#68/#70/#71/#72), this is the largest non-amortised allocation cost left. Switch the inner from `Ranges<RerVersion>` to `Rc<Ranges<RerVersion>>`. `Rc<T>::clone` is a refcount bump; `Rc<T>::Hash`/`Eq` defer to the inner `T`, so the derived semantics on `VersionRange` are unchanged. Methods that build a new range (`intersection`, `union`, `complement`, `from_versions`, `span`, `split`, ...) still produce a fresh `Ranges` internally and wrap it with `Rc::new` — the win is on the read / clone path, not the construction path. `as_ranges()` still returns `&Ranges` (via `Rc::deref`). `into_ranges` now uses `Rc::unwrap_or_clone` — falls back to a clone if the `Rc` is shared, but is the consume-the-`VersionRange` API and rare in practice. ## Benchmark (188 cases, release, same machine, two runs) | Stage | Total | Mean | vs rez | |------------------------------------|--------:|-------:|-------:| | Baseline (post-#71/#72), median | ~12.7 s | 68 ms | ~30× | | + this change, run 1 | 11.2 s | 60 ms | 34.1× | | + this change, run 2 | 11.3 s | 60 ms | 33.7× | **~11 % on top of #72**, **~74 % cumulative from main** (43.0 s → 11.2 s, 8.8× rez → 34.1× rez). Differential test (`cargo test … --ignored`): 17.73 s, **188/188 still match rez 1:1**. Predicted 3–5 %. The slightly bigger gain reflects that `VersionRange::clone` cascades into a lot more than just the `SmallVec::extend` it was attributed to in the callgrind exclusive view — it also drove allocator-side work and the matching `Drop`s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

doubleailes mentioned this pull request May 15, 2026

perf(resolver): pre-filter reduction pairs by fam_requires #68

Merged

3 tasks

Base automatically changed from shared-variant-cache to main May 15, 2026 14:27

doubleailes force-pushed the try-mimalloc branch from 224497b to 484162d Compare May 15, 2026 14:28

doubleailes force-pushed the try-mimalloc branch from 484162d to 7546273 Compare May 15, 2026 14:32

doubleailes merged commit e6e5e01 into main May 15, 2026
24 checks passed

doubleailes deleted the try-mimalloc branch May 15, 2026 14:36

This was referenced May 15, 2026

perf(resolver): intern package family names as Rc<str> #70

Merged

perf(resolver): O(1) extractable() via length compare #71

Merged

doubleailes mentioned this pull request May 15, 2026

perf(version): wrap Ranges in Rc inside VersionRange #73

Merged

3 tasks

doubleailes mentioned this pull request May 15, 2026

docs(readme): add local apples-to-apples rez 3.3.0 benchmark #74

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(examples): use mimalloc as global allocator in the bench binary#67

perf(examples): use mimalloc as global allocator in the bench binary#67
doubleailes merged 1 commit into
mainfrom
try-mimalloc

doubleailes commented May 15, 2026

Uh oh!

qodo-code-review Bot commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

doubleailes commented May 15, 2026

Summary

Benchmark (188 cases, release, same machine)

Correctness

Base

Uh oh!

qodo-code-review Bot commented May 15, 2026

Qodo reviews are paused for this user.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant