Skip to content

perf(examples): use mimalloc as global allocator in the bench binary#67

Merged
doubleailes merged 1 commit into
mainfrom
try-mimalloc
May 15, 2026
Merged

perf(examples): use mimalloc as global allocator in the bench binary#67
doubleailes merged 1 commit into
mainfrom
try-mimalloc

Conversation

@doubleailes
Copy link
Copy Markdown
Owner

Summary

Stacks on top of #66. Callgrind on the bench binary showed roughly 33 % of cycles inside libc's malloc/free family (`_int_malloc`, `_int_free`, `malloc`, `free`, `malloc_consolidate`, `unlink_chunk`, …). The hot allocations are short-lived small ones: `SmallVec` segments inside `Ranges` on every range clone, the per-call `FxHashMap<&Requirement, bool>` in `reduce_by`, hashbrown rehashes, `String::clone`s for package names. mimalloc's small-object path is measurably cheaper than glibc's on exactly that shape.

This wires `mimalloc::MiMalloc` as `#[global_allocator]` in the bench binary only. Production wheels (`rer-python`) are untouched — switching the allocator in a PyO3 extension has Python-allocator interactions worth treating as a separate change.

Benchmark (188 cases, release, same machine)

Stage Total Mean vs rez
Baseline (main) 43.0 s 230 ms 8.8×
+ Shared cache (#66) 34.4 s 183 ms 11.1×
+ mimalloc, run 1 29.8 s 158 ms 12.8×
+ mimalloc, run 2 29.4 s 156 ms 13.0×

Additive -13 % on top of #66; -32 % cumulative from main.

Correctness

  • `cargo build` — clean.
  • `cargo test` — passes.
  • `cargo test --release -p rer-resolver --test test_rez_benchmark -- --ignored` — 188/188 still match rez 1:1. (The differential test runs through its own test binary, which doesn't enable mimalloc, so its timing didn't change — but it confirms correctness wasn't broken.)

Base

This PR targets `shared-variant-cache` (`#66`). Once #66 merges, GitHub will retarget this to `main` automatically; rebase will be a no-op.

🤖 Generated with Claude Code

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Base automatically changed from shared-variant-cache to main May 15, 2026 14:27
Callgrind on the bench binary showed roughly 33 % of cycles inside
libc's malloc/free family (`_int_malloc`, `_int_free`, `malloc`,
`free`, `malloc_consolidate`, `unlink_chunk`, …). The hot paths are
short-lived small allocations: `SmallVec` segments inside
`Ranges<RerVersion>` on every range clone, the per-call
`FxHashMap<&Requirement, bool>` in `reduce_by`, hashbrown rehashes,
`String::clone`s for package names. mimalloc's small-object path is
measurably cheaper than glibc's on exactly this shape.

Wire `mimalloc::MiMalloc` as `#[global_allocator]` in the bench
binary only. Production wheels (`rer-python`) are untouched —
switching the global allocator in a PyO3 extension is more nuanced
and is left for a separate change.

## Benchmark (188 cases, release, same machine, two consecutive runs)

| Stage                              | Total   | Mean   | vs rez |
|------------------------------------|--------:|-------:|-------:|
| Baseline (main)                    |  43.0 s | 230 ms |   8.8× |
| + Shared cache (PR #66)            |  34.4 s | 183 ms |  11.1× |
| + mimalloc (this branch), run 1    |  29.8 s | 158 ms |  12.8× |
| + mimalloc (this branch), run 2    |  29.4 s | 156 ms |  13.0× |

Cumulative -32 % from main. Stacks cleanly on top of #66.

## Correctness

188/188 still match rez 1:1 (`cargo test --release -p rer-resolver
--test test_rez_benchmark -- --ignored`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
doubleailes added a commit that referenced this pull request May 15, 2026
The phase loop builds a `pending_set: FxHashSet<(usize, usize)>` of
scope-pair indices and then iterates it, calling
`scopes[x].reduce_by(scopes[y].package_request)` for each. Profile
data on rez's 188-case benchmark: 27 M `reduce_by` calls, but
**98.5 %** are immediate early-returns because `scopes[y]`'s family
doesn't appear in `scopes[x]`'s `fam_requires` set.

Doing that check at pair-generation time instead of pair-consumption
time avoids tens of millions of function-call round trips, *and* —
this is the bigger win in practice — keeps the pending set itself
~100× smaller. The downstream cost reduction is across the board:

- the `FxHashSet` doesn't rehash through 27 M inserts
- `pending.sort_unstable()` runs over ~400 K entries instead of 27 M
- the consume loop pops ~400 K entries instead of 27 M
- fewer allocations from the inner `pending_set` growth

Implementation:

- Add `PackageScope::fam_requires_contains(name)` that delegates to
  the slice (or returns `false` for conflict / ephemeral scopes).
- Replace each `pending_set.insert((x, y))` with a `maybe_push`
  helper that applies the filter.
- Apply the same filter in the cascade path inside the consume loop
  (when scope `x` narrows, we now only push `(j, x)` for `j` that
  actually depend on `x`'s family).

## Benchmark (188 cases, release, same machine)

| Stage                            | Total   | Mean   | vs rez  |
|----------------------------------|--------:|-------:|--------:|
| Baseline (main)                  |  43.0 s | 230 ms |    8.8× |
| + Shared cache (PR #66)          |  34.4 s | 183 ms |   11.1× |
| + mimalloc (PR #67)              |  29.4 s | 156 ms |   13.0× |
| **+ this filter, run 1**         |**21.9 s**|**117 ms**|**17.5×**|
| **+ this filter, run 2**         |**22.3 s**|**119 ms**|**17.1×**|

-25 % on top of #67 and **-49 % cumulative from main**. The
differential test got the same lift — `cargo test … --release` over
the 188 cases dropped from ~46 s to ~38 s.

## Correctness

188/188 still match rez 1:1 (`cargo test --release -p rer-resolver
--test test_rez_benchmark -- --ignored`). The filter is sound: if
`scopes[x].fam_requires_contains(name)` is `false`, then no variant
in `scopes[x]` mentions that family, so `reduce_by` would have
returned `Unchanged` with no reductions — skipping the call
preserves both the result and the reduction list.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@doubleailes doubleailes merged commit e6e5e01 into main May 15, 2026
24 checks passed
@doubleailes doubleailes deleted the try-mimalloc branch May 15, 2026 14:36
doubleailes added a commit that referenced this pull request May 15, 2026
The phase loop builds a `pending_set: FxHashSet<(usize, usize)>` of
scope-pair indices and then iterates it, calling
`scopes[x].reduce_by(scopes[y].package_request)` for each. Profile
data on rez's 188-case benchmark: 27 M `reduce_by` calls, but
**98.5 %** are immediate early-returns because `scopes[y]`'s family
doesn't appear in `scopes[x]`'s `fam_requires` set.

Doing that check at pair-generation time instead of pair-consumption
time avoids tens of millions of function-call round trips, *and* —
this is the bigger win in practice — keeps the pending set itself
~100× smaller. The downstream cost reduction is across the board:

- the `FxHashSet` doesn't rehash through 27 M inserts
- `pending.sort_unstable()` runs over ~400 K entries instead of 27 M
- the consume loop pops ~400 K entries instead of 27 M
- fewer allocations from the inner `pending_set` growth

Implementation:

- Add `PackageScope::fam_requires_contains(name)` that delegates to
  the slice (or returns `false` for conflict / ephemeral scopes).
- Replace each `pending_set.insert((x, y))` with a `maybe_push`
  helper that applies the filter.
- Apply the same filter in the cascade path inside the consume loop
  (when scope `x` narrows, we now only push `(j, x)` for `j` that
  actually depend on `x`'s family).

## Benchmark (188 cases, release, same machine)

| Stage                            | Total   | Mean   | vs rez  |
|----------------------------------|--------:|-------:|--------:|
| Baseline (main)                  |  43.0 s | 230 ms |    8.8× |
| + Shared cache (PR #66)          |  34.4 s | 183 ms |   11.1× |
| + mimalloc (PR #67)              |  29.4 s | 156 ms |   13.0× |
| **+ this filter, run 1**         |**21.9 s**|**117 ms**|**17.5×**|
| **+ this filter, run 2**         |**22.3 s**|**119 ms**|**17.1×**|

-25 % on top of #67 and **-49 % cumulative from main**. The
differential test got the same lift — `cargo test … --release` over
the 188 cases dropped from ~46 s to ~38 s.

## Correctness

188/188 still match rez 1:1 (`cargo test --release -p rer-resolver
--test test_rez_benchmark -- --ignored`). The filter is sound: if
`scopes[x].fam_requires_contains(name)` is `false`, then no variant
in `scopes[x]` mentions that family, so `reduce_by` would have
returned `Unchanged` with no reductions — skipping the call
preserves both the result and the reduction list.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
doubleailes added a commit that referenced this pull request May 15, 2026
`PackageVariantSlice::extractable` was a `HashSet::is_subset` call
that iterated `common_fams` and tested membership in `extracted_fams`
on every probe. Fresh callgrind on the rez 188-case benchmark (after
#66/#67/#68/#70) put `HashSet::is_subset` at **11.8 %** of inclusive
cycles — nearly half of `PackageVariantSlice::extract`'s total
25.3 %. Every `extract()` call hits this guard; the vast majority
return early with "nothing left to extract."

The set operation is unnecessary. `extracted_fams` is only ever
populated by inserting an element of `common_fams.difference(
extracted_fams)` in `extract`, and `copy_with_entries` resets it to
empty. So `extracted_fams ⊆ common_fams` always holds, and under
that invariant the `is_subset` check is equivalent to a length
compare:

    common_fams.is_subset(extracted_fams)
      ⟺ common_fams ⊆ extracted_fams ⟺ (since extracted ⊆ common)
        common_fams == extracted_fams ⟺ |common| == |extracted|

Replace the body with `common_fams().len() > extracted_fams.len()`.

## Benchmark (188 cases, release, same machine, two runs)

| Stage                                | Total   | Mean   | vs rez |
|--------------------------------------|--------:|-------:|-------:|
| Baseline (main, post-#70)            |  18.6 s |  99 ms |  20.5× |
| + this change, run 1                 |  13.1 s |  70 ms |  29.1× |
| + this change, run 2                 |  13.9 s |  74 ms |  27.4× |

**-30 % on top of #70**, **-69 % cumulative from main** (43.0 s →
13.1 s, 8.8× rez → 29.1× rez).

Differential test got the same lift: 188/188 still match rez 1:1,
in 20.77 s (down from 27.67 s).

Predicted gain was 5–10 %. Like #68, hidden downstream costs (the
`is_subset` iterator setup/teardown, the hash lookups it performed,
and the now-unnecessary `common_fams_cache` first-time computation
on slices that never need extraction) made the actual gain larger.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
doubleailes added a commit that referenced this pull request May 15, 2026
Callgrind on rez's 188-case benchmark (post-#71/#72) showed
`SmallVec::extend` + `Drop` at ~4 % of cycles, almost entirely from
`VersionRange::clone`. Every `Requirement::clone()` (in
`extracted_request.clone()`, the per-pair `package_request.clone()`
in `reduce_by`, the `req.clone()` and `package_request.clone()` in
`Reduction`, etc.) deep-copies the inner `Ranges`'s `SmallVec` of
`(Bound, Bound)` segments. After the rest of the perf stack
(#66/#67/#68/#70/#71/#72), this is the largest non-amortised
allocation cost left.

Switch the inner from `Ranges<RerVersion>` to `Rc<Ranges<RerVersion>>`.
`Rc<T>::clone` is a refcount bump; `Rc<T>::Hash`/`Eq` defer to the
inner `T`, so the derived semantics on `VersionRange` are unchanged.

Methods that build a new range (`intersection`, `union`, `complement`,
`from_versions`, `span`, `split`, ...) still produce a fresh `Ranges`
internally and wrap it with `Rc::new` — the win is on the read /
clone path, not the construction path.

`as_ranges()` still returns `&Ranges` (via `Rc::deref`). `into_ranges`
now uses `Rc::unwrap_or_clone` — falls back to a clone if the `Rc` is
shared, but is the consume-the-`VersionRange` API and rare in
practice.

## Benchmark (188 cases, release, same machine, two runs)

| Stage                              | Total   | Mean   | vs rez |
|------------------------------------|--------:|-------:|-------:|
| Baseline (post-#71/#72), median    | ~12.7 s |  68 ms | ~30×   |
| + this change, run 1               |  11.2 s |  60 ms |  34.1× |
| + this change, run 2               |  11.3 s |  60 ms |  33.7× |

**~11 % on top of #72**, **~74 % cumulative from main** (43.0 s →
11.2 s, 8.8× rez → 34.1× rez).

Differential test (`cargo test … --ignored`): 17.73 s, **188/188 still
match rez 1:1**.

Predicted 3–5 %. The slightly bigger gain reflects that
`VersionRange::clone` cascades into a lot more than just the
`SmallVec::extend` it was attributed to in the callgrind exclusive
view — it also drove allocator-side work and the matching `Drop`s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
doubleailes added a commit that referenced this pull request May 15, 2026
Callgrind on rez's 188-case benchmark (post-#71/#72) showed
`SmallVec::extend` + `Drop` at ~4 % of cycles, almost entirely from
`VersionRange::clone`. Every `Requirement::clone()` (in
`extracted_request.clone()`, the per-pair `package_request.clone()`
in `reduce_by`, the `req.clone()` and `package_request.clone()` in
`Reduction`, etc.) deep-copies the inner `Ranges`'s `SmallVec` of
`(Bound, Bound)` segments. After the rest of the perf stack
(#66/#67/#68/#70/#71/#72), this is the largest non-amortised
allocation cost left.

Switch the inner from `Ranges<RerVersion>` to `Rc<Ranges<RerVersion>>`.
`Rc<T>::clone` is a refcount bump; `Rc<T>::Hash`/`Eq` defer to the
inner `T`, so the derived semantics on `VersionRange` are unchanged.

Methods that build a new range (`intersection`, `union`, `complement`,
`from_versions`, `span`, `split`, ...) still produce a fresh `Ranges`
internally and wrap it with `Rc::new` — the win is on the read /
clone path, not the construction path.

`as_ranges()` still returns `&Ranges` (via `Rc::deref`). `into_ranges`
now uses `Rc::unwrap_or_clone` — falls back to a clone if the `Rc` is
shared, but is the consume-the-`VersionRange` API and rare in
practice.

## Benchmark (188 cases, release, same machine, two runs)

| Stage                              | Total   | Mean   | vs rez |
|------------------------------------|--------:|-------:|-------:|
| Baseline (post-#71/#72), median    | ~12.7 s |  68 ms | ~30×   |
| + this change, run 1               |  11.2 s |  60 ms |  34.1× |
| + this change, run 2               |  11.3 s |  60 ms |  33.7× |

**~11 % on top of #72**, **~74 % cumulative from main** (43.0 s →
11.2 s, 8.8× rez → 34.1× rez).

Differential test (`cargo test … --ignored`): 17.73 s, **188/188 still
match rez 1:1**.

Predicted 3–5 %. The slightly bigger gain reflects that
`VersionRange::clone` cascades into a lot more than just the
`SmallVec::extend` it was attributed to in the callgrind exclusive
view — it also drove allocator-side work and the matching `Drop`s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant