Modify a SmallVec inline size for UseList to be slightly larger. by cfallin · Pull Request #93 · bytecodealliance/regalloc2

cfallin · 2022-10-05T01:13:00Z

This PR updates the UseList type alias to a SmallVec with 4
Uses (which are 4 bytes each) rather than 2, because we get 16 bytes
of space "for free" in a SmallVec on a 64-bit machine.

This PR improves the compilation performance of Cranelift by 1% on
SpiderMonkey.wasm (measured on a Linux desktop with pinned CPU
frequency, and pinned to one core).

It's worth noting also that before making these changes, I explored
whether it would be possible to put the lists of uses and liveranges
in single large backing Vecs; the basic reason why we can't do this
is that during liverange construction, we append to many lists
concurrently. One could use a linked-list arrangement, and in fact RA2
did this early in its development; the separate SmallVecs were
better for performance overall because the cache locality wins when we
traverse the lists many times. It may still be worth investigating use
of an arena to allocate the vecs rather than the default heap allocator.

cfallin · 2022-10-05T01:13:42Z

I'll note also that this PR adds the union feature to smallvec, but that didn't make any difference on its own (probably because Cranelift also enables it now so it was already on in the use-case of regalloc2 that I was testing).

jameysharp

I think there's something strange going on here.

The Extend implementation for SmallVec uses Iterator::size_hint to reserve an appropriate amount of space. And the FromIterator implementation, used in Iterator::collect, just allocates a new vector and then calls extend on it. So for any iterator where size_hint yields a good approximation of the length of the iterator, collect() should be equivalent to SmallVec::with_capacity followed by extend.

Iterators over slices have an exact implementation of size_hint, and the Skip and Cloned iterators preserve however much precision was in the preceding iterator chain.

So aside from the change to double the inline array size in UseList (which I strongly approve of), I think this patch should have had zero effect on performance. How confident are you in your measurements? I'd want to dig deeper if there's a measurable effect from this despite all the performance tuning that's gone into SmallVec and Iterator.

cfallin · 2022-10-05T01:44:14Z

Huh, that's really weird -- given that, I agree that there shouldn't be an effect. I wasn't aware that the size hinting was preserved even through .skip()! I will look at this again tomorrow and measure on another system (I was using hyperfine on my M1 system for these measurements, constraining to 1 thread to reduce variability; I'll try on my Linux desktop with frequency pinned).

cfallin · 2022-10-07T18:35:48Z

I did some more controlled measurements of the two parts to this PR (the explicit sizing vs. relying on size hints to .collect(), and the change to the number of Uses in the SmallVec inline portion of a UseList). In the below, wasmtime.main is unmodified RA2, wasmtime.branch is this branch with both changes, and wasmtime.branch2 is this branch with just the smallvec-inline-size change:

[cfallin@xap]~/work/wasmtime% hyperfine 'taskset 1 ./wasmtime.main compile ../wasm-tests/spidermonkey.wasm' 'taskset 1 ./wasmtime.branch compile ../wasm-tests/spidermonkey.wasm' 'taskset 1 ./wasmtime.branch2 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: taskset 1 ./wasmtime.main compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):     13.500 s ±  0.034 s    [User: 13.061 s, System: 0.347 s]
  Range (min … max):   13.461 s … 13.577 s    10 runs

Benchmark 2: taskset 1 ./wasmtime.branch compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):     13.344 s ±  0.022 s    [User: 12.911 s, System: 0.344 s]
  Range (min … max):   13.316 s … 13.370 s    10 runs

Benchmark 3: taskset 1 ./wasmtime.branch2 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):     13.358 s ±  0.017 s    [User: 12.926 s, System: 0.343 s]
  Range (min … max):   13.336 s … 13.390 s    10 runs

Summary
  'taskset 1 ./wasmtime.branch compile ../wasm-tests/spidermonkey.wasm' ran
    1.00 ± 0.00 times faster than 'taskset 1 ./wasmtime.branch2 compile ../wasm-tests/spidermonkey.wasm'
    1.01 ± 0.00 times faster than 'taskset 1 ./wasmtime.main compile ../wasm-tests/spidermonkey.wasm'

So in other words, a reliable 1% improvement but just from the smallvec inline size change. The iterator size hinting is indeed working as you describe, so the other half didn't have an effect. I'll update the PR to contain just the first part -- thanks!

This PR updates the `UseList` type alias to a `SmallVec` with 4 `Use`s (which are 4 bytes each) rather than 2, because we get 16 bytes of space "for free" in a `SmallVec` on a 64-bit machine. This PR improves the compilation performance of Cranelift by 1% on SpiderMonkey.wasm (measured on a Linux desktop with pinned CPU frequency, and pinned to one core). It's worth noting also that before making these changes, I explored whether it would be possible to put the lists of uses and liveranges in single large backing `Vec`s; the basic reason why we can't do this is that during liverange construction, we append to many lists concurrently. One could use a linked-list arrangement, and in fact RA2 did this early in its development; the separate `SmallVec`s were better for performance overall because the cache locality wins when we traverse the lists many times. It may still be worth investigating use of an arena to allocate the vecs rather than the default heap allocator.

elliottt

Thanks for the benchmarks and writeup!

cfallin requested review from elliottt and jameysharp October 5, 2022 01:13

jameysharp reviewed Oct 5, 2022

View reviewed changes

cfallin force-pushed the smallvec-prealloc branch from b05c254 to 7c9497d Compare October 7, 2022 18:37

cfallin changed the title ~~Avoid some smallvec-resize allocations.~~ Modify a SmallVec inline size for UseList to be slightly larger. Oct 7, 2022

elliottt approved these changes Oct 7, 2022

View reviewed changes

cfallin merged commit 1efaa73 into bytecodealliance:main Oct 7, 2022

cfallin deleted the smallvec-prealloc branch October 7, 2022 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify a SmallVec inline size for UseList to be slightly larger.#93

Modify a SmallVec inline size for UseList to be slightly larger.#93
cfallin merged 1 commit intobytecodealliance:mainfrom
cfallin:smallvec-prealloc

cfallin commented Oct 5, 2022 •

edited

Loading

Uh oh!

cfallin commented Oct 5, 2022

Uh oh!

jameysharp left a comment

Uh oh!

cfallin commented Oct 5, 2022

Uh oh!

cfallin commented Oct 7, 2022

Uh oh!

elliottt left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cfallin commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfallin commented Oct 5, 2022

Uh oh!

jameysharp left a comment

Choose a reason for hiding this comment

Uh oh!

cfallin commented Oct 5, 2022

Uh oh!

cfallin commented Oct 7, 2022

Uh oh!

elliottt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cfallin commented Oct 5, 2022 •

edited

Loading