Skip to content

Modify a SmallVec inline size for UseList to be slightly larger.#93

Merged
cfallin merged 1 commit intobytecodealliance:mainfrom
cfallin:smallvec-prealloc
Oct 7, 2022
Merged

Modify a SmallVec inline size for UseList to be slightly larger.#93
cfallin merged 1 commit intobytecodealliance:mainfrom
cfallin:smallvec-prealloc

Conversation

@cfallin
Copy link
Copy Markdown
Member

@cfallin cfallin commented Oct 5, 2022

This PR updates the UseList type alias to a SmallVec with 4
Uses (which are 4 bytes each) rather than 2, because we get 16 bytes
of space "for free" in a SmallVec on a 64-bit machine.

This PR improves the compilation performance of Cranelift by 1% on
SpiderMonkey.wasm (measured on a Linux desktop with pinned CPU
frequency, and pinned to one core).

It's worth noting also that before making these changes, I explored
whether it would be possible to put the lists of uses and liveranges
in single large backing Vecs; the basic reason why we can't do this
is that during liverange construction, we append to many lists
concurrently. One could use a linked-list arrangement, and in fact RA2
did this early in its development; the separate SmallVecs were
better for performance overall because the cache locality wins when we
traverse the lists many times. It may still be worth investigating use
of an arena to allocate the vecs rather than the default heap allocator.

@cfallin cfallin requested review from elliottt and jameysharp October 5, 2022 01:13
@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Oct 5, 2022

I'll note also that this PR adds the union feature to smallvec, but that didn't make any difference on its own (probably because Cranelift also enables it now so it was already on in the use-case of regalloc2 that I was testing).

Copy link
Copy Markdown
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's something strange going on here.

The Extend implementation for SmallVec uses Iterator::size_hint to reserve an appropriate amount of space. And the FromIterator implementation, used in Iterator::collect, just allocates a new vector and then calls extend on it. So for any iterator where size_hint yields a good approximation of the length of the iterator, collect() should be equivalent to SmallVec::with_capacity followed by extend.

Iterators over slices have an exact implementation of size_hint, and the Skip and Cloned iterators preserve however much precision was in the preceding iterator chain.

So aside from the change to double the inline array size in UseList (which I strongly approve of), I think this patch should have had zero effect on performance. How confident are you in your measurements? I'd want to dig deeper if there's a measurable effect from this despite all the performance tuning that's gone into SmallVec and Iterator.

@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Oct 5, 2022

Huh, that's really weird -- given that, I agree that there shouldn't be an effect. I wasn't aware that the size hinting was preserved even through .skip()! I will look at this again tomorrow and measure on another system (I was using hyperfine on my M1 system for these measurements, constraining to 1 thread to reduce variability; I'll try on my Linux desktop with frequency pinned).

@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Oct 7, 2022

I did some more controlled measurements of the two parts to this PR (the explicit sizing vs. relying on size hints to .collect(), and the change to the number of Uses in the SmallVec inline portion of a UseList). In the below, wasmtime.main is unmodified RA2, wasmtime.branch is this branch with both changes, and wasmtime.branch2 is this branch with just the smallvec-inline-size change:

[cfallin@xap]~/work/wasmtime% hyperfine 'taskset 1 ./wasmtime.main compile ../wasm-tests/spidermonkey.wasm' 'taskset 1 ./wasmtime.branch compile ../wasm-tests/spidermonkey.wasm' 'taskset 1 ./wasmtime.branch2 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: taskset 1 ./wasmtime.main compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):     13.500 s ±  0.034 s    [User: 13.061 s, System: 0.347 s]
  Range (min … max):   13.461 s … 13.577 s    10 runs

Benchmark 2: taskset 1 ./wasmtime.branch compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):     13.344 s ±  0.022 s    [User: 12.911 s, System: 0.344 s]
  Range (min … max):   13.316 s … 13.370 s    10 runs

Benchmark 3: taskset 1 ./wasmtime.branch2 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):     13.358 s ±  0.017 s    [User: 12.926 s, System: 0.343 s]
  Range (min … max):   13.336 s … 13.390 s    10 runs

Summary
  'taskset 1 ./wasmtime.branch compile ../wasm-tests/spidermonkey.wasm' ran
    1.00 ± 0.00 times faster than 'taskset 1 ./wasmtime.branch2 compile ../wasm-tests/spidermonkey.wasm'
    1.01 ± 0.00 times faster than 'taskset 1 ./wasmtime.main compile ../wasm-tests/spidermonkey.wasm'

So in other words, a reliable 1% improvement but just from the smallvec inline size change. The iterator size hinting is indeed working as you describe, so the other half didn't have an effect. I'll update the PR to contain just the first part -- thanks!

This PR updates the `UseList` type alias to a `SmallVec` with 4
`Use`s (which are 4 bytes each) rather than 2, because we get 16 bytes
of space "for free" in a `SmallVec` on a 64-bit machine.

This PR improves the compilation performance of Cranelift by 1% on
SpiderMonkey.wasm (measured on a Linux desktop with pinned CPU
frequency, and pinned to one core).

It's worth noting also that before making these changes, I explored
whether it would be possible to put the lists of uses and liveranges
in single large backing `Vec`s; the basic reason why we can't do this
is that during liverange construction, we append to many lists
concurrently. One could use a linked-list arrangement, and in fact RA2
did this early in its development; the separate `SmallVec`s were
better for performance overall because the cache locality wins when we
traverse the lists many times. It may still be worth investigating use
of an arena to allocate the vecs rather than the default heap allocator.
@cfallin cfallin force-pushed the smallvec-prealloc branch from b05c254 to 7c9497d Compare October 7, 2022 18:37
@cfallin cfallin changed the title Avoid some smallvec-resize allocations. Modify a SmallVec inline size for UseList to be slightly larger. Oct 7, 2022
Copy link
Copy Markdown
Member

@elliottt elliottt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmarks and writeup!

@cfallin cfallin merged commit 1efaa73 into bytecodealliance:main Oct 7, 2022
@cfallin cfallin deleted the smallvec-prealloc branch October 7, 2022 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants