Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize table.init instruction and instantiation #2847

Merged
merged 2 commits into from
Apr 19, 2021

Conversation

alexcrichton
Copy link
Member

This commit optimizes table initialization as part of instance
instantiation and also applies the same optimization to the table.init
instruction. One part of this commit is to remove some preexisting
duplication between instance instantiation and the table.init
instruction itself, after this the actual implementation of table.init
is optimized to effectively have fewer bounds checks in fewer places and
have a much tighter loop for instantiation.

A big fallout from this change is that memory/table initializer offsets
are now stored as u32 instead of usize to remove a few casts in a
few places. This ended up requiring moving some overflow checks that
happened in parsing to later in code itself because otherwise the wrong
spec test errors are emitted during testing. I've tried to trace where
these can possibly overflow but I think that I managed to get
everything.

In a local synthetic test where an empty module with a single 80,000
element initializer this improves total instantiation time by 4x (562us
=> 141us)

This commit optimizes table initialization as part of instance
instantiation and also applies the same optimization to the `table.init`
instruction. One part of this commit is to remove some preexisting
duplication between instance instantiation and the `table.init`
instruction itself, after this the actual implementation of `table.init`
is optimized to effectively have fewer bounds checks in fewer places and
have a much tighter loop for instantiation.

A big fallout from this change is that memory/table initializer offsets
are now stored as `u32` instead of `usize` to remove a few casts in a
few places. This ended up requiring moving some overflow checks that
happened in parsing to later in code itself because otherwise the wrong
spec test errors are emitted during testing. I've tried to trace where
these can possibly overflow but I think that I managed to get
everything.

In a local synthetic test where an empty module with a single 80,000
element initializer this improves total instantiation time by 4x (562us
=> 141us)
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:wasm labels Apr 19, 2021
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me with comments below addressed

Comment on lines +576 to +580
unsafe { Some(&*self.vmctx_plus_offset(self.offsets.vmctx_anyfunc(index))) }
}

unsafe fn anyfunc_ptr(&self, index: FuncIndex) -> *mut VMCallerCheckedAnyfunc {
self.vmctx_plus_offset(self.offsets.vmctx_anyfunc(index))
unsafe fn anyfunc_base(&self) -> *mut VMCallerCheckedAnyfunc {
self.vmctx_plus_offset(self.offsets.vmctx_anyfuncs_begin())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? The anyfunc_ptr call should be inlined just fine, unless I'm missing something. This just makes things more open coded, which isn't the direction I'd generally move in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly changed because it was previously used in a loop but the pattern it was using was a bit slower than necessary. The main slowdown is that vmctx_anyfunc performed a multiplication instruction with the size of a pointer, where both the size and the index were not constant, resulting in bad codegen. By using the native types here it makes pointer arithmetic much speedier since everything is known at compile time.

Once I removed one main use of the function in the table initialization bits I figured I might as well go ahead and remove the function and speed up other callsites as well.

I do agree it's a bit more open coded, but then again everything about VMContext feels sort of inherently open-coded

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I would really have expected anyfunc_ptr to be inlined and then for inst combine to see that one of the operands of the multiplication was a constant and go to town from there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it is inlined, but that's not the problem here. The function called has a multiplication with 3 * self.pointer_size, but self.pointer_size is not a constant value. We could probably improve codegen by having something like NativeVMOffsets for when you're specifically not cross-compiling somewhere else but for now LLVM has no way of seeing that the value originally stored there was mem::size_of::<usize>()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I get it now. Thanks for your patience explaining it to me!

};

for (item, slot) in items.zip(elements) {
Self::set_raw(ty, slot, item.into())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get boiled down into a memcpy by LLVM? If so, we can close #983. If not, would we be able to do that here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this has just one call site, and I don't think anything is preventing us from doing a memcpy in theory here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this should actually probably close that issue. We don't actually have anything to memcpy though because our element segments are stored as lists of function indices, but the tables themselves are *mut VMCallerCheckedAnyfunc. This means that there's a bit of a mismatch where unless we allocate table segments per-instance to memcpy from (which I dont think we should do) then there's actually nothing to memcpy.

Otherwise, though, I just used set_raw here to be the same as the other bits and pieces of this module

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead though and made it a bit more raw to help streamline this a tiny bit more

crates/runtime/src/table.rs Outdated Show resolved Hide resolved
@alexcrichton alexcrichton merged commit 193551a into bytecodealliance:main Apr 19, 2021
@alexcrichton alexcrichton deleted the speed-up-table-init branch April 19, 2021 23:44
mchesser pushed a commit to mchesser/wasmtime that referenced this pull request May 24, 2021
…#2847)

* Optimize `table.init` instruction and instantiation

This commit optimizes table initialization as part of instance
instantiation and also applies the same optimization to the `table.init`
instruction. One part of this commit is to remove some preexisting
duplication between instance instantiation and the `table.init`
instruction itself, after this the actual implementation of `table.init`
is optimized to effectively have fewer bounds checks in fewer places and
have a much tighter loop for instantiation.

A big fallout from this change is that memory/table initializer offsets
are now stored as `u32` instead of `usize` to remove a few casts in a
few places. This ended up requiring moving some overflow checks that
happened in parsing to later in code itself because otherwise the wrong
spec test errors are emitted during testing. I've tried to trace where
these can possibly overflow but I think that I managed to get
everything.

In a local synthetic test where an empty module with a single 80,000
element initializer this improves total instantiation time by 4x (562us
=> 141us)

* Review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:wasm cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants