Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Func::call and its C API #3319

Merged

Conversation

alexcrichton
Copy link
Member

This commit is an alternative to #3298 which achieves effectively the
same goal of optimizing the Func::call API as well as its C API
sibling of wasmtime_func_call. The strategy taken here is different
than #3298 though where a new API isn't created, rather a small tweak to
an existing API is done. Specifically this commit handles the major
sources of slowness with Func::call with:

  • Looking up the type of a function, to typecheck the arguments with and
    use to guide how the results should be loaded, no longer hits the
    rwlock in the Engine but instead each Func contains its own
    FuncType. This can be an unnecessary allocation for funcs not used
    with Func::call, so this is a downside of this implementation
    relative to Add a new API to make Func::call faster #3298. A mitigating factor, though, is that instance
    exports are loaded lazily into the Store and in theory not too many
    funcs are active in the store as Func objects.

  • Temporary storage is amortized with a long-lived Vec in the Store
    rather than allocating a new vector on each call. This is basically
    the same strategy as Avoid vector allocations in wasm->host calls #3294 only applied to different types in
    different places. Specifically wasmtime::Store now retains a
    Vec<u128> for Func::call, and the C API retains a Vec<Val> for
    calling Func::call.

  • Finally, an API breaking change is made to Func::call and its type
    signature (as well as Func::call_async). Instead of returning
    Box<[Val]> as it did before this function now takes a
    results: &mut [Val] parameter. This allows the caller to manage the
    allocation and we can amortize-remove it in wasmtime_func_call by
    using space after the parameters in the Vec<Val> we're passing in.
    This change is naturally a breaking change and we'll want to consider
    it carefully, but mitigating factors are that most embeddings are
    likely using TypedFunc::call instead and this signature taking a
    mutable slice better aligns with Func::new which receives a mutable
    slice for the results.

Overall this change, in the benchmark of "call a nop function from the C
API" is not quite as good as #3298. It's still a bit slower, on the
order of 15ns, because there's lots of capacity checks around vectors
and the type checks are slightly less optimized than before. Overall
though this is still significantly better than today because allocations
and the rwlock to acquire the type information are both avoided. I
personally feel that this change is the best to do because it has less
of an API impact than #3298.

@github-actions github-actions bot added fuzzing Issues related to our fuzzing infrastructure wasi Issues pertaining to WASI wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:c-api Issues pertaining to the C API. labels Sep 8, 2021
@github-actions
Copy link

github-actions bot commented Sep 8, 2021

Subscribe to Label Action

cc @fitzgen, @kubkon, @peterhuene

This issue or pull request has been labeled: "fuzzing", "wasi", "wasmtime:api", "wasmtime:c-api"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: fuzzing
  • kubkon: wasi
  • peterhuene: wasmtime:api, wasmtime:c-api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@alexcrichton
Copy link
Member Author

To show an idea of the numbers I'm looking at, here's some information. I've got a C program that shows the following timings on various wasmtime versions. Timings are shown with using wasm.h and wasmtime.h. Times are all in nanoseconds.

What's being benchmarked:

nop

host calls a wasm function that does nothing

version wasm.h wasmtime.h
0.26.1 45
0.27.0 104
0.28.0 115 113
0.29.0 117 117
this PR 43 51

i64

host calls a wasm function. Wasm function calls a host function returning i64. Wasm then returns that i64.

version wasm.h wasmtime.h
0.26.1 181
0.27.0 247
0.28.0 258 240
0.29.0 258 245
this PR 130 104

many

host calls a wasm function. Wasm function calls a host function with 5 i32 params and one i32 return. Wasm discards result and returns.

version wasm.h wasmtime.h
0.26.1 223
0.27.0 275
0.28.0 295 258
0.29.0 284 261
this PR 173 138

This is all a far cry away from what Rust can do though, so while this PR certainly improves the state of affairs Rust far outstrips it. Rust has the option of using "typed" and "untyped" functions, effectively using Func or TypedFunc to call wasm functions or Func::wrap/Func::new to define host functions. The table below shows timings (in nanoseconds) of calling the functions in a typed/untyped fashion:

when imports are defined with Func::wrap

func typed untyped
nop 20 34
i64 21 41
many 23 37

when imports are defined with Func::new

func typed untyped
i64 43 62
many 85 100

The basic summary here is that when we have "typed" versions of everything basically any combination of arguments/returns is roughly 20ns of call-time overhead. That's the minimum threshold for calling a wasm function. Once "untyped" things are used then there starts to be per-argument and per-result overhead. This creeps up in both calling wasm functions and calling host functions, and can be seen how "many" is slower than "i64" which is slower than "nop". Finally the C API imposes about a 40ns blanket overhead on top of the wasmtime "untyped" Rust versions, mostly because wasmtime_val_t needs to be converted to Val and back.

Basically at this point I think that the "untyped" variants, which C currently is forced to use, are as optimal as they're gonna get. There's still inherent overhead with translation between wasmtime_val_t and Val, but at some point that can only be but so optimal. I think the only way forward to improve C bindings is to somehow get a "typed" variant in there which opens up opportunities to remove type translations and type checks entirely.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 13, 2021
This commit adds a new API to the `wasmtime::Func` type for wrapping a
C-ABI function with a well-defined signature dervied from a wasm type
signature. The purpose of this API is to add the-most-optimize-we-can
path for using the C API and having wasm call host functions. Previously
when wasm called a host function it would perform these steps:

1. Using a trampoline, place all arguments into a `u128*` array on the
   stack.
2. Call `Func::invoke` which uses the type of the function (dynamically)
   to read values from this `u128*`.
3. Values are placed within a `Vec<Val>` after being read.
4. The C API receives `&[Val]` and translates this to
   `&[wasmtime_val_t]`, iterating over each value and copying its
   contents into a new vector.
5. Then the host function is actually called.
6. The above argument-shuffling steps are all performed in reverse for
   the results, shipping everything through `wasmtime_val_t` and `Val`.

PRs such as bytecodealliance#3319 and related attempts have made this sequence faster,
but the numbers on bytecodealliance#3319 show that even after we get all the allocation
and such bits out of the way we're still spending quite a lot of time
shuffling arguments back-and-forth relative to the `Func::wrap` API that
Rust can use.

This commit fixes the issue by eliminating all steps except 1/5 above.
Although we still place all arguments on the stack and read them out
again to call the C-defined function with it's much faster than pushing
this all through the `Val` and `wasmtime_val_t` machinery. This overall
gets the cost of a wasm->host call basically on-par with `Func::wrap`.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 13, 2021
This commit adds a new API to the `wasmtime::Func` type for wrapping a
C-ABI function with a well-defined signature derived from a wasm type
signature. The purpose of this API is to add the-most-optimized-we-can
path for using the C API and having wasm call host functions. Previously
when wasm called a host function it would perform these steps:

1. Using a trampoline, place all arguments into a `u128*` array on the
   stack.
2. Call `Func::invoke` which uses the type of the function (dynamically)
   to read values from this `u128*`.
3. Values are placed within a `Vec<Val>` after being read.
4. The C API receives `&[Val]` and translates this to
   `&[wasmtime_val_t]`, iterating over each value and copying its
   contents into a new vector.
5. Then the host function is actually called.
6. The above argument-shuffling steps are all performed in reverse for
   the results, shipping everything through `wasmtime_val_t` and `Val`.

PRs such as bytecodealliance#3319 and related attempts have made this sequence faster,
but the numbers on bytecodealliance#3319 show that even after we get all the allocation
and such bits out of the way we're still spending quite a lot of time
shuffling arguments back-and-forth relative to the `Func::wrap` API that
Rust can use.

This commit fixes the issue by eliminating all steps except 1/5 above.
Although we still place all arguments on the stack and read them out
again to call the C-defined function with it's much faster than pushing
this all through the `Val` and `wasmtime_val_t` machinery. This overall
gets the cost of a wasm->host call basically on-par with `Func::wrap`,
although it's still not as optimized. Benchmarking the overhead of
wasm->host calls, where `i64` returns one i64 value and `many` takes 5
`i32` parameters and returns one `i64` value, the numbers I get are:

| Import | Rust | C before | C after |
|--------|------|----------|---------|
| `i64`  | 1ns  | 40ns     | 7ns     |
| `many` | 1ns  | 91ns     | 10ns    |

This commit is a clear win over the previous implementation, but it's
even still somewhat far away from Rust. That being said I think I'm out
of ideas of how to make this better. Without open-coding much larger
portions of `wasmtime` I'm not sure how much better we can get here. The
time in C after this commit is almost entirely spent in trampolines
storing the arguments to the stack and loading them later, and at this
point I'm not sure how much more optimized we can get than that since
Rust needs to enter the picture here somehow to handle the Wasmtime
fiddly-bits of calling back into C. I'm hopeful, though, that this is
such a large improvement from before that the question of optimizing
this further in C is best left for another day.

The new `Func::wrap_cabi` method is unlikely to ever get used from Rust,
but a `wasmtime_func_wrap` API was added to the C API to mirror
`Func::wrap` where if the host function pointer has a specific ABI this
function can be called instead of `wasmtime_func_new`.
This commit is an alternative to bytecodealliance#3298 which achieves effectively the
same goal of optimizing the `Func::call` API as well as its C API
sibling of `wasmtime_func_call`. The strategy taken here is different
than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to
an existing API is done. Specifically this commit handles the major
sources of slowness with `Func::call` with:

* Looking up the type of a function, to typecheck the arguments with and
  use to guide how the results should be loaded, no longer hits the
  rwlock in the `Engine` but instead each `Func` contains its own
  `FuncType`. This can be an unnecessary allocation for funcs not used
  with `Func::call`, so this is a downside of this implementation
  relative to bytecodealliance#3298. A mitigating factor, though, is that instance
  exports are loaded lazily into the `Store` and in theory not too many
  funcs are active in the store as `Func` objects.

* Temporary storage is amortized with a long-lived `Vec` in the `Store`
  rather than allocating a new vector on each call. This is basically
  the same strategy as bytecodealliance#3294 only applied to different types in
  different places. Specifically `wasmtime::Store` now retains a
  `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for
  calling `Func::call`.

* Finally, an API breaking change is made to `Func::call` and its type
  signature (as well as `Func::call_async`). Instead of returning
  `Box<[Val]>` as it did before this function now takes a
  `results: &mut [Val]` parameter. This allows the caller to manage the
  allocation and we can amortize-remove it in `wasmtime_func_call` by
  using space after the parameters in the `Vec<Val>` we're passing in.
  This change is naturally a breaking change and we'll want to consider
  it carefully, but mitigating factors are that most embeddings are
  likely using `TypedFunc::call` instead and this signature taking a
  mutable slice better aligns with `Func::new` which receives a mutable
  slice for the results.

Overall this change, in the benchmark of "call a nop function from the C
API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the
order of 15ns, because there's lots of capacity checks around vectors
and the type checks are slightly less optimized than before. Overall
though this is still significantly better than today because allocations
and the rwlock to acquire the type information are both avoided. I
personally feel that this change is the best to do because it has less
of an API impact than bytecodealliance#3298.
@alexcrichton
Copy link
Member Author

I'll note that I think we still want this even in the face of #3345. I personally like the idea of taking &mut [Val] instead of returning Box<[Val]> for Func::call, and otherwise all the optimizations here are internal implementation details we can fiddle with over time.

Copy link
Member

@peterhuene peterhuene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just the one question!

crates/c-api/src/func.rs Show resolved Hide resolved
@alexcrichton alexcrichton merged commit bcf3544 into bytecodealliance:main Sep 21, 2021
@alexcrichton alexcrichton deleted the optimize-call-take-two branch September 21, 2021 19:07
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 21, 2021
This commit fixes a "merge conflict" with bytecodealliance#3319 being merged into
`main`, causing CI failures on merge.
alexcrichton added a commit that referenced this pull request Sep 21, 2021
This commit fixes a "merge conflict" with #3319 being merged into
`main`, causing CI failures on merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fuzzing Issues related to our fuzzing infrastructure wasi Issues pertaining to WASI wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:c-api Issues pertaining to the C API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants