Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Func#call by optimizing arg buffer initialization #169

Open
ianks opened this issue Apr 6, 2023 · 2 comments
Open

Speed up Func#call by optimizing arg buffer initialization #169

ianks opened this issue Apr 6, 2023 · 2 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed perf Performance related

Comments

@ianks
Copy link
Collaborator

ianks commented Apr 6, 2023

I was experimenting when making Func#call faster, and noticed there is a decent performance win by avoiding rb_ary_push. It requires a bit of unsafe code, but I believe the implemntation is sound:

diff --git i/bench/host_call.rb w/bench/host_call.rb
index ca4d3fc..b75fe40 100644
--- i/bench/host_call.rb
+++ w/bench/host_call.rb
@@ -1,14 +1,18 @@
 require_relative "bench"
 
-# Call host func (4 args)    337.181k (± 3.7%) i/s -      1.692M in   5.024584s
-# Call host func (16 args)   296.615k (± 5.2%) i/s -      1.498M in   5.064241s
-# Call host func (64 args)   217.487k (± 3.2%) i/s -      1.106M in   5.090547s
-# Call host func (128 args)  119.689k (± 3.8%) i/s -    605.136k in   5.063428s
+# New:
+# Call host func (4 results):     348836.6 i/s
+# Call host func (16 results):    320159.3 i/s - 1.09x  slower
+# Call host func (64 results):    224814.2 i/s - 1.55x  slower
+# Call host func (128 results):   126469.2 i/s - 2.76x  slower
+# Call host func (256 results):   50566.7 i/s  - 6.90x  slower
 
-# Call host func (4 args):   333800.6 i/s
-# Call host func (16 args):  291889.7 i/s - 1.14x  slower
-# Call host func (64 args):  185375.6 i/s - 1.80x  slower
-# Call host func (128 args): 97043.2 i/s - 3.44x  slower
+# Old:
+# Call host func (4 args):      331524.8 i/s
+# Call host func (16 args):     280723.9 i/s - 1.18x  slower
+# Call host func (64 args):     182873.7 i/s - 1.81x  slower
+# Call host func (128 args):    96891.1 i/s  - 3.42x  slower
+# Call host func (256 args):    40983.9 i/s  - 8.09x  slower
 
 Bench.ips do |x|
   engine = Wasmtime::Engine.new
diff --git i/ext/src/ruby_api/func.rs w/ext/src/ruby_api/func.rs
index 1ef4ba8..e42a494 100644
--- i/ext/src/ruby_api/func.rs
+++ w/ext/src/ruby_api/func.rs
@@ -1,3 +1,5 @@
+use std::mem::transmute;
+
 use super::{
     convert::{ToRubyValue, ToSym, ToValTypeVec, ToWasmVal},
     errors::result_error,
@@ -178,16 +180,57 @@ impl<'a> Func<'a> {
             [] => Ok(QNIL.into()),
             [result] => result.to_ruby_value(store),
             _ => {
-                let array = RArray::with_capacity(results.len());
-                for result in results {
-                    array.push(result.to_ruby_value(store)?)?;
+                // We want to initialized a sized ruby array so we can write to
+                // it without using `rb_ary_push`, which is slow. So we just
+                // pass in the results as a slice for the initial values. They
+                // will be overridden by the loop below. But be careful not to
+                // let Ruby try to use the values in the array before we've
+                // written to them.
+                //
+                // For result sizes of 128, this is about 20% faster than using
+                // rb_ary_push.
+                //
+                // # Safety
+                //
+                // Safety is guaranteed by the assertions below:
+                // - `rb_ary_new_from_values` will allocated `results.len()`
+                //   elements of usize (not wasmtime::Val)
+                // - Ruby never access the values in the array before we've
+                //   written to them, since doing so would point to the Rust vec
+                //   until we write to it
+                let fake_values = unsafe { transmute(results.as_slice()) };
+                let array = RArray::from_slice::<Value>(fake_values);
+                let array_slice = unsafe { rarray_as_mut_slice(array, results.len()) };
+                let results_iter = results.iter().enumerate();
+
+                assert!(array_slice.len() == results_iter.len()); // optimize out bounds check
+
+                for (i, result) in results.iter().enumerate() {
+                    array_slice[i] = result.to_ruby_value(store).map_err(|e| {
+                        // If we fail along the way, zero out the array just to be safe
+                        let _ = array.clear();
+                        e
+                    })?;
                 }
+
                 Ok(array.into())
             }
         }
     }
 }
 
+/// Converts an `RArray` into a mutable slice.
+///
+/// # Safety
+/// The capacity of the array must be known in advance for this to be safe, since we provide mutable access to the array's contents.
+unsafe fn rarray_as_mut_slice<'a>(array: RArray, capacity: usize) -> &'a mut [Value] {
+    let array_slice = unsafe { array.as_slice() };
+    let ptr = array_slice.as_ptr();
+    let array_slice = unsafe { std::slice::from_raw_parts_mut(ptr as *mut Value, capacity) };
+
+    array_slice
+}
+
 impl From<&Func<'_>> for wasmtime::Extern {
     fn from(func: &Func) -> Self {
         Self::Func(func.get())
@ianks ianks added good first issue Good for newcomers help wanted Extra attention is needed perf Performance related labels Apr 6, 2023
jbourassa added a commit that referenced this issue Apr 23, 2023
After:
```
$ rake compile:release && rake bench:host_call
[...]

Calculating -------------------------------------
Call host func (4 args)
                        519.910k (± 1.3%) i/s -      2.608M in   5.016526s
Call host func (16 args)
                        440.924k (± 0.6%) i/s -      2.243M in   5.086693s
Call host func (64 args)
                        277.404k (± 1.2%) i/s -      1.412M in   5.089372s
Call host func (128 args)
                        144.274k (± 0.7%) i/s -    734.910k in   5.094121s
Call host func (256 args)
                         55.141k (± 1.1%) i/s -    279.990k in   5.078321s
```
jbourassa added a commit that referenced this issue Apr 23, 2023
After:
```
$ rake compile:release && rake bench:host_call
[...]

Calculating -------------------------------------
Call host func (4 args)
                        519.910k (± 1.3%) i/s -      2.608M in   5.016526s
Call host func (16 args)
                        440.924k (± 0.6%) i/s -      2.243M in   5.086693s
Call host func (64 args)
                        277.404k (± 1.2%) i/s -      1.412M in   5.089372s
Call host func (128 args)
                        144.274k (± 0.7%) i/s -    734.910k in   5.094121s
Call host func (256 args)
                         55.141k (± 1.1%) i/s -    279.990k in   5.078321s
```
jbourassa added a commit that referenced this issue Apr 23, 2023
After:
```
$ rake compile:release && rake bench:host_call
[...]

Calculating -------------------------------------
Call host func (4 args)
                        519.910k (± 1.3%) i/s -      2.608M in   5.016526s
Call host func (16 args)
                        440.924k (± 0.6%) i/s -      2.243M in   5.086693s
Call host func (64 args)
                        277.404k (± 1.2%) i/s -      1.412M in   5.089372s
Call host func (128 args)
                        144.274k (± 0.7%) i/s -    734.910k in   5.094121s
Call host func (256 args)
                         55.141k (± 1.1%) i/s -    279.990k in   5.078321s
```
@jbourassa
Copy link
Collaborator

I tried 2 different approaches to this with similar performance (the commits have bench result):

  • exactly this diff: ee2a33f
  • a simpler approach: accumulate values in a Vec<Value> and use RArray::from_slice: cb6125b

After comparing them, I realized they both suffer from an issue though: converting a wasmtime::Val into a Ruby object may trigger GC and crash (see these tests for a repro case). My understanding is that with this diff, we have an RArray on the stack that contains invalid Values, thus crash. In the other approach, we have unrooted Values on the heap (Vec<Value>), so GC collects them and crash.

I guess we could use the Store to temporarily save Values and mark them. That'll eat some of the perf benefit though, and add complexity. Not sure it's worth it.

Thoughts?

@matsadler
Copy link
Contributor

I've updated Magnus so the FromIterator implementation appends to an array in batches which in my testing (in a totally different project) is quite a bit quicker than appending one by one.

So the for loop in invoke could become:

results
    .into_iter()
    .map(|result| result.to_ruby_value(store))
    .collect::<Result<RArray, _>>()
    .map(|ary| ary.as_value())

I don't think I see a benchmark for this in wasmtime-rb, so I don't know if it'd be faster for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed perf Performance related
Projects
None yet
Development

No branches or pull requests

3 participants