Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize JSON ops: minimize allocs by decoding v8::Value directly #9540

Closed
AaronO opened this issue Feb 18, 2021 · 12 comments
Closed

Optimize JSON ops: minimize allocs by decoding v8::Value directly #9540

AaronO opened this issue Feb 18, 2021 · 12 comments
Labels
deno_core Changes in "deno_core" crate are needed suggestion suggestions for new features (yet to be agreed)

Comments

@AaronO
Copy link
Contributor

AaronO commented Feb 18, 2021

TLDR: implement a serde deserializer (& serializer) for v8::Value to minimize allocations/overhead of encoding to JSON

Problem & context

Deno-ops are kind of like syscalls, all deno programs must use them to do I/O and access runtime features.

Besides the v8 runtime, it's the only part of the stack that's not "user-optimizable" (users can easily optimize their own JS/TS code and JS/TS libraries, but touching runtime code will often be a stretch for many users).

As such I think it's a good principle to minimize syscall/op overhead, so the runtime is never "the bottleneck".

The current JSON op-ABI has some avoidable overhead, specifically heap-allocations and utf8 encoding.

(Note: I think most real-world programs will likely be I/O bound, or user-code CPU bound before being bottlenecked by the current op-ABI, but it's a good principle nonetheless and helps on microbenchmarks ...)

Current implementation

The current op-ABI passes JSON args as following:

  1. JSON stringify args
  2. String-concat with promise ID (for async ops)
  3. Encode to utf8
  4. Parse JSON
  5. Allocate struct (on stack)
  6. Allocate fields & fill struct

Steps 1., 2. and 3. make avoidable heap-allocations. CPU-time "wasted" on JSON parsing/stringification

Alternative implementation

  1. Pass JS value (v8::Value) as is
  2. Allocate struct/receiving type
  3. Fill receiving fields (with utf8 encoding for strings)

Basically pass JS objects untouched then decode using serde_v8 (to be implemented)

Benefits

  • Better performance (less allocs) on op-heavy code
  • Remove overhead of utf8 encoding/decoding (only for string fields instead of the entire JSON string)
  • Reduced conceptual overhead of fiddling with buffers (potentially less error-prone)
    • Ops could receive & return "natural" rust types (if we implement serde_v8 serialization)

Notes

  • Needs to benchmarked with op-heavy code
  • I focused on the JS->rust context-switch, but similarly the rust->JS return can avoid the decodeJson() step (utf8 decode + json parse) by serializing directly to a v8::Value
  • bin_ops will probably still be faster than this improved json_ops since using shared memory with no-decoding is hard to beat, but this should help close the gap whilst preserving benefits & convenience of structured args

Further changes

Implementing this optimization would require either changing the op-ABI or implementing an alternative opt-in "fast-path" (e.g: Deno.core.sendFast or sendJSON).

Currently JSON ops are a special case of "buffer ops", I think this would be a good opportunity to "reverse" that and consolidate all ops under a common explicit signature (rather than having implicit signatures packed through buffers):

fn(op_state, promise_id, args, buffer) -> Result

(args and Result could be any serde / serde_v8 Serializable types)
(promise_id, args and buffer should probably all be Option<>)

Next steps

  1. Proof of concept (evaluate perf gain / measure current overhead)
  2. Implement serde_v8
  3. Discuss op-ABI changes
  4. Implement changes to core, runtime & op_crates

Questions

  • Do any of the ops use 2 buffers (from BufVec)? Searching through the code the pattern seems to be 2 buffers max, first is promise+json, second buffer to read/write. If so we can remove BufVec from the new ABI (as suggested above)
@kitsonk kitsonk added deno_core Changes in "deno_core" crate are needed suggestion suggestions for new features (yet to be agreed) labels Feb 18, 2021
@AaronO
Copy link
Contributor Author

AaronO commented Feb 18, 2021

If successful this could simplify a lot of internal ops code (e.g: remove minimal ops, since they would be similar to the default)

@ry
Copy link
Member

ry commented Feb 19, 2021

@AaronO It's an interesting idea but we are skeptical it can be made faster than the status quo. It will also be quite a lot of work just to be able to perf test it. If you decide to play around with it and get some results, we will be eager to see them, but this is not something the core team will pursue.

@AaronO
Copy link
Contributor Author

AaronO commented Feb 19, 2021

@ry Completely understandable, I wasn't expecting anyone on the core team to put in time on it, especially without a POC. I simply shared the idea to get broad feedback and ensure that I wasn't missing some rationale backing the JSON approach.
(Or if any bijection between JS & OP-space was sufficient)

I hacked together a very quick POC (modifying core and http_bench_json_ops), I only optimized the deserialization JS->Rust (not Rust->JS yet), but there's roughly a 20% gain in throughput (reqs/s).

Here are the benchmarks ran locally on my M1 Air:

Master

Running 20s test @ http://127.0.0.1:4544/
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   134.63us   35.13us   2.01ms   85.55%
    Req/Sec    35.38k     0.88k   36.33k    97.76%
  Latency Distribution
     50%  130.00us
     75%  145.00us
     90%  165.00us
     99%  274.00us
  1415139 requests in 20.10s, 68.83MB read
Requests/sec:  70405.75
Transfer/sec:      3.42MB

Optimized Hack

Running 20s test @ http://127.0.0.1:4544/
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   120.83us  221.60us  11.69ms   99.76%
    Req/Sec    41.66k     1.47k   43.35k    95.77%
  Latency Distribution
     50%  112.00us
     75%  127.00us
     90%  144.00us
     99%  185.00us
  1666653 requests in 20.10s, 81.06MB read
Requests/sec:  82919.29
Transfer/sec:      4.03MB

@lucacasonato
Copy link
Member

@AaronO Do you have some source code you can share? Would be interested to look at this.

@AaronO
Copy link
Contributor Author

AaronO commented Feb 19, 2021

@lucacasonato The POC I just put together is extremely hackish and not worth sharing (goal was to test the perf hypothesis), but I have a relatively clear idea on how to implement serde_v8 and the new op-ABI i'm suggesting.

I might take a first stab at it this weekend.

@AaronO
Copy link
Contributor Author

AaronO commented Feb 27, 2021

A quick update to let you know I've started implementing serde_v8, deserialization (v8 -> rust) is working, need to implement serialization (rust -> v8) and then some optimizations, general polish and benchmarks (on a mock ABI vs JSON).

I hope to share a first implementation this weekend, the benchmarks should closely model the perf improvements of changing the op-ABI. Once/if the perf improvements are demonstrated, I'll open a PR implementing my suggested op-ABI.

(Note: We could in theory design the op-ABI to be encoding independent, so JSON/v8 arg-encodings could be easily switched with minimal changes)

Here's a quick example running locally to give you a preview:

Output

~/git/serde_v8 main*
❯ cargo run --example basic 2>/dev/null
x32 = 32
mop = MathOp { a: 1, b: 3, operator: None }
arr = [1, 2, 3, 4, 5]
hi = ["hello", "world"]
x = 12345

Example code

use rusty_v8 as v8;
use serde_v8;

use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct MathOp {
    pub a: u64,
    pub b: u64,
    pub operator: Option<String>,
}

fn main() {
    let platform = v8::new_default_platform().unwrap();
    v8::V8::initialize_platform(platform);
    v8::V8::initialize();

    {
        let isolate = &mut v8::Isolate::new(v8::CreateParams::default());
        let handle_scope = &mut v8::HandleScope::new(isolate);
        let context = v8::Context::new(handle_scope);
        let scope = &mut v8::ContextScope::new(handle_scope, context);

        fn exec<'s>(scope: &mut v8::HandleScope<'s>, src: &str) -> v8::Local<'s, v8::Value> {
            let code = v8::String::new(scope, src).unwrap();
            let script = v8::Script::compile(scope, code, None).unwrap();
            script.run(scope).unwrap()
        }

        let v = exec(scope, "32");
        let x32: u64 = serde_v8::from_v8(scope, v).unwrap();
        println!("x32 = {}", x32);

        let v = exec(scope, "({a: 1, b: 3, c: 'ignored'})");
        let mop: MathOp = serde_v8::from_v8(scope, v).unwrap();
        println!("mop = {:?}", mop);
        
        let v = exec(scope, "[1,2,3,4,5]");
        let arr: Vec<u64> = serde_v8::from_v8(scope, v).unwrap();
        println!("arr = {:?}", arr);
        
        let v = exec(scope, "['hello', 'world']");
        let hi: Vec<String> = serde_v8::from_v8(scope, v).unwrap();
        println!("hi = {:?}", hi);

        let v: v8::Local<v8::Value> = v8::Number::new(scope, 12345.0).into();
        let x: f64 = serde_v8::from_v8(scope, v).unwrap();
        println!("x = {}", x);
    }

    unsafe {
        v8::V8::dispose();
    }
    v8::V8::shutdown_platform();
}

@AaronO
Copy link
Contributor Author

AaronO commented Feb 28, 2021

TLDR: Early benchmarks on a new ABI (with serde_v8) point to a ~7x improvement in ABI-overhead
(add op is extremely fast itself, so I'm assuming it's virtually entirely overhead)

Here's a first iteration of benchmarks comparing simplified versions (only sync ops, ...) of Deno's current op-ABI and a new one using serde_v8 to handle encoding/decodes.

test abi_add_json     ... bench:     142,684 ns/iter (+/- 25,624)
test abi_add_v8       ... bench:      21,787 ns/iter (+/- 478)
test abi_promote_json ... bench:     181,272 ns/iter (+/- 9,638)
test abi_promote_v8   ... bench:      69,080 ns/iter (+/- 3,366)
test abi_sum_json     ... bench:     152,387 ns/iter (+/- 15,316)
test abi_sum_v8       ... bench:      27,430 ns/iter (+/- 540)

(Note: each iter here is 100 iterations of a JS for-loop)
(Note2: I wanted to get ABI benchmarks before spending time implementing ser, so I just swallowed return values, without any decoding work for both ABIs. So this benchmark is half of the story, but I'm confident serializing returns won't substantially change the conclusion)

Op functions

I implemented 3 basics ops called by each ABI, to represent a subset of payload types.

pub fn sum(args: Vec<u64>) -> u64 {
    args.into_iter().sum()
}

#[derive(Deserialize)]
pub struct AddArgs {
    a: u32,
    b: u32,
}
pub fn add(args: AddArgs) -> u32 {
    args.a + args.b
} 

#[derive(Deserialize, Serialize)]
pub struct Person {
    first_name: String,
    last_name: String,
    age: u8,
}
pub fn promote(args: Person) -> Person {
    Person {
        first_name: args.first_name.to_uppercase(),
        last_name: args.last_name.to_ascii_uppercase(),
        age: args.age + 1,
    }
}

Detailed decoding benchmarks

Here are more granular benchmarks purely on decoding:

test de_array_json        ... bench:         258 ns/iter (+/- 4)
test de_array_v8          ... bench:         512 ns/iter (+/- 25)
test de_bool_json         ... bench:           6 ns/iter (+/- 0)
test de_bool_v8           ... bench:           5 ns/iter (+/- 0)
test de_int_json          ... bench:           8 ns/iter (+/- 0)
test de_int_v8            ... bench:           4 ns/iter (+/- 0)
test de_str_json          ... bench:          37 ns/iter (+/- 0)
test de_str_v8            ... bench:          63 ns/iter (+/- 1)
test de_struct_json       ... bench:          31 ns/iter (+/- 1)
test de_struct_json_deopt ... bench:         308 ns/iter (+/- 5)
test de_struct_v8         ... bench:       2,675 ns/iter (+/- 1,081)
test de_struct_v8_opt     ... bench:          78 ns/iter (+/- 6)
test de_tuple_json        ... bench:          15 ns/iter (+/- 0)
test de_tuple_v8          ... bench:          61 ns/iter (+/- 1)
test de_tuple_v8_opt      ... bench:          57 ns/iter (+/- 1)

(Note: *_json benches are on existing rust strings, so we're skipping the stringify overhead of the JSON op-ABI)

The main takeaway here is that, de_struct_v8 is slow mainly due to allocating v8::Strings to .get() values.

I plan to implement a KeyCache that should avoid those allocations on frequent keys, bringing it in-line with de_struct_v8_opt.
Basically KeyCache would cache all rust struct field name to v8::String mappings, in Deno we can put that KeyCache in the runtime state (it's unfortunately per-isolate, so there would be some memory overhead, but could be opt-in and likely a good tradeoff).

I suspect with KeyCache and encoding return values in the ABI, we might see a ~10x improvement in ABI-overhead overall.

Note: it would be hard to optimize v8::Array unpacking further, looking at v8's json-stringifier.cc it has fast-paths for arrays, we're basically using their slow-path. Optimizing it further would require changing v8's public API to expose a faster means of iterating on arrays than successive .get_index() calls. So possible but a lot of work and unlikely.


@lucacasonato I'll share the repo soon. It would great to have your review to ensure that these benchmarks are fair and representative of the improvement (I also haven't used rust a ton, so feedback on style & best-practices are welcome too)

@AaronO
Copy link
Contributor Author

AaronO commented Feb 28, 2021

@lucacasonato Voilà, you can check out a first version here: https://github.com/AaronO/serde_v8

Main files of interest:

  • benches/abi.rs: the benchmarks shared above (showing a ~7x improvement in overhead)
  • src/utils/*: implements a lightweight version of deno's core (used by the ABI benches), it should look familiar

@AaronO
Copy link
Contributor Author

AaronO commented Mar 1, 2021

If the core team is open to this new ABI, I could probably submit a PR later this week or next weekend.

@AaronO
Copy link
Contributor Author

AaronO commented Mar 4, 2021

A quick update to share a decent performance improvement. Going by the abi_add_json/abi_add_v8 ratio, the encoding overhead is now 10x faster than Deno's current implementation (I previously shared ~7x, so up ~50% since then).

Key highlights (of benchmarks improved):

Before:

test abi_add_v8       ... bench:      21,787 ns/iter (+/- 478)
test de_struct_v8         ... bench:       2,675 ns/iter (+/- 1,081)

After:

test abi_add_v8       ... bench:      14,635 ns/iter (+/- 307)
test de_struct_v8         ... bench:         199 ns/iter (+/- 3)

The main improvement is in de_struct_v8 and decoding struct keys, there's a huge decrease in variability due to decreased allocations.

New benches

running 6 tests
test abi_add_json     ... bench:     148,937 ns/iter (+/- 14,882)
test abi_add_v8       ... bench:      14,635 ns/iter (+/- 307)
test abi_promote_json ... bench:     182,003 ns/iter (+/- 17,076)
test abi_promote_v8   ... bench:      55,284 ns/iter (+/- 848)
test abi_sum_json     ... bench:     153,089 ns/iter (+/- 19,175)
test abi_sum_v8       ... bench:      28,924 ns/iter (+/- 1,043)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 0 filtered out; finished in 14.23s

     Running unittests (target/release/deps/de-3851838fdb07fe65)

running 15 tests
test de_array_json        ... bench:         214 ns/iter (+/- 2)
test de_array_v8          ... bench:         511 ns/iter (+/- 8)
test de_bool_json         ... bench:           6 ns/iter (+/- 0)
test de_bool_v8           ... bench:           5 ns/iter (+/- 0)
test de_int_json          ... bench:           8 ns/iter (+/- 0)
test de_int_v8            ... bench:           4 ns/iter (+/- 0)
test de_str_json          ... bench:          37 ns/iter (+/- 1)
test de_str_v8            ... bench:          63 ns/iter (+/- 0)
test de_struct_json       ... bench:          32 ns/iter (+/- 0)
test de_struct_json_deopt ... bench:         326 ns/iter (+/- 6)
test de_struct_v8         ... bench:         199 ns/iter (+/- 3)
test de_struct_v8_opt     ... bench:          77 ns/iter (+/- 1)
test de_tuple_json        ... bench:          15 ns/iter (+/- 0)
test de_tuple_v8          ... bench:          60 ns/iter (+/- 1)
test de_tuple_v8_opt      ... bench:          56 ns/iter (+/- 1)

@AaronO AaronO mentioned this issue Mar 8, 2021
7 tasks
@AaronO
Copy link
Contributor Author

AaronO commented Mar 9, 2021

Full benches

These benches are now fully representative of op-encoding overhead in the API, it measures decoding inputs and encoding outputs/returns.

test abi_add_json     ... bench:     279,703 ns/iter (+/- 46,786)
test abi_add_v8       ... bench:      19,290 ns/iter (+/- 540)
test abi_add_void     ... bench:       6,551 ns/iter (+/- 1,777)
test abi_promote_json ... bench:     338,637 ns/iter (+/- 48,866)
test abi_promote_v8   ... bench:     100,903 ns/iter (+/- 3,525)
test abi_promote_void ... bench:      24,769 ns/iter (+/- 1,164)
test abi_sum_json     ... bench:     281,636 ns/iter (+/- 39,827)
test abi_sum_v8       ... bench:      36,453 ns/iter (+/- 2,492)
test abi_sum_void     ... bench:       5,323 ns/iter (+/- 1,148)

Note: *_void benches exist purely to measure the upper-bound of a hypothetical zero-overhead op-ABI (it doesn't look at args at all so its not useful but is a good baseline)

Conclusions

The serde_v8 op-ABI is consistently faster on all benches than the JSON version, ranging from ~3x to ~14x faster depending on the structure of inputs & returns (structs are generally fast now, but string values are relatively slow ...)

@AaronO
Copy link
Contributor Author

AaronO commented Apr 18, 2021

These improvements and many others shipped in 1.9.0 and were mostly landed by #9843 and its follow-ups.

@AaronO AaronO closed this as completed Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deno_core Changes in "deno_core" crate are needed suggestion suggestions for new features (yet to be agreed)
Projects
None yet
Development

No branches or pull requests

4 participants