update the Rust crate to v0.2 and enable the "c" feature #14

oconnor663 · 2020-03-06T17:52:32Z

This should be a measurable performance improvement in the native build. That said, I'm not quite sure how to test or benchmark this project properly :)

One of the things changed from v0.1 to v0.2 is that multithreading can now be enabled on a per-caller basis. This library could consider exposing a boolean multithreading flag or something like that, as I've done in https://github.com/oconnor663/blake3-py. I don't know what the multithreading story is for Wasm (I assume there isn't one yet?), but fingers crossed with the native build it could Just Work.

This enables optimized assembly implementations, and also AVX-512 support. This requires GCC or similar, but callers building native code are pretty likely to have that installed.

connor4312 · 2020-03-07T06:11:18Z

They're working on it, but WebAssembly doesn't support threading yet. Some compilers have support that shims threads into webworkers, but setup is finicky and without shared array buffers the overhead of copying memory would probably be slower than doing everything in the main thread or a single worker, which runs at about 400 MB/s on my machine.

The update to 0.2 speeds up Node by about 2x for large data, nice. WebAssembly is about the same speed.

native node:

            792,000 ops/sec > 64B#v1
            833,000 ops/sec > 64B#v2 
             49,400 ops/sec > 64KB#v1
             91,100 ops/sec > 64KB#v2
                549 ops/sec > 6MB#v1
              1,050 ops/sec > 6MB#v2

node wasm:

          1,230,000 ops/sec > 64B#v1
          1,150,000 ops/sec > 64B#v2
              7,120 ops/sec > 64KB#v1
              6,980 ops/sec > 64KB#v2
               75.4 ops/sec > 6MB#v1
               75.6 ops/sec > 6MB#v2

It looks like swapping in rayon-based update_with_join has a neutral or negative effect for smaller inputs, but improves speed on larger ones a bit, which makes sense since spawning threads surely has some amount of overhead. It's a good thing to put behind an option, but probably not an appropriate default; even when hashing large amounts of data, usually it will be streaming through and be added to the hash in relatively small chunks.

            771,000 ops/sec > 64B#v1
            784,000 ops/sec > 64B#v2
             47,800 ops/sec > 64KB#v1
             46,900 ops/sec > 64KB#v2
                538 ops/sec > 6MB#v1
              1,990 ops/sec > 6MB#v2

I'd be happy to accept a PR for adding such an option, or I can do it later this week/end. One note: I would add it as an option in the create* factory functions, rather than having a separate method for it on the Hasher. This is because in Node a common practice is to use the stream interface to do something like fs.createFileStream('input.bin').pipe(createHash())..., which doesn't use the update method explicitly so an analogous updateWithJoin wouldn't work as well.

oconnor663 · 2020-03-07T15:50:42Z

speeds up Node by about 2x for large data

That's a little surprising to me. Assuming you're running on an AVX2 machine, the new assembly implementation should be maybe 10-20% faster than the previous version. But there are a lot of potential factors, depending on exactly how you're benchmarking.

It looks like swapping in rayon-based update_with_join has a neutral or negative effect for smaller inputs, but improves speed on larger ones a bit

Yep, that's typical. Usually I find the breakeven point is around 128 KiB, and as a rule of thumb I tell people not to bother with multithreading below 1 MiB.

in Node a common practice is to use the stream interface to do something like fs.createFileStream('input.bin').pipe(createHash())...

Hmm, everything would depend on the buffer size they're using inside of pipe. Usually the implementation for something like that is to allocate, say, a 64 KiB buffer once, and then repeatedly read into that. But as you noted above, hashing 64 KiB inputs with multithreading actually hurts performance. It might be that in that pipelined setting, we're just never going to do well.

When I've experimented with multithreaded streaming before (as opposed to hashing a large input all-at-once, which makes recursive fork-join concurrency a good fit), it's required some fairly complicated background buffering. The idea is that you use a couple of really big buffers, for example 1 MiB each, and you have a background thread filling one buffer, while all the worker threads hash the other one. That avoids sleeping all the worker threads every time you need to refill the buffer, which kills your parallelism. You can get pretty close to all-at-once performance if you go through all that effort, but I don't expect many people to do it in practice. Realistically speaking, in a pipelined/streaming setting, you're much more likely to be bottlenecked by the network or something. And programs that are reading the filesystem, and that really care about this, should be using memory mapping instead of buffered reads.

connor4312 · 2020-03-07T16:34:50Z

Usually the implementation for something like that is to allocate, say, a 64 KiB buffer once, and then repeatedly read into that

This is something that's unfortunate about how streams are done in JavaScript. Buffers are passed to subsequent operators in the stream, rather than shared with them, which means no reuse and new allocations for every byte of incoming data. I much prefer Go's simple io.Reader/Writer approach. There are low level, C-ish read/write functions, but they are not idiomatic and are very rarely used.

When I've experimented with multithreaded streaming before (as opposed to hashing a large input all-at-once, which makes recursive fork-join concurrency a good fit), it's required some fairly complicated background buffering...

I see. I think we could do this in Neon--return control back to user code and trigger a call back into Node code when the hash completes. This would allow the Node consumer to implement appropriate buffering for whatever use case they have. A prepackaged file hashing implementation could be might make sense--I would do that on the Rust side of things though since Node.js doesn't support memory mapping and we would probably bottleneck transiting the data through V8 anyhow.

But as you pointed out, few consumers are likely to be bottlenecked around the hash function. BLAKE3 is just too fast already 😛

oconnor663 added 2 commits March 6, 2020 12:31

update the blake3 dependency to v0.2

a8689e6

enable the "c" feature for the native Rust build

33b8a85

This enables optimized assembly implementations, and also AVX-512 support. This requires GCC or similar, but callers building native code are pretty likely to have that installed.

connor4312 merged commit 32f0f89 into connor4312:master Mar 7, 2020

connor4312 mentioned this pull request Mar 7, 2020

Threading support in native Node #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update the Rust crate to v0.2 and enable the "c" feature #14

update the Rust crate to v0.2 and enable the "c" feature #14

oconnor663 commented Mar 6, 2020 •

edited

Loading

connor4312 commented Mar 7, 2020

oconnor663 commented Mar 7, 2020 •

edited

Loading

connor4312 commented Mar 7, 2020 •

edited

Loading

update the Rust crate to v0.2 and enable the "c" feature #14

update the Rust crate to v0.2 and enable the "c" feature #14

Conversation

oconnor663 commented Mar 6, 2020 • edited Loading

connor4312 commented Mar 7, 2020

oconnor663 commented Mar 7, 2020 • edited Loading

connor4312 commented Mar 7, 2020 • edited Loading

oconnor663 commented Mar 6, 2020 •

edited

Loading

oconnor663 commented Mar 7, 2020 •

edited

Loading

connor4312 commented Mar 7, 2020 •

edited

Loading