Skip to content

Threads + memory growth + JS #1271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kripken opened this issue Mar 25, 2019 · 30 comments
Closed

Threads + memory growth + JS #1271

kripken opened this issue Mar 25, 2019 · 30 comments

Comments

@kripken
Copy link
Member

kripken commented Mar 25, 2019

In the last meeting we briefly discussed this, and it was suggested that I write up more details of the problem and some possible solutions. I did that here:

https://gist.github.com/kripken/949eab99b7bc34f67c12140814d2b595

Feedback welcome!

@binji
Copy link
Member

binji commented Mar 25, 2019

@lukewagner @littledan you expressed some interest in this.

I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access, and for calling into wasm for each memory access too.

As for view8_s(start, end), etc. This would be identical to new Int8Array(memory.buffer, start, end - start), right? Seems like these functions aren't as useful as the load/store ones.

@jgravelle-google
Copy link

Not knowing enough about the total design space, I really like having typed loads on the memory object itself.
A question I have is, would it be reasonable to put these load methods on ArrayBuffer directly? Are there semantic differences between an ArrayBuffer and a WebAssembly.Memory? It feels like an API inconsistency to me to have these scoped to only wasm.

@binji
Copy link
Member

binji commented Mar 25, 2019

@jgravelle-google Seems unlikely that ES would want that, since this functionality is already provided by DataView.

@littledan
Copy link

This proposal sounds good to me, at a high level.

This would be the first time that WebAssembly text format instruction names would show up in JavaScript AFAIK. Bikeshedding, for the load and store API, I'm wondering, what if we gave WebAssembly.Memory methods of the same name and signature as a DataView? Or, maybe this doesn't make sense, since these methods should use native endianness, but DataViews are always based on an explicit endianness choice.

I don't understand the view API. Is this about making a TypedArray over the WebAssembly.Memory? Would this TypedArray be expected to remain "live" as memory grows?

@binji
Copy link
Member

binji commented Mar 25, 2019

Or, maybe this doesn't make sense, since these methods should use native endianness, but DataViews are always based on an explicit endianness choice.

Oh, that's an interesting point. WebAssembly is always little-endian, but TypedArrays are platform byte order. So using normal views is already going to be incorrect behavior.

Would this TypedArray be expected to remain "live" as memory grows?

I think they're meant to be normal TypedArrays, so their size won't change.

@kripken
Copy link
Member Author

kripken commented Mar 25, 2019

As for view8_s(start, end), etc. This would be identical to new Int8Array(memory.buffer, start, end - start), right? Seems like these functions aren't as useful as the load/store ones.

Yeah, good point, the view API here is less useful. I forgot about the typed array constructor with 3 parameters - and even without that it's not a huge win, I guess. It does seem kind of nice for completeness though.

Would this TypedArray be expected to remain "live" as memory grows?

Yeah, the idea was it would remain live exactly as typed arrays already do (no new features there).

I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access

Most overhead would probably be from the GC overhead of creating a new object on each load&store. It's hard to measure that directly since the overhead ends up in GC pauses, is affected by GC optimizations, etc. But this is the exact issue that led to the creation of GC-free entry points for WebGL in WebGL2, which are very useful. (It's potentially even more significant here, since there it saves creating a view per GL call, while here it could be a view per load and store.)

@guybedford
Copy link

It is an unfortunate problem to work around. A couple of naive questions:

  1. Would it be possible for JS to listen to the memory "change" event, and update any buffers in an event listener there? That seems like it might avoid the view-per-load problem.
  2. Surely JS engines could optimize small WASM functions that act directly on memory that are effectively just load calls, to avoid the full context switching costs here in future?

@littledan
Copy link

Oh, that's an interesting point. WebAssembly is always little-endian, but TypedArrays are platform byte order. So using normal views is already going to be incorrect behavior.

Oh, I didn't realize this. That makes the DataView-analogous API a reasonable interface on the JS side (as long as we're OK with the flip in endianness).

Yeah, the idea was it would remain live exactly as typed arrays already do (no new features there).

OK, in this case, I don't really understand the purpose of this part of the proposal--this seems like a way to get around a single access to the ArrayBuffer, but it still forces you to allocate a new TypedArray. It's easier to understand why you would want the load/store methods (since that's a bit of indirection and allocation avoided, and totally solves the growth synchronization problem).

@binji
Copy link
Member

binji commented Mar 25, 2019

Would it be possible for JS to listen to the memory "change" event, and update any buffers in an event listener there? That seems like it might avoid the view-per-load problem.

You could do this, but it would only help if you make sure you return to the event loop.

Surely JS engines could optimize small WASM functions that act directly on memory that are effectively just load calls, to avoid the full context switching costs here in future?

Yes, that seems possible. But it's not trivial -- ultimately you want these to optimize down to a single (perhaps bounds-checked) load, but at least for v8 there's a lot of stuff the optimizer would have to remove to make that happen.

That makes the DataView-analogous API a reasonable interface on the JS side (as long as we're OK with the flip in endianness).

Ha, I see! The default for DataView is big-endian. Might be a bit confusing if we reuse the naming convention there, but it seems like a small risk to me.

@lukewagner
Copy link
Member

I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access, and for calling into wasm for each memory access too.

I would also be interesting to see data comparing a non-growable-memory app (that just uses unchanging cached views vs. a growable-memory version of the same app that polyfills the abovementioned Memory methods by caching the views/buffer and thus guards the load/store methods with if (cachedBuffer === this.buffer).

In the common case, the buffer doesn't change and so the overhead is just that of calling the Memory.prototype.buffer getter which could itself be inlined (and that would actually simpler than inlining a family of new load/store methods).

@kripken
Copy link
Member Author

kripken commented Mar 27, 2019

Sure, here are some numbers. The polyfill and benchmark files are in this gist. The benchmark calls emscripten's AsciiToString function repeatedly, which is a function that does many memory reads, so in practice we replace the line

  var ch = HEAP8[ptr++ >> 0];

with

  var ch = GROWABLE_HEAP_LOAD_I8(ptr++ >> 0 | 0) | 0;

The latter is 2.9x slower on v8.

As Luke requested this app doesn't actually grow memory, so a new view is never created - the overhead comes from the function call and the check.

@lukewagner
Copy link
Member

Thanks Alon, that's really helpful. So I think the next step would be to see what the performance was like if the buffer accessor was inlined (which is of course much more work, but so is spec'ing and inlining N new Memory methods...).

@kripken
Copy link
Member Author

kripken commented Mar 27, 2019

By inlined, do you mean something like

  var ch = HEAP8[ptr++ >> 0];
==>
  var ch = (wasmMemory.buffer != buffer ?  buffer = wasmMemory.buffer, updateGlobalBufferViews() : 0), HEAP8[ptr++ >> 0];

?

@kripken
Copy link
Member Author

kripken commented Mar 28, 2019

With

  var ch = ((wasmMemory.buffer != buffer ? (buffer = wasmMemory.buffer, updateGlobalBufferViews()) : 0), HEAP8[ptr++ >> 0]);

it's about the same speed, so, still 2.9x slower. (Which I think makes sense as I assume it inlines eventually in such a hot loop.)

@lukewagner
Copy link
Member

Oh, no, I mean the engine-implemented C++ accessor function for buffer that is ultimately getting called by the JS engine when you call .buffer. It's probably not hard to inline, but I expect no engine has done so yet.

@dcodeIO
Copy link

dcodeIO commented Mar 29, 2019

Related: #1210 which proposes an event handler.

@Macil
Copy link

Macil commented Mar 29, 2019

Would it be possible for JS to listen to the memory "change" event, and update any buffers in an event listener there? That seems like it might avoid the view-per-load problem.

You could do this, but it would only help if you make sure you return to the event loop.

Couldn't the event be dispatched synchronously?

@lukewagner
Copy link
Member

Unfortunately, in the shared-memory case, the grow can happen completely racily (due to grow in another thread).

kripken added a commit to emscripten-core/emscripten that referenced this issue Apr 8, 2019
(in wasm - no support in asm.js)

Background: WebAssembly/design#1271

* Add GROWABLE_HEAP_* methods that replace each load and store in JS. These safely check if memory grew, and updates the buffer and views if so. The JS optimizer is used to instrument the JS in this way, so it works for user JS as well.
* Show a warning about the slowness during compilation.
* Make sbrk etc. threadsafe.
 * Rewrite brk using sbrk, to avoid writing two pieces of threadsafe code.
@cosinusoidally
Copy link

Oh, no, I mean the engine-implemented C++ accessor function for buffer that is ultimately getting called by the JS engine when you call .buffer. It's probably not hard to inline, but I expect no engine has done so yet.

Do TypedArray .length accessors get inlined? I get much better performance if build benchmark.cpp from https://gist.github.com/kripken/a9d90e7cfaed9d210aff4fd0be98217d with GROWABLE_HEAP_LOAD_U8 replaced with this:

function GROWABLE_HEAP_LOAD_U8(ptr) {
  if (ptr>=HEAPU8.length) {
    if (wasmMemory.buffer != buffer) {
      buffer = wasmMemory.buffer;
      updateGlobalBufferViews();
    }
  } 
  return HEAPU8[ptr];
}

ie only check if the buffer needs to grow if the program actually attempts to access out of bound memory.

I tested on Emscripten 1.38.31 building with:

em++ -O2 --std=c++11 -s USE_PTHREADS=1 -s ALLOW_MEMORY_GROWTH=1 -s WASM_MEM_MAX=64MB benchmark.cpp -o benchmark.htm

On my system (with Chrome 74) it took about 8 seconds with the exiting implementation of GROWABLE_HEAP_LOAD_U8 and about 2 seconds with my implementation.

VirtualTim pushed a commit to VirtualTim/emscripten that referenced this issue May 21, 2019
…ore#8365)

(in wasm - no support in asm.js)

Background: WebAssembly/design#1271

* Add GROWABLE_HEAP_* methods that replace each load and store in JS. These safely check if memory grew, and updates the buffer and views if so. The JS optimizer is used to instrument the JS in this way, so it works for user JS as well.
* Show a warning about the slowness during compilation.
* Make sbrk etc. threadsafe.
 * Rewrite brk using sbrk, to avoid writing two pieces of threadsafe code.
@JsBlueCat
Copy link

JsBlueCat commented Jun 8, 2019

emcc:WARNING: ignoring unsupported linker flag: --gc-sections
shared:ERROR: invalid forced library: libc++abi

when compile opencv.js

emaxx-google added a commit to GoogleChromeLabs/chromeos_smart_card_connector that referenced this issue Apr 7, 2021
Suppress the link-time warning emitted by Emscripten due to our usage of
multi-threading together with dynamic memory growth:

  warning: USE_PTHREADS + ALLOW_MEMORY_GROWTH may run non-wasm code
  slowly, see WebAssembly/design#1271

We took this warning into consideration and decided to trade the
performance in favor of lower memory consumption. (In reality, the
performance loss is likely small, since a big chunk of the run time is
spent on interprocess communication.)

So reduce the build log spam by silencing this warning, which was
emitted many times - once for every target binary or static library.
This change is a small improvement to the WebAssembly build
support that is tracked by #177.
@arsnyder16
Copy link

@kripken I was playing around with your benchmark, and i think i came up with an option to reduce the overhead of warning: USE_PTHREADS + ALLOW_MEMORY_GROWTH may run non-wasm code slowly

Instead of always relying on the array views, have a native function to call into which can cast the memory address, since this is on the wasm side no need to check if the underlying buffer has changed.

Here is the new benchmark.cpp
#include <iostream>
#include <string>
#include <chrono>
#include <emscripten.h>

extern "C" {
EMSCRIPTEN_KEEPALIVE uint8_t view_uint8(void* address) { return reinterpret_cast<uint8_t*>(address)[0]; }
}

int main() {
  auto start = std::chrono::steady_clock::now();
  size_t total = 0;
    for (const char* str :  { "hello", "world", "!", "test", "ing", "strings", "many", "of", "them" }) {
      total += EM_ASM_INT({
          var total = 0;
          var value;
          for (var i = 0; i < 1000000; ++i) {
              value = AsciiToString($0);
            total += value.length;
          }
          console.log(value);
          return total;
        }, str);
    }
  auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start).count() ;
  std::cout << total << " in " <<  ms << " ms " <<  total/ms << " /ms \n";
}
NO_ALLOW_MEMORY_GROWTH
~/emscripten/benchmark ((3.1.7))# em++ -Os -pthread -o -sNO_ALLOW_MEMORY_GROWTH -o benchmark.js benchmark.cpp 
~/emscripten/benchmark ((3.1.7))# node --experimental-wasm-threads --experimental-wasm-bulk-memory benchmark.js
hello
world
!
test
ing
strings
many
of
them
35000000 in 293 ms 119453 /ms 
ALLOW_MEMORY_GROWTH
~/emscripten/benchmark ((3.1.7))# em++ -Os -pthread -sALLOW_MEMORY_GROWTH -o benchmark.js benchmark.cpp 
em++: warning: USE_PTHREADS + ALLOW_MEMORY_GROWTH may run non-wasm code slowly, see https://github.com/WebAssembly/design/issues/1271 [-Wpthreads-mem-growth]
~/emscripten/benchmark ((3.1.7))# node --experimental-wasm-threads --experimental-wasm-bulk-memory benchmark.js
hello
world
!
test
ing
strings
many
of
them
35000000 in 1328 ms 26355 /ms 

Swapping out AsciiToString with:

function AsciiToString(ptr){
  var str="";
  while(1) {
    var ch=Module['_view_uint8'](ptr++>>0) >>> 0;
    if(!ch) return str;
    str+=String.fromCharCode(ch)
  }
}
Results!
~/emscripten/benchmark ((3.1.7))# node --experimental-wasm-threads --experimental-wasm-bulk-memory benchmark.js
hello
world
!
test
ing
strings
many
of
them
35000000 in 408 ms 85784 /ms 

@kripken
Copy link
Member Author

kripken commented Apr 6, 2022

@arsnyder16 Interesting, thanks, I'm impressed and surprised that calling into wasm is faster here. I would have guessed otherwise. But I guess optimizations on the JS side are hard to predict...

qtprojectorg pushed a commit to qt/qtbase that referenced this issue Sep 1, 2022
Unify the settings for single-threaded and multi-threaded builds;
Qt now always enables heap growth by default.

This means we don't have to reserve a large (1GB) fixed memory
size, but can instead set the smaller (50 MB) initial memory size,
like the single-threaded build does.

Enabling threads + memory growth can potentially cause
a performance regression when accessing heap memory from
JavaScript (WebAssembly/design#1271).
We leave it for the application to decide if this applies,
and if the switch to fixed memory should be made.

Change-Id: I96988b072506456685086e55aca4007a146bd70f
Reviewed-by: Alexandru Croitor <alexandru.croitor@qt.io>
Reviewed-by: Mikołaj Boc <Mikolaj.Boc@qt.io>
Reviewed-by: Lorn Potter <lorn.potter@gmail.com>
@arsnyder16
Copy link

@kripken Any progress on this issue ? If this seems like it might not be addressed in the short term.

I am wondering if my AsciiToString suggestions would make sense to integrate into emscripten as an improvment. So instead of the GROWABLE_HEAP functions make a call into a wasm c function?

@kripken
Copy link
Member Author

kripken commented Jan 24, 2023

@arsnyder16 Let's move that topic to an issue on the emscripten tracker? (as this is the wasm design repo, and your idea is an implementation detail) But in general it sounds interesting, if we see clear performance benefits.

@arsnyder16
Copy link

@kripken Will do.

@ghnp5
Copy link

ghnp5 commented Oct 28, 2024

Hey! Is there any update on this? :)
Thanks!

@JCash
Copy link

JCash commented Apr 3, 2025

I hope it's ok to ask for current status every 6 months or so :)
Any status updates?
/thanks

@dschuff
Copy link
Member

dschuff commented May 8, 2025

I think this is resolved by RAB/GSAB (https://github.com/tc39/proposal-resizablearraybuffer)
There are a couple of open questions (see discussion in the recent CG meeting) but those can be separate design issues or further CG discussions. I also filed emscripten-core/emscripten#24287 to implement this in Emscripten.
So, I think this issue can be closed.

@dschuff dschuff closed this as completed May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests