Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++20 coroutines + Native webassembly promise integration #20413

Closed
unit-404 opened this issue Oct 8, 2023 · 10 comments · Fixed by #20420
Closed

C++20 coroutines + Native webassembly promise integration #20413

unit-404 opened this issue Oct 8, 2023 · 10 comments · Fixed by #20420

Comments

@unit-404
Copy link

unit-404 commented Oct 8, 2023

From side of C++20 we have coroutines, from side of WebAssembly (at least in Chrome) we have native promise integration. Why be not to unify such features, and not to make an new and better async await API? Probably, make some sort of wrap or emulation? Or coroutine header with low-level code implementation?

@tlively
Copy link
Member

tlively commented Oct 8, 2023

Yes, this is something I'd like to do. See emscripten/promise.h for the C API we have for interfacing with JS promises, including via JSPI. The next steps would be to create a C++11 wrapper API to take care of the resource management, then a C++20 coroutine API on top of that.

Is this something you would be interested in working on? I'd be happy to review patches if so.

@RReverser
Copy link
Collaborator

RReverser commented Oct 8, 2023

Ha, I actually literally implemented this last week for Embind and was going to send a PR next week. Just randomly saw this issue.

@RReverser
Copy link
Collaborator

RReverser commented Oct 8, 2023

See emscripten/promise.h for the C API we have for interfacing with JS promises, including via JSPI.

Btw I saw that experimental API too, but it's pretty low-level and it's kinda a shame that it uses its own list of promises and handles as that makes it harder to integrate with Embind's emscripten::val.

I decided to go with the latter as it's more useful for complex JS interactions. I'd be happy to chat more about it e.g. in Discord.

@RReverser
Copy link
Collaborator

Note that JSPI is orthogonal / unnecessary for coroutines.

JSPI, like Asyncify, is useful for pausing the entire program, whereas in case of coroutines all the transformation magic happens at compile time and it only pauses the local coroutine itself, so Wasm engine doesn't [need to] know about promises and pausing.

@RReverser
Copy link
Collaborator

Ha, I actually literally implemented this last week for Embind and was going to send a PR next week. Just randomly saw this issue.

Yeah that's why I implemented mine via Embind instead - it supports passing promises from and to JS. I'll try to submit a PR soon.

RReverser added a commit to RReverser/emscripten that referenced this issue Oct 9, 2023
This adds support for `co_await`-ing Promises represented by `emscripten::val`.

The surrounding coroutine should also return `emscripten::val`, which will
be a promise representing the whole coroutine's return value.

Note that this feature uses LLVM coroutines and so, doesn't depend on
either Asyncify or JSPI. It doesn't pause the entire program, but only
the coroutine itself, so it serves somewhat different usecases even though
all those features operate on promises.

Nevertheless, if you are not implementing a syscall that must behave as-if
it was synchronous, but instead simply want to await on some async operations
and return a new promise to the user, this feature will be much more efficient.

Here's a simple benchmark measuring runtime overhead from awaiting on a no-op Promise
repeatedly in a deep call stack:

```cpp

using namespace emscripten;

// clang-format off
EM_JS(EM_VAL, wait_impl, (), {
  return Emval.toHandle(Promise.resolve());
});
// clang-format on

val wait() { return val::take_ownership(wait_impl()); }

val coro_co_await(int depth) {
  co_await wait();
  if (depth > 0) {
    co_await coro_co_await(depth - 1);
  }
  co_return val();
}

val asyncify_val_await(int depth) {
  wait().await();
  if (depth > 0) {
    asyncify_val_await(depth - 1);
  }
  return val();
}

EMSCRIPTEN_BINDINGS(bench) {
  function("coro_co_await", coro_co_await);
  function("asyncify_val_await", asyncify_val_await, async());
}
```

And the JS runner also comparing with pure-JS implementation:

```js
import Benchmark from 'benchmark';
import initModule from './async-bench.mjs';

let Module = await initModule();
let suite = new Benchmark.Suite();

function addAsyncBench(name, func) {
	suite.add(name, {
		defer: true,
		fn: (deferred) => func(1000).then(() => deferred.resolve()),
	});
}

for (const name of ['coro_co_await', 'asyncify_val_await']) {
  addAsyncBench(name, Module[name]);
}

addAsyncBench('pure_js', async function pure_js(depth) {
  await Promise.resolve();
  if (depth > 0) {
    await pure_js(depth - 1);
  }
});

suite
  .on('cycle', function (event) {
    console.log(String(event.target));
  })
  .run({async: true});
```

Results with regular Asyncify (I had to bump up `ASYNCIFY_STACK_SIZE` to accomodate said deep stack):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY -s ASYNCIFY_STACK_SIZE=1000000
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug async-bench-runner.mjs

coro_co_await x 727 ops/sec ±10.59% (47 runs sampled)
asyncify_val_await x 58.05 ops/sec ±6.91% (53 runs sampled)
pure_js x 3,022 ops/sec ±8.06% (52 runs sampled)
```

Results with JSPI (I had to disable `DYNAMIC_EXECUTION` because I was getting "RuntimeError: table index is out of bounds" in random places depending on optimisation mode - JSPI miscompilation?):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY=2 -s DYNAMIC_EXECUTION=0
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug --experimental-wasm-stack-switching async-bench-runner.mjs

coro_co_await x 955 ops/sec ±9.25% (62 runs sampled)
asyncify_val_await x 924 ops/sec ±8.27% (62 runs sampled)
pure_js x 3,258 ops/sec ±8.98% (53 runs sampled)
```

So the performance is much faster than regular Asyncify, and on par with JSPI.

Fixes emscripten-core#20413.
@RReverser
Copy link
Collaborator

See #20420.

@RReverser
Copy link
Collaborator

Ah right, I guess those are two slightly different issues - adding coroutine support for em_promise_t and for JavaScript values in Embind.

RReverser added a commit to RReverser/emscripten that referenced this issue Oct 9, 2023
This adds support for `co_await`-ing Promises represented by `emscripten::val`.

The surrounding coroutine should also return `emscripten::val`, which will
be a promise representing the whole coroutine's return value.

Note that this feature uses LLVM coroutines and so, doesn't depend on
either Asyncify or JSPI. It doesn't pause the entire program, but only
the coroutine itself, so it serves somewhat different usecases even though
all those features operate on promises.

Nevertheless, if you are not implementing a syscall that must behave as-if
it was synchronous, but instead simply want to await on some async operations
and return a new promise to the user, this feature will be much more efficient.

Here's a simple benchmark measuring runtime overhead from awaiting on a no-op Promise
repeatedly in a deep call stack:

```cpp

using namespace emscripten;

// clang-format off
EM_JS(EM_VAL, wait_impl, (), {
  return Emval.toHandle(Promise.resolve());
});
// clang-format on

val wait() { return val::take_ownership(wait_impl()); }

val coro_co_await(int depth) {
  co_await wait();
  if (depth > 0) {
    co_await coro_co_await(depth - 1);
  }
  co_return val();
}

val asyncify_val_await(int depth) {
  wait().await();
  if (depth > 0) {
    asyncify_val_await(depth - 1);
  }
  return val();
}

EMSCRIPTEN_BINDINGS(bench) {
  function("coro_co_await", coro_co_await);
  function("asyncify_val_await", asyncify_val_await, async());
}
```

And the JS runner also comparing with pure-JS implementation:

```js
import Benchmark from 'benchmark';
import initModule from './async-bench.mjs';

let Module = await initModule();
let suite = new Benchmark.Suite();

function addAsyncBench(name, func) {
	suite.add(name, {
		defer: true,
		fn: (deferred) => func(1000).then(() => deferred.resolve()),
	});
}

for (const name of ['coro_co_await', 'asyncify_val_await']) {
  addAsyncBench(name, Module[name]);
}

addAsyncBench('pure_js', async function pure_js(depth) {
  await Promise.resolve();
  if (depth > 0) {
    await pure_js(depth - 1);
  }
});

suite
  .on('cycle', function (event) {
    console.log(String(event.target));
  })
  .run({async: true});
```

Results with regular Asyncify (I had to bump up `ASYNCIFY_STACK_SIZE` to accomodate said deep stack):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY -s ASYNCIFY_STACK_SIZE=1000000
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug async-bench-runner.mjs

coro_co_await x 727 ops/sec ±10.59% (47 runs sampled)
asyncify_val_await x 58.05 ops/sec ±6.91% (53 runs sampled)
pure_js x 3,022 ops/sec ±8.06% (52 runs sampled)
```

Results with JSPI (I had to disable `DYNAMIC_EXECUTION` because I was getting "RuntimeError: table index is out of bounds" in random places depending on optimisation mode - JSPI miscompilation?):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY=2 -s DYNAMIC_EXECUTION=0
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug --experimental-wasm-stack-switching async-bench-runner.mjs

coro_co_await x 955 ops/sec ±9.25% (62 runs sampled)
asyncify_val_await x 924 ops/sec ±8.27% (62 runs sampled)
pure_js x 3,258 ops/sec ±8.98% (53 runs sampled)
```

So the performance is much faster than regular Asyncify, and on par with JSPI.

Fixes emscripten-core#20413.
@unit-404
Copy link
Author

unit-404 commented Oct 10, 2023

Sometime I thinking about extension or plugin API. For Embind, for JSPI API, etc. Also, I prefer to make own extension package.

@RReverser
Copy link
Collaborator

for JSPI API

FWIW I know I mentioned it before, but if you're referring to em_promise_*, it's not JSPI API. It's just a JavaScript / C library, just like Embind. https://github.com/emscripten-core/emscripten/blob/main/src/library_promise.js

Both of those libraries can use JSPI to await promises when compiled with -s ASYNCIFY=2, but otherwise the only difference between them is the provided API and value representation.

That is, I don't want to discourage you, just wanted to clarify because it sounds like you might think it's a low-level API for JSPI. They are both high-level libraries implemented in JavaScript and exposing C/C++ bindings - one works on JS promises created from C/C++ and another works on JS values (including promises) received from JS.

@unit-404
Copy link
Author

unit-404 commented Oct 10, 2023

Meanwhile... I getting such error when trying use promise C API with -sASYNCIFY=2, with any other is working.

C:\***\EMX\test\js>call node --wasm-stack-switching-stack-size=1000 --experimental-wasm-modules --experimental-wasm-memory64 --experimental-modules --experimental-wasi-unstable-preview1 test.mjs --input-type=module
Aborted(Assertion failed: Missing __sig for invoke_djjj)
file:///C:/***/EMX/test/cxx/test.js:154
      throw ex;
      ^

RuntimeError: Aborted(Assertion failed: Missing __sig for invoke_djjj)
    at abort (file:///C:/***/EMX/test/cxx/test.js:684:11)
    at assert (file:///C:/***/EMX/test/cxx/test.js:396:5)
    at file:///C:/***/EMX/test/cxx/test.js:5078:17
    at Object.instrumentWasmImports (file:///C:/***/EMX/test/cxx/test.js:5090:13)
    at file:///C:/***/EMX/test/cxx/test.js:5621:10
    at async file:///C:/***/EMX/test/js/test.mjs:5:15

Node.js v20.7.0

My makefile:

____: 
    EMCC_DEBUG=1 $(CC) -I./include \
        -I ../../src/cxx/ \
        -c ./test.cpp \
        -std=c++23 -sASYNCIFY=2 \
        -DHALF_ENABLE_CPP11_CFENV=false \
        -sNO_DISABLE_EXCEPTION_CATCHING \
        -sDEMANGLE_SUPPORT=1 -sASSERTIONS -frtti \
        -Wno-limited-postlink-optimizations \
        -sALLOW_TABLE_GROWTH=1 \
        -O0 -msimd128 --no-entry -sRESERVED_FUNCTION_POINTERS=1 --target=wasm64
    
    EMCC_DEBUG=1 $(CC) -g ./test.o -o ./test.js \
        -I ../../src/cxx/ \
        -std=c++23 -sASYNCIFY=2 \
        -Wno-limited-postlink-optimizations \
        -O0 -msimd128 --no-entry -sRESERVED_FUNCTION_POINTERS=1 --target=wasm64 \
        -sALLOW_MEMORY_GROWTH=1 \
        -sSINGLE_FILE -sTOTAL_MEMORY=4MB \
        -sEXPORT_ES6=1 \
        -sNODERAWFS=0 \
        -sUSE_ES6_IMPORT_META=1 \
        -sNO_DISABLE_EXCEPTION_CATCHING \
        -sDEMANGLE_SUPPORT=1 -sASSERTIONS -frtti \
        -sEXPORTED_RUNTIME_METHODS="['addFunction']" \
        -sALLOW_TABLE_GROWTH=1 \
        -sEXPORTED_FUNCTIONS="[\
            '_malloc', '_free', '_calloc', \
            '_testPromise', \
            '_emx_promise_then', '_emx_promise_resolve', '_emx_promise_create'\
        ]"

RReverser added a commit to RReverser/emscripten that referenced this issue Oct 16, 2023
This adds support for `co_await`-ing Promises represented by `emscripten::val`.

The surrounding coroutine should also return `emscripten::val`, which will
be a promise representing the whole coroutine's return value.

Note that this feature uses LLVM coroutines and so, doesn't depend on
either Asyncify or JSPI. It doesn't pause the entire program, but only
the coroutine itself, so it serves somewhat different usecases even though
all those features operate on promises.

Nevertheless, if you are not implementing a syscall that must behave as-if
it was synchronous, but instead simply want to await on some async operations
and return a new promise to the user, this feature will be much more efficient.

Here's a simple benchmark measuring runtime overhead from awaiting on a no-op Promise
repeatedly in a deep call stack:

```cpp

using namespace emscripten;

// clang-format off
EM_JS(EM_VAL, wait_impl, (), {
  return Emval.toHandle(Promise.resolve());
});
// clang-format on

val wait() { return val::take_ownership(wait_impl()); }

val coro_co_await(int depth) {
  co_await wait();
  if (depth > 0) {
    co_await coro_co_await(depth - 1);
  }
  co_return val();
}

val asyncify_val_await(int depth) {
  wait().await();
  if (depth > 0) {
    asyncify_val_await(depth - 1);
  }
  return val();
}

EMSCRIPTEN_BINDINGS(bench) {
  function("coro_co_await", coro_co_await);
  function("asyncify_val_await", asyncify_val_await, async());
}
```

And the JS runner also comparing with pure-JS implementation:

```js
import Benchmark from 'benchmark';
import initModule from './async-bench.mjs';

let Module = await initModule();
let suite = new Benchmark.Suite();

function addAsyncBench(name, func) {
	suite.add(name, {
		defer: true,
		fn: (deferred) => func(1000).then(() => deferred.resolve()),
	});
}

for (const name of ['coro_co_await', 'asyncify_val_await']) {
  addAsyncBench(name, Module[name]);
}

addAsyncBench('pure_js', async function pure_js(depth) {
  await Promise.resolve();
  if (depth > 0) {
    await pure_js(depth - 1);
  }
});

suite
  .on('cycle', function (event) {
    console.log(String(event.target));
  })
  .run({async: true});
```

Results with regular Asyncify (I had to bump up `ASYNCIFY_STACK_SIZE` to accomodate said deep stack):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY -s ASYNCIFY_STACK_SIZE=1000000
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug async-bench-runner.mjs

coro_co_await x 727 ops/sec ±10.59% (47 runs sampled)
asyncify_val_await x 58.05 ops/sec ±6.91% (53 runs sampled)
pure_js x 3,022 ops/sec ±8.06% (52 runs sampled)
```

Results with JSPI (I had to disable `DYNAMIC_EXECUTION` because I was getting "RuntimeError: table index is out of bounds" in random places depending on optimisation mode - JSPI miscompilation?):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY=2 -s DYNAMIC_EXECUTION=0
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug --experimental-wasm-stack-switching async-bench-runner.mjs

coro_co_await x 955 ops/sec ±9.25% (62 runs sampled)
asyncify_val_await x 924 ops/sec ±8.27% (62 runs sampled)
pure_js x 3,258 ops/sec ±8.98% (53 runs sampled)
```

So the performance is much faster than regular Asyncify, and on par with JSPI.

Fixes emscripten-core#20413.
RReverser added a commit to RReverser/emscripten that referenced this issue Nov 3, 2023
This adds support for `co_await`-ing Promises represented by `emscripten::val`.

The surrounding coroutine should also return `emscripten::val`, which will
be a promise representing the whole coroutine's return value.

Note that this feature uses LLVM coroutines and so, doesn't depend on
either Asyncify or JSPI. It doesn't pause the entire program, but only
the coroutine itself, so it serves somewhat different usecases even though
all those features operate on promises.

Nevertheless, if you are not implementing a syscall that must behave as-if
it was synchronous, but instead simply want to await on some async operations
and return a new promise to the user, this feature will be much more efficient.

Here's a simple benchmark measuring runtime overhead from awaiting on a no-op Promise
repeatedly in a deep call stack:

```cpp

using namespace emscripten;

// clang-format off
EM_JS(EM_VAL, wait_impl, (), {
  return Emval.toHandle(Promise.resolve());
});
// clang-format on

val wait() { return val::take_ownership(wait_impl()); }

val coro_co_await(int depth) {
  co_await wait();
  if (depth > 0) {
    co_await coro_co_await(depth - 1);
  }
  co_return val();
}

val asyncify_val_await(int depth) {
  wait().await();
  if (depth > 0) {
    asyncify_val_await(depth - 1);
  }
  return val();
}

EMSCRIPTEN_BINDINGS(bench) {
  function("coro_co_await", coro_co_await);
  function("asyncify_val_await", asyncify_val_await, async());
}
```

And the JS runner also comparing with pure-JS implementation:

```js
import Benchmark from 'benchmark';
import initModule from './async-bench.mjs';

let Module = await initModule();
let suite = new Benchmark.Suite();

function addAsyncBench(name, func) {
	suite.add(name, {
		defer: true,
		fn: (deferred) => func(1000).then(() => deferred.resolve()),
	});
}

for (const name of ['coro_co_await', 'asyncify_val_await']) {
  addAsyncBench(name, Module[name]);
}

addAsyncBench('pure_js', async function pure_js(depth) {
  await Promise.resolve();
  if (depth > 0) {
    await pure_js(depth - 1);
  }
});

suite
  .on('cycle', function (event) {
    console.log(String(event.target));
  })
  .run({async: true});
```

Results with regular Asyncify (I had to bump up `ASYNCIFY_STACK_SIZE` to accomodate said deep stack):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY -s ASYNCIFY_STACK_SIZE=1000000
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug async-bench-runner.mjs

coro_co_await x 727 ops/sec ±10.59% (47 runs sampled)
asyncify_val_await x 58.05 ops/sec ±6.91% (53 runs sampled)
pure_js x 3,022 ops/sec ±8.06% (52 runs sampled)
```

Results with JSPI (I had to disable `DYNAMIC_EXECUTION` because I was getting "RuntimeError: table index is out of bounds" in random places depending on optimisation mode - JSPI miscompilation?):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY=2 -s DYNAMIC_EXECUTION=0
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug --experimental-wasm-stack-switching async-bench-runner.mjs

coro_co_await x 955 ops/sec ±9.25% (62 runs sampled)
asyncify_val_await x 924 ops/sec ±8.27% (62 runs sampled)
pure_js x 3,258 ops/sec ±8.98% (53 runs sampled)
```

So the performance is much faster than regular Asyncify, and on par with JSPI.

Fixes emscripten-core#20413.
RReverser added a commit to RReverser/emscripten that referenced this issue Nov 3, 2023
This adds support for `co_await`-ing Promises represented by `emscripten::val`.

The surrounding coroutine should also return `emscripten::val`, which will
be a promise representing the whole coroutine's return value.

Note that this feature uses LLVM coroutines and so, doesn't depend on
either Asyncify or JSPI. It doesn't pause the entire program, but only
the coroutine itself, so it serves somewhat different usecases even though
all those features operate on promises.

Nevertheless, if you are not implementing a syscall that must behave as-if
it was synchronous, but instead simply want to await on some async operations
and return a new promise to the user, this feature will be much more efficient.

Here's a simple benchmark measuring runtime overhead from awaiting on a no-op Promise
repeatedly in a deep call stack:

```cpp

using namespace emscripten;

// clang-format off
EM_JS(EM_VAL, wait_impl, (), {
  return Emval.toHandle(Promise.resolve());
});
// clang-format on

val wait() { return val::take_ownership(wait_impl()); }

val coro_co_await(int depth) {
  co_await wait();
  if (depth > 0) {
    co_await coro_co_await(depth - 1);
  }
  co_return val();
}

val asyncify_val_await(int depth) {
  wait().await();
  if (depth > 0) {
    asyncify_val_await(depth - 1);
  }
  return val();
}

EMSCRIPTEN_BINDINGS(bench) {
  function("coro_co_await", coro_co_await);
  function("asyncify_val_await", asyncify_val_await, async());
}
```

And the JS runner also comparing with pure-JS implementation:

```js
import Benchmark from 'benchmark';
import initModule from './async-bench.mjs';

let Module = await initModule();
let suite = new Benchmark.Suite();

function addAsyncBench(name, func) {
	suite.add(name, {
		defer: true,
		fn: (deferred) => func(1000).then(() => deferred.resolve()),
	});
}

for (const name of ['coro_co_await', 'asyncify_val_await']) {
  addAsyncBench(name, Module[name]);
}

addAsyncBench('pure_js', async function pure_js(depth) {
  await Promise.resolve();
  if (depth > 0) {
    await pure_js(depth - 1);
  }
});

suite
  .on('cycle', function (event) {
    console.log(String(event.target));
  })
  .run({async: true});
```

Results with regular Asyncify (I had to bump up `ASYNCIFY_STACK_SIZE` to accomodate said deep stack):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY -s ASYNCIFY_STACK_SIZE=1000000
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug async-bench-runner.mjs

coro_co_await x 727 ops/sec ±10.59% (47 runs sampled)
asyncify_val_await x 58.05 ops/sec ±6.91% (53 runs sampled)
pure_js x 3,022 ops/sec ±8.06% (52 runs sampled)
```

Results with JSPI (I had to disable `DYNAMIC_EXECUTION` because I was getting "RuntimeError: table index is out of bounds" in random places depending on optimisation mode - JSPI miscompilation?):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY=2 -s DYNAMIC_EXECUTION=0
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug --experimental-wasm-stack-switching async-bench-runner.mjs

coro_co_await x 955 ops/sec ±9.25% (62 runs sampled)
asyncify_val_await x 924 ops/sec ±8.27% (62 runs sampled)
pure_js x 3,258 ops/sec ±8.98% (53 runs sampled)
```

So the performance is much faster than regular Asyncify, and on par with JSPI.

Fixes emscripten-core#20413.
RReverser added a commit to RReverser/emscripten that referenced this issue Nov 3, 2023
This adds support for `co_await`-ing Promises represented by `emscripten::val`.

The surrounding coroutine should also return `emscripten::val`, which will
be a promise representing the whole coroutine's return value.

Note that this feature uses LLVM coroutines and so, doesn't depend on
either Asyncify or JSPI. It doesn't pause the entire program, but only
the coroutine itself, so it serves somewhat different usecases even though
all those features operate on promises.

Nevertheless, if you are not implementing a syscall that must behave as-if
it was synchronous, but instead simply want to await on some async operations
and return a new promise to the user, this feature will be much more efficient.

Here's a simple benchmark measuring runtime overhead from awaiting on a no-op Promise
repeatedly in a deep call stack:

```cpp

using namespace emscripten;

// clang-format off
EM_JS(EM_VAL, wait_impl, (), {
  return Emval.toHandle(Promise.resolve());
});
// clang-format on

val wait() { return val::take_ownership(wait_impl()); }

val coro_co_await(int depth) {
  co_await wait();
  if (depth > 0) {
    co_await coro_co_await(depth - 1);
  }
  co_return val();
}

val asyncify_val_await(int depth) {
  wait().await();
  if (depth > 0) {
    asyncify_val_await(depth - 1);
  }
  return val();
}

EMSCRIPTEN_BINDINGS(bench) {
  function("coro_co_await", coro_co_await);
  function("asyncify_val_await", asyncify_val_await, async());
}
```

And the JS runner also comparing with pure-JS implementation:

```js
import Benchmark from 'benchmark';
import initModule from './async-bench.mjs';

let Module = await initModule();
let suite = new Benchmark.Suite();

function addAsyncBench(name, func) {
	suite.add(name, {
		defer: true,
		fn: (deferred) => func(1000).then(() => deferred.resolve()),
	});
}

for (const name of ['coro_co_await', 'asyncify_val_await']) {
  addAsyncBench(name, Module[name]);
}

addAsyncBench('pure_js', async function pure_js(depth) {
  await Promise.resolve();
  if (depth > 0) {
    await pure_js(depth - 1);
  }
});

suite
  .on('cycle', function (event) {
    console.log(String(event.target));
  })
  .run({async: true});
```

Results with regular Asyncify (I had to bump up `ASYNCIFY_STACK_SIZE` to accomodate said deep stack):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY -s ASYNCIFY_STACK_SIZE=1000000
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug async-bench-runner.mjs

coro_co_await x 727 ops/sec ±10.59% (47 runs sampled)
asyncify_val_await x 58.05 ops/sec ±6.91% (53 runs sampled)
pure_js x 3,022 ops/sec ±8.06% (52 runs sampled)
```

Results with JSPI (I had to disable `DYNAMIC_EXECUTION` because I was getting "RuntimeError: table index is out of bounds" in random places depending on optimisation mode - JSPI miscompilation?):

```bash
> ./emcc async-bench.cpp -std=c++20 -O3 -o async-bench.mjs --bind -s ASYNCIFY=2 -s DYNAMIC_EXECUTION=0
> node --no-liftoff --no-wasm-tier-up --no-wasm-lazy-compilation --no-sparkplug --experimental-wasm-stack-switching async-bench-runner.mjs

coro_co_await x 955 ops/sec ±9.25% (62 runs sampled)
asyncify_val_await x 924 ops/sec ±8.27% (62 runs sampled)
pure_js x 3,258 ops/sec ±8.98% (53 runs sampled)
```

So the performance is much faster than regular Asyncify, and on par with JSPI.

Fixes emscripten-core#20413.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants