Skip to content

Make Node.js pthreads behavior in line with native #19073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 22, 2023

Conversation

kleisauke
Copy link
Collaborator

@kleisauke kleisauke commented Mar 26, 2023

By marking workers as weakly referenced. This corresponds to the ideal scenario as mentioned in commit a9cbf47.

(originally based on commit e6f86a8 and 4b4faf9)


This passes the test suite locally, except for test_pthread_abort_interrupt. Fixed with commit ca5f97c.

/cc @RReverser

@RReverser
Copy link
Collaborator

FWIW I'm definitely in favour of this change, but, to be clear, one concern was/is that it's potentially a breaking one: someone who isn't even using PROXY_TO_PTHREAD, might rely on current Emscripten behaviour of app staying alive as long as some pthread is and using some out-of-band communication between the two.

Hopefully no one is really doing this in the wild, and there isn't really that many Node.js users of Emscripten anyway, but, unlike with previous linked iterations, there is some small risk involved.

kleisauke added a commit to kleisauke/wasm-vips that referenced this pull request Mar 28, 2023
@@ -2884,7 +2884,7 @@ def test_pthread_abort(self):

@node_pthreads
def test_pthread_abort_interrupt(self):
self.set_setting('PTHREAD_POOL_SIZE', 1)
self.set_setting('PROXY_TO_PTHREAD')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would want to investigate why this is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed a bit curious that this is needed. Linking with -sPTHREAD_POOL_SIZE=1 and -sEXIT_RUNTIME doesn't work either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular test case is calling abort() in its pthread entry point, which causes a WebAssembly.RuntimeError to be thrown that is supposed to be propagated back to the main thread via the 'error' event. However, the worker is calling _emscripten_thread_crashed() just before throwing this exception:

emscripten/src/worker.js

Lines 278 to 281 in 574f4f4

if (Module['__emscripten_thread_crashed']) {
Module['__emscripten_thread_crashed']();
}
throw ex;

As a result, the main thread immediately exits while the runtime is kept alive:

if (thread_crashed) {
// Return the event loop so we can handle the message from the crashed
// thread.
emscripten_exit_with_live_runtime();
}

Since the worker is weakly referenced, the program exits too early, causing the test to fail.


A possible solution to this is to make the thread strongly referenced during pthread_join():

// Mark the thread as strongly referenced so that Node.js doesn't exit
// while the pthread is running.
_emscripten_thread_set_strongref(t, 1);

This ensures that the program does not exit too early when a joinable thread crashes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit ca5f97c improves this.

@@ -49,5 +49,8 @@ EMSCRIPTEN_KEEPALIVE int _emscripten_proxy_main(int argc, char** argv) {
pthread_t thread;
int rc = pthread_create(&thread, &attr, _main_thread, NULL);
pthread_attr_destroy(&attr);
if (rc == 0) {
emscripten_thread_set_strongref(thread, 1);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a call to runtimeKeepalivePop in exitOnMainThread that is supposed to take are of bringing down the app once the proxy'd main ends.

I have an stalled change that tried to recfactor that: #18770

The corresponding push is in callMain:

#if PROXY_TO_PTHREAD                                                             
  // With PROXY_TO_PTHREAD make sure we keep the runtime alive until the         
  // proxied main calls exit (see exitOnMainThread() for where Pop is called).   
  {{{ runtimeKeepalivePush() }}}                                                 
#endif   

On other words aren't already forcing the runtime to stay alive as long as the proxied main is running?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On other words aren't already forcing the runtime to stay alive as long as the proxied main is running?

IIUC, the runtimeKeepalivePush() / runtimeKeepalivePop() mechanism would only keep the Emscripten runtime alive, but still allows Node.js to exit (i.e. it doesn't keep the event loop of Node.js alive). For example:

$ cat <<EOT > test.c
#include <emscripten/emscripten.h>

int main() {
  emscripten_exit_with_live_runtime();
  // Should never get here
  __builtin_trap();
}
EOT
$ emcc -O3 test.c -o test.js -sEXIT_RUNTIME
$ node test.js
$ echo $?
0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in scope of this PR, but I think one should imply the other. I was pretty surprised when I ran into that behaviour for the first time and noticed that runtimeKeepAlivePush didn't keep Node from exiting.

worker.ref();
// Once we start executing a pthread, we need to mark the worker as weakly
// referenced, this ensures that Node.js may exit while a pthread is running.
worker.unref();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just set them to unreferenced on first creation then? Or are you saying that an unused worker is strongly referenced and used one only weakly? That seems a odd to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just set them to unreferenced on first creation then?

I considered making the worker weakly referenced during the 'loaded' event, i.e.:

Patch
--- a/src/library_pthread.js
+++ b/src/library_pthread.js
@@ -287,11 +287,9 @@ var LibraryPThread = {
         } else if (cmd === 'loaded') {
           worker.loaded = true;
 #if ENVIRONMENT_MAY_BE_NODE
-          // Check that this worker doesn't have an associated pthread.
-          if (ENVIRONMENT_IS_NODE && !worker.pthread_ptr) {
-            // Once worker is loaded & idle, mark it as weakly referenced,
-            // so that mere existence of a Worker in the pool does not prevent
-            // Node.js from exiting the app.
+          if (ENVIRONMENT_IS_NODE) {
+            // Once the worker is loaded, mark it as weakly referenced, so
+            // that its existence does not prevent Node.js from exiting.
             worker.unref();
           }
 #endif
@@ -686,13 +684,6 @@ var LibraryPThread = {
     // in this file, and not from the external worker.js.
     msg.moduleCanvasId = threadParams.moduleCanvasId;
     msg.offscreenCanvases = threadParams.offscreenCanvases;
-#endif
-#if ENVIRONMENT_MAY_BE_NODE
-    if (ENVIRONMENT_IS_NODE) {
-      // Mark worker as weakly referenced once we start executing a pthread,
-      // so that its existence does not prevent Node.js from exiting.
-      worker.unref();
-    }
 #endif
     // Ask the worker to start executing its pthread entry point function.
     worker.postMessage(msg, threadParams.transferList);

But this 'loaded' event could occur after pthread_create() returns, which would mean it could break such constructions:

pthread_create(&thread, NULL, worker_thread, NULL);
_emscripten_thread_set_strongref(thread, 1);

(i.e. the same construction that PROXY_TO_PTHREAD currently uses).

Or are you saying that an unused worker is strongly referenced and used one only weakly?

I'm not sure if I understand this question, but idle/unused workers (i.e. workers not hosting a pthread) are already weakly referenced within the pool.

@@ -18,6 +18,9 @@ static int __pthread_timedjoin_np(pthread_t t, void **res, const struct timespec
// TODO: The detached check here is just to satisfy the
// `other.test_{proxy,main}_pthread_join_detach` tests.
if (t->detach_state != DT_DETACHED && __pthread_self() == t) return EDEADLK;
// Mark the thread as strongly referenced so that Node.js doesn't exit
// while the pthread is running.
_emscripten_thread_set_strongref(t, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does one thread joining with another thread effect whether we want to keep node alive or not?

Sure if the main thread returns in brings down all threads even if they are waiting on each other?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a workaround for the failing test_pthread_abort_interrupt testcase. Commit ca5f97c improves this and removes the need to do this in pthread_join().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context: #19073 (comment)

@@ -575,6 +575,15 @@ var LibraryPThread = {
else postMessage({ 'cmd': 'cleanupThread', 'thread': thread });
},

_emscripten_thread_set_strongref: function(thread, strongRef) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be marked as __proxy: sync since I don't see how it could work on anything but the main runtime thread?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, marked as __proxy: 'sync' with commit d123a49. Though, we can be sure that the current call sites are only executed on the main runtime thread after commit ca5f97c.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this __proxy: 'sync' annotation with commit 059448a as this function is now only called on the main runtime thread.

// Mark worker as strongly referenced once we start executing a pthread,
// so that Node.js doesn't exit while the pthread is running.
worker.ref();
// Mark worker as weakly referenced once we start executing a pthread,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't understand about this comment "Mark worker as weakly referenced once we start executing a pthread" is that workers are already weakly referenced when they are in the pool, so why does anything need to change once we start executing a pthread. They are weakly referenced both before and after they start executing a thread, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, only idle/unused workers are weakly referenced within the pool (i.e. when -sPTHREAD_POOL_SIZE=x is specified).

// Once worker is loaded & idle, mark it as weakly referenced,
// so that mere existence of a Worker in the pool does not prevent
// Node.js from exiting the app.
worker.unref();

// Once a pthread has finished and the worker becomes idle, mark it
// as weakly referenced so that its existence does not prevent Node.js
// from exiting.
worker.unref();

But when -sPTHREAD_POOL_SIZE=x is omitted (which should be safe on Node.js after commit 73eaf81) or when an application spawns more threads than specified within that setting (i.e. thread pool exhaustion), workers are strongly referenced by default, so this .unref() is still necessary for those cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instread fix the workers are strongly referenced by default thing?

IRC the intent prior this change is that they should be weakly referenced until they are running a pthread, and then strongly referenced one a pthread is running.

After this change they should never be strongly referenced so should be .unref() as soon as we create them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to simplify this with commit 059448a.

After this change they should never be strongly referenced so should be .unref() as soon as we create them?

AFAIK it's only safe to .unref() the worker during the 'loaded' event or just before executing the pthread entry point function (i.e. only after the .on('message') listeners are attached), it will not work if you .unref() too early. See for example this program:

Details
const { Worker, isMainThread, parentPort } = require('worker_threads');

if (isMainThread) {
  const worker = new Worker(__filename);
  worker.on('error', (err) => { throw err; });
  worker.on('exit', () => console.log('exit'));
  worker.on('message', (msg) => console.log(msg))
  worker.postMessage('From main thread!')
  /* setTimeout(() => */worker.unref()/* , 100); */
} else {
  parentPort.on('message', (msg) => console.log(msg));
  parentPort.postMessage('From worker!');
}

As a possible enhancement, perhaps we could make PROXY_TO_PTHREAD no-op on Node.js (i.e. just call the original main)? This would be safe after commit 73eaf81, and would allow further simplification.


void _emscripten_thread_crashed() {
thread_crashed = true;
crashed_thread_id = pthread_self();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just call _emscripten_thread_set_strongref(pthread_self()) here and avoid the change to _emscripten_yield below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but then I noticed this comment:

// When a secondary thread crashes, we need to be able to interrupt the main
// thread even if it's in a blocking/looping on a mutex. We want to avoid
// using the normal proxying mechanism to send this message since it can
// allocate (or otherwise itself crash) so use a low level atomic primitive
// for this signal.

So since this function is called from a crashed thread, I don't think we can use the normal proxying mechanism here (i.e. __proxy: 'sync') .

// as weakly referenced so that its existence does not prevent Node.js
// from exiting.
#if ENVIRONMENT_MAY_BE_NODE && PROXY_TO_PTHREAD
if (ENVIRONMENT_IS_NODE && worker.proxied_main) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if this read more like:

// Even threads that are strongly referenced, such as the proxied main thread
// or threads that are being kept alive long enough to proxy their error state
// should be unref() once that are returned to the poll
if (worker.strongly_referenced) {
   worker.unref();
   worker.is_strongly_referenced = false;
}

Then _emscripten_thread_set_strongref could set is_strongly_referenced to true and PROZY_TO_PTHREAD could explictly call _emscripten_thread_set_strongref on main's pthread instead of adding the magic extra argument to __pthread_create_js. ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better I think we can just unconditionally call worker.unref(); here! Since according the docs "If the worker is already unref()ed calling unref() again has no effect." (https://nodejs.org/api/worker_threads.html#workerunref)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that extra argument added to __pthread_create_js was a bit of an weird way to identify the proxied main. I reverted those commits and instead added a comment to clarify that .ref()/.unref() is backed by a boolean.

By marking workers as weakly referenced. This corresponds to the
ideal scenario as mentioned in commit a9cbf47.

(originally based on commit e6f86a8 and 4b4faf9)
This change was generated using:
$ ./test/runner other.*code_size* other.*metadce* --rebase
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've very excited to see this change land!

_emscripten_thread_set_strongref: function(thread) {
#if ENVIRONMENT_MAY_BE_NODE
if (!ENVIRONMENT_IS_NODE) return;
PThread.pthreads[thread].ref();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a two line function like this I think maybe the positive check is more readable:

    if (ENVIRONMENT_IS_NODE) {
       PThread.pthreads[thread].ref();
    }

(and maybe fewer bytes of code?)

Copy link
Collaborator Author

@kleisauke kleisauke Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with 8f2bc84, there were no changes in the code size tests.

if (rc == 0) {
// Mark the thread as strongly referenced, so that Node.js doesn't exit
// while the pthread is running.
_emscripten_thread_set_strongref(thread);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one alternative to adding this new API would be to add a custom pthread attr to signal this? However, since this is an internal API we can always experiment with changing that later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with a custom pthread attr (similar to emscripten_pthread_attr_settransferredcanvases()), but that was more complicated than doing this. _emscripten_thread_set_strongref() is also used for crashed threads, so it can't be removed in favor of a custom pthread attr.

// Once a pthread has finished and the worker becomes idle, mark it
// as weakly referenced so that its existence does not prevent Node.js
// from exiting.
// Once the proxied main thread has finished, mark it as weakly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is specific to the proxied main but the _emscripten_thread_set_strongref seems like a generic function.

Perhaps the comment could say

// In case a thread was marked as strongly referenced using
// _emscripten_thread_set_strongref (for example, the proxied main thread)
// we need to ensure that its un-referenced when we the thread exits and
// the works is returned to the pool

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This .unref() is conditional guarded by PROXY_TO_PTHREAD, so I think the comment still makes sense.

In other words, we can be sure that all threads are already weakly referenced when the worker is returned to the pool, expect for the "main" thread in PROXY_TO_PTHREAD mode; since that one is strongly referenced.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I hadn't noticed that this block was now only PROXY_TO_PTHREAD.

@@ -571,6 +572,13 @@ var LibraryPThread = {
else postMessage({ 'cmd': 'cleanupThread', 'thread': thread });
},

_emscripten_thread_set_strongref: function(thread) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a comment here about how the sole current use case for this function is for keeping the "main" thread alive in PROXY_TO_PTHREAD mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with 8f2bc84.

// Mark the crashed thread as strongly referenced so that Node.js doesn't
// exit while the pthread is propagating the uncaught exception back to
// the main thread.
_emscripten_thread_set_strongref(crashed_thread_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this change might help with the race conditions we've been seeing in certain pthread tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is possible that by the time we get here, the crashed_thread_id has already be cleaned up and that crashed_thread_id is no longer a valid thread pointer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this change might help with the race conditions we've been seeing in certain pthread tests.

Did you mean issue #15014? I think that race condition is still there, I just responded to that issue.

Is is possible that by the time we get here, the crashed_thread_id has already be cleaned up and that crashed_thread_id is no longer a valid thread pointer?

Given that crashed threads are never cleaned up (i.e. they do not call _emscripten_thread_exit), I think we can be sure that the pthread and corresponding worker are still alive up once it arrives here.

There could be a possibility that the uncaught exception will be re-thrown on the main thread before it ever arrives here, but since the program will be finished by then, I don't think that would be an issue.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth adding a ChangeLog entry for?

Feel free to merge either way, we can always add one later. I'm excited to see this land.

@kleisauke kleisauke enabled auto-merge (squash) April 22, 2023 09:37
@kleisauke kleisauke merged commit 9757ac1 into emscripten-core:main Apr 22, 2023
@kleisauke kleisauke deleted the node-weak-ref branch April 22, 2023 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants