Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of pthread in a Node.js app never lets it exit #12801

Closed
RReverser opened this issue Nov 17, 2020 · 35 comments · Fixed by #18227
Closed

Usage of pthread in a Node.js app never lets it exit #12801

RReverser opened this issue Nov 17, 2020 · 35 comments · Fixed by #18227

Comments

@RReverser
Copy link
Collaborator

RReverser commented Nov 17, 2020

There is something wrong with the way Emscripten uses threads on Node.js. Looks like any usage prevents Node app from ever exiting.

Just tried the simplest example:

#include <thread>
#include <iostream>

int main() {
	std::thread t([] {
		std::cout << "Hello from another thread\n";
	});
	t.join();
	return 0;
}

Compiled with:

> emcc temp.cc -o temp.js -pthread -s PTHREAD_POOL_SIZE=4

This results in:

> node --experimental-wasm-threads --experimental-wasm-bulk-memory temp
Hello from another thread
Pthread 0x705d10 exited.
[stuck here]

Looking at https://nodejs.org/api/worker_threads.html, I suspect Emscripten needs to call .unref() to indicate that it's okay to exit as soon as reference is unreachable, but probably doesn't do that yet?

@emaxx-google
Copy link

Is this supposed to work without PROXY_TO_PTHREAD? See https://emscripten.org/docs/porting/pthreads.html#additional-flags

@RReverser
Copy link
Collaborator Author

@emaxx-google Yes. PROXY_TO_PTHREAD is a separate feature that is built on top of PThread emulation and allows to move main to a separate thread. This bug is about regular PThread emulation.

@emaxx-google
Copy link

emaxx-google commented Nov 18, 2020

I'm not an Emscripten expert, but the same page describes in a bit more detail:

The Emscripten implementation for the pthreads API should follow the POSIX standard closely, but some behavioral differences do exist:

When pthread_create() is called, if we need to create a new Web Worker, then that requires returning the main event loop. That is, you cannot call pthread_create and then keep running code synchronously that expects the worker to start running - it will only run after you return to the event loop. This is a violation of POSIX behavior and will break common code which creates a thread and immediately joins it or otherwise synchronously waits to observe an effect such as a memory write.

@RReverser
Copy link
Collaborator Author

@emaxx-google That's not the issue here. I already know how pthreads work in Emscripten :) What that paragraph is describing is the reason why you need PTHREAD_POOL_SIZE which I'm already passing.

@juj
Copy link
Collaborator

juj commented Nov 18, 2020

Try building with -s PTHREAD_POOL_SIZE=1 -s EXIT_RUNTIME=1 linker flags. Does that help?

@RReverser
Copy link
Collaborator Author

@juj Hmm it does, but that's not necessary in a non-pthread builds. Why is adding pthread different?

@RReverser
Copy link
Collaborator Author

[semi-answering my own question] IIRC the difference is that EXIT_RUNTIME forcibly does process.exit() or something like that, which is not exactly what we want here - what we really want is for Node to run event loop to completion and exit naturally, but right now it seems it can't due to hanging workers.

@juj
Copy link
Collaborator

juj commented Nov 18, 2020

The behavior is likely caused by this line:

noExitRuntime = true;
. If you remove that line, it should also cause the code to exit, even if you don't specify -s EXIT_RUNTIME=1. Does that also do a forcible process.exit()?

You can also try running PThread.terminateAllThreads(); JS function if you think the issue is caused by hanging Workers. That will tear down all Workers.

@RReverser
Copy link
Collaborator Author

. If you remove that line, it should also cause the code to exit, even if you don't specify -s EXIT_RUNTIME=1. Does that also do a forcible process.exit()?

Yeah, they're equivalent AFAIK:

exit(ret, /* implicit = */ true);

@juj
Copy link
Collaborator

juj commented Nov 18, 2020

That should be the only difference with respect to process teardown in st and mt builds. If in that mode there is a forcible shutdown with process.exit() happening, then it should also be happening in non-multithreaded builds. Not sure where such a forcible shutdown is happening, though Emscripten at least does not emit a call to process.exit() API into the generated build.

@RReverser
Copy link
Collaborator Author

RReverser commented Nov 18, 2020

I actually now wonder if this is "expected behaviour"... if issue is caused by hanging Workers, then there are only two choices:

  1. Emscripten tears them down at the end of main. Node.js exits gracefully when JS is used as a process, but if user imports this JS as a module and tries to call other functions, they can no longer [easily] create threads because Worker pool is empty.
  2. Emscripten doesn't tear them down, Workers remain hanging and can be used by further calls into the module, but when used as a process, Node.js can't shut down - this is what happens now.

I do think that it's possible to get a middle ground by looking into those .unref() - Node.js should be smart enough to figure out that, when main JS has finished execution, workers can no longer receive messages and don't need to wait anymore. Right now it seems like they hang around in case something will send them a message and they need to do more work.

@RReverser
Copy link
Collaborator Author

So yeah, I can reproduce by only loading the .worker.js part from a custom JS instead of Emscripten glue:

const {Worker} = require('worker_threads');
let w = new Worker('./temp.worker.js');

This also hangs forever, probably because Worker still waits for messages.

@RReverser
Copy link
Collaborator Author

RReverser commented Nov 18, 2020

@addaleax I admit I don't fully understand the .ref() / .unref() mechanics of worker_threads, maybe you can help?

Basically, the question is - is there a way to make sure that Workers are torn down once the Worker object is GC'd?

Right now it looks like a reference cycle keeping both main thread and the Worker alive.

@addaleax
Copy link

Basically, the question is - is there a way to make sure that Workers are torn down once the Worker object is GC'd?

No, that never happens. Worker objects cannot be GC’ed before they are terminated because they are GC roots because they receive events from the event loop.

I admit I don't fully understand the .ref() / .unref() mechanics of worker_threads, maybe you can help?

If you call .unref() on a Worker, then that means that they will not keep the event loop alive on its own, i.e. this works the same as .ref()/unref() on network sockets, timers, etc.

This also hangs forever, probably because Worker still waits for messages.

That sounds very likely, yes, but I wouldn’t know what we could do about this on the Node.js side.

@RReverser
Copy link
Collaborator Author

Worker objects cannot be GC’ed before they are terminated because they are GC roots because they receive events from the event loop.

Huh. But that's not how they work in browsers? Can't Workers be tied to the parentPort instead so that, when that side is collected, Worker stops waiting for messages too?

@addaleax
Copy link

@RReverser What makes you think that browsers behave differently? I think the behavior you want would require cross-thread heap reference tracking, which doesn’t exist.

Can't Workers be tied to the parentPort instead so that, when that side is collected, Worker stops waiting for messages too?

parentPort also can’t be GC’ed because it’s also on the event loop because the parent thread might send a message at any point.

@RReverser
Copy link
Collaborator Author

What makes you think that browsers behave differently?

I think I saw it somewhere, but I'll need to dig to find the reference.

the parent thread might send a message at any point

Right, but if the parent thread has exited (which it can e.g. in my last example without onmessage handler on the main thread), surely the Worker can get notified about that and get unreferenced?

Or, more generally, when Worker object is collected, it can send a notification to the actual worker thread to stop listening?

@addaleax
Copy link

@RReverser If the parent thread exits, that also forcibly stops all child threads, there’s no unreferencing possible or necessary.

Or, more generally, when Worker object is collected, it can send a notification to the actual worker thread to stop listening?

Just to reiterate, Worker objects do not get collected until they are actually stopped because they can generate events at any point, just like network sockets and other event loop objects.

@RReverser
Copy link
Collaborator Author

If the parent thread exits, that also forcibly stops all child threads, there’s no unreferencing possible or necessary.

Hmm, then why is Node still hanging in the example above?

@addaleax
Copy link

@RReverser Because the parent doesn’t exit, because the Worker is still there and can emit events.

@RReverser
Copy link
Collaborator Author

But parent is not subscribed to Worker events in this case? To clarify, I'm referring to

So yeah, I can reproduce by only loading the .worker.js part from a custom JS instead of Emscripten glue:

const {Worker} = require('worker_threads');
let w = new Worker('./temp.worker.js');

This also hangs forever, probably because Worker still waits for messages.

@addaleax
Copy link

Ahh, I see. I guess that’s a fair point then … in Node.js, what currently happens when there’s an uncaught exception inside a Worker is that an 'error' event is emitted, which is always visible behavior in the parent thread, so we’re always subscribed to that by default.

Would it be acceptable to explicitly use .unref() here?

@RReverser
Copy link
Collaborator Author

Would it be acceptable to explicitly use .unref() here?

I don't know... I mean, it's certainly possible in some simple cases like above to try and find places where explicit .unref() is necessary, but it feels like manual memory management and kinda un-JavaScript-y.

In cases like Emscripten's it's a lot harder, because there is an actual reference cycle (main thread and a Worker waiting for each other) which traditionally can only be fixed with a GC (or hard termination).

@stale
Copy link

stale bot commented Apr 17, 2022

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

@stale stale bot added the wontfix label Apr 17, 2022
@wheresthecode
Copy link

I pretty new to Emscripten and javascript altogether, but I also ran into this problem. I am compiling a library with no main function and then I have a javascript with some Jest tests in it. Once all my tests have run and I am certain all my threads have executed I am calling PThread.reminateAllThreads which allows node to exit. The code looks like this:
// afterAll is a Jest function that is called after all tests have run
afterAll(() => {
createModule().then((m) => m.PThread.terminateAllThreads());
});

I'm not sure about the safety of doing this, but it has worked for me so far

@stale stale bot removed the wontfix label Oct 15, 2022
@sbc100
Copy link
Collaborator

sbc100 commented Oct 17, 2022

Does adding -sEXIT_RUNTIME to the command line also work for you @wheresthecode ?

@RReverser
Copy link
Collaborator Author

Working on yet another project where this bites me and also had to apply Mocha workaround like @wheresthecode. @sbc100 Unfortunately, in library case like @wheresthecode's EXIT_RUNTIME doesn't help because rather than using main function, we want to stop pthreads when library is gone.

However, I no longer think Emscripten can fix this on its side, since it has no way of knowing whether the library will be called into again, and, thus, whether it will need to use threads from the pthread pool again :(

@sbc100
Copy link
Collaborator

sbc100 commented Oct 27, 2022

Working on yet another project where this bites me and also had to apply Mocha workaround like @wheresthecode. @sbc100 Unfortunately, in library case like @wheresthecode's EXIT_RUNTIME doesn't help because rather than using main function, we want to stop pthreads when library is gone.

In that case perhaps the library could require exit() to be called explicitly.. at which point all the threads would be killed. The normal C exit function could be exported for this purpose.

However, I no longer think Emscripten can fix this on its side, since it has no way of knowing whether the library will be called into again, and, thus, whether it will need to use threads from the pthread pool again :(

@RReverser
Copy link
Collaborator Author

However, I no longer think Emscripten can fix this on its side

I think I might have been convinced otherwise and this is actually something that can be fixed on Emscripten side; not 100% sure yet, but going to experiment in the next couple of days.

@kleisauke
Copy link
Collaborator

While working on PR #18201, I noticed that using Wasm Workers (the -sWASM_WORKERS compile/link flag) didn't have this issue.

Somehow (perhaps due to the use of Atomics.waitAsync()?), it can also spawn two workers on main() without causing a deadlock (afaik, -pthread needed to link with either -sPTHREAD_POOL_SIZE=2 or -sPROXY_TO_PTHREAD for that).

worker[0] = emscripten_malloc_wasm_worker(/*stack size: */1024);
worker[1] = emscripten_malloc_wasm_worker(/*stack size: */1024);
emscripten_wasm_worker_post_function_v(worker[0], (void (*))thread_main);
emscripten_wasm_worker_post_function_v(worker[1], (void (*))thread_main);
// Terminate both workers after a small delay
emscripten_set_timeout(terminate_worker, 1000, 0);

So, an alternative route is to implement the pthread API on top of Wasm Workers as discussed in #12833 (comment).

@RReverser
Copy link
Collaborator Author

RReverser commented Nov 17, 2022

Somehow (perhaps due to the use of Atomics.waitAsync()?), it can also spawn two workers on main() without causing a deadlock

Right, it doesn't deadlock because the API requirement are different from pthreads. Pthreads depend on being able to spawn threads synchronously & block on shared memory, so Emscripten implementation has to make it possible too, but Wasm Workers don't have that compat restriction as it's a brand-new API and can require it to be async. #9910 would have a similar effect if implemented.

Anyway, I've ben working and going to send a PR fixing this issue for most common use-cases today, with some others either being fixed later or for now left to end users.

@kleisauke
Copy link
Collaborator

Great, thanks for doing this!

It would indeed be great if issue #9910 is also addressed in the future. There's now a -sASYNCIFY=2 experimental mode which depend upon the https://github.com/WebAssembly/stack-switching proposal, so perhaps the Asyncify overhead would be minimal in the future. :)

@sbc100
Copy link
Collaborator

sbc100 commented Nov 17, 2022

@brendandahl, another use case of -sASYNCIFY=2 here.. the ability to start new threads/workers without returning to the event loop.

@RReverser
Copy link
Collaborator Author

To be precise, #9910 is, this issue itself is not relevant to Asyncify.

RReverser added a commit that referenced this issue Nov 18, 2022
Fixes #12801 for majority of cases.

This is a relatively simple change, but it took embarrassingly many attempts to get it in the right places for all obscure tests to pass + to figure out which tests can make use of it instead of doing manual exit + to debug some apparent differences in Node Worker GC behaviour between Windows/Linux as a bonus.

I tried two approaches in parallel, a conservative one in this PR and one that brings Emscripten behaviour closer to native in a separate branch.

In ideal scenario, I wanted to make Node.js apps behave just like native, where background threads themselves don't keep an app open, and instead app lives as long as it explicitly blocks on `pthread_join` or other blocking APIs. However, it's a more disruptive change that still requires more work and testing, as some Emscripten use-cases implicitly depend on the app running despite not having any more foreground work to do - one notable example is `PROXY_TO_PTHREAD` that spawns a detached thread, but obviously wants the app to continue running. All those cases are fixable, but, as said above, requires more work so I'm keeping it aside for now.

Instead, in this PR I'm adding a .ref/.unref "dance" (h/t @d3lm for the original idea) that keeps the app alive as long as _any_ pthreads are running, whether joinable or detached, and whether you have explicit blocking on them or not. It works as following:

- Upon creation, all pool workers are strongly referenced as we need to wait for them to be properly loaded.
- Once worker is loaded, it's marked as weakly referenced, as we don't want idle workers to prevent app from exiting.
- Once worker is associated with & starts running a pthread, it's marked as strongly referenced so that the app stays alive as long as it's doing some work.
- Once worker is done and returned to the idle worker pool, it's weakly referenced again.

This ensures maximum compatibility, while fixing majority of common cases.

One usecase it doesn't fix is when a C/C++ app itself has an internal singletone threadpool (like one created by glib) - in this case there's no way for Emscripten to know that those "running" threads are actually semantically idle. This would be fixed by the more rigorous alternative implementation mentioned above, but, for now, such usecases can be relatively easily worked around with a bit of custom `--pre-js` that goes over all `PThread.runningWorkers` and marks them as `.unref`d. That's what I did in an app I'm currently working on, and it works pretty well. To avoid reaching into JS internals, we might consider adding an `emscripten_`-prefixed API to allow referencing/unreferencing Worker via a `pthread_t` instance from the C code, but for now I'm leaving it out of scope of this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants