Prevent idle Workers from keeping Node.js app alive #18227

RReverser · 2022-11-17T17:08:26Z

Fixes #12801 for majority of cases.

This is a relatively simple change, but it took embarrassingly many attempts to get it in the right places for all obscure tests to pass + to figure out which tests can make use of it instead of doing manual exit + to debug some apparent differences in Node Worker GC behaviour between Windows/Linux as a bonus.

I tried two approaches in parallel, a conservative one in this PR and one that brings Emscripten behaviour closer to native in a separate branch.

In ideal scenario, I wanted to make Node.js apps behave just like native, where background threads themselves don't keep an app open, and instead app lives as long as it explicitly blocks on pthread_join or other blocking APIs. However, it's a more disruptive change that still requires more work and testing, as some Emscripten use-cases implicitly depend on the app running despite not having any more foreground work to do - one notable example is PROXY_TO_PTHREAD that spawns a detached thread, but obviously wants the app to continue running. All those cases are fixable, but, as said above, requires more work so I'm keeping it aside for now.

Instead, in this PR I'm adding a .ref/.unref "dance" (h/t @d3lm for the original idea) that keeps the app alive as long as any pthreads are running, whether joinable or detached, and whether you have explicit blocking on them or not. It works as following:

Upon creation, all pool workers are strongly referenced as we need to wait for them to be properly loaded.
Once worker is loaded, it's marked as weakly referenced, as we don't want idle workers to prevent app from exiting.
Once worker is associated with & starts running a pthread, it's marked as strongly referenced so that the app stays alive as long as it's doing some work.
Once worker is done and returned to the idle worker pool, it's weakly referenced again.

This ensures maximum compatibility, while fixing majority of common cases.

One usecase it doesn't fix is when a C/C++ app itself has an internal singletone threadpool (like one created by glib) - in this case there's no way for Emscripten to know that those "running" threads are actually semantically idle. This would be fixed by the more rigorous alternative implementation mentioned above, but, for now, such usecases can be relatively easily worked around with a bit of custom --pre-js that goes over all PThread.runningWorkers and marks them as .unrefd. That's what I did in an app I'm currently working on, and it works pretty well. To avoid reaching into JS internals, we might consider adding an emscripten_-prefixed API to allow referencing/unreferencing Worker via a pthread_t instance from the C code, but for now I'm leaving it out of scope of this PR.

Let me know if you have any questions.

sbc100 · 2022-11-17T17:40:26Z

I haven't had a chance to look at the code but thank you for working on this! I'm excited to get it fixed.

One question that popped into my mind: Would simply removing the thread pooling on node also fix the issue? Do we need thread pooling on node? If out node thread implementation is mostly for testing perhaps we don't need to care about the startup code of new threads and we can create a new worker each time? I guess the downside of doing that is that we have less parity with the browser tests so might catch fewer bugs in node tests?

Also, for the case of glib-based apps that use thread pools, wouldn't easiest thing do for them be to build with -sEXIT_RUNTIME, which I think also fixes the issue by terminating all threads on exit()?

sbc100

Nice work! Surprisingly simple fix in the end

test/test_core.py

src/library_pthread.js

RReverser · 2022-11-17T17:51:07Z

Also, for the case of glib-based apps that use thread pools, wouldn't easiest thing do for them be to build with -sEXIT_RUNTIME, which I think also fixes the issue by terminating all threads on exit()?

For apps - yes - but as mentioned in the original issue, the biggest problem is porting libraries, where you can't just exit runtime because there is no "main" function and, therefore, no "end" point of execution, but rather a bunch of exports that user might call at any time.

One question that popped into my mind: Would simply removing the thread pooling on node also fix the issue? Do we need thread pooling on node?

Well, yes, because they exhibit the same issue as in browsers when you block the event loop but a Worker is not created yet. E.g. if I run my example code in Node.js without worker pool, I'll see the typical

Before the thread
Tried to spawn a new thread, but the thread pool is exhausted.
This might result in a deadlock unless some threads eventually exit or the code explicitly breaks out to the event loop.
If you want to increase the pool size, use setting `-sPTHREAD_POOL_SIZE=...`.
If you want to throw an explicit error instead of the risk of deadlocking in those cases, use setting `-sPTHREAD_POOL_SIZE_STRICT=2`.

and the app will deadlock.

If out node thread implementation is mostly for testing perhaps we don't need to care about the startup code of new threads

I'm not sure what you mean by saying "mostly for testing". There are pretty real usecases for Wasm in Node.js environment, including for multithreaded Wasm.

sbc100 · 2022-11-17T17:55:12Z

I'm not sure what you mean by saying "mostly for testing". There are pretty real usecases for Wasm in Node.js environment, including for multithreaded Wasm.

I guess I meant to ask that as a question. I'm not aware of any folks using emscripten-built module under node in production, but that might simply be because they don't tend to file bug here. I would love to support this use case I just don't know how common it is today. Do have some specific examples?

sbc100 · 2022-11-17T18:01:46Z

Well, yes, because they exhibit the same issue as in browsers when you block the event loop but a Worker is not created yet. E.g. if I run my example code in Node.js without worker pool, I'll see the typical

I guess there are two reasons for having the worker pool:

My application is structured such that I don't run the even loop before needing access to a new thread
The cost of worker creation is high enough that I don't want to pay for it on each and every pthread creation.

Its is normally pretty obvious when (1) is the issue, but its less clear when (2) is the issue.

Assuming we some day fix (1) in some other way (e.g. via -sASYNCIFY=2 or via a dedicated "worker-creation-worker") then that would only leave (2) as the reason to ever re-use workers, rather than just letting them die.

If we remove reason (1) do we still want to do pooling for reason (2)? I would guess the answer might be different for node vs browser but I don't know.

The other downside to never removing workers is that applications that don't use threads excepts for certain tasks will have those resources locked up for the lifetime of the applications (i.e. the number of OS threads can go up, but never come down).

RReverser · 2022-11-17T18:12:25Z

Do have some specific examples?

Squoosh Node.js usecase would be definitely one, and there were couple of others I encountered over time. The one I'm working with right now - StackBlitz - might be a bit unusual yet pretty popular. It provides a full Node.js environment in browser, so you can use arbitrary the Node.js APIs, but you don't have access to browser APIs and you can't run native code, so that's where Wasm Node.js target steps in and fills the gap.

If we remove reason (1) do we still want to do pooling for reason (2)? I would guess the answer might be different for node vs browser but I don't know.

I saw that expressed in some issue before, but I'm sceptical it would be very different tbh. In both cases the cost is not negligible, because whether Node.js or browser, they both need to first load JS from external source, create new context, evaluate & potentially JIT compile the JS code etc. Sure, in browser if the JS is not cached (first visit), you might need to do the more expensive HTTP call too, but besides that they both need to do the ~same amount of extra work on top of the native pthread_create.

But, without having an alternative and doing measurements it's all just guesswork. Once we do have a working alternative and it proves fast enough, I'm as happy to get rid of the pthread pool as you are - it caused way too many problems over time :)

This is the workaround I mentioned in emscripten-core/emscripten#18227. Since we know all the threads in the app are part of the threadpool, they can be just weakly referenced, so that the existence of the Worker alone doesn't prevent Node.js from exiting, and instead it's the blocking that waits for results of specific ops that keeps the event loop alive. This allows to get rid of non-JS-esque shutdown helper.

kripken

Nice!

RReverser · 2022-11-17T22:29:50Z

Why did test_run_wasi_sdk_output suddenly start falling, and pretty consistently at that? It wasn't before 😭

RReverser · 2022-11-17T22:30:51Z

Ok at least it's failing on main too: https://app.circleci.com/pipelines/github/emscripten-core/emscripten/24525/workflows/fb61d497-664c-4ad0-8f87-a6b919d1d91e/jobs/581744

kleisauke · 2022-11-17T23:20:45Z

Why did test_run_wasi_sdk_output suddenly start falling, and pretty consistently at that?

wasm-ld: error: cannot open /root/emsdk/upstream/lib/clang/16/lib/wasi/libclang_rt.builtins-wasm32.a: No such file or directory
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)

I think(?) it's caused by commit llvm/llvm-project@e1b88c8 which was auto-rolled with https://chromium.googlesource.com/emscripten-releases/+/4e2ffe94b04dbadfbca1687ab458d306b3414d13.

kleisauke · 2022-11-17T23:39:22Z

... #18231 :)

d3lm · 2022-11-18T10:31:15Z

Good job @RReverser, and thanks for the shoutout here. Really glad that we can fix this in Emscripten directly 🙌

d3lm

LGTM 👏

This fixes couple more tests.

Probably worth a highlight, but forgot to add it in the original PR.

RReverser requested review from kripken and sbc100 November 17, 2022 17:08

sbc100 approved these changes Nov 17, 2022

View reviewed changes

test/test_core.py Outdated Show resolved Hide resolved

sbc100 reviewed Nov 17, 2022

View reviewed changes

src/library_pthread.js Show resolved Hide resolved

RReverser mentioned this pull request Nov 17, 2022

Mark all threadpool Workers as weakly referenced kleisauke/wasm-vips#29

Merged

sbc100 approved these changes Nov 17, 2022

View reviewed changes

kripken approved these changes Nov 17, 2022

View reviewed changes

RReverser changed the title ~~Add .ref/.unref dance to prevent idle Workers from keeping Node.js app alive~~ Prevent idle Workers from keeping Node.js app alive Nov 17, 2022

RReverser enabled auto-merge (squash) November 17, 2022 22:24

d3lm approved these changes Nov 18, 2022

View reviewed changes

RReverser and others added 8 commits November 18, 2022 12:01

Add .ref/.unref dance for Node.js workers

14e9ec1

Better placement for worker.unref()

2010a48

This fixes couple more tests.

Don't .unref if already need to run

1ae57db

Try to revert some tests

5d61105

One more option

a4047bb

.unref under Node only

1255f71

Try to revert one more test

6327c6b

Revert

3e532d7

RReverser added 4 commits November 18, 2022 12:01

One more attempt

6c7ac06

Add closure externs

89cf13d

Add comments to .ref/.unref

763c60b

Revert print -> logger.debug change

07fd976

RReverser force-pushed the node-worker-auto-exit-conservative branch from dabe2e8 to 07fd976 Compare November 18, 2022 12:01

RReverser merged commit a9cbf47 into main Nov 18, 2022

RReverser deleted the node-worker-auto-exit-conservative branch November 18, 2022 13:20

RReverser added a commit that referenced this pull request Nov 25, 2022

Add changelog for #18227

2ace1c5

Probably worth a highlight, but forgot to add it in the original PR.

sbc100 pushed a commit that referenced this pull request Nov 27, 2022

Add changelog for #18227 (#18260)

94c5984

Probably worth a highlight, but forgot to add it in the original PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent idle Workers from keeping Node.js app alive #18227

Prevent idle Workers from keeping Node.js app alive #18227

RReverser commented Nov 17, 2022

sbc100 commented Nov 17, 2022

sbc100 left a comment

RReverser commented Nov 17, 2022 •

edited

sbc100 commented Nov 17, 2022

sbc100 commented Nov 17, 2022

RReverser commented Nov 17, 2022 •

edited

kripken left a comment

RReverser commented Nov 17, 2022

RReverser commented Nov 17, 2022

kleisauke commented Nov 17, 2022

kleisauke commented Nov 17, 2022

d3lm commented Nov 18, 2022

d3lm left a comment

Prevent idle Workers from keeping Node.js app alive #18227

Prevent idle Workers from keeping Node.js app alive #18227

Conversation

RReverser commented Nov 17, 2022

sbc100 commented Nov 17, 2022

sbc100 left a comment

Choose a reason for hiding this comment

RReverser commented Nov 17, 2022 • edited

sbc100 commented Nov 17, 2022

sbc100 commented Nov 17, 2022

RReverser commented Nov 17, 2022 • edited

kripken left a comment

Choose a reason for hiding this comment

RReverser commented Nov 17, 2022

RReverser commented Nov 17, 2022

kleisauke commented Nov 17, 2022

kleisauke commented Nov 17, 2022

d3lm commented Nov 18, 2022

d3lm left a comment

Choose a reason for hiding this comment

RReverser commented Nov 17, 2022 •

edited

RReverser commented Nov 17, 2022 •

edited