Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent idle Workers from keeping Node.js app alive #18227

Merged
merged 12 commits into from Nov 18, 2022

Conversation

RReverser
Copy link
Collaborator

Fixes #12801 for majority of cases.

This is a relatively simple change, but it took embarrassingly many attempts to get it in the right places for all obscure tests to pass + to figure out which tests can make use of it instead of doing manual exit + to debug some apparent differences in Node Worker GC behaviour between Windows/Linux as a bonus.

I tried two approaches in parallel, a conservative one in this PR and one that brings Emscripten behaviour closer to native in a separate branch.

In ideal scenario, I wanted to make Node.js apps behave just like native, where background threads themselves don't keep an app open, and instead app lives as long as it explicitly blocks on pthread_join or other blocking APIs. However, it's a more disruptive change that still requires more work and testing, as some Emscripten use-cases implicitly depend on the app running despite not having any more foreground work to do - one notable example is PROXY_TO_PTHREAD that spawns a detached thread, but obviously wants the app to continue running. All those cases are fixable, but, as said above, requires more work so I'm keeping it aside for now.

Instead, in this PR I'm adding a .ref/.unref "dance" (h/t @d3lm for the original idea) that keeps the app alive as long as any pthreads are running, whether joinable or detached, and whether you have explicit blocking on them or not. It works as following:

  • Upon creation, all pool workers are strongly referenced as we need to wait for them to be properly loaded.
  • Once worker is loaded, it's marked as weakly referenced, as we don't want idle workers to prevent app from exiting.
  • Once worker is associated with & starts running a pthread, it's marked as strongly referenced so that the app stays alive as long as it's doing some work.
  • Once worker is done and returned to the idle worker pool, it's weakly referenced again.

This ensures maximum compatibility, while fixing majority of common cases.

One usecase it doesn't fix is when a C/C++ app itself has an internal singletone threadpool (like one created by glib) - in this case there's no way for Emscripten to know that those "running" threads are actually semantically idle. This would be fixed by the more rigorous alternative implementation mentioned above, but, for now, such usecases can be relatively easily worked around with a bit of custom --pre-js that goes over all PThread.runningWorkers and marks them as .unrefd. That's what I did in an app I'm currently working on, and it works pretty well. To avoid reaching into JS internals, we might consider adding an emscripten_-prefixed API to allow referencing/unreferencing Worker via a pthread_t instance from the C code, but for now I'm leaving it out of scope of this PR.

Let me know if you have any questions.

@sbc100
Copy link
Collaborator

sbc100 commented Nov 17, 2022

I haven't had a chance to look at the code but thank you for working on this! I'm excited to get it fixed.

One question that popped into my mind: Would simply removing the thread pooling on node also fix the issue? Do we need thread pooling on node? If out node thread implementation is mostly for testing perhaps we don't need to care about the startup code of new threads and we can create a new worker each time? I guess the downside of doing that is that we have less parity with the browser tests so might catch fewer bugs in node tests?

Also, for the case of glib-based apps that use thread pools, wouldn't easiest thing do for them be to build with -sEXIT_RUNTIME, which I think also fixes the issue by terminating all threads on exit()?

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Surprisingly simple fix in the end

test/test_core.py Outdated Show resolved Hide resolved
@RReverser
Copy link
Collaborator Author

RReverser commented Nov 17, 2022

Also, for the case of glib-based apps that use thread pools, wouldn't easiest thing do for them be to build with -sEXIT_RUNTIME, which I think also fixes the issue by terminating all threads on exit()?

For apps - yes - but as mentioned in the original issue, the biggest problem is porting libraries, where you can't just exit runtime because there is no "main" function and, therefore, no "end" point of execution, but rather a bunch of exports that user might call at any time.

One question that popped into my mind: Would simply removing the thread pooling on node also fix the issue? Do we need thread pooling on node?

Well, yes, because they exhibit the same issue as in browsers when you block the event loop but a Worker is not created yet. E.g. if I run my example code in Node.js without worker pool, I'll see the typical

Before the thread
Tried to spawn a new thread, but the thread pool is exhausted.
This might result in a deadlock unless some threads eventually exit or the code explicitly breaks out to the event loop.
If you want to increase the pool size, use setting `-sPTHREAD_POOL_SIZE=...`.
If you want to throw an explicit error instead of the risk of deadlocking in those cases, use setting `-sPTHREAD_POOL_SIZE_STRICT=2`.

and the app will deadlock.

If out node thread implementation is mostly for testing perhaps we don't need to care about the startup code of new threads

I'm not sure what you mean by saying "mostly for testing". There are pretty real usecases for Wasm in Node.js environment, including for multithreaded Wasm.

@sbc100
Copy link
Collaborator

sbc100 commented Nov 17, 2022

I'm not sure what you mean by saying "mostly for testing". There are pretty real usecases for Wasm in Node.js environment, including for multithreaded Wasm.

I guess I meant to ask that as a question. I'm not aware of any folks using emscripten-built module under node in production, but that might simply be because they don't tend to file bug here. I would love to support this use case I just don't know how common it is today. Do have some specific examples?

@sbc100
Copy link
Collaborator

sbc100 commented Nov 17, 2022

Well, yes, because they exhibit the same issue as in browsers when you block the event loop but a Worker is not created yet. E.g. if I run my example code in Node.js without worker pool, I'll see the typical

I guess there are two reasons for having the worker pool:

  1. My application is structured such that I don't run the even loop before needing access to a new thread
  2. The cost of worker creation is high enough that I don't want to pay for it on each and every pthread creation.

Its is normally pretty obvious when (1) is the issue, but its less clear when (2) is the issue.

Assuming we some day fix (1) in some other way (e.g. via -sASYNCIFY=2 or via a dedicated "worker-creation-worker") then that would only leave (2) as the reason to ever re-use workers, rather than just letting them die.

If we remove reason (1) do we still want to do pooling for reason (2)? I would guess the answer might be different for node vs browser but I don't know.

The other downside to never removing workers is that applications that don't use threads excepts for certain tasks will have those resources locked up for the lifetime of the applications (i.e. the number of OS threads can go up, but never come down).

@RReverser
Copy link
Collaborator Author

RReverser commented Nov 17, 2022

Do have some specific examples?

Squoosh Node.js usecase would be definitely one, and there were couple of others I encountered over time. The one I'm working with right now - StackBlitz - might be a bit unusual yet pretty popular. It provides a full Node.js environment in browser, so you can use arbitrary the Node.js APIs, but you don't have access to browser APIs and you can't run native code, so that's where Wasm Node.js target steps in and fills the gap.

If we remove reason (1) do we still want to do pooling for reason (2)? I would guess the answer might be different for node vs browser but I don't know.

I saw that expressed in some issue before, but I'm sceptical it would be very different tbh. In both cases the cost is not negligible, because whether Node.js or browser, they both need to first load JS from external source, create new context, evaluate & potentially JIT compile the JS code etc. Sure, in browser if the JS is not cached (first visit), you might need to do the more expensive HTTP call too, but besides that they both need to do the ~same amount of extra work on top of the native pthread_create.

But, without having an alternative and doing measurements it's all just guesswork. Once we do have a working alternative and it proves fast enough, I'm as happy to get rid of the pthread pool as you are - it caused way too many problems over time :)

RReverser added a commit to RReverser/wasm-vips that referenced this pull request Nov 17, 2022
This is the workaround I mentioned in emscripten-core/emscripten#18227. Since we know all the threads in the app are part of the threadpool, they can be just weakly referenced, so that the existence of the Worker alone doesn't prevent Node.js from exiting, and instead it's the blocking that waits for results of specific ops that keeps the event loop alive.

This allows to get rid of non-JS-esque shutdown helper.
kleisauke pushed a commit to kleisauke/wasm-vips that referenced this pull request Nov 17, 2022
This is the workaround I mentioned in emscripten-core/emscripten#18227. Since we know all the threads in the app are part of the threadpool, they can be just weakly referenced, so that the existence of the Worker alone doesn't prevent Node.js from exiting, and instead it's the blocking that waits for results of specific ops that keeps the event loop alive.

This allows to get rid of non-JS-esque shutdown helper.
Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@RReverser RReverser changed the title Add .ref/.unref dance to prevent idle Workers from keeping Node.js app alive Prevent idle Workers from keeping Node.js app alive Nov 17, 2022
@RReverser RReverser enabled auto-merge (squash) November 17, 2022 22:24
@RReverser
Copy link
Collaborator Author

Why did test_run_wasi_sdk_output suddenly start falling, and pretty consistently at that? It wasn't before 😭

@RReverser
Copy link
Collaborator Author

@kleisauke
Copy link
Collaborator

Why did test_run_wasi_sdk_output suddenly start falling, and pretty consistently at that?

wasm-ld: error: cannot open /root/emsdk/upstream/lib/clang/16/lib/wasi/libclang_rt.builtins-wasm32.a: No such file or directory
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)

I think(?) it's caused by commit llvm/llvm-project@e1b88c8 which was auto-rolled with https://chromium.googlesource.com/emscripten-releases/+/4e2ffe94b04dbadfbca1687ab458d306b3414d13.

@kleisauke
Copy link
Collaborator

... #18231 :)

@d3lm
Copy link

d3lm commented Nov 18, 2022

Good job @RReverser, and thanks for the shoutout here. Really glad that we can fix this in Emscripten directly 🙌

Copy link

@d3lm d3lm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👏

@RReverser RReverser force-pushed the node-worker-auto-exit-conservative branch from dabe2e8 to 07fd976 Compare November 18, 2022 12:01
@RReverser RReverser merged commit a9cbf47 into main Nov 18, 2022
@RReverser RReverser deleted the node-worker-auto-exit-conservative branch November 18, 2022 13:20
RReverser added a commit that referenced this pull request Nov 25, 2022
Probably worth a highlight, but forgot to add it in the original PR.
sbc100 pushed a commit that referenced this pull request Nov 27, 2022
Probably worth a highlight, but forgot to add it in the original PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Usage of pthread in a Node.js app never lets it exit
5 participants