Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle thread crash/abort even when main thread is blocking (busy-waiting). NFC #16174

Merged
merged 1 commit into from
Feb 2, 2022

Conversation

sbc100
Copy link
Collaborator

@sbc100 sbc100 commented Feb 2, 2022

Previously if a thread crashed when the main thread was blocked on
something (such as pthread_join) it would never get the onerror
message because it would never return the event loop.

Without this change the main thread will simple loop forever after the
secondary thread crashes.

@sbc100 sbc100 force-pushed the better_thread_crash_handling branch from f4dc6e8 to ceeb252 Compare February 2, 2022 00:38
@sbc100
Copy link
Collaborator Author

sbc100 commented Feb 2, 2022

Still working a on test ..

@sbc100 sbc100 requested a review from tlively February 2, 2022 00:38
// mechanism to send this message since it can allocate (or otherwise itself
// crash) so use a low level atomic primitive for this signal.
if (emscripten_is_main_runtime_thread() && thread_crashed)
abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a thread crashing causing the main thread to abort the desired behavior? Is this equivalent to a thread doing something like a segfault and bringing down the entire process?

What if the main thread never calls emscripten_futex_wait?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is like one of the threads segfaulting.

If the main thread is returning to the even loop there is no issue. This patch is only for the case when the main thread is blocking/looping and not returning (e.g. pthread_join).

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, that makes sense now. Checking for crashes in futex_wait is sufficient because we're assuming that the main thread will call pthread_join at some point.

system/lib/pthread/emscripten_futex_wait.c Outdated Show resolved Hide resolved
@sbc100 sbc100 force-pushed the better_thread_crash_handling branch from b910c50 to fcbe96e Compare February 2, 2022 01:04
@sbc100
Copy link
Collaborator Author

sbc100 commented Feb 2, 2022

Aha, that makes sense now. Checking for crashes in futex_wait is sufficient because we're assuming that the main thread will call pthread_join at some point.

All our blocking primitives (not just pthread_join) bottom out here in futex_wait. If the thread is busy in a compute loop we can't do anything.. but such a program has worse problems to deal with.

@sbc100 sbc100 force-pushed the better_thread_crash_handling branch from fcbe96e to 22f0d22 Compare February 2, 2022 01:10
…ting). NFC

Previously if a thread crashed when the main thread was blocked on
something (such as pthread_join) it would never get the `onerror`
message because it would never return the event loop.
@sbc100 sbc100 force-pushed the better_thread_crash_handling branch from 22f0d22 to d072e29 Compare February 2, 2022 15:48
@sbc100
Copy link
Collaborator Author

sbc100 commented Feb 2, 2022

@sbc100 sbc100 merged commit d37a121 into main Feb 2, 2022
@sbc100 sbc100 deleted the better_thread_crash_handling branch February 2, 2022 17:32
// Return the event loop so we can handle the message from the crashed
// thread.
emscripten_unwind_to_js_event_loop();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the call to futex_wait_busy below? should we not check this in the loop there as well? (it is in JS so it's harder - not sure if worth it)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It recently got moved to native code just above!

That reason I didn't bother with that is that all the call sites that we care about already time slice calls to emscripten_futex_wait anyway.

Its a little odd that we have the time slice happening both outside of emscripten_futex_wait and also inside of emscripten_futex_wait (but we only do that for if we choose the futex_wait_busy path). I've been thinking about trying to eliminate one or the other since two it at two levels seems wrong.. but its not obvious which place is that right place to to it.

Separely, I think there is a different way we might want to factor this stuff. Create a new API called _emscripten_thread_tick which calls both emscripten_main_thread_process_queued_calls and runs checks like this. There are possibly other things we might want to do here too. Then change basically all the emscripten_main_thread_process_queued_calls calls to _emscripten_thread_tick.. that would automatically mean doing this new check in all the right places (i.e. whenever a thread is idle).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, great! Very nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants