Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Ffi calls to be marked as potentially blocking / exiting the isolate. #51261

Open
mkustermann opened this issue Feb 6, 2023 · 5 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. library-ffi

Comments

@mkustermann
Copy link
Member

Some users are running into an issue where many isolates are calling out to C code that will then block. This can cause the dart app to no longer work due to our limitation on maximum number of threads that can be active in an isolate group at a given point in time.

The limitation is there to avoid too many threads executing Dart code at same time. This can lead to situations where X threads all have TLABs which may contain unallocated memory, but the X+1 thread tries to obtain TLAB and fails, which will cause it to trigger GC (despite other thread's TLAB still having unallocated memory)
=> Allowing unbounded number of threads to enter an isolate group can lead to excessive triggering of GCs (despite free memory in other thread's TLAB)

See runtime/vm/heap/scavenger.h for the current calculation of the limit:

  // The maximum number of Dart mutator threads we allow to execute at the same
  // time.
  static intptr_t MaxMutatorThreadCount() {
    // With a max new-space of 16 MB and 512kb TLABs we would allow up to 8
    // mutator threads to run at the same time.
    const intptr_t max_parallel_tlab_usage =
        (FLAG_new_gen_semi_max_size * MB) / Scavenger::kTLABSize;
    const intptr_t max_pool_size = max_parallel_tlab_usage / 4;
    return max_pool_size > 0 ? max_pool_size : 1;
  }

We may consider adding a boolean flag to specify that a FFI call may be blocking / should exit the isolate.

// Static binding
@Native("sleep", exitIsolate: true)
external void sleep(int seconds);

// Dynamic binding
dylib.lookup().asFunction<...>(exitIsolate: true);

to automatically exit and re-enter the isolate to avoid custom C code like this:

auto isolate = Dart_CurrentIsolate();
Dart_ExitIsolate();
<... run blocking C Code, e.g. sleep() ...>
Dart_EnterIsolate(isolate);

See motivating use case: #51254

@mkustermann mkustermann added area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. library-ffi labels Feb 6, 2023
@mkustermann
Copy link
Member Author

The underlying issue is that new space (which we require for bump-allocation) doesn't scale with number of threads. The fact that this limitation would surface to the FFI API does seem a bit iffy.

We could device a scheme where FFI calls will give up their TLAB on transitions to C and re-acquire on the way back and limit the number of outstanding TLABs instead of number of active isolates. Though that would make transitions more heavyweight, would make returning from C (as well as Dart C API calls) possibly blocking for arbitrary amount of time. Seems less than ideal.

/cc @rmacnak-google

@rmacnak-google
Copy link
Contributor

I think this can be for free for the uncontented case: What we could do is when a new mutator wants to enter the isolate and the limit has been reached, we can check if any existing mutators are in an ffi-exited safepoint state, CAS its safepoint state to one meaning it has been kicked out, causing the safepoint transition on the ffi-return to hit the slow path, and take its TLAB away. The safepoint transition slow path then has a new check if it needs to wait on the mutator count to re-enter as a mutator.

@dcharkes
Copy link
Contributor

Though that would make returning from C (as well as Dart C API calls) possibly blocking for arbitrary amount of time.

It would still be compatible with Dart's semantics of synchronous code on an isolate running to completion before any other code is run on that isolate.

However, it would change the scheduling which isolate runs when we have exhausted the max number of mutators in an isolate group, that might be surprising. Do we have some kind of scheduling logic for that? @mkustermann

@mkustermann
Copy link
Member Author

mkustermann commented Apr 3, 2023

I think this can be for free for the uncontented case: What we could do is when a new mutator wants to enter the isolate and the limit has been reached, we can check if any existing mutators are in an ffi-exited safepoint state, CAS its safepoint state to one meaning it has been kicked out, causing the safepoint transition on the ffi-return to hit the slow path, and take its TLAB away. The safepoint transition slow path then has a new check if it needs to wait on the mutator count to re-enter as a mutator.

That's an interesting idea.

I'm a little worried that doing this blindly can lead to situations where e.g. Flutter UI isolate does a FFI call, then another thread kicking the UI isolate out. When the FFI call on UI isolate returns it will take the slow path and block (which could freeze flutter UI).

This can also happen to some extent today as well - but only at event loop boundary (e.g. Flutter UI isolate is idle, N threads enter isolate group and then flutter UI isolate cannot enter anymore but has to wait).

If one mutator has been kicked out and returns from ffi call then in the slow path it should be allowed to kick out another thread if it's in a ffi call. That would mean the system would work flawlessly irrespective of number of threads - as long as there are not more than N threads executing Dart code concurrently (which may be an ok restriction as all dart code being executed will either go back to event loop or do ffi call eventually which are yield poitns). Though it will require some synchronization on both sides:

  • Calling native via ffi: If there's another thread waiting to execute Dart we need to notify it (could use similar mechanism as our existing "gc-safepoint-requested" bit which forces GeneratedToNative to slowpath)
  • Return from ffi call: Safepoint may have been stolen from the side, so we have to take slowpath and wait for mutator count (or kick another thread out if there's any in ffi-exited state)

@Piero512
Copy link

Piero512 commented Apr 3, 2023

Hi. I have a complaint about this. If we're going to expose to FFI developers to those kinds of Isolate details, why we can't have some way for native code to at least return (synchronously) Dart objects that can be created through the Dart_CObject struct?

copybara-service bot pushed a commit that referenced this issue Apr 12, 2023
For applications that want to have arbitrary number of isolates call
into native code that may be blocking, we expose the API functions that
allows those native threads to exit an isolate before running
long/blocking code.

Without the ability to exit/re-enter isolate, one may experience
deadlocks as we have a fixed limit on the number of concurrently
executing isolates atm.

In the longer term we may find a way to do this automatically
with low overhead, see [0]. But since those API functions are quite
stable and we already expose e.g. `Dart_{Enter,Exit}Scope`, I don't
see a reason not to expose `Dart_{Enter,Exit}Isolate`.

[0] Issue #51261

Issue #51254

TEST=ffi{,_2}/dl_api_exit_enter_isolate_test

Change-Id: I91c772ca962fddb87919663fea07939a498fa205
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/292722
Commit-Queue: Martin Kustermann <kustermann@google.com>
Reviewed-by: Daco Harkes <dacoharkes@google.com>
Reviewed-by: Ryan Macnak <rmacnak@google.com>
copybara-service bot pushed a commit that referenced this issue Apr 12, 2023
This reverts commit a251281.

Reason for revert: FFI tests fail to link on Windows, fail to load on product-mode Android

Original change's description:
> [vm] Expose Dart_{CurrentIsolate,ExitIsolate,EnterIsolate}
>
> For applications that want to have arbitrary number of isolates call
> into native code that may be blocking, we expose the API functions that
> allows those native threads to exit an isolate before running
> long/blocking code.
>
> Without the ability to exit/re-enter isolate, one may experience
> deadlocks as we have a fixed limit on the number of concurrently
> executing isolates atm.
>
> In the longer term we may find a way to do this automatically
> with low overhead, see [0]. But since those API functions are quite
> stable and we already expose e.g. `Dart_{Enter,Exit}Scope`, I don't
> see a reason not to expose `Dart_{Enter,Exit}Isolate`.
>
> [0] Issue #51261
>
> Issue #51254
>
> TEST=ffi{,_2}/dl_api_exit_enter_isolate_test
>
> Change-Id: I91c772ca962fddb87919663fea07939a498fa205
> Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/292722
> Commit-Queue: Martin Kustermann <kustermann@google.com>
> Reviewed-by: Daco Harkes <dacoharkes@google.com>
> Reviewed-by: Ryan Macnak <rmacnak@google.com>

Change-Id: I05ad5b9ce24754a68693160e470f8eb71a812c75
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/294860
Auto-Submit: Ryan Macnak <rmacnak@google.com>
Commit-Queue: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>
Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>
copybara-service bot pushed a commit that referenced this issue Apr 13, 2023
For applications that want to have arbitrary number of isolates call
into native code that may be blocking, we expose the API functions that
allows those native threads to exit an isolate before running
long/blocking code.

Without the ability to exit/re-enter isolate, one may experience
deadlocks as we have a fixed limit on the number of concurrently
executing isolates atm.

In the longer term we may find a way to do this automatically
with low overhead, see [0]. But since those API functions are quite
stable and we already expose e.g. `Dart_{Enter,Exit}Scope`, I don't
see a reason not to expose `Dart_{Enter,Exit}Isolate`.

Difference to original CL:

  Do use STL synchronization primitives (as the ones in runtime/bin
  are not always available in shared libraries)


[0] Issue #51261

Issue #51254

TEST=ffi{,_2}/dl_api_exit_enter_isolate_test

Change-Id: Id817e8d4edb3db35f029248d62388cbd0682001d
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/294980
Reviewed-by: Daco Harkes <dacoharkes@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. library-ffi
Projects
None yet
Development

No branches or pull requests

4 participants