Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Impeller] Encode render passes concurrently on iOS. #42028

Merged
merged 23 commits into from May 23, 2023

Conversation

jonahwilliams
Copy link
Member

Allows pushing encoding of command buffers to a worker thread, relying on the fact that these buffers are always scheduled in the order that they are enqueued. This follows the guidelines from https://developer.apple.com/documentation/metal/mtlcommandbuffer?language=objc

@@ -63,16 +63,18 @@
PlaygroundImplMTL::PlaygroundImplMTL(PlaygroundSwitches switches)
: PlaygroundImpl(switches),
handle_(nullptr, &DestroyWindowHandle),
data_(std::make_unique<Data>()) {
data_(std::make_unique<Data>()),
concurrent_loop_(fml::ConcurrentMessageLoop::Create()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more or less duplicating the problem that the VKContext has wherein we should only be creating a single concurrent message loop once per engine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW We switched around ContextVK so that it is made to support sharing a concurrent message loop. There's no known problem. This came from Chinmay's design goal of wanting to share these. I'm not sure what that means for your PR, I'm just clarifying that point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at ContextVK then, when I started this PR ContextVK was still creating its own message loop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContextVK is still creating its own concurrent message loop?

workers_(fml::ConcurrentMessageLoop::Create()) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContextVK takes in a shared_ptr to a task queue which can be shared across multiple engines, but currently is not:

std::shared_ptr<fml::ConcurrentTaskRunner> worker_task_runner;

I had a PR out that made ContextVK own a single concurrent message loop but we decided we didn't want to shut the door on sharing concurrent message loops between engines. I haven't read this PR through, I'm just responding to the comment "the problem that the VKContext has wherein we should only be creating a single concurrent message loop once per engine". There is no problem that I know of with Vulkan and we made sure of that with our recent work. Let me know if you want me review this PR if you think it would be helpful.

if (!context) {
return false;
}
[buffer_ enqueue];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enquing on the main therad ensures the order

@@ -179,6 +179,7 @@
}
}

#if ((FML_OS_MACOSX && !FML_OS_IOS) || FML_OS_IOS_SIMULATOR)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed using wonderous that the waitUntilScheduled was only necessary on simulator and (speculatively) on macOS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that with this patch or without?

If you've observed this without the changes in this patch, can we move this to a separate patch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can split this out. Though we need both of these changes to see much of an improvement, otherwise the final scheduling just takes longer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved here: #42160

@@ -171,6 +176,53 @@ static bool LogMTLCommandBufferErrorIfPresent(id<MTLCommandBuffer> buffer) {
return true;
}

bool CommandBufferMTL::SubmitCommandsAsync(
std::shared_ptr<RenderPass> render_pass) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this only works for the render pass but could optionally take the blit and compute passes too. I moved this functionality to the command buffer to make ownership of the render pass data more obvious - the closure keeps the shared_ptr for the render pass alive while the command buffer itself only requires the buffer.

// Render command encoder creation has been observed to exceed the stack size
// limit for worker threads, and therefore is intentionally constructed on the
// raster thread.
auto render_command_encoder =
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a frustrating discovery because it prevents us from doing the obvious thing and just calling RenderPass::Encode in the closure.

@chinmaygarde
Copy link
Member

Exciting stuff. Looking 👀

if (!IsValid() || !render_pass->IsValid()) {
return false;
}
auto context = context_.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more or less what ended up being the fix for problems around ownership in Vulkan backend.


// Render command encoder creation has been observed to exceed the stack size
// limit for worker threads, and therefore is intentionally constructed on the
// raster thread.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not bump the stack size limit for worker threads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I try that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's platform specific code, we'd need a pthreads and a win32 implementation at the least.

}

auto task = fml::MakeCopyable(
[render_pass, buffer, render_command_encoder, context]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead capture the weak_ptr to context and make sure it's still lockable in the callback here?

The advantage being that if the context has otherwise gone away, we avoid doing work.

Otherwise, I think we might need to have something about the GPU sync switch in here to make sure we're not doing encoding work when the GPU is unavailable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, capturing the context seems reasonable. I'm not sure what the buffer itself will do, I guess we should probably assume it won't handle anything gracefully.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -42,7 +45,8 @@ @implementation FlutterDarwinContextMetalImpeller
- (instancetype)init {
self = [super init];
if (self != nil) {
_context = CreateImpellerContext();
_workers = fml::ConcurrentMessageLoop::Create();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't create a message loop here. Instead, we should be getting the concurrent loop the engine creates already and passing it in as a parameter here.

That means we should mark the default initializer as unavailable and make a new one like initWithMessageLoop, or alternatively just initWithTaskRunner since that's all we actually need here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure exactly how to do this. It looks like ImpellerContext is setup without any references to the current engine. Is this something we can/should change, or should I look at providing the message loop some other way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd flow through PlatformViewIOS (ctor) -> IOSContext::Create -> IOSContextMetalImpeller (ctor)

Looks like there's something left to figure out there though and we're not doing this on Android right now (we create a special concurrent loop for the context to use there in android_context_vulkan_impeller.cc

Maybe for now we can file a bug for this. We shouldn't really be creating multiple concurrent message loops. But there might need to be some refactoring in the shell/platformview/engine setup to make sure that ownership and sharing of the loop is handled properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed flutter/flutter#127160. I'm ambivalent about whether resolving that blocks this or not, but if it turns out to be not too hard to resolve that then we should do it before creating another violating of it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should figure out the concurrent loop stuff first.

Copy link
Contributor

@dnfield dnfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things that seem dangerous to me:

  • Doing encoding work on another thread without checking if the GPU is available or not. I think if we don't keep the context alive the context will do the job of that for us, but I might be wrong.
  • Creating another message loop. We should avoid that and just use the existing one.

@jonahwilliams
Copy link
Member Author

@dnfield PTAL, I followed your example and updated this to use the existing concurrent message loop.

[render_pass, buffer, render_command_encoder, weak_context = context_]() {
auto context = weak_context.lock();
if (!context) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realizing this will not protect us from potentially encoding command in the background.

We need the gpu sync switch in here so we can fizzle if GPU access is disabled. That's available via Shell::GetIsGpuDisabledSyncSwitch. I think it would make sense to expose that on ContextMTL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which operations are forbidden in the background? Is just submitting the cmd buffer? I'm curious what stops us from doing that now on the single raster thread

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that is, do we need to use the sync switch elsewhere too?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committing encoding work is forbidden: https://developer.apple.com/documentation/metal/gpu_devices_and_work_submission/preparing_your_metal_app_to_run_in_the_background?language=objc

Before this change, the work would all be guarded by the rasterizer - but now, spawning new threads, it's harder to predict how scheduling will occur and the rasterizer may get to the end of the function it's calling but the task spawned to encode new work happens after.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, [buffer commit] below is definitely illegal in the background, but I think EncodeCommands may make other calls that are probably not allowed.

At any rate, we can't trust that we're on the raster thread with the rasterizer having checked whether GPU access is allowed in this closure.

@jonahwilliams jonahwilliams changed the title [Impeller] Encode render passes concurrently on iOS, only block final submission on macOS and simulator. [Impeller] Encode render passes concurrently on iOS. May 22, 2023
Copy link
Contributor

@dnfield dnfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once CI is happy

@jonahwilliams jonahwilliams added the autosubmit Merge PR when tree becomes green via auto submit App label May 23, 2023
@auto-submit auto-submit bot merged commit fc6df95 into flutter:main May 23, 2023
28 checks passed
@jonahwilliams jonahwilliams deleted the encode_and_enqueue branch May 23, 2023 20:20
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 23, 2023
auto-submit bot pushed a commit to flutter/flutter that referenced this pull request May 23, 2023
…127440)

flutter/engine@3535e0d...fc6df95

2023-05-23 jonahwilliams@google.com [Impeller] Encode render passes concurrently on iOS. (flutter/engine#42028)

If this roll has caused a breakage, revert this CL and stop the roller
using the controls here:
https://autoroll.skia.org/r/flutter-engine-flutter-autoroll
Please CC bdero@google.com,rmistry@google.com,zra@google.com on the revert to ensure that a human
is aware of the problem.

To file a bug in Flutter: https://github.com/flutter/flutter/issues/new/choose

To report a problem with the AutoRoller itself, please file a bug:
https://bugs.chromium.org/p/skia/issues/entry?template=Autoroller+Bug

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+doc/main/autoroll/README.md
CaseyHillers pushed a commit to CaseyHillers/flutter that referenced this pull request May 24, 2023
…lutter#127440)

flutter/engine@3535e0d...fc6df95

2023-05-23 jonahwilliams@google.com [Impeller] Encode render passes concurrently on iOS. (flutter/engine#42028)

If this roll has caused a breakage, revert this CL and stop the roller
using the controls here:
https://autoroll.skia.org/r/flutter-engine-flutter-autoroll
Please CC bdero@google.com,rmistry@google.com,zra@google.com on the revert to ensure that a human
is aware of the problem.

To file a bug in Flutter: https://github.com/flutter/flutter/issues/new/choose

To report a problem with the AutoRoller itself, please file a bug:
https://bugs.chromium.org/p/skia/issues/entry?template=Autoroller+Bug

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+doc/main/autoroll/README.md
@zanderso
Copy link
Member

SkiaPerf link for next set of release notes: https://flutter-flutter-perf.skia.org/e/?begin=1684868908&end=1684953846&keys=Xa06e17cd029c62b99ebc7c4da0a2c039&num_commits=50&request_type=1&xbaroffset=34979

auto-submit bot pushed a commit that referenced this pull request Nov 15, 2023
This is a requirement for the "Remove Drawable Acquisition Latency" Change, but is otherwise a harmless improvement.

---

Submitting a command buffer causes the backend specific encoding logic to run. Metal is unique in that it is fairly easy to move this work into a background thread, allowing the engine to move onto creating the next command buffer. This improves throughput of the engine, at the cost of needing two slightly different APIs. Currently the GLES and Vulkan versions of this method still submit synchronously - for now that is out of scope as doing background work with those APIs has proved more challenging.

See also:
   * #42028
   * flutter/flutter#131698
   
Separately, as a requirement for the design in "Remove Drawable Acquisition Latency", we need to be able to defer drawable acquisition to this background thread. While this almost already works for render passes, it does not work for blit passes today. if the engine renders a backdrop filter, then the final command buffer submitted will be a blit pass that copies an offscreen onto the drawable. Therefore we need to add an async version of the blit submission, so that we have a hook to move the drawable acquisition onto a background thread for metal.

This hadn't been done until now because most blit cmd buffers have 1 or 2 cmds on them so the benefit of moving to a background thread is minimal.

Part of flutter/flutter#138490
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autosubmit Merge PR when tree becomes green via auto submit App e: impeller platform-ios
Projects
No open projects
Archived in project
5 participants