Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made Picture::toImage happen on the IO thread with no need for an onscreen surface. #9813

Merged
merged 10 commits into from Jul 16, 2019

Conversation

@gaaclarke
Copy link
Contributor

commented Jul 12, 2019

This fixes ShaderWarmUp for iOS (possibly Android too).
References: flutter/flutter#35923

I suspect there are a few other issues related to this bug too, this was probably an issue for Android as well. Here are a few extra benefits of this change:

  1. Now ui.Picture.toImage won't interfere with regular rendering.
  2. In the fullscreen case we should be getting a few more milliseconds of shader warmup
  3. It simplifies the API by removing the SnapshotDelegate

I have a few concerns for this PR:

  1. The thread safety of PersistentCache is suspect. I'll have to audit it further. Chinmay said it should be threadsafe.
  2. There is a race condition where rendering the first frame to the onscreen surface is not waiting for the shader warmup to complete. In the worse case (assuming PersistentCache is threadsafe) we have already compiled a shader in the warmup but the first frame won't have access to it. To fix that we'd have to give the ShaderWarmUp some ability to block the GPU thread.

I'm open to any suggestions for automated testing.

@googlebot googlebot added the cla: yes label Jul 12, 2019
@gaaclarke gaaclarke requested review from chinmaygarde and liyuqian Jul 12, 2019
Copy link
Contributor

left a comment

Thank you Aaron!

This has been something that I wanted to do for a long time. I'd say that doing warm-up/snapshot on the IO thread without blocking is better than doing the work on the GPU thread, or blocking GPU thread in some other ways. By blocking, we're giving users no frame at all during the warm-up; by not blocking, the users can at least see some frames as soon as possible. Of course, we can always provide developers flexibility on whether to block or not.

For unit tests, I think you can add something to testing/dart/canvas_test.dart: record a Picture, calls Picture.toImage, and then check that its pixels are expected. The only challenge I know is how to run the Dart test locally. So far, the only way I know is to run testing/run_tests.sh. You can modify the script to only run the test that you care about.

Once this gets landed, please feel free to add some comments about the performance and UX difference between this new IO-thread approach, and the old GPU-thread approach in the design doc (https://docs.google.com/document/d/1dTcjrK5Nn7roa3WftaKcoOk7-2u_2HItCahlNc73Xt8/edit#heading=h.ar902g5pm5jo) for our future reference.

}
}

sk_sp<SkImage> MakeRasterSnapshot(sk_sp<SkPicture> picture,

This comment has been minimized.

Copy link
@liyuqian

liyuqian Jul 12, 2019

Contributor

Here are some points that Chinmay and I discussed offline, and think would be nice to add to the code comment:

  1. MakeRasterSnapshot makes a CPU-backed image (instead of a GPU-backed image). It's enables peeking pixels from the Dart code, or encoding the image for storage. But it introduces extra overhead for converting GPU images to CPU images. (Raster means CPU in Skia's jargon.)

  2. Maybe it's nice to also have a MakeSnapshot API that generates a GPU-backed image without the GPU-to-CPU overhead. It doesn't have to be added in this PR. Maybe just left a comment here and add it in a future PR, potentially with some benchmarks to verify the performance gain of MakeSnapshot over MakeRasterSnapshot.

  3. Shader warm-up only needs MakeSnapshot (and as Skia pointed out, we can further modify the GrContext to skip all GPU operations other than shader compilations). We can still preserve the behavior of using MakeRasterSnapshot in this PR. But eventually we'll want to switch to a cheaper version. Maybe leave a comment in the code so we don't forget this.

  4. In some cases, the developer may just want to have a GPU-backed snapshot, and directly composite it back into a GPU surface. The MakeRasterSnapshot will be quite expensive for this case as it triggers a GPU-to-CPU copy, and a CPU-to-GPU copy. Maybe leave a comment here to suggest that we should use the future MakeSnapshot API for this case.

This comment has been minimized.

Copy link
@gaaclarke

gaaclarke Jul 15, 2019

Author Contributor

Done. I'll add an issue once this lands.

This comment has been minimized.

Copy link
@cmkweber

cmkweber Jul 15, 2019

I'd like to add that #2 was discussed here at length:
#8835

Would it be possible to work #2 in since this section of code is related? Having a GPU-backed image and avoiding the GPU-to-CPU overhead would be beneficial.

Adding @Hixie - I will soon address the concerns brought up there about increased documentation and tests in this area.

This comment has been minimized.

Copy link
@gaaclarke

gaaclarke Jul 15, 2019

Author Contributor

I'd like to add that #2 was discussed here at length:
#8835

Would it be possible to work #2 in since this section of code is related? Having a GPU-backed image and avoiding the GPU-to-CPU overhead would be beneficial.

Adding @Hixie - I will soon address the concerns brought up there about increased documentation and tests in this area.

Noted. I think it's something we should do as well, but I think it should be prioritized against other issues and not serve as an impediment to this change.

}

{
TRACE_EVENT0("flutter", "DeviceHostTransfer");

This comment has been minimized.

Copy link
@liyuqian

liyuqian Jul 12, 2019

Contributor

Maybe leave a comment here like "Here device means GPU and host means CPU; this is different from use cases like Flutter driver tests where device means mobile devices and host means laptops/desktops."

This comment has been minimized.

Copy link
@gaaclarke

gaaclarke Jul 15, 2019

Author Contributor

Done.

gaaclarke added 2 commits Jul 15, 2019
Copy link
Member

left a comment

The ui_unittests target now is OpenGL ES capable on the IO thread. Please add a unit-test for this. See the see image_decoder_unittests.cc for a harness you can use as a reference.

MakeSnapshotSurface(picture_bounds, resource_context);
sk_sp<SkImage> raster_image = MakeRasterSnapshot(picture, surface);

fml::TaskRunner::RunNowOrPostTask(

This comment has been minimized.

Copy link
@chinmaygarde

chinmaygarde Jul 15, 2019

Member

RunNowOrPostTask is only used if there is some form of latching involved. There isn't any here. Please post the task directly to the UI task runner. I realize this may have been because the previous code did the same. But there really is no reason to do this here.

This comment has been minimized.

Copy link
@gaaclarke

gaaclarke Jul 15, 2019

Author Contributor

Done.

This comment has been minimized.

Copy link
@gaaclarke

gaaclarke Jul 15, 2019

Author Contributor

I had to revert this too. It was causing a deadlock for some tests.

/// @param[in] picture The picture that will get converted to an image.
/// @param[in] surface The surface tha will be used to render the picture. This
/// will be CPU or GPU based.
/// @todo Currently this creates a RAM backed image regardless of what type of

This comment has been minimized.

Copy link
@chinmaygarde

chinmaygarde Jul 15, 2019

Member

"RAM backed" is odd terminology. Even device memory is in RAM right? Can we stick to just the device/host terminology like you have done for the traces?

This comment has been minimized.

Copy link
@gaaclarke

gaaclarke Jul 15, 2019

Author Contributor

IMO device/host is very confusing. Even yuqian asked me to add a comment to his code to clarify them. Images in RAM vs VRAM is the model I've used when dealing with the discrete GPU model. Can you tell me exactly what you want here and I'll plop it in.

sk_sp<SkSurface> MakeSnapshotSurface(const SkISize& picture_size,
fml::WeakPtr<GrContext> resource_context) {
SkImageInfo image_info = SkImageInfo::MakeN32Premul(
picture_size.width(), picture_size.height(), SkColorSpace::MakeSRGB());

This comment has been minimized.

Copy link
@chinmaygarde

chinmaygarde Jul 15, 2019

Member

If the colorspace is nullptr, it defaults to SRGB. We have stuck to this default behavior elsewhere. Please do the same here as well.

This comment has been minimized.

Copy link
@gaaclarke

gaaclarke Jul 15, 2019

Author Contributor

Done.

This comment has been minimized.

Copy link
@gaaclarke

gaaclarke Jul 15, 2019

Author Contributor

I had to remove this change, it started causing tests to fail.

gaaclarke added 3 commits Jul 15, 2019
@gaaclarke

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

The ui_unittests target now is OpenGL ES capable on the IO thread. Please add a unit-test for this. See the see image_decoder_unittests.cc for a harness you can use as a reference.

I've tested the unit-test I wrote for yuqian. It actually is running on the GPU. Maybe we can add an optional 3rd argument that will specify that you want it to run on the GPU if available. I'm a bit worried this could mess with golden image tests as-is since it might be moving those over to GPU from CPU.

I attempted this Chinmay but it is non-trivial for the following reasons:
1) The function I would be testing takes as input and output Dart_Handle's. I'm sure there is a way to introspect into them. I'm not sure it's worth doing from the C++ side.
2) The functions rely on a global UIDartState. We could factor that out to an argument and rework it as a subset needed for my function.
3) I'd have to rewrite the golden image testing I made for Dart in C++.

I think the best way to test this would be to run the test on the Dart side like I did at yuqian's request. Do you have any ideas how we could take what you did in ui_unittests and get Dart running on top of it so we can test it similarly?

@chinmaygarde

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

It actually is running on the GPU.

Flutter tester does not run the GPU backend. Are you sure? The ui_unittests can run with both the GPU and CPU backend on a per test basis.

@chinmaygarde

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

For example, see image_decoder_unittests.cc for ValidImageResultsInSuccess and CanDecodeWithoutAGPUContext. That target also sets up Dart code to run in AOT as well as JIT modes.

This reverts commit 3e3e711.
@gaaclarke

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

It actually is running on the GPU.

Flutter tester does not run the GPU backend. Are you sure? The ui_unittests can run with both the GPU and CPU backend on a per test basis.

I was mistaken, no they run on the CPU.

@gaaclarke

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

@chinmaygarde We talked about this offline, the automated test tests the CPU render path. The GPU render path was only tested manually and there is some future work to get Dart running (with dependencies) on the GPU backend.

@chinmaygarde

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

Filed flutter/flutter#36224 and flutter/flutter#36225 to remind us to continue this yak shave.

This reverts commit ca8776e.
@gaaclarke gaaclarke merged commit 78a8ca0 into flutter:master Jul 16, 2019
23 checks passed
23 checks passed
WIP Ready for review
Details
build_and_test_host Task Summary
Details
build_and_test_host
Details
build_and_test_host_profile Task Summary
Details
build_and_test_host_profile
Details
build_and_test_host_release Task Summary
Details
build_and_test_host_release
Details
build_android Task Summary
Details
build_android
Details
build_fuchsia_debug Task Summary
Details
build_fuchsia_debug
Details
build_fuchsia_profile Task Summary
Details
build_fuchsia_profile
Details
build_fuchsia_release Task Summary
Details
build_fuchsia_release
Details
build_windows_debug Task Summary
Details
build_windows_debug
Details
build_windows_debug_unopt Task Summary
Details
build_windows_debug_unopt
Details
cla/google All necessary CLAs are signed
format_and_dart_test Task Summary
Details
format_and_dart_test
Details
luci-engine
Details
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Jul 16, 2019
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Jul 16, 2019
@jason-simmons

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

This affected the memory usage metrics in the performance benchmarks.

Picture.toImage is called during app startup by the ShaderWarmUp class in the framework. After this change, the total memory usage of Gallery at startup is reported as several MB higher than before in the Android dumpsys meminfo report. The difference is due to a combination of increased native heap and graphics memory consumption.

I'm not sure why moving picture rasterization to the IO thread has this effect. This may be related to implementation details of how Android malloc handles requests from multiple threads (similar to what we were looking at in flutter/flutter#36079)

@chinmaygarde

@gaaclarke

This comment has been minimized.

Copy link
Contributor Author

commented Jul 17, 2019

This affected the memory usage metrics in the performance benchmarks.

Picture.toImage is called during app startup by the ShaderWarmUp class in the framework. After this change, the total memory usage of Gallery at startup is reported as several MB higher than before in the Android dumpsys meminfo report. The difference is due to a combination of increased native heap and graphics memory consumption.

Are the performance benchmarks running with the OpenGL backend? If so, a memory difference is expected because previously shaderwarmup could be incorrectly running on the cpu renderer which meant no shaders actually warmup. The added memory should be compiled shaders. I wouldn't expect those to be several megs in size though. Even as raw text they are maybe 1MB.

@gaaclarke

This comment has been minimized.

Copy link
Contributor Author

commented Jul 17, 2019

This affected the memory usage metrics in the performance benchmarks.

Another thing to consider is that depending on how you are measuring memory usage this could cause a higher number because the code is more parallelized. For example if it is measuring peak allocations we are now potentially running the GPU thread at the same time as the IO thread so peak usage could be higher. Whereas previously everything we put on the GPU thread so peak usage would be gated on what that one thread was doing.

@liyuqian

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

It seems that most memory regressions are on Android, and I believe the shader warm-up were previuosly using GPUs on Android (it was only using CPUs on iOS).

Having resources on the IO thread in addition to those in the GPU thread might be a good explanation. In that case, maybe we should find some way to confirm the conjecture. Finally, if the 10MB cost was unavoidable in the IO thread, then we need to figure out the trade-offs to decide whether to use GPU or IO thread by default (the decisions could be different for Andoird and iOS), and probably provide options to developers to choose which thread to use.

@jason-simmons

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

You can see the effect on memory usage by starting a simple app created with flutter create and running adb shell dumpsys meminfo.

I took these samples on a Nexus 5X.

Before this PR:

Java Heap:     4124
Native Heap:   13340
Code:          12560
Stack:         556
Graphics:      50792
Private Other: 15660
System:        5161

After:

Java Heap:     4172
Native Heap:   23224
Code:          13556
Stack:         568
Graphics:      61824
Private Other: 15692
System:        7232

The additional memory usage appears to start when surface->getCanvas()->flush() is called on the IO thread in MakeRasterSnapshot.

@gaaclarke

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2019

Here are traces from before and after this patch of the first frame:

post.json.gz
pre.json.gz

@gaaclarke

This comment has been minimized.

Copy link
Contributor Author

commented Jul 18, 2019

Here is one more trace that shows the gpu no longer being blocked by the warmup. This results in a first frame that is roughly 75ms sooner.

concurrent.json.gz

@liyuqian

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2019

Several notes as @xster and I recently discussed:

  • iOS persistent cache only has GLSL but not binary. Hence doing warm-up on the IO thread is going to hurt the iOS performance as the shaders covered in DefaultShaderWarmUp still need to be compiled from GLSL to binary.
  • Warm-up on the IO thread also makes some GrContext related resource warm-up not working as IO thread and GPU thread have different GrContexts. (I remember one particular example is the ProgramBuilder which takes about 30ms to initialize in @xster 's trace.)
  • Before this patch, our GPU thread warm-up probably doesn't work for iOS add-to-app scenario. But it should be working on standalone iOS app as traces in #9813 (comment) show.

Hence we might want to reconsider how to do shader warm-up on iOS. It seems to be better to use the GPU thread for standalone iOS apps.

liyuqian added a commit to liyuqian/flutter that referenced this pull request Aug 9, 2019
The update is copied from an update we made to a Google-internal
client: cl/260202900

The update will save 1 shader compilation.

This should help solve our regression:
flutter#31203

More regressions on iOS might be introduced later by
flutter/engine#9813 (comment)

Unfortunately, we didn't rebase our benchmarks so such regressions
were not detected.

Hence to fully solve flutter#31203,
we might need to revert some change in
flutter/engine#9813 to make iOS shader warm-up
happen on the GPU thread again.
liyuqian added a commit to flutter/flutter that referenced this pull request Aug 15, 2019
The update is copied from an update we made to a Google-internal
client: cl/260202900

The update will save 1 shader compilation.

This should help solve our regression:
#31203

More regressions on iOS might be introduced later by
flutter/engine#9813 (comment)

Unfortunately, we didn't rebase our benchmarks so such regressions
were not detected.

Hence to fully solve #31203,
we might need to revert some change in
flutter/engine#9813 to make iOS shader warm-up
happen on the GPU thread again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.