Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate wiring up Metal Binary Archives on iOS. #60267

Closed
chinmaygarde opened this issue Jun 25, 2020 · 134 comments
Closed

Investigate wiring up Metal Binary Archives on iOS. #60267

chinmaygarde opened this issue Jun 25, 2020 · 134 comments
Assignees
Labels
c: new feature Nothing broken; request for a new capability c: performance Relates to speed or footprint issues (see "perf:" labels) customer: crowd Affects or could affect many people, though not necessarily a specific customer. customer: product dependency: skia Skia team may need to help us engine flutter/engine repository. See also e: labels. P0 Critical issues such as a build break or regression perf: speed Performance issues related to (mostly rendering) speed platform-ios iOS applications specifically

Comments

@chinmaygarde
Copy link
Member

chinmaygarde commented Jun 25, 2020

Introduced at WWDC 2020, Metal Binary Archives speed up creation of pipeline state objects. In Flutter, these objects are created as necessary within a frame workload on the raster thread. This can cause jank. Investigate persisting these archives (shipping them if possible in the application bundle) and loading them (possibly via the Skia persistent cache interface) at runtime to reduce jank during startup.

Tracking issue in Skia: https://bugs.chromium.org/p/skia/issues/detail?id=10804

@chinmaygarde chinmaygarde added c: new feature Nothing broken; request for a new capability engine flutter/engine repository. See also e: labels. dependency: skia Skia team may need to help us P3 Issues that are less important to the Flutter project labels Jun 25, 2020
@dnfield dnfield added perf: speed Performance issues related to (mostly rendering) speed platform-ios iOS applications specifically c: performance Relates to speed or footprint issues (see "perf:" labels) labels Aug 11, 2020
@petro-i
Copy link

petro-i commented Sep 3, 2020

Issue containing phrase "This can cause jank," in case of Flutter, should have highest possible priority by default, IMHO,

@lukepighetti
Copy link

lukepighetti commented Sep 3, 2020

We've noticed new sources of jank (these are documented already in other issues) after Flutter switched to Metal and we've had to invest in some very expensive (time consuming) workarounds to be able to continue to ship high fidelity apps on iOS. I agree that these Metal jank issues should be high priority tasks. P5 is too low for these issues imho.

@oleh-kolinko
Copy link

oleh-kolinko commented Sep 5, 2020

Please investigate to possibly add this to flutter, I am experiencing terrible jank animations on iOS on startup.

This potentially can solve:
#61450

@kbokarius
Copy link

What could the timing be for a fix to these startup jank issues? We're completely invested in Flutter at this point, and it's frustrating to say the least that it's currently not possible for us to have smooth non-janky animations at startup, especially considering how important the startup experience is for onboarding and engaging new users.

@chinmaygarde
Copy link
Member Author

Elevated the priority of this issue with the Skia team. Adjusting here. Will elevate the priority here further once a milestone can be set.

@chinmaygarde chinmaygarde added P1 High-priority issues at the top of the work list and removed P3 Issues that are less important to the Flutter project labels Sep 21, 2020
@kbokarius
Copy link

In case anyone comes here looking for a workaround to startup jank issues, there are OpenGL compiled frameworks available here: https://github.com/acoutts/flutter-engines-no-metal

A more detailed report on our experience with these OpenGL frameworks: #61045 (comment)

@doppio
Copy link

doppio commented Sep 29, 2020

Thank you @kbokarius! That does significantly improve performance of first-time animations. Unfortunately, disabling Metal seems to result in several visual regressions in other areas of my app, but maybe it's worth the tradeoff. Thankfully this article drastically improved performance on Android without sacrificing visual quality.

It's really a shame, since Flutter is supposed to support 60 FPS out of the box, but there doesn't seem to be any way to run reasonably simple animations on iOS without significant performance hitches on the first run. My app's onboarding animations are currently an awful first impression on iOS, far from the smooth experience the framework promises. :(

@kbokarius
Copy link

kbokarius commented Sep 29, 2020

@doppio we're going to test and launch on Android after we launch iOS next month, so we haven't gotten to Android yet. All of our testing was done on iOS so far. Reading your message, it isn't clear if you've tried to disable Metal on iOS? I assume you did, and you ran into other visual regressions? In our case, simplifying our animations and reducing their overall visual attractiveness was well worth alleviating the startup animation jank. First time impressions are absolutely critical for consumer applications, so honestly I'm not sure what we're going to do if our comprehensive QA testing tonight with the OpenGL iOS build doesn't prove successful. I just can't imagine having to launch with that startup jank. I'd almost consider removing those animations altogether, as bad as that sounds.

Edit: the thought is that if the animation jank is really that bad, it might be worth investing time into exploring ways to address any visual regressions that result from disabling Metal. Hopefully we get a startup jank solution for Metal soon, but reverting to OpenGL does seem to be a workable solution in the interim albeit with additional development to work around the side effects.

@doppio
Copy link

doppio commented Sep 29, 2020

@kbokarius Yes, I tried disabling Metal on iOS and ran into other visual regressions. The first run performance of animations was significantly improved though, like you experienced. I definitely cannot launch with the startup jank either, so I'll either need to wait for a fix or find a workaround for my visual regression issues as you suggest.

Even with workarounds for the regressions, it will add a non-trivial amount of wasted dev time to build/download new "no-Metal" versions of the framework, get it set up with our CI/CD system etc. I think for now, I'll keep my fingers crossed and hope for a fix, since I'm still several months from a possible release candidate.

@erf
Copy link

erf commented Sep 29, 2020

@kbokarius i tried the method with the compiled frameworks you supplied, also doing the "SkSL warm up". I think i experienced some performance improvment, but there is still some yank when i open the first page (push a fullscreen MaterialPageRoute), which has simple UI and no animations. I experience the yank on iOS (iPhone 7), Android seem fine.

@doppio
Copy link

doppio commented Sep 29, 2020

That sounds like a different issue, @erf. Try exploring the Flutter DevTools to see if you can pin down what's causing the jank in your case.

@kbokarius
Copy link

Quick update on our situation: unfortunately, despite the improvement to startup jank, QA testing did show that our app has OOM crashes intermittently on both 12.X and 13.X devices with the OpenGL frameworks. It looks like we'll have to launch on iOS next month with startup jank. Hopefully we'll have a solution available for Metal soon.

@acoutts
Copy link

acoutts commented Nov 9, 2020

Quick update on our situation: unfortunately, despite the improvement to startup jank, QA testing did show that our app has OOM crashes intermittently on both 12.X and 13.X devices with the OpenGL frameworks. It looks like we'll have to launch on iOS next month with startup jank. Hopefully we'll have a solution available for Metal soon.

I recently found the same issue in my app. I think once you have more than a few basic screens / animations it quickly becomes unstable. I'm no longer recommending anyone runs OpenGL because of the potential for stability issues. I understand OpenGL is no longer supported either, so fingers crossed this will be resolved soon.

@alectogeek
Copy link

Quick update on our situation: unfortunately, despite the improvement to startup jank, QA testing did show that our app has OOM crashes intermittently on both 12.X and 13.X devices with the OpenGL frameworks. It looks like we'll have to launch on iOS next month with startup jank. Hopefully we'll have a solution available for Metal soon.

I recently found the same issue in my app. I think once you have more than a few basic screens / animations it quickly becomes unstable. I'm no longer recommending anyone runs OpenGL because of the potential for stability issues. I understand OpenGL is no longer supported either, so fingers crossed this will be resolved soon.

That is a pity. I released my app with OpenGL, there are much fewer janks, but users complain that the app is unstable.
We really hope that the Flutter team will solve all the problems with metal soon 🤞🤞🤞

@Superkramar
Copy link

We had same issue for iOS devices =( If I understand right, no way to fix this problem right now ?

@meowofficial
Copy link

Could anyone from Flutter team say something about probable terms of solution for ios? As far as I know the issue became critical on ios after release of flutter 1.17.0 on stable channel 6 months ago. I think almost all of flutter developers who know about the cause of their app's lags need this information to choose 1 of 4 available options:

  1. Downgrade to flutter 1.12
    I don't know much about consequences of this option because I didn't try it yet. I can't remember critical issues for me from that time, probably firebase_crashlytics won't work, may be obfuscation. If there wasn't a problem with OOM crashes it's the best of the worst I think.
  2. Release with no-metal flutter engine
    I don't know about others, but I don't have testing team or variety of iPhones. I would test on couple iPhones I have and it seems to me I can possibly face OOM crashes on production and won't do anything about it.
  3. Release with metal flutter engine
    As for me the janks are so terrible and app is so unusable that the only result I would achieve is 1-star rating, angry users and ruined launch.
  4. Wait for fix and add new features while issue is resolving. The question is how long?

@Hixie

This comment has been minimized.

@lukepighetti

This comment has been minimized.

@escamoteur

This comment has been minimized.

@dnfield

This comment has been minimized.

@lukepighetti

This comment has been minimized.

@machinescream

This comment has been minimized.

@dnfield
Copy link
Contributor

dnfield commented Dec 29, 2020

I've created https://stackoverflow.com/q/65485696/3293914 - please feel free to add new answers there or comments/edits as appropriate. That is a better area for collaboration on community support for this issue, since here good comments will likely just get burried (this issue already has about 100 comments, most of which get hidden when first coming to this page).

In the interest of not spamming the many subscribers of this issue, and keeping things on topic, I'm locking this issue to collaborators. I started hiding some of the off-topic conversation, but there's just a lot of it and I'm going to mainly leave it as is :)

If you'd like to discuss this further, please feel free to hop on the Discord.

@flutter flutter locked and limited conversation to collaborators Dec 29, 2020
@escamoteur
Copy link
Contributor

Do I understand this correct, that so far we aren't using compiled shaders?

@dnfield
Copy link
Contributor

dnfield commented Jan 1, 2021

It's that the binary output of compilation is not getting cached and reused

@escamoteur
Copy link
Contributor

But it would be possible to link the binary versions of into the executable, right?
What does SwiftUI make different to not get into this problems? And why isn't it a problem on Android?

@dnfield
Copy link
Contributor

dnfield commented Jan 2, 2021

These are not the kind of binary you would link together with your program, since they can be highly device/gpu/driver specific.

On Android you only get this problem on the very first time a new shader is compiled, after which they get cached.

SwiftUI benefits from boiling down to UIKit, which doesn't need shaders in the same way that Metal/OpenGL based applications do.

@escamoteur
Copy link
Contributor

While investigating how to realize a WebGPU plugin for Flutter I met two of the developers of Google's WebGPU implementation @SenorBlanco and @kainino0x and they told me that the topic of Shader problems with Metal is one that they are very familiar with.

Maybe it could make sense to discuss this with them to see if they have some additional ideas.

They also told me that there is an experimental WebGPU backend for Skia. Could that possibly be another approach to Flutter Web rendering?

cc: @timsneath I hope its ok that I encourage this communucation here.

@zanderso
Copy link
Member

zanderso commented Jan 5, 2021

Promoting to P2 and marking 'customer:product' to help ensure adequate progress is tracked ahead of the next release.

@zanderso zanderso added customer: product P2 and removed P1 High-priority issues at the top of the work list labels Jan 5, 2021
@kf6gpe kf6gpe added this to the January Beta Release (1.26) milestone Jan 7, 2021
@chinmaygarde
Copy link
Member Author

Update: A prototype with the caches in place should be out for review this week.

@escamoteur
Copy link
Contributor

Awesome!

@chinmaygarde
Copy link
Member Author

Findings From The Prototype

Some concerns were raised in the original Skia issue regarding the way in which the archives will be wired up. I haven't been able to successfully address all these concerns. Specifically, the concerns are mentioned in the header for MTLBinaryArchive instead of the documentation on Apple's developer portal (which is rather sparse). I am pasting the relevant lines from the header here so its easier to reference:

Updating a MTLBinaryArchive at runtime in a shipping app configuration is not
recommended; such a scenario requires corruption resiliency, careful storage
space management and may cache hard-to-reproduce errors. These kind of issues
are handled transparently by the Metal maintained cache, therefore we recommend
that MTLBinaryArchive is populated during development time and shipped as an
asset.

This recommendation does not work for Flutter. Flutter cannot recommend developers generate an archive at development time and ship as an asset. It requires platform (iOS/Mac) and device (per GPU family) specific training runs, increases binary size, and is of limited use unless the entire set of pipeline state objects used in the application at runtime are captured in the training run (not to mention the archive becomes stale as soon as the application code is updated). In that regard, the answer to my original call to "Investigate persisting these archives ... shipping them if possible in the application bundle" is a definite no. Apple's actual recommendation in the same header is as follows:

Once all desired pipeline state descriptors have been added, use
serializeToURL:error: to write the contents for the current device to disk.
MTLBinaryArchive files generated for multiple different devices can be combined
using the "lipo" tool into a single archive, which can then be shipped with the
application. It is possible to maintain different archive files for different
contexts; for example each level in a game may use a different cache object.

This cache will automatically accelerate pipeline state creation after a
pipeline is created for the first time. Use MTLBinaryArchive to augment that
cache by accelerating pipeline state creation even on the first run of an app.

Instead of that, the use-case addressed in the prototype is to maintain an archive of the last run of the application. This will not affect first run performance but will help with subsequent launches. Since this use-case is different from the Apple's recommendations, I have been trying to reason about the aforementioned "corruption resiliency, careful storage space management and possibility of caching hard-to-reproduce errors".

  • On corruption resiliency related to disk I/O: I believe we are fine on this front with careful error handling. We already have the GrPersistentCache in place and the issues related to disk I/O are already well understood and handled.
  • On corruption resiliency related to thread safety: Flutter does work with Metal on multiple threads. Neither the docs nor the header mentions threading at all. Assuming the worst, I only gave the onscreen GrContext an archive since we only access that on the raster thread. The IO thread does no rendering and should not have any render pipeline functions to cache. In the case of multiple engines, where there will be multiple raster threads, my approach was to give each engine its own separate archive. In the case of multiple engine having previous runs, give an array of these archives to each engine instance. This API does not exist in Skia today but we can request the team to add such an affordance. In the meantime, secondary engine launches will get just the one archive.
  • On storage space management: Disk usage should not grow indefinitely as archives for runs other than the last run will be collected.
  • On caching hard-to-reproduce errors: I suppose this is still a concern theoretically but it is not something I have explored in detail as pipeline state object management is internal to Skia. I can followup.

Clarifications on OpenGL

In my original call to investigate binary archives, I did not mention OpenGL at all. Metal binary archives are unrelated to any feature of the OpenGL backend on iOS. The same causes of jank I described in this issue exist in the OpenGL backend as well. For developers using training runs already to cache SKPs and seed them at runtime, it is likely that the increase in the number of unique pipeline state objects in the Metal backend gives more benefit to an OpenGL application than to one using Metal (the SKP cache is wired up for both OpenGL and Metal). But, just switching to OpenGL without SKP training runs is not expected to change much. However, I do realize that the discussions in this thread reference multiple real issues we need to make progress on and my attempt to deconflate the various potential issues is as follows:

  • Without training runs, the Metal backend is alleged to be constructing more pipelines state objects than OpenGL.
  • Without training runs, the Metal backend is alleged to be taking more time to construct a pipeline state object than OpenGL.
  • After training runs, the Metal backend is alleged to demonstrate less benefits over the OpenGL backend.
  • Pipeline state objects are constructed in a frame workload.

Besides the last point, which I believe is the long term solution to jank due to shader compilation, we can work in parallel to answer the other narrowly scoped questions and deploy improvements as necessary. However, unlike as suggested by some in this thread, I highly recommend against switching to the OpenGL backend via custom engine builds. There are stability, performance and correctness issues in that backend that the team can no longer fix.

Next Steps

  • Specific issues are filed for the investigating work items I described in the last section and we go chase each one down.
  • Given the more qualified scope of the prototype, I get guidance from release management on how to prioritize this issue and the timelines. If we do decide to productionize the archive usage as described, the limitations must be made clear upfront.
  • As a team, we need to figure out how to better communicate with the community on these sorts of performance issues. I know I was really overwhelmed with the numerous threads that (to me) were tangential to the original issue describing the need to investigate wiring up the binary archives. And I am pretty sure most folks here in centithread were dissatisfied with the answers they got (or didn't).
  • We fast track the design work to move creation of pipeline state objects outside the frame workload. That is the only way to fix jank of this sort without resorting to training runs and such.

@zanderso
Copy link
Member

Following @chinmaygarde’s detailed analysis above, I’d like to close the loop on this issue. The issue as stated was to “Investigate wiring up Metal Binary Archives on iOS”, and that work has been completed. Unfortunately, the results of that investigation were not what we had hoped for.

As described above, we encountered two main problems. First, integrating this feature of Metal into the Flutter Engine entails off-label uses of Apple’s APIs, with no guarantees that the off-label uses will continue to work going forward. Second, the performance improvements were extremely modest as Metal shader library construction and not pipeline state object creation was found to be the bottleneck. Taken together, these two problems would mean accepting a large risk in exchange for a limited benefit. For that reason, we have decided not to do this.

However, though it covered several different topics, the discussion in this thread has shed light on a few different areas for improvement being tracked here. We will continue to be hard at work on solving first-run shader compilation jank on both iOS and Android, as we currently view it as one of the most serious performance issues facing Flutter. We are focusing our efforts on a long-term, maintainable solution to this problem outlined by the linked issues.

Jank has many causes outside of shader compilation and pipeline setup, so in the meantime, please see the good general advice from @stx above about optimizing animations here, stackoverflow for queries about other strategies for working around this issue, starting with @dnfield’s thread here, as well as other documents, videos, and tools for performance optimization generally. Thanks!

@escamoteur
Copy link
Contributor

@zanderso I read die analysis in the PR and added this question flutter/engine#23914 (comment)

@Hixie Hixie added this to Shader warmup in Early-onset jank Feb 28, 2021
@Hixie Hixie moved this from Shader warmup to Closed in Early-onset jank Mar 29, 2021
@flutter-triage-bot flutter-triage-bot bot added P0 Critical issues such as a build break or regression and removed P2 labels Jun 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
c: new feature Nothing broken; request for a new capability c: performance Relates to speed or footprint issues (see "perf:" labels) customer: crowd Affects or could affect many people, though not necessarily a specific customer. customer: product dependency: skia Skia team may need to help us engine flutter/engine repository. See also e: labels. P0 Critical issues such as a build break or regression perf: speed Performance issues related to (mostly rendering) speed platform-ios iOS applications specifically
Projects
Development

No branches or pull requests