Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Impeller] Why did Impeller opt for "triangulation then draw" over "stencil then cover"? #123671

Closed
Tracked by #144398
ColdPaleLight opened this issue Mar 29, 2023 · 19 comments
Assignees
Labels
e: impeller Impeller rendering backend issues and features requests engine flutter/engine repository. See also e: labels. P3 Issues that are less important to the Flutter project team-engine Owned by Engine team triaged-engine Triaged by Engine team

Comments

@ColdPaleLight
Copy link
Member

ColdPaleLight commented Mar 29, 2023

Recently, while reading the code of Skia's new GPU backend "Graphite", I noticed that they did not use a technology like "libtess2" to triangulate paths, but instead used "stencil then cover" or its variants. Details can be found in their code (https://github.com/google/skia/blob/5d8b4bc3c7133e38dcae7b76cd708ed4649c13d1/src/gpu/graphite/Device.cpp#L1103-L1121) and in their implementation of the TessellateWedgesRenderStep (https://github.com/google/skia/blob/main/src/gpu/graphite/render/TessellateWedgesRenderStep.cpp).

In my opinion, "stencil then cover" has many benefits, such as not requiring triangulation techniques like libtess2, thus saving CPU time. Additionally, it can use a single render pass for convex paths without triangulation, and it's easier to design methods for parallel drawing of curves using GPUs without pre-computing vertices through CPU or compute shaders.

Of course, "stencil then cover" also has drawbacks, such as the need for 2-pass or 3-pass when drawing a path. However, since there is no switching of the RenderTarget during the process, it doesn't seem to have a significant impact on performance.

I'm curious as to why Impeller didn't choose the technical solution of "stencil then cover"? @chinmaygarde @bdero, would you mind enlightening me? Thank you!

@ColdPaleLight ColdPaleLight added the e: impeller Impeller rendering backend issues and features requests label Mar 29, 2023
@danagbemava-nc danagbemava-nc added the engine flutter/engine repository. See also e: labels. label Mar 30, 2023
@chinmaygarde chinmaygarde added the P3 Issues that are less important to the Flutter project label Apr 1, 2023
@jonahwilliams
Copy link
Member

The initial reason for using libtess2 is that it is easy to start with an engine that can render everything that Flutter supports without needing to add special cases. I don't think we're necessarily married to it as an implementation strategy.

Certainly, libtess2 can be very expensive in terms of CPU utilization. I think we have a good path forward for strokes and convex fills, we're in the process of switching those over to simpler routines and leaning more on the GPU to determine geometry.

For concave / complex polygons, in my experience this occur rarely but are very expensive to tessellate. We're certainly open to solutions, provided these solutions are fairly general purpose and don't require a combinatorial explosion of shaders

@dnfield
Copy link
Contributor

dnfield commented May 9, 2023

For the linked code, it seems like it requires some analytic AA solution. If that's not the case we should explore it more. If it is the case we'd have to figure out some efficient and usable way of doing the AA.

@bdero
Copy link
Member

bdero commented May 10, 2023

In terms of concrete implementation, my understanding is that stencil-then-cover means using a derivative of this trick. Impeller defers antialiasing to pass resolve time, and so AA shouldn't be a concern.

As Jonah mentioned, for all convex fills, we're adding specialized paths that don't involve running a generic tessellation algorithm.

Concave paths is where "stencil then cover" shines. In the ideal case, we get to trade off running a poorly scaling tessellation algorithm on the CPU for an extra draw call and a bit of stencil overdraw. Clearly a good deal!

What's not clear to me is how we can use a technique like this to efficiently implement the nonZero winding mode, which is Flutter's default and thus the common case:

evenOdd (not Flutter's default) is trivial to implement with this trick. It would even work for self-intersecting and multi-contour paths. Performing the stencil inversion is key to erasing all of the geometry that lies outside the path. But for the other winding modes (including nonZero, which is the only other winding mode supported by Flutter's API), even if we dedicated multiple bits in the stencil for counting the overlap, a workable solution to cancel out parts of the geometry fan that lie outside the path isn't obvious to me.

One potential way to eke out gains for nonZero concave paths would be to "stencil then cover" only for cases where we can determine that a nonZero path would draw the same if it was evenOdd, and continue falling back to libtess2 for rarer cases. For example, if the path has one contour and doesn't self-intersect (an expensive check), we can be sure it'll behave the same as evenOdd -- but then we're back in the business of computing path properties that scale poorly, diminishing some of the benefit.

@dnfield
Copy link
Contributor

dnfield commented May 10, 2023

AA shouldn't be a concern

The problem is that if the geometry supplied is a quad, MSAA will only see the edges of the quad - the stencil and cover techniques would be in a fragment stage

@bdero
Copy link
Member

bdero commented May 10, 2023

The problem is that if the geometry supplied is a quad, MSAA will only see the edges of the quad - the stencil and cover techniques would be in a fragment stage

The particular variation I'm describing doesn't have this problem.

We first "stencil" by carving out the path on the stencil buffer (with no color attachment) in a way that requires no tessellation, and then "cover" by drawing a quad over the path coverage with a color attachment, culling fragments that lie outside the path just like a regular Impeller clip does. Both the stencil and color are multisample in tile memory.

@dnfield
Copy link
Contributor

dnfield commented May 10, 2023

I think we could do that in compute, and I think there's probably a way to adapt this similarly for nonZero but need to mull this over.

@ColdPaleLight
Copy link
Member Author

What's not clear to me is how we can use a technique like this to efficiently implement the nonZero winding mode, which is Flutter's default and thus the common case

1. Prepare

Split the polygon into triangles, just like mentioned in the article
ki2LAsWg4T

2. Stencil

Before each stencil operation, ensure that the value on the stencil buffer is 0.

We can observe that some of these triangles are counterclockwise while others are clockwise. Suppose we set the counterclockwise to be the front and the clockwise to be the back, then we can set the front stencil pass op to IncreaseWrap, and the back stencil pass op to DecreaseWrap. Then draw all the triangles. In the end, all values ​​not equal to 0 in the stencil buffer are the area of ​​the entire polygon

3. Cover

Draw a rectangle whose size is the size of the RenderTarget, when drawing, we need to set the compare op of the stencil test to NotEqual, and the stencil ref value to 0, so that the polygon can be drawn.

What is the biggest difficulty for Impeller to implement this method?

One reason I think Impeller doesn't work this way right now is Impeller's Clip implementation. Now Impeller's Clip implementation requires a stencil buffer, and all bits of the stencil buffer need to be used. The stencil then cover will also use all the bits of the stencil buffer, so in the current implementation of Impeller, it is difficult to introduce the algorithm of stencil then cover

Does Skia use stencil then cover, how does it solve the conflict with clip implementation?

Skia's gpu acceleration has two implementations, ganesh and graphite.

In Ganesh, if the stenci buffer is enabled and the stencil buffer participates in the clip, only the highest bit in the stencil buffer value will be used to represent the clip, while the rest of the lower bits can be used for the stencil then cover. For the specific algorithm, please see section 3.4 of this article: https://developer.download.nvidia.cn/devzone/devcenter/gamegraphics/files/opengl/gpupathrender.pdf

In Graphite, clip is implemented using depth test and depth buffer, so there is no conflict anymore.

@bdero
Copy link
Member

bdero commented May 11, 2023

We can observe that some of these triangles are counterclockwise while others are clockwise.

Oh derp, this was the insight I was missing while visualizing the triangle fan in my mind. Yes, it's easy to implement nonZero with this method as well, assuming we can dedicate enough stencil bits to it. Impeller doesn't support separate stencil operations for backfaces and frontfaces at the moment, but we can and should add this, as all graphics backends we'll be targeting indeed support this.

Now Impeller's Clip implementation requires a stencil buffer, and all bits of the stencil buffer need to be used. The stencil then cover will also use all the bits of the stencil buffer, so in the current implementation of Impeller, it is difficult to introduce the algorithm of stencil then cover

I more or less agree with this. A 255 clip stack limit is reasonable, but 8 bits isn't going to be enough if we get into the business of segmenting the stencil buffer to play multiple counting roles. And we can't reach for larger 16 bit stencils.

In Graphite, clip is implemented using depth test and depth buffer, so there is no conflict anymore.

Yes, this makes sense. It's quite likely we'll need a depth buffer anyhow for the purpose of draw order optimization -- even if we didn't reverse opaque draw calls, some GPUs don't perform HSR if depth testing is disabled. Clips are hard barriers for reversing the draw order, and so cramming both jobs into the depth buffer would be possible without loss of effectiveness.

I'll append a depth-based design to #114402 that includes clipping in the depth buffer (right now it only has a stencil-based proposal, but again, segmenting the stencil buffer isn't really practical for this).

@dnfield
Copy link
Contributor

dnfield commented May 11, 2023

What if we just used a separate render pass for this?

@bdero
Copy link
Member

bdero commented May 11, 2023

What if we just used a separate render pass for this?

This would let us repurpose the stencil buffer, but we'd have to store an extra texture and add an additional texture sampling draw call on the original pass.

@dnfield
Copy link
Contributor

dnfield commented May 11, 2023

I think utilizing the depth buffer for clipping and repurposing the stencil buffer for filling will be as good or better than whatever we could come up with in compute for tessellating fills.

@bdero
Copy link
Member

bdero commented May 11, 2023

One major concern with adding a depth buffer is that Impeller sometimes needs to segment rendering of a single SaveLayer into multiple passes (this happens when a backdrop filter or advanced blend is present). Today, when we encounter these scenarios, we take a hit of allocating/storing the multisample stencil buffer on the device heap -- the stencil is 8 bit, and so this is the same storage price as the RGBA resolve texture. If we add a depth buffer to the mix, this cost will 3x if the depth buffer is 16 bits and 5x if the depth buffer is 32 bits.

Given our use of MSAA, if we can't find a way around this without extra render passes, I think it's going to be hard to find a short term solution that's not substantially heavier than just tessellating triangle meshes like we're doing now (whether that be on the CPU or on the GPU with a compute shader).

Multi-pass is kinda not the common case but also isn't rare. It might be worth having a compromise solution where we only track a depth buffer and perform stencil-then-cover when we know that multi-pass isn't needed, for example.

@dnfield
Copy link
Contributor

dnfield commented May 11, 2023

perform stencil-then-cover when we know that multi-pass isn't needed, for example.

The sad thing about this is that apps suddenly mysteriously drop off in performance because of a one line code change. We could still do it, but it makes users really frustrated.

@chinmaygarde
Copy link
Member

Since this is more of a discussion than an issue or proposal. I'm closing this.

@chinmaygarde chinmaygarde closed this as not planned Won't fix, can't repro, duplicate, stale Sep 8, 2023
@jonahwilliams
Copy link
Member

I'm reopening this issue because I think that the current tessellation performance is not good enough and there is more to discuss here. For complex animated geometry, we're currently bottlenecked on libtess2. The generic tessellation algorithm is not parallelizable so we cannot use compute. We could attempt to use worker tasks, but given the limited fast threads on Android and the fact that tessellation can easily account for > 50% of CPU time, I think there is little to gain here.

In contrast, triangle generation for STC is trivially parallelizable. We just have the issue of storage of depth buffers with backdrop filters. So dumb question, rather than storing the entire depth buffer, can we instead store the depth affecting clip commands and then replay them with a fresh depth buffer? FYI @bdero @dnfield

@jonahwilliams jonahwilliams reopened this Sep 17, 2023
@bdero
Copy link
Member

bdero commented Sep 18, 2023

So dumb question, rather than storing the entire depth buffer, can we instead store the depth affecting clip commands and then replay them with a fresh depth buffer?

I don't think this is such a bad idea to try. So long as we continue not tracking draw order and feel HSR coverage good enough (wonderful on iOS, but I haven't confirmed HSR is actually working on any Android devices), dedicating the depth buffer to clips and the stencil buffer to StC seems feasible. EntityPass already internally tracks a clip coverage stack, and so keeping a running record of the entities needed to restore the clip state in the depth buffer isn't a big deal.
I suspect that In the common case (with perhaps a handful of large/mid sized coverage area clips), having to redraw the clip collection would require order(s) of magnitude less energy on the GPU than having to store and restore the absolute ocean liner that is a 32bit x4 multisample depth buffer.

We should also double-check/document the availability of different depth/stencil formats across Flutter's target devices. Last time I checked, we'll need to support at least S8D32 and S8D24.

@jonahwilliams
Copy link
Member

We have landed a change that removes stencil storage and replays clips as I described earlier in this issue: #137448

This technically unblocks the foundational work to begin using StC

auto-submit bot pushed a commit to flutter/engine that referenced this issue Jan 17, 2024
#47987)

Part of flutter/flutter#138460.

In preparation for [draw order optimization](flutter/flutter#114402) and [StC](flutter/flutter#123671).

Use a transient depth+stencil texture instead of a stencil-only texture. Doing this in isolation to detect/weed out any HAL bugs with handling the attachment.
auto-submit bot added a commit to flutter/engine that referenced this issue Jan 17, 2024
…il buffer." (#49832)

Reverts #47987
Initiated by: jonahwilliams
This change reverts the following previous change:
Original Description:
Part of flutter/flutter#138460.

In preparation for [draw order optimization](flutter/flutter#114402) and [StC](flutter/flutter#123671).

Use a transient depth+stencil texture instead of a stencil-only texture. Doing this in isolation to detect/weed out any HAL bugs with handling the attachment.
bdero added a commit to flutter/engine that referenced this issue Jan 22, 2024
…l buffer. (#49838)

Part of flutter/flutter#138460.

In preparation for [draw order
optimization](flutter/flutter#114402) and
[StC](flutter/flutter#123671).

Use a transient depth+stencil texture instead of a stencil-only texture.
Doing this in isolation to detect/weed out any bugs with handling the
attachment.
auto-submit bot pushed a commit to flutter/engine that referenced this issue Mar 5, 2024
bdero added a commit to bdero/flutter-engine that referenced this issue Mar 5, 2024
auto-submit bot pushed a commit to flutter/engine that referenced this issue Mar 5, 2024
auto-submit bot added a commit to flutter/engine that referenced this issue Mar 5, 2024
… path rendering. (#51209)" (#51217)

Reverts #51209
Initiated by: bdero
Reason for reverting: Golden breakages on [engine roll](flutter/flutter#144647) https://flutter-gold.skia.org/search?issue=144647&crs=github&patchsets=5&corpus=flutter
Original PR Author: bdero

Reviewed By: {jonahwilliams, chinmaygarde}

This change reverts the following previous change:
Original Description:
Turn the page, wash your hands.

Addresses the following issues:
* flutter/flutter#143077
* flutter/flutter#137714
* flutter/flutter#138460
* flutter/flutter#123671
* flutter/flutter#141961
* flutter/flutter#134432

Previous attempt:
- #50856
    - reverted with #51191
    - fixed with #51198
bdero added a commit to bdero/flutter-engine that referenced this issue Mar 6, 2024
auto-submit bot pushed a commit to flutter/engine that referenced this issue Mar 6, 2024
…dering. (#51219)

Turn the page, wash your hands.

Addresses the following issues:
* flutter/flutter#143077
* flutter/flutter#137714
* flutter/flutter#138460
* flutter/flutter#123671
* flutter/flutter#141961
* flutter/flutter#134432

Previous attempts:
1. #50856
    - reverted with #51191
    - fixed with #51198
2. #51209
    - reverted with #51217
    - fixed with #51218
@bdero bdero self-assigned this Mar 12, 2024
@bdero
Copy link
Member

bdero commented Mar 12, 2024

StC was enabled with flutter/engine#51219

@bdero bdero closed this as completed Mar 12, 2024
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
e: impeller Impeller rendering backend issues and features requests engine flutter/engine repository. See also e: labels. P3 Issues that are less important to the Flutter project team-engine Owned by Engine team triaged-engine Triaged by Engine team
Projects
No open projects
Status: ⚡ Performance
Development

No branches or pull requests

6 participants