Add new option: cpu_cache #120

AlexCharlton · 2020-09-14T00:56:38Z

Hi Alex, thanks for the super useful library!

I've been using glyph-brush (and previously rusttype::gpu_cache) with wgpu-rs. I was doing so by calling the CommandEncoder method copy_buffer_to_texture for each value returned by cache_queued. When I updated wgpu to version 0.6.0, I hit a snag, as it started enforcing buffer copies to align to 256 bytes.

I had also previously noticed that this strategy came with some performance problems as well, as it required one copy to the GPU per changed glyph. So when rendering 100 glyphs for the first time, the result was 14.1ms time spent caching glyphs, the majority of which was spent in these GPU operations:

(sorry, it's a little hard to see where the time is going there, but all of those wgpu operations there under the update_font_cache::{{closure}} are the buffer-to-texture copies)

To help solve both of these problems (after switching to glyph_brush_draw_cache), I added a new option: cpu_cache. From the rust doc comment on the option:

Cache the rasterized glyphs in CPU-side memory. When enabled, cache_queued will return any rows of glyphs that have been updated (as opposed to individual glyphs). When multiple contiguous rows are updated, they are sent to the cache_queued user-supplied uploader function as a single chunk. The result is a minimization of calls to uploader, which in turn can minimize the number of write operations to one's GPU cache.

The result is this draft PR. Now rendering 100 glyphs for the first time takes 5.6ms, with my profiler not even registering the copy-to-texture operation:

If you think this approach has legs, then I'll do the work to get this working with the multithread option and add some tests. Thanks!

alexheretic · 2020-09-14T09:41:49Z

Hey, thanks for the pr! This is an interesting idea grouping contiguous uploads together.

Having to upload a whole row for a single glyph change doesn't seem ideal though. If we go further with this, I wonder if we could have a separate optional layer that providers an uploader to current code, saves into a array and minimises uploads & rounds alignments. This should then integrate with the multithreading code fine, as it'll be on a different level. We could maybe remove the align_4x4 option too, as this layer could perform more flexible alignment rounding?

Regarding the alignment issues it might be worth looking at (or maybe working with) https://github.com/hecrj/wgpu_glyph to see how that was addressed there.

alexheretic · 2020-09-15T08:42:16Z

I also had an idea of a "lite" version of this. That simply checks if the draw-cache is empty, if so skips all individual uploads performs a single rect update covering all rows. This has the advantage of not requiring texture information to be held in memory (at least not long term).

If we cut down the uploading use cases to:

First time population of the draw-cache.
Small incremental updates.
Large partial updates.
Draw-cache reordering to fit all glyphs (clears the cache and reworks the queue).

The lite optimisation will help 1 & 4 directly. 2 is already fairly optimal, at least when it comes to uploads. Leaving just 3 to be less efficiently handled with multiple uploads.

On the face of it this idea seems attractive and should similarly improve your "first time" benchmark. What do you think?

AlexCharlton · 2020-09-16T02:45:21Z

I'm glad this stimulated some interesting thoughts :)

Having to upload a whole row for a single glyph change doesn't seem ideal though. If we go further with this, I wonder if we could have a separate optional layer that providers an uploader to current code, saves into a array and minimises uploads & rounds alignments

I had considered the idea of more "optimally" batching these uploads, but consciously decided against it. My intuition here is that, in this sort of situation where you're trying to maintain a GPU cache, the synchronization cost of telling the GPU, "I want you to copy this data," far outweighs the cost of performing the actual transfer for pretty much any row size you'd have. In other words, I think that in practice the cost of sending a whole row for a single (or small number of) glyph change(s) will work out to be less than the cost of either 1) sending the small changes individually (with some alignment logic thrown in), or 2) figuring out how to group the small changes together, and potentially sending a larger number of total updates.

The lite optimisation will help 1 & 4 directly. 2 is already fairly optimal, at least when it comes to uploads.

With this premise, I'd argue that 2 is actually non-optimal (and also does not address alignment issues). The thought here is that keeping the CPU-side cache both lets you limit GPU calls and fixes alignment issues with almost no additional overhead other than the RAM usage.

However, I could very well be wrong! I'm basing this all on my intuition formed around some tangential benchmarks. I'm happy to work to prove this intuition out, though, by creating some benchmarks that would be representative of the different use-cases we can hit with different approaches.

Separately, thanks for this recommendation:

Regarding the alignment issues it might be worth looking at (or maybe working with) https://github.com/hecrj/wgpu_glyph to see how that was addressed there.

This https://github.com/hecrj/wgpu_glyph/blob/master/src/pipeline/cache.rs#L46 has made me realize that the important part of this buffer-size restriction is actually the buffer size and not the size of the region of the texture being copied into. This is something that could be, in theory, addressed in draw-cache by having some buffer-alignment option that forces the region being drawn onto into alignment while otherwise keeping everything else the same (or else, perhaps less efficiently, by copying the region returned into your own buffer first).

alexheretic · 2020-09-18T18:44:18Z

I agree that more complex partial update cases require real world benchmarks to guide optimisation.

But doing a single rect update on initial population & full-reorder should be a win in any case. It should be an "easy" optimisation to what already exists too.

AlexCharlton · 2020-10-03T22:30:03Z

Hi Alex, I spent the week running a few benchmarks on this branch and on master (including #121). I wanted to make the scenarios being tested at least somewhat "realistic" but a consequence of this is that I created the tests in an ad-hoc fashion and did not make them readily reproducible. Nonetheless I think the results are illuminating.

Methodology

There were three benchmarks which I'm calling "Resize", "Jitter" and "Large jitter". I tested them on this branch and your master with wgpu handling the rendering. I ran each test with the Vulkan and DirectX 12 backends and instrumented them with Superliminal.

I measured the time that it took to "update the texture cache" - which included calling glyph_cache.queue_glyph for all glyphs then glyph_cache.cache_queued. For each benchmark, I averaged how many glyph updates occured per frame and how many calls to cache_queued were made (i.e. how many GPU operations). The operation within cache_queued was also timed. Each benchmark was run for 100 frames and the results are the averages across those frames. Draw-cache multithreading was turned off.

I performed a write_texture operation every call to cache_queued. This is a functional way of doing things for this branch, and as far as I can tell it's a best-case-scenario (but not particularly functional) for the main branch. Further study of https://github.com/hecrj/wgpu_glyph shows that it's actually doing two gpu operations per glyph update: A copy_buffer_to_buffer then a copy_buffer_to_texture, the two combined being significantly slower than a single write_texture.

The three benchmarks were as follows:

Resize

This benchmark asks the question: What's the performance when you do something fairly minor, like resize your window, thus forcing a new text layout? In most cases I tested the answer was: performance on the whole is good because after a bit of resizing you wind up with all of the possible glyphs you need in the cache. To make it more interesting, I experimented with a set of glyphs + cache-size such that you could never fit all possible glyphs for all text layouts into the cache.

The result was a body of text, 625 characters, repeated 3 times in three different sizes, in a 512x512 cache. A random sized text boundary was chosen every frame, which caused an average of 29.3 glyph updates per frame.

Jitter

This benchmark asks the question: What's the performance when the cache experiences a reasonably large change? This could be caused by e.g. someone resizing their text or someone pasting some new text into place.

This was simulated with a body of text, 625 characters, in a 256x256 cache. The text had a random jitter between 0.0 and 1.0 applied to its x and y positions, which caused an average of 97.2 glyph updates per frame.

Large jitter

Much like the last, but with ~10x the number of changes.

This was done with a body of text, 625 characters, repeated 9 times in different sizes, in a 1024x1024 cache with jitter applied as before. This caused an average of 1236.1 glyph updates per frame.

Results

Bench	Branch	Backend	Avg glyphs updated	Time to update texture cache (ms)	Avg texture copies	Time to copy a single texture (ms)
Resize	this	Vulkan	29.3	4.6	1.1	0.12
Resize	this	DX12	29.3	4.8	1.1	0.17
Resize	master	Vulkan	29.3	5.2	15.3	0.05
Resize	master	DX12	29.3	5.3	15.3	0.11
Jitter	this	Vulkan	97.2	7.2	3.9	0.06
Jitter	this	DX12	97.2	8.7	3.9	0.16
Jitter	master	Vulkan	97.2	14.0	96.0	0.05
Jitter	master	DX12	97.2	20.0	96.0	0.11
Large jitter	this	Vulkan	1236.1	70.1	2.1	0.13
Large jitter	this	DX12	1236.1	82.1	2.1	0.21
Large jitter	master	Vulkan	1236.1	128.9	713.5	0.05
Large jitter	master	DX12	1236.1	164.7	713.5	0.10

Discussion

The clear takeaway here is that the penalty for performing a GPU operation is large and the additional cost for copying more data is small. Even when ~100x the data is being copied, the operations only took about 2x as long. Clearly this makes doing 10-1000x fewer operations worth while. Even when only a small number of glyphs were updated, this branch still outperformed master, but as soon as you have more than a very small set of changes, the performance advantage in this approach is fairly evident.

One question that remains is, are these numbers remotely meaningful? In other words, while using a CPU-side cache gives clearly better performance, is it actually worth the trouble? While you aren't likely to run into a scenario where you have to update this number of glyphs frame-after-frame-after-frame, I'd argue that it's entirely possible that you could still hit these situations on occasion (e.g. when resizing your text or pasting a chunk of previously unseen characters) and the penalty for doing so - skipping quite a number of frames - is severe enough that it should be minimized as much as possible.

The other question is, of course, does this mean that this is the right approach for this draw-cache library? And to that I say: I don't know! It's still entirely possible to apply this methodology if you are calling into draw-cache, without having to add it to the library itself, so it's not a matter of necessity. On one hand, I wouldn't be hugely bothered if I had to add a CPU-side cache onto my program. On the other hand, this really feels like it's the core use-case of this library, and the changes required to support this aren't particularly large. In the end, I'd say it comes down to your vision for what you think draw-cache should be doing for its users, @alexheretic. Either way, I'll personally be happy with the functionality it provides :)

alexheretic · 2020-10-04T10:40:03Z

Thanks for doing the investigation! I think you make a good case for having the draw cache in ram to reduce writes per frame.

Another little detail is that with multithreading enabled uploads can actually be done concurrently with rasterization... But I'm more convinced by your numbers overall.

Also after thinking about it some more, storing the texture in ram would be a fairly unsurprising thing for this lib to do. So I don't have anything against it, the currently design just didn't need it.

If minimising uploads is the aim, I wonder if we could go further and guarantee that there is at most one upload per frame. So we take the smallest rect that covers all changed glyphs and upload that.

Following on from that it makes me think a different API would make sense, as the current callback would be executed only 0 or 1 times. So instead it could simply return a result that contains the texture updates.

fn cache_queued<F, U>(&mut self, fonts: &[F]) -> Result<TextureUpdate, CacheWriteErr>

I was also thinking of optimisations this could allow for when we reorder the draw cache. Currently in that case we clear and re-process, which reorders the glyph tall-to-short. It also means we must rasterize all the glyphs again. With the texture in ram we could do this by moving the existing glyph data into a new texture reordered position (rasterize anything we haven't got) as an optimisation.

AlexCharlton · 2020-10-04T19:36:52Z

If minimising uploads is the aim, I wonder if we could go further and guarantee that there is at most one upload per frame. So we take the smallest rect that covers all changed glyphs and upload that.

I think this approach would work quite well, and it makes the API quite attractive too. And yes, given the cost of rasterization, I think the additional opportunities for optimization are pretty interesting as well.

It feels like we've hit on something here, but that something looks like a pretty major change. Unless you're looking for help, I'm happy to let you take the lead here as I'm sure you're better equipped than I to drive this. Either way, I think this branch is far enough away from this vision that I'll close the PR. If you'd like any help moving this forward, just say the word!

AlexCharlton added 2 commits September 10, 2020 21:52

Remove redundant block with cfg!

b68295e

Add cpu_cache option

5e387f8

alexheretic mentioned this pull request Sep 19, 2020

Draw-cache: Optimise empty cache-queued into single texture upload #121

Merged

AlexCharlton closed this Oct 4, 2020

alexheretic mentioned this pull request Oct 5, 2020

draw cache: Rework API around at most 1 upload per cache_queued optimisation #122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new option: cpu_cache #120

Add new option: cpu_cache #120

AlexCharlton commented Sep 14, 2020 •

edited

Loading

alexheretic commented Sep 14, 2020

alexheretic commented Sep 15, 2020 •

edited

Loading

AlexCharlton commented Sep 16, 2020

alexheretic commented Sep 18, 2020

AlexCharlton commented Oct 3, 2020

alexheretic commented Oct 4, 2020 •

edited

Loading

AlexCharlton commented Oct 4, 2020

Add new option: cpu_cache #120

Add new option: cpu_cache #120

Conversation

AlexCharlton commented Sep 14, 2020 • edited Loading

alexheretic commented Sep 14, 2020

alexheretic commented Sep 15, 2020 • edited Loading

AlexCharlton commented Sep 16, 2020

alexheretic commented Sep 18, 2020

AlexCharlton commented Oct 3, 2020

Methodology

Resize

Jitter

Large jitter

Results

Discussion

alexheretic commented Oct 4, 2020 • edited Loading

AlexCharlton commented Oct 4, 2020

AlexCharlton commented Sep 14, 2020 •

edited

Loading

alexheretic commented Sep 15, 2020 •

edited

Loading

alexheretic commented Oct 4, 2020 •

edited

Loading