Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new option: cpu_cache #120

Closed
wants to merge 2 commits into from

Conversation

AlexCharlton
Copy link

@AlexCharlton AlexCharlton commented Sep 14, 2020

Hi Alex, thanks for the super useful library!

I've been using glyph-brush (and previously rusttype::gpu_cache) with wgpu-rs. I was doing so by calling the CommandEncoder method copy_buffer_to_texture for each value returned by cache_queued. When I updated wgpu to version 0.6.0, I hit a snag, as it started enforcing buffer copies to align to 256 bytes.

I had also previously noticed that this strategy came with some performance problems as well, as it required one copy to the GPU per changed glyph. So when rendering 100 glyphs for the first time, the result was 14.1ms time spent caching glyphs, the majority of which was spent in these GPU operations:
rusttype::gpu_cache
(sorry, it's a little hard to see where the time is going there, but all of those wgpu operations there under the update_font_cache::{{closure}} are the buffer-to-texture copies)

To help solve both of these problems (after switching to glyph_brush_draw_cache), I added a new option: cpu_cache. From the rust doc comment on the option:

Cache the rasterized glyphs in CPU-side memory. When enabled, cache_queued will return any rows of glyphs that have been updated (as opposed to individual glyphs). When multiple contiguous rows are updated, they are sent to the cache_queued user-supplied uploader function as a single chunk. The result is a minimization of calls to uploader, which in turn can minimize the number of write operations to one's GPU cache.

The result is this draft PR. Now rendering 100 glyphs for the first time takes 5.6ms, with my profiler not even registering the copy-to-texture operation:
glyph_bruch_draw_cache with cpu_cache

If you think this approach has legs, then I'll do the work to get this working with the multithread option and add some tests. Thanks!

@alexheretic
Copy link
Owner

Hey, thanks for the pr! This is an interesting idea grouping contiguous uploads together.

Having to upload a whole row for a single glyph change doesn't seem ideal though. If we go further with this, I wonder if we could have a separate optional layer that providers an uploader to current code, saves into a array and minimises uploads & rounds alignments. This should then integrate with the multithreading code fine, as it'll be on a different level. We could maybe remove the align_4x4 option too, as this layer could perform more flexible alignment rounding?

Regarding the alignment issues it might be worth looking at (or maybe working with) https://github.com/hecrj/wgpu_glyph to see how that was addressed there.

@alexheretic
Copy link
Owner

alexheretic commented Sep 15, 2020

I also had an idea of a "lite" version of this. That simply checks if the draw-cache is empty, if so skips all individual uploads performs a single rect update covering all rows. This has the advantage of not requiring texture information to be held in memory (at least not long term).

If we cut down the uploading use cases to:

  1. First time population of the draw-cache.
  2. Small incremental updates.
  3. Large partial updates.
  4. Draw-cache reordering to fit all glyphs (clears the cache and reworks the queue).

The lite optimisation will help 1 & 4 directly. 2 is already fairly optimal, at least when it comes to uploads. Leaving just 3 to be less efficiently handled with multiple uploads.

On the face of it this idea seems attractive and should similarly improve your "first time" benchmark. What do you think?

@AlexCharlton
Copy link
Author

I'm glad this stimulated some interesting thoughts :)

Having to upload a whole row for a single glyph change doesn't seem ideal though. If we go further with this, I wonder if we could have a separate optional layer that providers an uploader to current code, saves into a array and minimises uploads & rounds alignments

I had considered the idea of more "optimally" batching these uploads, but consciously decided against it. My intuition here is that, in this sort of situation where you're trying to maintain a GPU cache, the synchronization cost of telling the GPU, "I want you to copy this data," far outweighs the cost of performing the actual transfer for pretty much any row size you'd have. In other words, I think that in practice the cost of sending a whole row for a single (or small number of) glyph change(s) will work out to be less than the cost of either 1) sending the small changes individually (with some alignment logic thrown in), or 2) figuring out how to group the small changes together, and potentially sending a larger number of total updates.

The lite optimisation will help 1 & 4 directly. 2 is already fairly optimal, at least when it comes to uploads.

With this premise, I'd argue that 2 is actually non-optimal (and also does not address alignment issues). The thought here is that keeping the CPU-side cache both lets you limit GPU calls and fixes alignment issues with almost no additional overhead other than the RAM usage.

However, I could very well be wrong! I'm basing this all on my intuition formed around some tangential benchmarks. I'm happy to work to prove this intuition out, though, by creating some benchmarks that would be representative of the different use-cases we can hit with different approaches.


Separately, thanks for this recommendation:

Regarding the alignment issues it might be worth looking at (or maybe working with) https://github.com/hecrj/wgpu_glyph to see how that was addressed there.

This https://github.com/hecrj/wgpu_glyph/blob/master/src/pipeline/cache.rs#L46 has made me realize that the important part of this buffer-size restriction is actually the buffer size and not the size of the region of the texture being copied into. This is something that could be, in theory, addressed in draw-cache by having some buffer-alignment option that forces the region being drawn onto into alignment while otherwise keeping everything else the same (or else, perhaps less efficiently, by copying the region returned into your own buffer first).

@alexheretic
Copy link
Owner

I agree that more complex partial update cases require real world benchmarks to guide optimisation.

But doing a single rect update on initial population & full-reorder should be a win in any case. It should be an "easy" optimisation to what already exists too.

@AlexCharlton
Copy link
Author

Hi Alex, I spent the week running a few benchmarks on this branch and on master (including #121). I wanted to make the scenarios being tested at least somewhat "realistic" but a consequence of this is that I created the tests in an ad-hoc fashion and did not make them readily reproducible. Nonetheless I think the results are illuminating.

Methodology

There were three benchmarks which I'm calling "Resize", "Jitter" and "Large jitter". I tested them on this branch and your master with wgpu handling the rendering. I ran each test with the Vulkan and DirectX 12 backends and instrumented them with Superliminal.

I measured the time that it took to "update the texture cache" - which included calling glyph_cache.queue_glyph for all glyphs then glyph_cache.cache_queued. For each benchmark, I averaged how many glyph updates occured per frame and how many calls to cache_queued were made (i.e. how many GPU operations). The operation within cache_queued was also timed. Each benchmark was run for 100 frames and the results are the averages across those frames. Draw-cache multithreading was turned off.

I performed a write_texture operation every call to cache_queued. This is a functional way of doing things for this branch, and as far as I can tell it's a best-case-scenario (but not particularly functional) for the main branch. Further study of https://github.com/hecrj/wgpu_glyph shows that it's actually doing two gpu operations per glyph update: A copy_buffer_to_buffer then a copy_buffer_to_texture, the two combined being significantly slower than a single write_texture.

The three benchmarks were as follows:

Resize

This benchmark asks the question: What's the performance when you do something fairly minor, like resize your window, thus forcing a new text layout? In most cases I tested the answer was: performance on the whole is good because after a bit of resizing you wind up with all of the possible glyphs you need in the cache. To make it more interesting, I experimented with a set of glyphs + cache-size such that you could never fit all possible glyphs for all text layouts into the cache.

The result was a body of text, 625 characters, repeated 3 times in three different sizes, in a 512x512 cache. A random sized text boundary was chosen every frame, which caused an average of 29.3 glyph updates per frame.

Jitter

This benchmark asks the question: What's the performance when the cache experiences a reasonably large change? This could be caused by e.g. someone resizing their text or someone pasting some new text into place.

This was simulated with a body of text, 625 characters, in a 256x256 cache. The text had a random jitter between 0.0 and 1.0 applied to its x and y positions, which caused an average of 97.2 glyph updates per frame.

Large jitter

Much like the last, but with ~10x the number of changes.

This was done with a body of text, 625 characters, repeated 9 times in different sizes, in a 1024x1024 cache with jitter applied as before. This caused an average of 1236.1 glyph updates per frame.

Results

Bench Branch Backend Avg glyphs updated Time to update texture cache (ms) Avg texture copies Time to copy a single texture (ms)
Resize this Vulkan 29.3 4.6 1.1 0.12
Resize this DX12 29.3 4.8 1.1 0.17
Resize master Vulkan 29.3 5.2 15.3 0.05
Resize master DX12 29.3 5.3 15.3 0.11
Jitter this Vulkan 97.2 7.2 3.9 0.06
Jitter this DX12 97.2 8.7 3.9 0.16
Jitter master Vulkan 97.2 14.0 96.0 0.05
Jitter master DX12 97.2 20.0 96.0 0.11
Large jitter this Vulkan 1236.1 70.1 2.1 0.13
Large jitter this DX12 1236.1 82.1 2.1 0.21
Large jitter master Vulkan 1236.1 128.9 713.5 0.05
Large jitter master DX12 1236.1 164.7 713.5 0.10

Discussion

The clear takeaway here is that the penalty for performing a GPU operation is large and the additional cost for copying more data is small. Even when ~100x the data is being copied, the operations only took about 2x as long. Clearly this makes doing 10-1000x fewer operations worth while. Even when only a small number of glyphs were updated, this branch still outperformed master, but as soon as you have more than a very small set of changes, the performance advantage in this approach is fairly evident.

One question that remains is, are these numbers remotely meaningful? In other words, while using a CPU-side cache gives clearly better performance, is it actually worth the trouble? While you aren't likely to run into a scenario where you have to update this number of glyphs frame-after-frame-after-frame, I'd argue that it's entirely possible that you could still hit these situations on occasion (e.g. when resizing your text or pasting a chunk of previously unseen characters) and the penalty for doing so - skipping quite a number of frames - is severe enough that it should be minimized as much as possible.

The other question is, of course, does this mean that this is the right approach for this draw-cache library? And to that I say: I don't know! It's still entirely possible to apply this methodology if you are calling into draw-cache, without having to add it to the library itself, so it's not a matter of necessity. On one hand, I wouldn't be hugely bothered if I had to add a CPU-side cache onto my program. On the other hand, this really feels like it's the core use-case of this library, and the changes required to support this aren't particularly large. In the end, I'd say it comes down to your vision for what you think draw-cache should be doing for its users, @alexheretic. Either way, I'll personally be happy with the functionality it provides :)

@alexheretic
Copy link
Owner

alexheretic commented Oct 4, 2020

Thanks for doing the investigation! I think you make a good case for having the draw cache in ram to reduce writes per frame.

Another little detail is that with multithreading enabled uploads can actually be done concurrently with rasterization... But I'm more convinced by your numbers overall.

Also after thinking about it some more, storing the texture in ram would be a fairly unsurprising thing for this lib to do. So I don't have anything against it, the currently design just didn't need it.

If minimising uploads is the aim, I wonder if we could go further and guarantee that there is at most one upload per frame. So we take the smallest rect that covers all changed glyphs and upload that.

Following on from that it makes me think a different API would make sense, as the current callback would be executed only 0 or 1 times. So instead it could simply return a result that contains the texture updates.

fn cache_queued<F, U>(&mut self, fonts: &[F]) -> Result<TextureUpdate, CacheWriteErr>

I was also thinking of optimisations this could allow for when we reorder the draw cache. Currently in that case we clear and re-process, which reorders the glyph tall-to-short. It also means we must rasterize all the glyphs again. With the texture in ram we could do this by moving the existing glyph data into a new texture reordered position (rasterize anything we haven't got) as an optimisation.

@AlexCharlton
Copy link
Author

If minimising uploads is the aim, I wonder if we could go further and guarantee that there is at most one upload per frame. So we take the smallest rect that covers all changed glyphs and upload that.

I think this approach would work quite well, and it makes the API quite attractive too. And yes, given the cost of rasterization, I think the additional opportunities for optimization are pretty interesting as well.

It feels like we've hit on something here, but that something looks like a pretty major change. Unless you're looking for help, I'm happy to let you take the lead here as I'm sure you're better equipped than I to drive this. Either way, I think this branch is far enough away from this vision that I'll close the PR. If you'd like any help moving this forward, just say the word!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants