Every 676 frames, Queue::write_buffer takes ~25ms #1242

Aeledfyr · 2021-03-02T00:44:52Z

Description
Every 676 frames (varies by vsync / workload, but is exact within one set of settings), one Queue::write_buffer or Device::create_buffer call takes ~25ms. The submit call on the next frame also takes longer than usual, generally about 16ms.

See also: gfx-rs/wgpu-rs#363, since it appears to be a similar issue with a previous allocator.

Repro steps
I don't have a minimal example and the code that I am experiencing this in is not public, but here's the general overview for reproduction.

My application allocates most memory at the start, and only rarely creates new buffers. There are about 10 write_buffer calls per frame, with reasonably small buffers for each. A random one of these calls takes 25ms on the spike frame. My application allocates a large amount of memory, which could be causing this. It allocates 1 or 2 128MiB vertex buffers, and sub allocates within those buffers to reduce vertex buffer swapping. There are not many individual buffers, so that shouldn't be the cause of the issue.

To find the spikes, I have been using tracing and tracing-tracy; this integrates with wgpu's tracing setup, so it can show some more detail.

Expected vs observed behavior
I would expect the frame time to be consistent, with no random or regular spikes. Instead, once every n frames, a single Queue::write_buffer call takes 25ms, and submitting the next frame after that frame (not the submit after the write, the second submit after the write...) takes ~10ms.

The n here varies based on whether vsync is enabled or the general frame time, but is very consistent. The number of frames between all frame spikes are the exact same, while the time may vary. Running a more minimal render and increasing the framerate as much as possible makes this more visible.

Extra materials
These images use the Tracy profiler, watching tracing's output at the trace level for all crates. I added a patch to gpu-alloc to add more tracing information - most of the time is spent in the backend allocate memory function, which I believe is provided in gfx-hal in this case.

Tracy inspection of a two frame spike:

Comparison to nearby normal frames:

Platform
GPU: GTX 970
OS: Manjaro Linux
Backend used: Vulkan
wgpu version: 0.7.0

The text was updated successfully, but these errors were encountered:

Aeledfyr · 2021-03-02T03:21:00Z

I have also tested this on a Linux machine with an AMD gpu with similar results (the allocations take 9ms rather than 25ms), so this is most likely not a driver/gpu issue.

I have also created an example case from the wgpu-rs cube example: https://github.com/Aeledfyr/wgpu-example. The spikes are rarer, but it consistently has spikes about ~20-30s apart on my machine. (I disabled vsync to make the spikes visible; otherwise finding them would be a pain).

The modifications to the example are:
Allocate a large (128MiB) buffer marked as VERTEX and COPY_DST.
In the render function, call queue.write_buffer to overwrite the vertex and index buffers 10 times.
The spikes do not appear to occur if the large allocation was not performed.

kvark · 2021-03-02T03:55:28Z

Thank you for filing this beautiful issue!
You found that allocate_memory is the problem. Good news is - we don't expect this to be called at all in a use case where the uploads are done regularly, unless the amount of uploads exceeded some threshold. So we'll need to debug (gpu-alloc in particular) and see why exactly we ended up allocating new memory in these spikes.

John-Nagle · 2021-03-22T18:44:24Z

I may have this problem, but am not sure yet. I'm getting brief stalls, as long as 160ms, from a Rust program atop Rend3 atop wgpu. One thread is running the refresh loop, which does little else. Another thread is loading content, allocating GPU memory, and adding textures, materials, and objects via Rend3. All this is in Rust. On a complex scene, the normal frame rate is around 200 FPS, but every 1-2 secs then there's a stutter, with one frame taking far too long. This only happens during content loading; once all content is loaded, there is no more stuttering. Loading larger vertex buffers (64K vertices) seems to make it worse.

So the symptom is the same, but I have not done any profiling to confirm the cause.

(6 CPUs, 12 hardware threads, AMD Ryzen 5, NVidia 3070 8GB, 32 GB of RAM, Ubuntu 20.04 LTS)

VincentJousse · 2021-03-22T22:28:08Z

I also reproduce this with Vulkan and WebGL backends.

VincentJousse · 2021-03-30T08:05:43Z

So we now have only two allocate_memory calls, but the second is still annoying. It is due to the current Linear Allocator algorithm.
I suggested to allocate two chunks directly, that was rejected by @zakarumych .
@kvark do you have suggestions ?

zakarumych · 2021-03-30T08:08:50Z

I left that issue open to not forget to think about it.
Maybe treating one memory object as chunk pair and reusing first half if it's free when the second one is exhausted.

VincentJousse · 2021-03-30T08:19:21Z

Do you think it would be to complex to keep track of deallocated regions inside a chunk to reuse them directly ?

zakarumych · 2021-03-30T09:25:24Z

Maybe I could keep sorted list of free regions, find suitable region and cut it on allocation and merge on deallocation.
It would be easily fragmented, but user promises to deallocated all blocks shortly, so fragmentation should not be an issue.

VincentJousse · 2021-03-30T09:28:21Z

user promises to deallocated all blocks shortly

What do you mean ?
Is it a requirement ?

zakarumych · 2021-03-30T09:47:57Z

There's gpu_alloc::UsageFlags::TRANSIENT flag. It can be set as a hint that this allocation is short-living.
wgpu uses it for particular type of allocations, for example for staging buffers for uploads. Exactly the case of Queue::write_buffer
gpu-alloc uses LinearAllocator only if allocation request contains this flag.
For long-lived allocations another allocator is used, which avoids fragmentation and can reuse individual allocated blocks, but have a bit of memory overhead.

VincentJousse · 2021-03-30T09:51:57Z

Good, so no need to track freed regions, your proposal of splitting in a chunk pair seems great !

John-Nagle · 2021-03-30T18:16:12Z

As a user of Rend3, I'm one level up from this problem, but I definitely see it, and it has an impact. I'm writing a viewer for a virtual world, and load gigabytes of content into the GPU. One thread is just a refresh loop. Another thread is loading content. Frame rates look like this:

00217 frames over 01.00s. Min: 03.16ms; Average: 04.63ms; 95%: 06.15ms; 99%: 09.88ms; Max: 105.19ms; StdDev: 06.91ms
00189 frames over 01.00s. Min: 03.42ms; Average: 05.30ms; 95%: 06.06ms; 99%: 105.38ms; Max: 105.38ms; StdDev: 10.41ms
00231 frames over 01.00s. Min: 03.38ms; Average: 04.33ms; 95%: 05.06ms; 99%: 06.32ms; Max: 108.59ms; StdDev: 06.90ms
00152 frames over 01.01s. Min: 03.41ms; Average: 06.66ms; 95%: 18.86ms; 99%: 119.01ms; Max: 119.01ms; StdDev: 12.81ms
00135 frames over 01.00s. Min: 03.39ms; Average: 07.43ms; 95%: 18.30ms; 99%: 107.28ms; Max: 107.28ms; StdDev: 10.04ms
00235 frames over 01.00s. Min: 03.38ms; Average: 04.27ms; 95%: 04.94ms; 99%: 07.67ms; Max: 107.22ms; StdDev: 06.76ms
00187 frames over 01.00s. Min: 03.04ms; Average: 05.36ms; 95%: 06.80ms; 99%: 111.21ms; Max: 111.21ms; StdDev: 10.92ms
00226 frames over 01.00s. Min: 03.33ms; Average: 04.43ms; 95%: 05.55ms; 99%: 11.14ms; Max: 108.14ms; StdDev: 06.97ms
00235 frames over 01.00s. Min: 03.35ms; Average: 04.27ms; 95%: 05.42ms; 99%: 07.64ms; Max: 110.49ms; StdDev: 06.97ms
00225 frames over 01.00s. Min: 03.43ms; Average: 04.46ms; 95%: 05.37ms; 99%: 12.77ms; Max: 111.15ms; StdDev: 07.19ms
00195 frames over 01.00s. Min: 03.18ms; Average: 05.14ms; 95%: 05.54ms; 99%: 113.05ms; Max: 113.05ms; StdDev: 10.73ms
00217 frames over 01.00s. Min: 03.29ms; Average: 04.62ms; 95%: 05.21ms; 99%: 17.83ms; Max: 104.99ms; StdDev: 06.91ms

Notice the stalls. Average around 5ms, 95% of frames around 5ms, max around 100ms! Those huge stalls are a big drag on the user experience.

(Plenty of CPU time available; 6 cores and under 25% total CPU utilization.)

zakarumych · 2021-03-30T19:45:57Z

I added experimental allocation strategy that can be enabled with feature "freelist" on version 0.4.2
Currently it will replace LinearAllocator with FreeListAllocator which can reuse individual memory regions, and merge them.
Without adding anything to config it'll just keep at least 2*linear_chunk of memory preallocated.
And if memory consumption is low, only one chunk of size linear_chunk will be allocated.

zakarumych · 2021-03-30T19:49:43Z

Regarding loading gigabytes of data to the GPU, allocator configuration is required to keep more memory preallocated. Or some sophisticated guessing, about what memory could be required again soon.

VincentJousse · 2021-03-31T14:25:30Z

@kvark will wgpu use this freelist feature ?

kvark · 2021-03-31T14:29:27Z

I'm still expecting this to be fully abstracted away by gpu-alloc. If we were to start manually keep memory chunks, we'd then basically start re-implementing gpu-alloc internally.

VincentJousse · 2021-03-31T14:34:03Z

If I understand well what @zakarumych said, it's just a feature to add, nothing more.

kvark · 2021-03-31T15:02:52Z

Oh, ok. We'd use whatever gpu-alloc provides, of course.

zakarumych · 2021-03-31T18:36:17Z

@VincentFTS as this is a feature, you can enable it in your crate, without changes in wgpu. Just add gpu-alloc to your dependencies with feature enabled.
One we confirm that FreeListAllocator works fine, I'll just make it on by default and add fields into config to control when it shall be used.

VincentJousse · 2021-04-26T10:17:03Z

@zakarumych sorry for this late answer …
I tried to activate the freelist feature, and I get a segmentation fault.
thread 'main' panicked at 'attempt to subtract with overflow' in gpu-alloc/src/freelist.rs:140:29

zakarumych · 2021-04-26T10:19:41Z

Try to use latest commit on git, see if it helps

VincentJousse · 2021-04-26T10:21:30Z

I already use it

kvark · 2021-04-26T14:45:53Z

Marked this for 0.8 release. If the changes are in gpu-alloc, they'd naturally be picked up because we'll require gpu-alloc to be published.

VincentJousse · 2021-04-26T22:19:43Z

@zakarumych here is my proposal : zakarumych/gpu-alloc#44

VincentJousse · 2021-04-28T10:42:19Z

@kvark is it OK for you to update gpu-alloc dependency and to activate the freelist feature ?

zakarumych · 2021-04-28T10:49:57Z

Feature is active by default now

kvark · 2021-04-28T13:20:03Z

Thanks to both of you!
I'm a bit worried about making a wgpu release now, given that issues may be found with this new code, but hopefully we'll be able to patch it.

kvark · 2021-04-29T16:08:15Z

We just updated to use "gpu-alloc-0.4.4"

kvark added area: performance How fast things go help required We need community help to make this happen. type: bug Something isn't working labels Mar 2, 2021

kvark mentioned this issue Mar 26, 2021

Linear allocator doesn't re-use memory zakarumych/gpu-alloc#38

Closed

kvark added this to the Version 0.8 milestone Apr 26, 2021

VincentJousse mentioned this issue Apr 26, 2021

Review FreeListAllocator total_free management. zakarumych/gpu-alloc#44

Closed

VincentJousse mentioned this issue Apr 27, 2021

Fix freelist zakarumych/gpu-alloc#45

Merged

kvark closed this as completed Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Every 676 frames, Queue::write_buffer takes ~25ms #1242

Every 676 frames, Queue::write_buffer takes ~25ms #1242

Aeledfyr commented Mar 2, 2021 •

edited

Loading

Aeledfyr commented Mar 2, 2021 •

edited

Loading

kvark commented Mar 2, 2021

John-Nagle commented Mar 22, 2021

VincentJousse commented Mar 22, 2021

VincentJousse commented Mar 30, 2021

zakarumych commented Mar 30, 2021

VincentJousse commented Mar 30, 2021

zakarumych commented Mar 30, 2021

VincentJousse commented Mar 30, 2021 •

edited

Loading

zakarumych commented Mar 30, 2021 •

edited

Loading

VincentJousse commented Mar 30, 2021

John-Nagle commented Mar 30, 2021

zakarumych commented Mar 30, 2021 •

edited

Loading

zakarumych commented Mar 30, 2021

VincentJousse commented Mar 31, 2021

kvark commented Mar 31, 2021

VincentJousse commented Mar 31, 2021

kvark commented Mar 31, 2021

zakarumych commented Mar 31, 2021 •

edited

Loading

VincentJousse commented Apr 26, 2021

zakarumych commented Apr 26, 2021

VincentJousse commented Apr 26, 2021

kvark commented Apr 26, 2021

VincentJousse commented Apr 26, 2021

VincentJousse commented Apr 28, 2021

zakarumych commented Apr 28, 2021

kvark commented Apr 28, 2021

kvark commented Apr 29, 2021

Every 676 frames, Queue::write_buffer takes ~25ms #1242

Every 676 frames, Queue::write_buffer takes ~25ms #1242

Comments

Aeledfyr commented Mar 2, 2021 • edited Loading

Aeledfyr commented Mar 2, 2021 • edited Loading

kvark commented Mar 2, 2021

John-Nagle commented Mar 22, 2021

VincentJousse commented Mar 22, 2021

VincentJousse commented Mar 30, 2021

zakarumych commented Mar 30, 2021

VincentJousse commented Mar 30, 2021

zakarumych commented Mar 30, 2021

VincentJousse commented Mar 30, 2021 • edited Loading

zakarumych commented Mar 30, 2021 • edited Loading

VincentJousse commented Mar 30, 2021

John-Nagle commented Mar 30, 2021

zakarumych commented Mar 30, 2021 • edited Loading

zakarumych commented Mar 30, 2021

VincentJousse commented Mar 31, 2021

kvark commented Mar 31, 2021

VincentJousse commented Mar 31, 2021

kvark commented Mar 31, 2021

zakarumych commented Mar 31, 2021 • edited Loading

VincentJousse commented Apr 26, 2021

zakarumych commented Apr 26, 2021

VincentJousse commented Apr 26, 2021

kvark commented Apr 26, 2021

VincentJousse commented Apr 26, 2021

VincentJousse commented Apr 28, 2021

zakarumych commented Apr 28, 2021

kvark commented Apr 28, 2021

kvark commented Apr 29, 2021

Aeledfyr commented Mar 2, 2021 •

edited

Loading

Aeledfyr commented Mar 2, 2021 •

edited

Loading

VincentJousse commented Mar 30, 2021 •

edited

Loading

zakarumych commented Mar 30, 2021 •

edited

Loading

zakarumych commented Mar 30, 2021 •

edited

Loading

zakarumych commented Mar 31, 2021 •

edited

Loading