Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Every 676 frames, Queue::write_buffer takes ~25ms #1242

Closed
Aeledfyr opened this issue Mar 2, 2021 · 28 comments
Closed

Every 676 frames, Queue::write_buffer takes ~25ms #1242

Aeledfyr opened this issue Mar 2, 2021 · 28 comments
Labels
area: performance How fast things go help required We need community help to make this happen. type: bug Something isn't working
Milestone

Comments

@Aeledfyr
Copy link

Aeledfyr commented Mar 2, 2021

Description
Every 676 frames (varies by vsync / workload, but is exact within one set of settings), one Queue::write_buffer or Device::create_buffer call takes ~25ms. The submit call on the next frame also takes longer than usual, generally about 16ms.

See also: gfx-rs/wgpu-rs#363, since it appears to be a similar issue with a previous allocator.

Repro steps
I don't have a minimal example and the code that I am experiencing this in is not public, but here's the general overview for reproduction.

My application allocates most memory at the start, and only rarely creates new buffers. There are about 10 write_buffer calls per frame, with reasonably small buffers for each. A random one of these calls takes 25ms on the spike frame. My application allocates a large amount of memory, which could be causing this. It allocates 1 or 2 128MiB vertex buffers, and sub allocates within those buffers to reduce vertex buffer swapping. There are not many individual buffers, so that shouldn't be the cause of the issue.

To find the spikes, I have been using tracing and tracing-tracy; this integrates with wgpu's tracing setup, so it can show some more detail.

Expected vs observed behavior
I would expect the frame time to be consistent, with no random or regular spikes. Instead, once every n frames, a single Queue::write_buffer call takes 25ms, and submitting the next frame after that frame (not the submit after the write, the second submit after the write...) takes ~10ms.

The n here varies based on whether vsync is enabled or the general frame time, but is very consistent. The number of frames between all frame spikes are the exact same, while the time may vary. Running a more minimal render and increasing the framerate as much as possible makes this more visible.

Extra materials
These images use the Tracy profiler, watching tracing's output at the trace level for all crates. I added a patch to gpu-alloc to add more tracing information - most of the time is spent in the backend allocate memory function, which I believe is provided in gfx-hal in this case.

Tracy inspection of a two frame spike:
Tracy view of the frame spike
Comparison to nearby normal frames:
Tracy view of the frame spike with nearby frames shown for comparison

Platform
GPU: GTX 970
OS: Manjaro Linux
Backend used: Vulkan
wgpu version: 0.7.0

@Aeledfyr
Copy link
Author

Aeledfyr commented Mar 2, 2021

I have also tested this on a Linux machine with an AMD gpu with similar results (the allocations take 9ms rather than 25ms), so this is most likely not a driver/gpu issue.

I have also created an example case from the wgpu-rs cube example: https://github.com/Aeledfyr/wgpu-example. The spikes are rarer, but it consistently has spikes about ~20-30s apart on my machine. (I disabled vsync to make the spikes visible; otherwise finding them would be a pain).

The modifications to the example are:
Allocate a large (128MiB) buffer marked as VERTEX and COPY_DST.
In the render function, call queue.write_buffer to overwrite the vertex and index buffers 10 times.
The spikes do not appear to occur if the large allocation was not performed.

@kvark kvark added area: performance How fast things go help required We need community help to make this happen. type: bug Something isn't working labels Mar 2, 2021
@kvark
Copy link
Member

kvark commented Mar 2, 2021

Thank you for filing this beautiful issue!
You found that allocate_memory is the problem. Good news is - we don't expect this to be called at all in a use case where the uploads are done regularly, unless the amount of uploads exceeded some threshold. So we'll need to debug (gpu-alloc in particular) and see why exactly we ended up allocating new memory in these spikes.

@John-Nagle
Copy link

I may have this problem, but am not sure yet. I'm getting brief stalls, as long as 160ms, from a Rust program atop Rend3 atop wgpu. One thread is running the refresh loop, which does little else. Another thread is loading content, allocating GPU memory, and adding textures, materials, and objects via Rend3. All this is in Rust. On a complex scene, the normal frame rate is around 200 FPS, but every 1-2 secs then there's a stutter, with one frame taking far too long. This only happens during content loading; once all content is loaded, there is no more stuttering. Loading larger vertex buffers (64K vertices) seems to make it worse.

So the symptom is the same, but I have not done any profiling to confirm the cause.

(6 CPUs, 12 hardware threads, AMD Ryzen 5, NVidia 3070 8GB, 32 GB of RAM, Ubuntu 20.04 LTS)

@VincentJousse
Copy link
Contributor

I also reproduce this with Vulkan and WebGL backends.

@VincentJousse
Copy link
Contributor

So we now have only two allocate_memory calls, but the second is still annoying. It is due to the current Linear Allocator algorithm.
I suggested to allocate two chunks directly, that was rejected by @zakarumych .
@kvark do you have suggestions ?

@zakarumych
Copy link

I left that issue open to not forget to think about it.
Maybe treating one memory object as chunk pair and reusing first half if it's free when the second one is exhausted.

@VincentJousse
Copy link
Contributor

Do you think it would be to complex to keep track of deallocated regions inside a chunk to reuse them directly ?

@zakarumych
Copy link

Maybe I could keep sorted list of free regions, find suitable region and cut it on allocation and merge on deallocation.
It would be easily fragmented, but user promises to deallocated all blocks shortly, so fragmentation should not be an issue.

@VincentJousse
Copy link
Contributor

VincentJousse commented Mar 30, 2021

user promises to deallocated all blocks shortly

What do you mean ?
Is it a requirement ?

@zakarumych
Copy link

zakarumych commented Mar 30, 2021

There's gpu_alloc::UsageFlags::TRANSIENT flag. It can be set as a hint that this allocation is short-living.
wgpu uses it for particular type of allocations, for example for staging buffers for uploads. Exactly the case of Queue::write_buffer
gpu-alloc uses LinearAllocator only if allocation request contains this flag.
For long-lived allocations another allocator is used, which avoids fragmentation and can reuse individual allocated blocks, but have a bit of memory overhead.

@VincentJousse
Copy link
Contributor

Good, so no need to track freed regions, your proposal of splitting in a chunk pair seems great !

@John-Nagle
Copy link

As a user of Rend3, I'm one level up from this problem, but I definitely see it, and it has an impact. I'm writing a viewer for a virtual world, and load gigabytes of content into the GPU. One thread is just a refresh loop. Another thread is loading content. Frame rates look like this:

00217 frames over 01.00s. Min: 03.16ms; Average: 04.63ms; 95%: 06.15ms; 99%: 09.88ms; Max: 105.19ms; StdDev: 06.91ms
00189 frames over 01.00s. Min: 03.42ms; Average: 05.30ms; 95%: 06.06ms; 99%: 105.38ms; Max: 105.38ms; StdDev: 10.41ms
00231 frames over 01.00s. Min: 03.38ms; Average: 04.33ms; 95%: 05.06ms; 99%: 06.32ms; Max: 108.59ms; StdDev: 06.90ms
00152 frames over 01.01s. Min: 03.41ms; Average: 06.66ms; 95%: 18.86ms; 99%: 119.01ms; Max: 119.01ms; StdDev: 12.81ms
00135 frames over 01.00s. Min: 03.39ms; Average: 07.43ms; 95%: 18.30ms; 99%: 107.28ms; Max: 107.28ms; StdDev: 10.04ms
00235 frames over 01.00s. Min: 03.38ms; Average: 04.27ms; 95%: 04.94ms; 99%: 07.67ms; Max: 107.22ms; StdDev: 06.76ms
00187 frames over 01.00s. Min: 03.04ms; Average: 05.36ms; 95%: 06.80ms; 99%: 111.21ms; Max: 111.21ms; StdDev: 10.92ms
00226 frames over 01.00s. Min: 03.33ms; Average: 04.43ms; 95%: 05.55ms; 99%: 11.14ms; Max: 108.14ms; StdDev: 06.97ms
00235 frames over 01.00s. Min: 03.35ms; Average: 04.27ms; 95%: 05.42ms; 99%: 07.64ms; Max: 110.49ms; StdDev: 06.97ms
00225 frames over 01.00s. Min: 03.43ms; Average: 04.46ms; 95%: 05.37ms; 99%: 12.77ms; Max: 111.15ms; StdDev: 07.19ms
00195 frames over 01.00s. Min: 03.18ms; Average: 05.14ms; 95%: 05.54ms; 99%: 113.05ms; Max: 113.05ms; StdDev: 10.73ms
00217 frames over 01.00s. Min: 03.29ms; Average: 04.62ms; 95%: 05.21ms; 99%: 17.83ms; Max: 104.99ms; StdDev: 06.91ms

Notice the stalls. Average around 5ms, 95% of frames around 5ms, max around 100ms! Those huge stalls are a big drag on the user experience.

(Plenty of CPU time available; 6 cores and under 25% total CPU utilization.)

@zakarumych
Copy link

zakarumych commented Mar 30, 2021

I added experimental allocation strategy that can be enabled with feature "freelist" on version 0.4.2
Currently it will replace LinearAllocator with FreeListAllocator which can reuse individual memory regions, and merge them.
Without adding anything to config it'll just keep at least 2*linear_chunk of memory preallocated.
And if memory consumption is low, only one chunk of size linear_chunk will be allocated.

@zakarumych
Copy link

Regarding loading gigabytes of data to the GPU, allocator configuration is required to keep more memory preallocated. Or some sophisticated guessing, about what memory could be required again soon.

@VincentJousse
Copy link
Contributor

@kvark will wgpu use this freelist feature ?

@kvark
Copy link
Member

kvark commented Mar 31, 2021

I'm still expecting this to be fully abstracted away by gpu-alloc. If we were to start manually keep memory chunks, we'd then basically start re-implementing gpu-alloc internally.

@VincentJousse
Copy link
Contributor

If I understand well what @zakarumych said, it's just a feature to add, nothing more.

@kvark
Copy link
Member

kvark commented Mar 31, 2021

Oh, ok. We'd use whatever gpu-alloc provides, of course.

@zakarumych
Copy link

zakarumych commented Mar 31, 2021

@VincentFTS as this is a feature, you can enable it in your crate, without changes in wgpu. Just add gpu-alloc to your dependencies with feature enabled.
One we confirm that FreeListAllocator works fine, I'll just make it on by default and add fields into config to control when it shall be used.

@VincentJousse
Copy link
Contributor

@zakarumych sorry for this late answer …
I tried to activate the freelist feature, and I get a segmentation fault.
thread 'main' panicked at 'attempt to subtract with overflow' in gpu-alloc/src/freelist.rs:140:29

@zakarumych
Copy link

Try to use latest commit on git, see if it helps

@VincentJousse
Copy link
Contributor

I already use it

@kvark kvark added this to the Version 0.8 milestone Apr 26, 2021
@kvark
Copy link
Member

kvark commented Apr 26, 2021

Marked this for 0.8 release. If the changes are in gpu-alloc, they'd naturally be picked up because we'll require gpu-alloc to be published.

@VincentJousse
Copy link
Contributor

@zakarumych here is my proposal : zakarumych/gpu-alloc#44

@VincentJousse
Copy link
Contributor

@kvark is it OK for you to update gpu-alloc dependency and to activate the freelist feature ?

@zakarumych
Copy link

Feature is active by default now

@kvark
Copy link
Member

kvark commented Apr 28, 2021

Thanks to both of you!
I'm a bit worried about making a wgpu release now, given that issues may be found with this new code, but hopefully we'll be able to patch it.

@kvark
Copy link
Member

kvark commented Apr 29, 2021

We just updated to use "gpu-alloc-0.4.4"

@kvark kvark closed this as completed Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance How fast things go help required We need community help to make this happen. type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants