-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Every 676 frames, Queue::write_buffer takes ~25ms #1242
Comments
I have also tested this on a Linux machine with an AMD gpu with similar results (the allocations take 9ms rather than 25ms), so this is most likely not a driver/gpu issue. I have also created an example case from the The modifications to the example are: |
Thank you for filing this beautiful issue! |
I may have this problem, but am not sure yet. I'm getting brief stalls, as long as 160ms, from a Rust program atop Rend3 atop wgpu. One thread is running the refresh loop, which does little else. Another thread is loading content, allocating GPU memory, and adding textures, materials, and objects via Rend3. All this is in Rust. On a complex scene, the normal frame rate is around 200 FPS, but every 1-2 secs then there's a stutter, with one frame taking far too long. This only happens during content loading; once all content is loaded, there is no more stuttering. Loading larger vertex buffers (64K vertices) seems to make it worse. So the symptom is the same, but I have not done any profiling to confirm the cause. (6 CPUs, 12 hardware threads, AMD Ryzen 5, NVidia 3070 8GB, 32 GB of RAM, Ubuntu 20.04 LTS) |
I also reproduce this with Vulkan and WebGL backends. |
So we now have only two allocate_memory calls, but the second is still annoying. It is due to the current Linear Allocator algorithm. |
I left that issue open to not forget to think about it. |
Do you think it would be to complex to keep track of deallocated regions inside a chunk to reuse them directly ? |
Maybe I could keep sorted list of free regions, find suitable region and cut it on allocation and merge on deallocation. |
What do you mean ? |
There's |
Good, so no need to track freed regions, your proposal of splitting in a chunk pair seems great ! |
As a user of Rend3, I'm one level up from this problem, but I definitely see it, and it has an impact. I'm writing a viewer for a virtual world, and load gigabytes of content into the GPU. One thread is just a refresh loop. Another thread is loading content. Frame rates look like this:
Notice the stalls. Average around 5ms, 95% of frames around 5ms, max around 100ms! Those huge stalls are a big drag on the user experience. (Plenty of CPU time available; 6 cores and under 25% total CPU utilization.) |
I added experimental allocation strategy that can be enabled with feature "freelist" on version 0.4.2 |
Regarding loading gigabytes of data to the GPU, allocator configuration is required to keep more memory preallocated. Or some sophisticated guessing, about what memory could be required again soon. |
@kvark will wgpu use this freelist feature ? |
I'm still expecting this to be fully abstracted away by |
If I understand well what @zakarumych said, it's just a feature to add, nothing more. |
Oh, ok. We'd use whatever gpu-alloc provides, of course. |
@VincentFTS as this is a feature, you can enable it in your crate, without changes in |
@zakarumych sorry for this late answer … |
Try to use latest commit on git, see if it helps |
I already use it |
Marked this for 0.8 release. If the changes are in gpu-alloc, they'd naturally be picked up because we'll require gpu-alloc to be published. |
@zakarumych here is my proposal : zakarumych/gpu-alloc#44 |
@kvark is it OK for you to update gpu-alloc dependency and to activate the freelist feature ? |
Feature is active by default now |
Thanks to both of you! |
We just updated to use "gpu-alloc-0.4.4" |
Description
Every 676 frames (varies by vsync / workload, but is exact within one set of settings), one
Queue::write_buffer
orDevice::create_buffer
call takes ~25ms. The submit call on the next frame also takes longer than usual, generally about 16ms.See also: gfx-rs/wgpu-rs#363, since it appears to be a similar issue with a previous allocator.
Repro steps
I don't have a minimal example and the code that I am experiencing this in is not public, but here's the general overview for reproduction.
My application allocates most memory at the start, and only rarely creates new buffers. There are about 10
write_buffer
calls per frame, with reasonably small buffers for each. A random one of these calls takes 25ms on the spike frame. My application allocates a large amount of memory, which could be causing this. It allocates 1 or 2 128MiB vertex buffers, and sub allocates within those buffers to reduce vertex buffer swapping. There are not many individual buffers, so that shouldn't be the cause of the issue.To find the spikes, I have been using
tracing
andtracing-tracy
; this integrates withwgpu
's tracing setup, so it can show some more detail.Expected vs observed behavior
I would expect the frame time to be consistent, with no random or regular spikes. Instead, once every
n
frames, a singleQueue::write_buffer
call takes 25ms, and submitting the next frame after that frame (not the submit after the write, the second submit after the write...) takes ~10ms.The
n
here varies based on whether vsync is enabled or the general frame time, but is very consistent. The number of frames between all frame spikes are the exact same, while the time may vary. Running a more minimal render and increasing the framerate as much as possible makes this more visible.Extra materials
These images use the Tracy profiler, watching
tracing
's output at thetrace
level for all crates. I added a patch togpu-alloc
to add more tracing information - most of the time is spent in the backend allocate memory function, which I believe is provided ingfx-hal
in this case.Tracy inspection of a two frame spike:
Comparison to nearby normal frames:
Platform
GPU: GTX 970
OS: Manjaro Linux
Backend used: Vulkan
wgpu version: 0.7.0
The text was updated successfully, but these errors were encountered: