-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan: Use VMA for memory allocations #11132
Conversation
Good to see this being attempted again. But this needs to be tested to make sure we actually have some gains or at least don't lose anything ( see #10380 ). The clang / windows checks might be needed from that review (?). Haven't looked at the buildbot failures. Additionally, I think the general practice is to now use a submodule for the external code. That'll reduce the LOC. |
@iwubcode I'm open to using the Btw, I'ved tried it and on my PC and phone, performance is identical. I can't see at first glance, why your attempt was slower. |
I guess I'll also steel the pragmas from your branch to make the compilers happy. |
@K0bin -
As mentioned in my comment. Just personally, I don't want to cater to Mali (or any vendor / driver). I want to write graphics code with best practices. If we need workarounds for specific targets then we can do that like we did with your linear workaround (I'd personally leave workarounds out but sometimes they are inevitable).
Compared with master? That's good. I don't recall why it was slower at first but after some tweaks it became on-par. We just didn't have much benefit to merge it then. If we have a good reason now (I think uniformity between emulators should be a good enough reason tbh) then I'm on board.
Please do! Thanks again for working on this. |
I switched to using the VMA flags instead of picking memory flags directly and it actually seems quite a bit faster now. Not sure about that, I haven't measured it, just a quick test. My best guess is that it uses host visible VRAM now for staging buffers and the streaming buffer, which may improve GPU performance (and thus reduce the time spent waiting for the GPU to finish because of EFB peeks).
The main reason is that we seem to hit the Vulkan 4096 allocation limit on Windows in some cases. |
8e143f7
to
7efffb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in a comment, the vma allocator isn't a part of the Vulkan release so it does not belong in the Vulkan external. It needs to be in its own directory (possibly with a cmake file for completeness?). Preferably with a submodule.
I did a bit of testing on a variety of games. Memory utilization is higher on this branch than it is on master but overall the games run fine. As somewhat of an aside, the command buffer enhancement PR caused a regression. If I try to play the opening of Opoona my card runs out of memory. On master it returns DEVICE OUT OF MEMORY, while on this branch textures start to glitch out. I can get it to return DEVICE OUT OF MEMORY if I try to take a screenshot (or move too much). Not particularly, related to this branch but I bring it up because I'm not sure if this branch truly fixes the root cause of 13064. Then again, maybe this is a different issue. |
The higher memory usage can probably be addressed by using a smaller block size. Could you point me at games where this is the case? Assuming this is a problem and we're not just talking about a few dozen megabytes here. As for the other issue, did you also use my descriptor set PR? It likely doesn't really run out of memory, just out of descriptors and or descriptor sets which that PR fixes. |
|
ea37f26
to
fb46965
Compare
Yes, I tried all your PRs, none of them fix the issue. I watch as the VRAM usage goes higher and higher until there's no more VRAM left. I thought it was something related to how you changed the idle wait but that didn't resolve the issue either. I will go ahead and try out this PR again. Haven't looked at the code yet but the changes sound very good.
I did general testing on a handful of games but only looked at the memory usage on one game (besides the crashing one). That was The Last Story with a texture pack. I believe this PR was 2.8gb of VRAM usage whereas master was 2.4. I will try the PR changes and report back. |
Memory is now within an acceptable difference I think. Using the same Last Story test, I see 2.3gb for this PR and 2.2gb for master. I also noticed without the frame limit in place master hits a high of 221% while this PR hits around 218%. That could be explained by slight differences between master and this PR. Opoona still runs out of memory but overall I'm pretty happy with this implementation. I will finish my review. Great work @K0bin |
I guess we could try setting PREFER_HOST for staging buffers too and see if that makes a difference. Beyond that, we're pretty much doing the same things as far as the CPU is concerned.
I can reproduce that and am working on a fix. It's unrelated to this PR though. |
Fix for Opoona: #11143 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While there might be some tweaks that can be made to the flags, overall this LGTM.
Might want to test this on ubershaders. On Metal (which doesn't have host-visible VRAM), I saw a noticeable improvement (45 → 60 fps on prime2tri-mapslowdown.dff.zip on my 5600M at 4x, run with VS expansion) from allocating separate host and device buffers and issuing copy commands to upload stream buffer data (but only when running exclusive ubershaders). |
I think that would be something to look into after this PR is merged. Master right now doesn't use VRAM either and in my testing (without uber shaders) in Mario Galaxy, copying straight to VRAM on the CPU is 25% slower. |
Testing? |
@s3731997 What do you mean? If you want to test it, go ahead. |
I addressed the review comments. |
Co-authored-by: iwubcode <iwubcode@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me apart from one possible naming thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Untested, but looks good to me.
I did test this and while I didn't see any major difference, I do think this is good for other users/configurations. |
When trying to build this: don't forget to update the submodules.
The error posted in https://bugs.dolphin-emu.org/issues/13064#note-30 indicates that Dolphin hits the implementation specific Vulkan memory allocation limit. (Allocation limit as in number of allocations, rather than allocated memory)
To fix this, we can use a Vulkan memory allocator.
So with this PR, Dolphin joins the long list of Vulkan applications using AMDs VMA. PCSX2 and Duckstation for example also use VMA.
In theory, this should speed up memory allocations a bit, since we don't hit the system allocator for every call but I don't think Dolphin allocates frequently enough for this to be noticable at all.
Another nice side effect is that it should automatically use dedicated allocations for resources where it's beneficial. (See VK_KHR_dedicated_allocation for more info. I know that this can have a decent impact on GPU performance on Nvidia hardware. That said, desktop GPUs are ridiculously powerful for the Dolphin workload anyway, so I don't think this is noticable either.