Vulkan: Use VMA for memory allocations #11132

K0bin · 2022-10-07T22:08:59Z

When trying to build this: don't forget to update the submodules.

The error posted in https://bugs.dolphin-emu.org/issues/13064#note-30 indicates that Dolphin hits the implementation specific Vulkan memory allocation limit. (Allocation limit as in number of allocations, rather than allocated memory)

To fix this, we can use a Vulkan memory allocator.
So with this PR, Dolphin joins the long list of Vulkan applications using AMDs VMA. PCSX2 and Duckstation for example also use VMA.

In theory, this should speed up memory allocations a bit, since we don't hit the system allocator for every call but I don't think Dolphin allocates frequently enough for this to be noticable at all.
Another nice side effect is that it should automatically use dedicated allocations for resources where it's beneficial. (See VK_KHR_dedicated_allocation for more info. I know that this can have a decent impact on GPU performance on Nvidia hardware. That said, desktop GPUs are ridiculously powerful for the Dolphin workload anyway, so I don't think this is noticable either.

iwubcode · 2022-10-07T22:25:13Z

Good to see this being attempted again. But this needs to be tested to make sure we actually have some gains or at least don't lose anything ( see #10380 ). The clang / windows checks might be needed from that review (?). Haven't looked at the buildbot failures.

Additionally, I think the general practice is to now use a submodule for the external code. That'll reduce the LOC.

Source/Core/VideoBackends/Vulkan/VKTexture.cpp

Source/Core/VideoBackends/Vulkan/StagingBuffer.cpp

Source/Core/VideoBackends/Vulkan/VKStreamBuffer.cpp

Source/Core/VideoBackends/Vulkan/VKTexture.cpp

K0bin · 2022-10-07T23:06:12Z

@iwubcode I'm open to using the VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT and VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT flags but that creates a problem with the Mali slow coherent memory workaround, where we prefer a coherent memory type, even if that ends up being uncached.

Btw, I'ved tried it and on my PC and phone, performance is identical. I can't see at first glance, why your attempt was slower.
I might also get rid of the manual coherency checking, because vmaFlushAllocation and vmaInvalidateAllocation check that internally.

K0bin · 2022-10-07T23:09:54Z

I guess I'll also steel the pragmas from your branch to make the compilers happy.

iwubcode · 2022-10-07T23:27:19Z

@K0bin -

I'm open to using the VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT and VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT flags but that creates a problem with the Mali slow coherent memory workaround, where we prefer a coherent memory type, even if that ends up being uncached.

As mentioned in my comment. Just personally, I don't want to cater to Mali (or any vendor / driver). I want to write graphics code with best practices. If we need workarounds for specific targets then we can do that like we did with your linear workaround (I'd personally leave workarounds out but sometimes they are inevitable).

Btw, I'ved tried it and on my PC and phone, performance is identical. I can't see at first glance, why your attempt was slower.

Compared with master? That's good. I don't recall why it was slower at first but after some tweaks it became on-par. We just didn't have much benefit to merge it then. If we have a good reason now (I think uniformity between emulators should be a good enough reason tbh) then I'm on board.

I guess I'll also steel the pragmas from your branch to make the compilers happy.

Please do! Thanks again for working on this.

K0bin · 2022-10-07T23:43:23Z

I switched to using the VMA flags instead of picking memory flags directly and it actually seems quite a bit faster now. Not sure about that, I haven't measured it, just a quick test. My best guess is that it uses host visible VRAM now for staging buffers and the streaming buffer, which may improve GPU performance (and thus reduce the time spent waiting for the GPU to finish because of EFB peeks).

If we have a good reason now

The main reason is that we seem to hit the Vulkan 4096 allocation limit on Windows in some cases.

Source/Core/VideoBackends/Vulkan/StagingBuffer.cpp

Source/Core/VideoBackends/Vulkan/VKStreamBuffer.cpp

Source/Core/VideoBackends/Vulkan/VKTexture.cpp

iwubcode

As mentioned in a comment, the vma allocator isn't a part of the Vulkan release so it does not belong in the Vulkan external. It needs to be in its own directory (possibly with a cmake file for completeness?). Preferably with a submodule.

iwubcode · 2022-10-08T02:26:54Z

I did a bit of testing on a variety of games. Memory utilization is higher on this branch than it is on master but overall the games run fine.

As somewhat of an aside, the command buffer enhancement PR caused a regression. If I try to play the opening of Opoona my card runs out of memory. On master it returns DEVICE OUT OF MEMORY, while on this branch textures start to glitch out. I can get it to return DEVICE OUT OF MEMORY if I try to take a screenshot (or move too much). Not particularly, related to this branch but I bring it up because I'm not sure if this branch truly fixes the root cause of 13064. Then again, maybe this is a different issue.

K0bin · 2022-10-08T10:57:10Z

The higher memory usage can probably be addressed by using a smaller block size. Could you point me at games where this is the case? Assuming this is a problem and we're not just talking about a few dozen megabytes here.

As for the other issue, did you also use my descriptor set PR? It likely doesn't really run out of memory, just out of descriptors and or descriptor sets which that PR fixes.

K0bin · 2022-10-08T13:19:54Z

Reduced the block size from the default 256MiB to 64MiB to reduce memory overhead.
Pulled in VMA as a git submodule (someone help with the build bot pls)
Enabled VK_EXT_memory_budget when supported and set the flag for allocations to stay within the budget
Removed coherent bools because VMA checks that internally anyway.
Removed manual mapping and unmapping and changed it to use the CREATE_MAPPED flag instead
Using PREFER_HOST for the stream buffer because otherwise VMA picks a VRAM memory type which hurts performance.

iwubcode · 2022-10-08T17:48:30Z

As for the other issue, did you also use my descriptor set PR? It likely doesn't really run out of memory, just out of descriptors and or descriptor sets which that PR fixes.

Yes, I tried all your PRs, none of them fix the issue. I watch as the VRAM usage goes higher and higher until there's no more VRAM left. I thought it was something related to how you changed the idle wait but that didn't resolve the issue either.

I will go ahead and try out this PR again. Haven't looked at the code yet but the changes sound very good.

The higher memory usage can probably be addressed by using a smaller block size. Could you point me at games where this is the case? Assuming this is a problem and we're not just talking about a few dozen megabytes here.

I did general testing on a handful of games but only looked at the memory usage on one game (besides the crashing one). That was The Last Story with a texture pack. I believe this PR was 2.8gb of VRAM usage whereas master was 2.4. I will try the PR changes and report back.

iwubcode · 2022-10-08T18:15:54Z

Memory is now within an acceptable difference I think. Using the same Last Story test, I see 2.3gb for this PR and 2.2gb for master. I also noticed without the frame limit in place master hits a high of 221% while this PR hits around 218%. That could be explained by slight differences between master and this PR.

Opoona still runs out of memory but overall I'm pretty happy with this implementation. I will finish my review.

Great work @K0bin

K0bin · 2022-10-08T19:35:33Z

I also noticed without the frame limit in place master hits a high of 221% while this PR hits around 218%. That could be explained by slight differences between master and this PR.

I guess we could try setting PREFER_HOST for staging buffers too and see if that makes a difference. Beyond that, we're pretty much doing the same things as far as the CPU is concerned.

Opoona still runs out of memory

I can reproduce that and am working on a fix. It's unrelated to this PR though.

K0bin · 2022-10-08T19:44:00Z

Fix for Opoona: #11143

iwubcode

While there might be some tweaks that can be made to the flags, overall this LGTM.

Source/Core/VideoBackends/Vulkan/VulkanContext.cpp

TellowKrinkle · 2022-10-10T04:32:10Z

Using PREFER_HOST for the stream buffer because otherwise VMA picks a VRAM memory type which hurts performance.

Might want to test this on ubershaders. On Metal (which doesn't have host-visible VRAM), I saw a noticeable improvement (45 → 60 fps on prime2tri-mapslowdown.dff.zip on my 5600M at 4x, run with VS expansion) from allocating separate host and device buffers and issuing copy commands to upload stream buffer data (but only when running exclusive ubershaders).

K0bin · 2022-10-10T10:15:05Z

from allocating separate host and device buffers and issuing copy commands to upload stream buffer data

I think that would be something to look into after this PR is merged. Master right now doesn't use VRAM either and in my testing (without uber shaders) in Mario Galaxy, copying straight to VRAM on the CPU is 25% slower.

testsr-student · 2022-10-21T06:04:18Z

Testing?

K0bin · 2022-10-21T10:56:25Z

@s3731997

What do you mean? If you want to test it, go ahead.

Source/Core/VideoBackends/Vulkan/StagingBuffer.cpp

Source/Core/VideoBackends/Vulkan/VulkanContext.h

K0bin · 2022-10-23T00:38:47Z

I addressed the review comments.

Co-authored-by: iwubcode <iwubcode@users.noreply.github.com>

Pokechu22

Looks good to me apart from one possible naming thing.

Source/Core/VideoBackends/Vulkan/StagingBuffer.h

Source/Core/VideoBackends/Vulkan/VulkanContext.cpp

Pokechu22

Untested, but looks good to me.

JMC47 · 2022-10-23T05:18:10Z

I did test this and while I didn't see any major difference, I do think this is good for other users/configurations.