Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Use VMA for memory allocations #11132

Merged
merged 6 commits into from
Oct 23, 2022
Merged

Vulkan: Use VMA for memory allocations #11132

merged 6 commits into from
Oct 23, 2022

Conversation

K0bin
Copy link
Contributor

@K0bin K0bin commented Oct 7, 2022

When trying to build this: don't forget to update the submodules.

The error posted in https://bugs.dolphin-emu.org/issues/13064#note-30 indicates that Dolphin hits the implementation specific Vulkan memory allocation limit. (Allocation limit as in number of allocations, rather than allocated memory)

To fix this, we can use a Vulkan memory allocator.
So with this PR, Dolphin joins the long list of Vulkan applications using AMDs VMA. PCSX2 and Duckstation for example also use VMA.

In theory, this should speed up memory allocations a bit, since we don't hit the system allocator for every call but I don't think Dolphin allocates frequently enough for this to be noticable at all.
Another nice side effect is that it should automatically use dedicated allocations for resources where it's beneficial. (See VK_KHR_dedicated_allocation for more info. I know that this can have a decent impact on GPU performance on Nvidia hardware. That said, desktop GPUs are ridiculously powerful for the Dolphin workload anyway, so I don't think this is noticable either.

@iwubcode
Copy link
Contributor

iwubcode commented Oct 7, 2022

Good to see this being attempted again. But this needs to be tested to make sure we actually have some gains or at least don't lose anything ( see #10380 ). The clang / windows checks might be needed from that review (?). Haven't looked at the buildbot failures.

Additionally, I think the general practice is to now use a submodule for the external code. That'll reduce the LOC.

@K0bin
Copy link
Contributor Author

K0bin commented Oct 7, 2022

@iwubcode I'm open to using the VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT and VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT flags but that creates a problem with the Mali slow coherent memory workaround, where we prefer a coherent memory type, even if that ends up being uncached.

Btw, I'ved tried it and on my PC and phone, performance is identical. I can't see at first glance, why your attempt was slower.
I might also get rid of the manual coherency checking, because vmaFlushAllocation and vmaInvalidateAllocation check that internally.

@K0bin
Copy link
Contributor Author

K0bin commented Oct 7, 2022

I guess I'll also steel the pragmas from your branch to make the compilers happy.

@iwubcode
Copy link
Contributor

iwubcode commented Oct 7, 2022

@K0bin -

I'm open to using the VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT and VMA_ALLOCATION_CREATE_HOST_ACCESS_RANDOM_BIT flags but that creates a problem with the Mali slow coherent memory workaround, where we prefer a coherent memory type, even if that ends up being uncached.

As mentioned in my comment. Just personally, I don't want to cater to Mali (or any vendor / driver). I want to write graphics code with best practices. If we need workarounds for specific targets then we can do that like we did with your linear workaround (I'd personally leave workarounds out but sometimes they are inevitable).

Btw, I'ved tried it and on my PC and phone, performance is identical. I can't see at first glance, why your attempt was slower.

Compared with master? That's good. I don't recall why it was slower at first but after some tweaks it became on-par. We just didn't have much benefit to merge it then. If we have a good reason now (I think uniformity between emulators should be a good enough reason tbh) then I'm on board.

I guess I'll also steel the pragmas from your branch to make the compilers happy.

Please do! Thanks again for working on this.

@K0bin
Copy link
Contributor Author

K0bin commented Oct 7, 2022

I switched to using the VMA flags instead of picking memory flags directly and it actually seems quite a bit faster now. Not sure about that, I haven't measured it, just a quick test. My best guess is that it uses host visible VRAM now for staging buffers and the streaming buffer, which may improve GPU performance (and thus reduce the time spent waiting for the GPU to finish because of EFB peeks).

If we have a good reason now

The main reason is that we seem to hit the Vulkan 4096 allocation limit on Windows in some cases.

@K0bin K0bin force-pushed the vma branch 2 times, most recently from 8e143f7 to 7efffb4 Compare October 7, 2022 23:48
Copy link
Contributor

@iwubcode iwubcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in a comment, the vma allocator isn't a part of the Vulkan release so it does not belong in the Vulkan external. It needs to be in its own directory (possibly with a cmake file for completeness?). Preferably with a submodule.

@iwubcode
Copy link
Contributor

iwubcode commented Oct 8, 2022

I did a bit of testing on a variety of games. Memory utilization is higher on this branch than it is on master but overall the games run fine.

As somewhat of an aside, the command buffer enhancement PR caused a regression. If I try to play the opening of Opoona my card runs out of memory. On master it returns DEVICE OUT OF MEMORY, while on this branch textures start to glitch out. I can get it to return DEVICE OUT OF MEMORY if I try to take a screenshot (or move too much). Not particularly, related to this branch but I bring it up because I'm not sure if this branch truly fixes the root cause of 13064. Then again, maybe this is a different issue.

@K0bin
Copy link
Contributor Author

K0bin commented Oct 8, 2022

The higher memory usage can probably be addressed by using a smaller block size. Could you point me at games where this is the case? Assuming this is a problem and we're not just talking about a few dozen megabytes here.

As for the other issue, did you also use my descriptor set PR? It likely doesn't really run out of memory, just out of descriptors and or descriptor sets which that PR fixes.

@K0bin
Copy link
Contributor Author

K0bin commented Oct 8, 2022

  • Reduced the block size from the default 256MiB to 64MiB to reduce memory overhead.
  • Pulled in VMA as a git submodule (someone help with the build bot pls)
  • Enabled VK_EXT_memory_budget when supported and set the flag for allocations to stay within the budget
  • Removed coherent bools because VMA checks that internally anyway.
  • Removed manual mapping and unmapping and changed it to use the CREATE_MAPPED flag instead
  • Using PREFER_HOST for the stream buffer because otherwise VMA picks a VRAM memory type which hurts performance.

@K0bin K0bin force-pushed the vma branch 3 times, most recently from ea37f26 to fb46965 Compare October 8, 2022 14:17
@iwubcode
Copy link
Contributor

iwubcode commented Oct 8, 2022

As for the other issue, did you also use my descriptor set PR? It likely doesn't really run out of memory, just out of descriptors and or descriptor sets which that PR fixes.

Yes, I tried all your PRs, none of them fix the issue. I watch as the VRAM usage goes higher and higher until there's no more VRAM left. I thought it was something related to how you changed the idle wait but that didn't resolve the issue either.

I will go ahead and try out this PR again. Haven't looked at the code yet but the changes sound very good.

The higher memory usage can probably be addressed by using a smaller block size. Could you point me at games where this is the case? Assuming this is a problem and we're not just talking about a few dozen megabytes here.

I did general testing on a handful of games but only looked at the memory usage on one game (besides the crashing one). That was The Last Story with a texture pack. I believe this PR was 2.8gb of VRAM usage whereas master was 2.4. I will try the PR changes and report back.

@iwubcode
Copy link
Contributor

iwubcode commented Oct 8, 2022

Memory is now within an acceptable difference I think. Using the same Last Story test, I see 2.3gb for this PR and 2.2gb for master. I also noticed without the frame limit in place master hits a high of 221% while this PR hits around 218%. That could be explained by slight differences between master and this PR.

Opoona still runs out of memory but overall I'm pretty happy with this implementation. I will finish my review.

Great work @K0bin

@K0bin
Copy link
Contributor Author

K0bin commented Oct 8, 2022

I also noticed without the frame limit in place master hits a high of 221% while this PR hits around 218%. That could be explained by slight differences between master and this PR.

I guess we could try setting PREFER_HOST for staging buffers too and see if that makes a difference. Beyond that, we're pretty much doing the same things as far as the CPU is concerned.

Opoona still runs out of memory

I can reproduce that and am working on a fix. It's unrelated to this PR though.

@K0bin
Copy link
Contributor Author

K0bin commented Oct 8, 2022

Fix for Opoona: #11143

Copy link
Contributor

@iwubcode iwubcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there might be some tweaks that can be made to the flags, overall this LGTM.

@TellowKrinkle
Copy link
Contributor

Using PREFER_HOST for the stream buffer because otherwise VMA picks a VRAM memory type which hurts performance.

Might want to test this on ubershaders. On Metal (which doesn't have host-visible VRAM), I saw a noticeable improvement (45 → 60 fps on prime2tri-mapslowdown.dff.zip on my 5600M at 4x, run with VS expansion) from allocating separate host and device buffers and issuing copy commands to upload stream buffer data (but only when running exclusive ubershaders).

@K0bin
Copy link
Contributor Author

K0bin commented Oct 10, 2022

from allocating separate host and device buffers and issuing copy commands to upload stream buffer data

I think that would be something to look into after this PR is merged. Master right now doesn't use VRAM either and in my testing (without uber shaders) in Mario Galaxy, copying straight to VRAM on the CPU is 25% slower.

@testsr-student
Copy link

Testing?

@K0bin
Copy link
Contributor Author

K0bin commented Oct 21, 2022

@s3731997

What do you mean? If you want to test it, go ahead.

@K0bin
Copy link
Contributor Author

K0bin commented Oct 23, 2022

I addressed the review comments.

Co-authored-by: iwubcode <iwubcode@users.noreply.github.com>
Copy link
Contributor

@Pokechu22 Pokechu22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me apart from one possible naming thing.

Source/Core/VideoBackends/Vulkan/StagingBuffer.h Outdated Show resolved Hide resolved
Copy link
Contributor

@Pokechu22 Pokechu22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untested, but looks good to me.

@JMC47
Copy link
Contributor

JMC47 commented Oct 23, 2022

I did test this and while I didn't see any major difference, I do think this is good for other users/configurations.

@JMC47 JMC47 merged commit 06bd0a9 into dolphin-emu:master Oct 23, 2022
@K0bin K0bin deleted the vma branch October 23, 2022 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants