Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VideoCommon: Add dynamic vertex loader for ubershaders to reduce pipeline count #10781

Merged
merged 10 commits into from Sep 20, 2022

Conversation

TellowKrinkle
Copy link
Contributor

@TellowKrinkle TellowKrinkle commented Jun 24, 2022

Some fun things I've been working on, figured I'd post it here for other people to try out, even if it's not quite ready for merge

Adds a vertex loader to vertex ubershaders which removes the need to do anything with the input assembler, reducing the number of unique pipelines needed. This makes the number of different pipelines that actually require recompiles small enough that we can compile them all ahead of time on most renderers. In particular:

  • Metal: Completely eliminates pipeline compiles on all known GPUs
  • DX12: Completely eliminates pipeline compiles on the two GPUs I tested (Radeon Pro 5600M with weird AMD Mac bootcamp drivers and a GT 750M with its ancient Kepler Mobile driver)
  • Vulkan:
    • Completely eliminates pipeline compiles on MoltenVK on Intel, AMD, and Nvidia GPUs
    • Mostly eliminates pipeline compiles on MoltenVK on Apple GPUs
    • Mostly eliminates pipeline compiles on the two GPUs listed above on their Windows Vulkan drivers

The implementation puts the list of offsets and strides into the vertex uniform, which the shader uses to figure out where it should load from in an SSBO.

I'm not super confident in my Vulkan code and even less in my DX12 code, so if someone could look those over especially closely that would be nice.

Before and After recordings (Running with all pipeline and driver caches cleared, exclusive ubershaders, compile shaders before starting)

Before:

Base.mp4

After (that one stutter is an EFB copy pipeline):

UberVertexLoader.mp4

Note: Just waiting on #10747All ready

@TellowKrinkle TellowKrinkle force-pushed the UberVertexLoader branch 2 times, most recently from d00065d to 6242055 Compare June 24, 2022 07:43
@K0bin
Copy link
Contributor

K0bin commented Jun 29, 2022

Wouldn't it make more sense to use VK_EXT_vertex_input_dynamic_state on Vulkan when supported (Nvidia GPUs)?

@Dentomologist
Copy link
Contributor

In the beginning of the title screen in your after video there's a bunch of glitchiness on the left side. Is that a preexisting issue that just didn't show up in the before video?

@JMC47
Copy link
Contributor

JMC47 commented Jun 29, 2022

I'm guessing that's an encoding issue based on the size of the pixels, so I just ignored it.

@Rumi-Larry
Copy link

Rumi-Larry commented Jun 29, 2022

I think it would make more sense to make a PR without the Metal code.

@TellowKrinkle
Copy link
Contributor Author

TellowKrinkle commented Jun 30, 2022

Wouldn't it make more sense to use VK_EXT_vertex_input_dynamic_state on Vulkan when supported (Nvidia GPUs)?

Probably, I admittedly wasn't really developing this for Vulkan. I'm happy to adjust the Vulkan implementation to check g_ActiveConfig.backend_info.bSupportsDynamicVertexLoader so that the whole thing can be flipped off by writing false into the variable when someone does implement VK_EXT_vertex_input_dynamic_state though. Maybe use a different flag to enable the additional pipelines in the VideoCommon: Compile a few extra pipelines commit as well? Or just always compile them, assuming it'll be really fast on OGL / DX11 since it doesn't actually involve any compilation there?

I'm guessing that's an encoding issue based on the size of the pixels, so I just ignored it.

I had to compress those pretty hard to get under GitHub's 10mb limit. Used the same (horribly low) quality settings on both but clearly that doesn't mean x264 will produce similar results. Can confirm that doesn't happen when you actually play.

I think it would make more sense to make a PR without the Metal code.

Just to confirm, you think that even though it would still include the spirv-cross PR code? spirv-cross was merged, so I rebased off Metal

@OatmealDome
Copy link
Member

@MayImilae Would be interesting to see if this fixes the problems with frame times you observed on Intel Macs.

@TellowKrinkle TellowKrinkle marked this pull request as ready for review July 17, 2022 05:05
@Pokechu22
Copy link
Contributor

I haven't looked over this in detail, but it feels like a somewhat strange approach. Dolphin already has a system to reformat vertices from the GameCube's format to one that works well for modern GPUs (VertexLoaderX64/VertexLoaderARM64), though as you've found it generates different formats based on the original one. IMO it would be better to modify that so that a single vertex with all components is generated. That would mean using 156-byte vertices in all cases (even if the input vertex is just a color and a single texture coordinate), though.

@TellowKrinkle
Copy link
Contributor Author

The majority of vertices I've seen are in the 20-28 byte range, which would mean sending everything would be a 6-8x increase in vertex bytes, just to slightly reduce the complexity of ubershaders. Messing with stats reporting shows that it would be pretty common to see games hitting 20MB/frame or a bit over 1GB/s of vertices this way.

Side note, if you were hoping to keep vertex loaders out of shaders in general, I was planning on implementing vertex shader line to triangle expansion for platforms without geometry shaders, which will also require a shader vertex loader.

@JMC47
Copy link
Contributor

JMC47 commented Jul 17, 2022

Ah, Line-width without geometry shaders might actually be faster on Android. That's how slow they are.

Copy link
Contributor

@Pokechu22 Pokechu22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've looked over it further, this approach doesn't seem that bad (although I think it is a bit jank that we have several things labeled as vertex loaders that all do slightly different things). Since all components are 4-byte aligned, the actual code seems fairly simple, and uintBitsToFloat is also convenient.

It might be a bit simpler if there were, say, individual arrays for each component, some of which don't get populated, but I'm not sure if that would work or have any real benefit (and it would require changing the assembly vertex loaders).

Side note, if you were hoping to keep vertex loaders out of shaders in general, I was planning on implementing vertex shader line to triangle expansion for platforms without geometry shaders, which will also require a shader vertex loader.

Note that there's an existing implementation of this that's rather jank at #10295 (which doesn't use a shader-based vertex loader, but instead duplicates them in IndexGenerator.cpp).

Source/Core/VideoCommon/UberShaderVertex.cpp Outdated Show resolved Hide resolved
Source/Core/VideoCommon/VertexShaderManager.cpp Outdated Show resolved Hide resolved
"\tuint vertex_offset_posmtx;\n"
"\tuint vertex_offset_rawcolor0;\n"
"\tuint vertex_offset_rawcolor1;\n"
"\tuint4 vertex_offset_rawtex[2];\n" // std140 is pain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit odd that these are included in the uniform for specialized shaders (and ubershaders that don't use the vertex loader), though I don't know if 64 additional bytes is really a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS uniform size is currently 4032 bytes, and this makes it 4096 bytes, a ~1.6% increase.

If it becomes a big issue, our full vertex layout can be calculated from the component list plus ~17 bits of component size data (since the order is always the same), just a bit more messily. We could put something like this at the top of main instead:

// vs uniform: u32 component_sizes; // bits 0-16: rawtex, bit 17: position has 3 floats
uint vertex_stride = 0;
uint vertex_offset_posmtx;
uint vertex_offset_rawpos;
uint vertex_offset_rawnormal;
uint vertex_offset_rawtangent;
uint vertex_offset_rawbinormal;
uint vertex_offset_rawcolor0;
uint vertex_offset_rawcolor1;
uint vertex_offset_rawtex[8];
if (components & VB_HAS_POSMTX) {
  vertex_offset_posmtx = vertex_stride;
  vertex_stride++;
}
vertex_offset_position = vertex_stride;
vertex_stride += component_sizes & (1 << 17) ? 3 : 2;
if (components & VB_HAS_NORMAL) {
  vertex_offset_rawnormal = vertex_stride;
  vertex_stride += 3;
}
// ... binormal, tangent, color0, color1
for (int i = 0; i < 8; i++) {
  if (components & (VB_HAS_UV0 << i)) {
    vertex_offset_rawtex[i] = vertex_stride;
    vertex_stride += (component_sizes >> (2 * i)) & 3;
  }
}

Assuming the compiler recognizes those as uniform, AMD and Intel's ISAs probably wouldn't mind too much, it's just a bunch of sgprs on AMD (we usually have plenty to spare) and would take 2/128 gprs on Intel. No clue how Nvidia and Apple would fare though, since I'm not sure those two have special storage for non-constant uniform values.

A second question would be how often do we upload new uniforms for something that doesn't affect non-uber shaders. I would guess this happens more often in PS uniforms than VS uniforms though. Might still be useful to have separate dirty flags for the two in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO putting it in the uniform is better than calculating it in the shader itself. Non-uber shaders seem like the bigger concern to me, but on the other hand the vertex format changing is likely to require a different shader, so I don't think uploading new uniforms will matter that much. (And, often the vertex format changing will only happen when other settings change, though you can set up to 8 VATs and switch between them as easily as switching primitives and that can toggle having the tangent and binormal (and change the vertex format more, but I think everything else is irrelevant by this point in Dolphin). I don't know if this is relevant in practice; the closest case might be Rogue Squadron 2/3 which deliberately don't send tangent/binormal vectors for the VAT they use in one case, but they change other things in between.)

@TellowKrinkle
Copy link
Contributor Author

It might be a bit simpler if there were, say, individual arrays for each component, some of which don't get populated, but I'm not sure if that would work or have any real benefit

Wouldn't that require binding like 15 different vertex buffers? Or were you thinking of some other way of accessing them all?

@Pokechu22
Copy link
Contributor

It might be a bit simpler if there were, say, individual arrays for each component, some of which don't get populated, but I'm not sure if that would work or have any real benefit

Wouldn't that require binding like 15 different vertex buffers? Or were you thinking of some other way of accessing them all?

I don't have an exact idea in mind, but it probably would require binding 15 different buffers, yeah. The main benefit would be reducing the amount of calculation that needs to be done to find the index into those buffers since you would always know the correct buffer and stride for given component. But I'm not sure if that would actually be any faster (or if it's possible to avoid syncing data for buffers that aren't relevant), and it would require lots of other work.

@TellowKrinkle TellowKrinkle force-pushed the UberVertexLoader branch 4 times, most recently from f9c7b37 to 678d75b Compare August 12, 2022 20:42
@Pokechu22
Copy link
Contributor

Other than that things look mostly reasonable, though I don't really understand the D3D or Vulkan setup code. I don't see anything that looks obviously wrong, at least.

Also removes cullmode all handling, it's handled in CPU and DX11 backend doesn't specially handle it either
Copy link
Contributor

@Pokechu22 Pokechu22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I don't 100% understand everything, but I think it's good enough to merge for users to find any remaining subtle bugs.

@JMC47 JMC47 merged commit 22197c0 into dolphin-emu:master Sep 20, 2022
11 checks passed
@TellowKrinkle TellowKrinkle deleted the UberVertexLoader branch September 22, 2022 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
7 participants