New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VideoCommon: Add dynamic vertex loader for ubershaders to reduce pipeline count #10781
Conversation
d00065d
to
6242055
Compare
|
Wouldn't it make more sense to use VK_EXT_vertex_input_dynamic_state on Vulkan when supported (Nvidia GPUs)? |
|
In the beginning of the title screen in your after video there's a bunch of glitchiness on the left side. Is that a preexisting issue that just didn't show up in the before video? |
|
I'm guessing that's an encoding issue based on the size of the pixels, so I just ignored it. |
|
I think it would make more sense to make a PR without the Metal code. |
Probably, I admittedly wasn't really developing this for Vulkan. I'm happy to adjust the Vulkan implementation to check
I had to compress those pretty hard to get under GitHub's 10mb limit. Used the same (horribly low) quality settings on both but clearly that doesn't mean x264 will produce similar results. Can confirm that doesn't happen when you actually play.
|
6242055
to
53b713d
Compare
|
@MayImilae Would be interesting to see if this fixes the problems with frame times you observed on Intel Macs. |
53b713d
to
7313f0f
Compare
|
I haven't looked over this in detail, but it feels like a somewhat strange approach. Dolphin already has a system to reformat vertices from the GameCube's format to one that works well for modern GPUs ( |
|
The majority of vertices I've seen are in the 20-28 byte range, which would mean sending everything would be a 6-8x increase in vertex bytes, just to slightly reduce the complexity of ubershaders. Messing with stats reporting shows that it would be pretty common to see games hitting 20MB/frame or a bit over 1GB/s of vertices this way. Side note, if you were hoping to keep vertex loaders out of shaders in general, I was planning on implementing vertex shader line to triangle expansion for platforms without geometry shaders, which will also require a shader vertex loader. |
|
Ah, Line-width without geometry shaders might actually be faster on Android. That's how slow they are. |
5f215ca
to
6a63fde
Compare
6a63fde
to
9c902dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I've looked over it further, this approach doesn't seem that bad (although I think it is a bit jank that we have several things labeled as vertex loaders that all do slightly different things). Since all components are 4-byte aligned, the actual code seems fairly simple, and uintBitsToFloat is also convenient.
It might be a bit simpler if there were, say, individual arrays for each component, some of which don't get populated, but I'm not sure if that would work or have any real benefit (and it would require changing the assembly vertex loaders).
Side note, if you were hoping to keep vertex loaders out of shaders in general, I was planning on implementing vertex shader line to triangle expansion for platforms without geometry shaders, which will also require a shader vertex loader.
Note that there's an existing implementation of this that's rather jank at #10295 (which doesn't use a shader-based vertex loader, but instead duplicates them in IndexGenerator.cpp).
| "\tuint vertex_offset_posmtx;\n" | ||
| "\tuint vertex_offset_rawcolor0;\n" | ||
| "\tuint vertex_offset_rawcolor1;\n" | ||
| "\tuint4 vertex_offset_rawtex[2];\n" // std140 is pain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit odd that these are included in the uniform for specialized shaders (and ubershaders that don't use the vertex loader), though I don't know if 64 additional bytes is really a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VS uniform size is currently 4032 bytes, and this makes it 4096 bytes, a ~1.6% increase.
If it becomes a big issue, our full vertex layout can be calculated from the component list plus ~17 bits of component size data (since the order is always the same), just a bit more messily. We could put something like this at the top of main instead:
// vs uniform: u32 component_sizes; // bits 0-16: rawtex, bit 17: position has 3 floats
uint vertex_stride = 0;
uint vertex_offset_posmtx;
uint vertex_offset_rawpos;
uint vertex_offset_rawnormal;
uint vertex_offset_rawtangent;
uint vertex_offset_rawbinormal;
uint vertex_offset_rawcolor0;
uint vertex_offset_rawcolor1;
uint vertex_offset_rawtex[8];
if (components & VB_HAS_POSMTX) {
vertex_offset_posmtx = vertex_stride;
vertex_stride++;
}
vertex_offset_position = vertex_stride;
vertex_stride += component_sizes & (1 << 17) ? 3 : 2;
if (components & VB_HAS_NORMAL) {
vertex_offset_rawnormal = vertex_stride;
vertex_stride += 3;
}
// ... binormal, tangent, color0, color1
for (int i = 0; i < 8; i++) {
if (components & (VB_HAS_UV0 << i)) {
vertex_offset_rawtex[i] = vertex_stride;
vertex_stride += (component_sizes >> (2 * i)) & 3;
}
}Assuming the compiler recognizes those as uniform, AMD and Intel's ISAs probably wouldn't mind too much, it's just a bunch of sgprs on AMD (we usually have plenty to spare) and would take 2/128 gprs on Intel. No clue how Nvidia and Apple would fare though, since I'm not sure those two have special storage for non-constant uniform values.
A second question would be how often do we upload new uniforms for something that doesn't affect non-uber shaders. I would guess this happens more often in PS uniforms than VS uniforms though. Might still be useful to have separate dirty flags for the two in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO putting it in the uniform is better than calculating it in the shader itself. Non-uber shaders seem like the bigger concern to me, but on the other hand the vertex format changing is likely to require a different shader, so I don't think uploading new uniforms will matter that much. (And, often the vertex format changing will only happen when other settings change, though you can set up to 8 VATs and switch between them as easily as switching primitives and that can toggle having the tangent and binormal (and change the vertex format more, but I think everything else is irrelevant by this point in Dolphin). I don't know if this is relevant in practice; the closest case might be Rogue Squadron 2/3 which deliberately don't send tangent/binormal vectors for the VAT they use in one case, but they change other things in between.)
Wouldn't that require binding like 15 different vertex buffers? Or were you thinking of some other way of accessing them all? |
317ad6d
to
a9e26e6
Compare
I don't have an exact idea in mind, but it probably would require binding 15 different buffers, yeah. The main benefit would be reducing the amount of calculation that needs to be done to find the index into those buffers since you would always know the correct buffer and stride for given component. But I'm not sure if that would actually be any faster (or if it's possible to avoid syncing data for buffers that aren't relevant), and it would require lots of other work. |
f9c7b37
to
678d75b
Compare
|
Other than that things look mostly reasonable, though I don't really understand the D3D or Vulkan setup code. I don't see anything that looks obviously wrong, at least. |
Makes it easier to disable in the future if support for VK_EXT_vertex_input_dynamic_state is added
678d75b
to
e98e84b
Compare
Also removes cullmode all handling, it's handled in CPU and DX11 backend doesn't specially handle it either
e98e84b
to
ee692ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I don't 100% understand everything, but I think it's good enough to merge for users to find any remaining subtle bugs.
Some fun things I've been working on, figured I'd post it here for other people to try out, even if it's not quite ready for merge
Adds a vertex loader to vertex ubershaders which removes the need to do anything with the input assembler, reducing the number of unique pipelines needed. This makes the number of different pipelines that actually require recompiles small enough that we can compile them all ahead of time on most renderers. In particular:
The implementation puts the list of offsets and strides into the vertex uniform, which the shader uses to figure out where it should load from in an SSBO.
I'm not super confident in my Vulkan code and even less in my DX12 code, so if someone could look those over especially closely that would be nice.
Before and After recordings
(Running with all pipeline and driver caches cleared, exclusive ubershaders, compile shaders before starting)Before:
Base.mp4
After (that one stutter is an EFB copy pipeline):
UberVertexLoader.mp4
Note: Just waiting on #10747All ready