New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cull vertices on the CPU #11208
Cull vertices on the CPU #11208
Conversation
edb9b6d
to
2fb062f
Compare
|
If it's globally faster + safe, I see no reason to need an option for it. |
2fb062f
to
b3d6f81
Compare
|
I can recreate some of the visual glitches listed through FifoCI. Is it too early for performance testing? |
|
Yeah, Rogue Squadron 2/3 see huge benefits from this, but they also suffer from big visual glitches lol |
|
Going through a couple of demanding games, the potential for absolutely disgustingly big gains is there, especially in games that aren't very good at culling. Unfortunately right now there's a lot of minor flickering in a lot of titles. |
Yeah, wait for visual glitches to be fixed before performance testing. Since the increased performance comes from drawing less things, performance will be higher the more things we cull... even if those things were actually supposed to be drawn. |
|
With most games, it's just a missing floor polygon here and there. Rogue Squadron 2/3 might not be compatible due to their zfreeze bullshit. |
|
So I have mixed feelings on this. I definitely think this is a great feature to provide but I am not sure how I feel about it being always on. A few comments:
My personal approach to this would be to introduce the callback based on the vertex data. Then we can introduce a "Cull non-visible geometry" as a graphics mod or modders can cull individual pieces of geometry. Alternative features using the vertex positions/transforms could be done as well. That being said, I'm fine with the feature as is too, just would like to make it optional, so I can disable it in scenarios where culling is not desired. |
This comment was marked as duplicate.
This comment was marked as duplicate.
I don't think that's something that has to be a problem. As long as the CPU culling takes into account freelook or any other transformations that would also apply on the GPU, the CPU culling would only be removing invisible stuff. I'm not sure if this current implementation does that or not though. |
Good point @Pokechu22 ! Actually, I was just thinking of something I did in the post processing overhaul. I apply all post processing shaders then I apply the stereoscopic shaders if enabled. We could do something similar here.
I agree, I think that would address all my concerns. That means I can probably make my changes for graphics mod support in a separate PR (I need to look at this code in more details to see how I could make it more generic). |
510ae67
to
d0a0a3c
Compare
|
This comment in the software renderer's clipper may be relevant for zfreeze: dolphin/Source/Core/VideoBackends/Software/Clipper.cpp Lines 292 to 298 in 060d928
Additionally, zfreeze still gets updated if a triangle is backfacing. But this means that (as long as you're only doing trivial rejection on the CPU) you can do that, then update zfreeze with the last triangle in the batch, and then cull according to Regarding what "trivial rejection" actually refers to, Also, note that the fifoci diffs are between versions of this pull request - if you run |
d0a0a3c
to
a16374e
Compare
CPUCull takes free look into account. It has trouble with the graphicsmods because it wants the transformation matrix earlier than before, in a place that gets called more often than the current call to |
|
It seems games that do weird depth bullshit confuse this. |
If that is the case. Great!
Yeah currently graphics mods use the texture as a key to trigger any sort of modification. So please don't move that over. It would indeed be expensive. If this PR ends up being merged, I will need to use the vertices to ID the draw calls, which can then be used for the move/scale operations and a number of other things. I haven't assimilated all the details but it seems like I should be able to make this code more generic. Great work @TellowKrinkle ! |
a16374e
to
6e62521
Compare
|
I tested a few more games on this, and haven't been noticing as many issues with flickering. |
6e62521
to
54c802c
Compare
|
I did a quick glance at the differences, it looks like Samurai Warriors 3 is the only legitimately broken one. I know that game as super weird depth. The others look like more early memory update/first frame shenanigans. |
9b4896d
to
bb5fe79
Compare
Hmm, I had actually failed to pick up on that aspect of it. I did see this comment, but didn't think about how that actually affected things: dolphin/Source/Core/VideoCommon/VertexLoaderManager.cpp Lines 370 to 374 in 2f3375c
It might be better to rename functions from e.g. |
|
Is this ready or does it need more changes? |
|
|
|
The way you've set up the sse/avx tiering won't work for msvc (meaning: currently none of the >sse2 code is active on msvc builds). I've heard rumors that in the future there will be better support, but for now, you must be able to compile translation units with different compiler flags and then link them together (if you want to have the same function compiled for multiple sse/avx levels, and have all versions of the function present in the same binary). Also ... I haven't looked that closely at the actual code, but is there opportunity for avx512 here? :) |
You sure about that? https://gcc.godbolt.org/z/Y3r3rYG14 (I won't have access to a windows computer for a week or so and the buildbot's builds have no pdbs, so I can't verify in dolphin, but if you want to build locally and check / upload an exe + pdb for me to check, that would be nice)
Yes, but I don't have a computer to test that on, and don't really feel like guessing and hoping for the best here |
2f3375c
to
cfe373c
Compare
Intrinsics on msvc behave differently than you would expect if you come from from gcc background. On msvc, intrinsics are always available no matter the compiler flags (given that the intrinsic being used exists on the base arch - can't use arm intrinsics on x64 obviously). However, the actual code emitted by the intrinsic may depend on compiler flags. For example, certain intrinsics may generate VEX-prefixed opcodes or not depending on flags. Additionally, the compiler flags effect the preprocessor macros defined automatically by the compiler. For example, since we do not enable AVX in Dolphin compiler flags, the code inside Important differences in codegen when changing arch-level compiler flags in msvc to keep in mind:
My guess is that it would be better to target 512-bit registers and let the cpu deal with it (and it allows the code to be a little future-proof in a lazy way). However this doesn't really need to be a part of this PR, especially if you don't have a way to test it. Here is exe+pdb from this PR on the buildbot if you want to take a look: https://drive.google.com/file/d/1rptOdaQmenGj_sOAGT4FIe7AdTnj8CtF/view?usp=sharing ...having typed that up, in the above binary, I do see: void __fastcall CPUCull::Init(CPUCull *this)
{
// [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]
if ( cpu_info.bAVX )
{
v1 = CPUCull_AVX::TransformVertices_0_0_;
if ( cpu_info.bFMA )
v1 = CPUCull_FMA::TransformVertices_0_0_;
}
else
{
v1 = CPUCull_SSE::TransformVertices_0_0_;
}Considering godbolt (see https://gcc.godbolt.org/z/q33G55f9K) and the msdn docs agree with what I said, this is kind of a head-scratcher for me :) ah... this is because the PR code is |
Yeah, all the |
|
out of curiosity i tried compiling this pr with /arch:AVX2 (for the entire codebase): https://drive.google.com/file/d/1iNjRe4RJ9BMSkcv7pXcC7sEL1Qi3Tba2/view?usp=sharing
all names should be present, including inlined information. however even in tools using the standard pdb libraries, you typically do not see the inlined info unless you have an active thread context/stack, since inlined things can be interleaved in confusing ways. but, if you have a way to load some 'real' pdb parsing library (via wine or something?) it may work better than alternatives. |
|
Ah, here's an interesting effect: since msvc winds up emitting the same opcodes for if ( cpu_info.bAVX )
{
v18 = CPUCull_SSE3::CullVertices_3_0_; // !!
}
else
{
if ( cpu_info.bSSE3 )
{
v18 = CPUCull_SSE3::CullVertices_3_0_;
goto LABEL_67;
}
v18 = CPUCull_SSE::CullVertices_3_0_;
}here's the object file which still includes both copies of the function if you're interested: https://drive.google.com/file/d/1V7JsqU0dfSPRifdHIlSdZMEAA11PL4nj/view?usp=sharing |
| #include "VideoCommon/CPUCullImpl.h" | ||
| #define USE_FMA | ||
| #include "VideoCommon/CPUCullImpl.h" | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code is already written in a good way. You should only need to transform this sequence of includes into individual files which do the define + include, then set flags on those files as needed.
00000001`405d050e c5fc116d00 vmovups ymmword ptr [rbp],ymm5
00000001`405d0513 c5fc11b580000000 vmovups ymmword ptr [rbp+80h],ymm6
00000001`405d051b c5fc11bda0000000 vmovups ymmword ptr [rbp+0A0h],ymm7
00000001`405d0523 c57c1185c0000000 vmovups ymmword ptr [rbp+0C0h],ymm8
00000001`405d052b c57c118de0000000 vmovups ymmword ptr [rbp+0E0h],ymm9
00000001`405d0533 c57c115520 vmovups ymmword ptr [rbp+20h],ymm10
00000001`405d0538 c5fc116540 vmovups ymmword ptr [rbp+40h],ymm4
00000001`405d053d c57c115d60 vmovups ymmword ptr [rbp+60h],ymm11Kind of disappointed, I was hoping for this to be removed by doing The avx version of |
|
optimizing for piledriver (or anything before zen) would be a waste of time imo (zero such cpus show up in dolphin analytics)[1]. have you diff'd e.g. In general, if you see suboptimal/bad codegen, it would be great if you could file a bug at https://developercommunity.visualstudio.com/cpp/report (this stack spill does look like such a case) 1: To expand on that, I think we're at the point where we can essentially rely on AVX and AVX2 being present on cores running dolphin. It's nice to provide some fallback paths, but IRL only a generic/fallback path and something >=AVX2 is really needed. |
They look pretty similar to me There is a noticeable speed difference on Sandy Bridge, but not on newer architectures.
Done. |
|
I would say that's "completely different", considering that you'd think you're getting the AVX version, but you're really (on the current msvc build) getting the SSE one instead. Maybe just put a flashy comment in the source code to the effect of it not really being AVX on msvc currently? Otherwise meh w/e |
cfe373c
to
7413be1
Compare
Something like that? |
|
Lgtm |
|
LGTM. Tested The Legend of Zelda: Twilight Princess and Metroid Prime 3 and saw significant gains. |
|
Newbie question: will Android users also most likely see a perf. boost on those two games (and potentially others) or does the CPU Culling only make sense/apply to desktop CPUs? Tks and sorry for the hijack |
|
Android will also see a performance boost, sometimes a bigger performance boost depending on the bottleneck. In Twilight Princess, my desktop is pretty good at powering through the minimap, but my phone is really slow at it. So it helps a lot more on my phone. |
|
Merging this now, it's disabled by default anyway which should significantly reduce regression risks. Really cool feature... but now I wonder at which point do we just say "fuck it" and just run all the vertex processing on the CPU and keep the GPU for fragments only :p |



Twilight Princess and Metroid Prime 3 seem to like alternating between lines/points and degenerate triangles (working around some hardware bug?). Since each switch requires a change in draw configuration, this puts a lot of strain on the render backend, as it encodes ~27k draw calls per frame in the case of Twilight Princess.
To avoid this, pre-run the coordinate transform part of the vertex shader on the CPU and cull triangles there, skipping the draw entirely if all its vertices would get culled by the GPU. Nearly halves the number of draw calls in Twilight Princess, bringing it down to just 15k, bringing its frame rate from ~22fps to ~32fps on my laptop, a nearly 50% improvement!
This also seems to help Densha De Go, giving a ~15% improvement. Not sure what it's drawing, but apparently a decent portion is either off screen or facing backwards. (14k draw calls to 9k)
TODO: