Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cull vertices on the CPU #11208

Merged
merged 4 commits into from Jan 26, 2023
Merged

Cull vertices on the CPU #11208

merged 4 commits into from Jan 26, 2023

Conversation

TellowKrinkle
Copy link
Contributor

Twilight Princess and Metroid Prime 3 seem to like alternating between lines/points and degenerate triangles (working around some hardware bug?). Since each switch requires a change in draw configuration, this puts a lot of strain on the render backend, as it encodes ~27k draw calls per frame in the case of Twilight Princess.

To avoid this, pre-run the coordinate transform part of the vertex shader on the CPU and cull triangles there, skipping the draw entirely if all its vertices would get culled by the GPU. Nearly halves the number of draw calls in Twilight Princess, bringing it down to just 15k, bringing its frame rate from ~22fps to ~32fps on my laptop, a nearly 50% improvement!

This also seems to help Densha De Go, giving a ~15% improvement. Not sure what it's drawing, but apparently a decent portion is either off screen or facing backwards. (14k draw calls to 9k)

TODO:

  • GraphicsMods can apparently do things to the projection matrix, and that isn't currently taken into account properly. Is that a problem? What kind of graphics mods is this meant for? (If they don't move previously-offscreen things on screen or spin them around, it won't cause a problem, but if they do, they could be culled accidentally)
  • Arm NEON code is completely untested
  • Should I be enabling this by default? Should I put it in the config UI or leave it hidden?
  • How is performance affected for normal games? I'm seeing ~1-3% CPU time for the culling functions on affected games, but that might be lower in games that don't issue so many draw calls (because we only CPU cull if there's a chance we could remove an entire draw call)

@JMC47
Copy link
Contributor

JMC47 commented Oct 25, 2022

If it's globally faster + safe, I see no reason to need an option for it.

@dvessel
Copy link
Contributor

dvessel commented Oct 25, 2022

I can recreate some of the visual glitches listed through FifoCI. Is it too early for performance testing?

@JMC47
Copy link
Contributor

JMC47 commented Oct 25, 2022

Yeah, Rogue Squadron 2/3 see huge benefits from this, but they also suffer from big visual glitches lol

@JMC47
Copy link
Contributor

JMC47 commented Oct 25, 2022

Going through a couple of demanding games, the potential for absolutely disgustingly big gains is there, especially in games that aren't very good at culling. Unfortunately right now there's a lot of minor flickering in a lot of titles.

@TellowKrinkle
Copy link
Contributor Author

I can recreate some of the visual glitches listed through FifoCI. Is it too early for performance testing?

Yeah, wait for visual glitches to be fixed before performance testing. Since the increased performance comes from drawing less things, performance will be higher the more things we cull... even if those things were actually supposed to be drawn.

@JMC47
Copy link
Contributor

JMC47 commented Oct 25, 2022

With most games, it's just a missing floor polygon here and there. Rogue Squadron 2/3 might not be compatible due to their zfreeze bullshit.

@iwubcode
Copy link
Contributor

iwubcode commented Oct 26, 2022

So I have mixed feelings on this. I definitely think this is a great feature to provide but I am not sure how I feel about it being always on.

A few comments:

  • Some users don't want less content to be drawn, they want more. Particularly when you think about FreeLook or a potential VR solution in the future. You want all the vertices you can get.
  • Re:Graphics Mods:
    • Currently, users have a way to move objects using the same projection matrix trick we use for FreeLook. At the moment 2D is the more reasonable use case because I don't handle the 'z' component which means 3D content can barely be moved.
    • There is a "bug" with the Projection based move capability in specific instances with some games. In one example I saw, if you try and move something, to the left. The left will actually move to the right due to how the matrix multiplication works with the projection matrix and the view/model matrix. The solution I hoped to use, was instead of using the projection matrix, look at the vertices and move those.
    • Also, there are a handful of features I want to accomplish using the vertex data. For instance, being able to use the vertices to tag a specific object by hashing up the vertex positions. My plan was to add a callback into the graphics mods with the vertices. I haven't looked into it in detail but this seems to be challenging due to the loader and assembly in place. This cull code is very interesting as a reference but I also don't want to cull the data in those scenarios.

My personal approach to this would be to introduce the callback based on the vertex data. Then we can introduce a "Cull non-visible geometry" as a graphics mod or modders can cull individual pieces of geometry. Alternative features using the vertex positions/transforms could be done as well.

That being said, I'm fine with the feature as is too, just would like to make it optional, so I can disable it in scenarios where culling is not desired.

@Rumi-Larry

This comment was marked as duplicate.

@Pokechu22
Copy link
Contributor

Some users don't want less content to be drawn, they want more. Particularly when you think about FreeLook or a potential VR solution in the future. You want all the vertices you can get.

I don't think that's something that has to be a problem. As long as the CPU culling takes into account freelook or any other transformations that would also apply on the GPU, the CPU culling would only be removing invisible stuff. I'm not sure if this current implementation does that or not though.

@iwubcode
Copy link
Contributor

iwubcode commented Oct 26, 2022

I don't think that's something that has to be a problem

Good point @Pokechu22 ! Actually, I was just thinking of something I did in the post processing overhaul. I apply all post processing shaders then I apply the stereoscopic shaders if enabled.

We could do something similar here.

  • Process the graphics mods based on the vertices / transforms. Handling any moving / repositioning that occurs (or any other operation)
  • Cull the vertices after all transformations based on the FreeLook matrix

I agree, I think that would address all my concerns. That means I can probably make my changes for graphics mod support in a separate PR (I need to look at this code in more details to see how I could make it more generic).

@Pokechu22
Copy link
Contributor

This comment in the software renderer's clipper may be relevant for zfreeze:

if (IsTriviallyRejected(v0, v1, v2))
{
INCSTAT(g_stats.this_frame.num_triangles_rejected)
// NOTE: The slope used by zfreeze shouldn't be updated if the triangle is
// trivially rejected during clipping
return;
}

Additionally, zfreeze still gets updated if a triangle is backfacing. But this means that (as long as you're only doing trivial rejection on the CPU) you can do that, then update zfreeze with the last triangle in the batch, and then cull according to CullMode.

Regarding what "trivial rejection" actually refers to, xfmem.clipDisable has 3 bits, one of which disables that. Here's images of the same scene zoomed out using the hardware fifoplayer: https://imgur.com/a/XqdyV2d (note that if something is out of the -1 to 1 range after projection, clipping will affect it (though those screenshots indicate triangles in the range -2 to +2 are not affected by clipping if they're on screen), even if the viewport is set up to make those on screen. Mario Party 8 accidentally depends on this.

Also, note that the fifoci diffs are between versions of this pull request - if you run git rebase --force master then it'll force it to be a diff from master (since the rebased version would not have matching parent commits).

@TellowKrinkle
Copy link
Contributor Author

Some users don't want less content to be drawn, they want more. Particularly when you think about FreeLook or a potential VR solution in the future. You want all the vertices you can get.

CPUCull takes free look into account. It has trouble with the graphicsmods because it wants the transformation matrix earlier than before, in a place that gets called more often than the current call to SetConstants(). The graphics mod check seems to require loading all the textures used, which seemed like a rather expensive check to be moving over, especially since we don't even know if we'll need to load the textures at all at this point.

@JMC47
Copy link
Contributor

JMC47 commented Oct 26, 2022

It seems games that do weird depth bullshit confuse this.

@iwubcode
Copy link
Contributor

iwubcode commented Oct 26, 2022

CPUCull takes free look into account

If that is the case. Great!

The graphics mod check seems to require loading all the textures used, which seemed like a rather expensive check to be moving over, especially since we don't even know if we'll need to load the textures at all at this point.

Yeah currently graphics mods use the texture as a key to trigger any sort of modification. So please don't move that over. It would indeed be expensive.

If this PR ends up being merged, I will need to use the vertices to ID the draw calls, which can then be used for the move/scale operations and a number of other things.

I haven't assimilated all the details but it seems like I should be able to make this code more generic. Great work @TellowKrinkle !

@JMC47
Copy link
Contributor

JMC47 commented Oct 26, 2022

I tested a few more games on this, and haven't been noticing as many issues with flickering.

@JMC47
Copy link
Contributor

JMC47 commented Oct 29, 2022

I did a quick glance at the differences, it looks like Samurai Warriors 3 is the only legitimately broken one. I know that game as super weird depth. The others look like more early memory update/first frame shenanigans.

@Pokechu22
Copy link
Contributor

Reusing the vertices on the GPU would be difficult, since the current implementation turns off cpu cull to save CPU cycles as soon as it knows enough to determine it wouldn't be helpful, and therefore doesn't transform all vertices that pass through to the GPU.

Hmm, I had actually failed to pick up on that aspect of it. I did see this comment, but didn't think about how that actually affected things:

// CPUCull's performance increase comes from encoding fewer GPU commands, not sending less data
// Therefore it's only useful to check if culling could remove a flush
const bool can_cpu_cull = g_ActiveConfig.bCPUCull &&
primitive < OpcodeDecoder::Primitive::GX_DRAW_LINES &&
!g_vertex_manager->HasSendableVertices();

It might be better to rename functions from e.g. CullVertices to AreAllVerticesCulled to more accurately convey that that's what happens.

@JMC47
Copy link
Contributor

JMC47 commented Jan 11, 2023

Is this ready or does it need more changes?

@TellowKrinkle
Copy link
Contributor Author

TellowKrinkle commented Jan 12, 2023

I haven't done the requested rename yet
(Just noticed it right now)

Done

@shuffle2
Copy link
Contributor

shuffle2 commented Jan 12, 2023

The way you've set up the sse/avx tiering won't work for msvc (meaning: currently none of the >sse2 code is active on msvc builds). I've heard rumors that in the future there will be better support, but for now, you must be able to compile translation units with different compiler flags and then link them together (if you want to have the same function compiled for multiple sse/avx levels, and have all versions of the function present in the same binary).
In practice, this isn't that gross: just create a .cpp file for each arch target (which include the #ifdef'd code), then set the flags on those files specifically.

Also ... I haven't looked that closely at the actual code, but is there opportunity for avx512 here? :)

@TellowKrinkle
Copy link
Contributor Author

The way you've set up the sse/avx tiering won't work for msvc (meaning: currently none of the >sse2 code is active on msvc builds).

You sure about that? https://gcc.godbolt.org/z/Y3r3rYG14

(I won't have access to a windows computer for a week or so and the buildbot's builds have no pdbs, so I can't verify in dolphin, but if you want to build locally and check / upload an exe + pdb for me to check, that would be nice)

Also ... I haven't looked that closely at the actual code, but is there opportunity for avx512 here? :)

Yes, but I don't have a computer to test that on, and don't really feel like guessing and hoping for the best here
(BTW if I were to target avx512, do you think I should target 512-bit registers even though most avx512 processors currently don't process them faster than 256 bit registers, or should I aim to save a few instructions in 256-bit with prefix registers?)

@shuffle2
Copy link
Contributor

shuffle2 commented Jan 12, 2023

The way you've set up the sse/avx tiering won't work for msvc (meaning: currently none of the >sse2 code is active on msvc builds).

You sure about that? https://gcc.godbolt.org/z/Y3r3rYG14

(I won't have access to a windows computer for a week or so and the buildbot's builds have no pdbs, so I can't verify in dolphin, but if you want to build locally and check / upload an exe + pdb for me to check, that would be nice)

Intrinsics on msvc behave differently than you would expect if you come from from gcc background. On msvc, intrinsics are always available no matter the compiler flags (given that the intrinsic being used exists on the base arch - can't use arm intrinsics on x64 obviously). However, the actual code emitted by the intrinsic may depend on compiler flags. For example, certain intrinsics may generate VEX-prefixed opcodes or not depending on flags. Additionally, the compiler flags effect the preprocessor macros defined automatically by the compiler. For example, since we do not enable AVX in Dolphin compiler flags, the code inside #ifdef __AVX__ and similar in this PR will not be active.

Important differences in codegen when changing arch-level compiler flags in msvc to keep in mind:

  • intrinsics may emit different code
  • the compiler is free to emit code throughout the translation unit up to the arch level specified in the compiler flag. e.g. there may be memset/memcpy optimizations automatically generated for non-intrinsic-using c++ code.

Also ... I haven't looked that closely at the actual code, but is there opportunity for avx512 here? :)

Yes, but I don't have a computer to test that on, and don't really feel like guessing and hoping for the best here (BTW if I were to target avx512, do you think I should target 512-bit registers even though most avx512 processors currently don't process them faster than 256 bit registers, or should I aim to save a few instructions in 256-bit with prefix registers?)

My guess is that it would be better to target 512-bit registers and let the cpu deal with it (and it allows the code to be a little future-proof in a lazy way). However this doesn't really need to be a part of this PR, especially if you don't have a way to test it.

Here is exe+pdb from this PR on the buildbot if you want to take a look: https://drive.google.com/file/d/1rptOdaQmenGj_sOAGT4FIe7AdTnj8CtF/view?usp=sharing

...having typed that up, in the above binary, I do see:

void __fastcall CPUCull::Init(CPUCull *this)
{
  // [COLLAPSED LOCAL DECLARATIONS. PRESS KEYPAD CTRL-"+" TO EXPAND]

  if ( cpu_info.bAVX )
  {
    v1 = CPUCull_AVX::TransformVertices_0_0_;
    if ( cpu_info.bFMA )
      v1 = CPUCull_FMA::TransformVertices_0_0_;
  }
  else
  {
    v1 = CPUCull_SSE::TransformVertices_0_0_;
  }

Considering godbolt (see https://gcc.godbolt.org/z/q33G55f9K) and the msdn docs agree with what I said, this is kind of a head-scratcher for me :)

ah... this is because the PR code is if (MIN_SSE >= 51 || (cpu_info.bAVX && cpu_info.bFMA)). I ...guess... this is fine? You may just have some suboptimal codegen on msvc because the compiler could miss some inline/optimization opportunities between the chunks of code with intrinsics. and as noted, the actual code emitted by the intrinsic may not be exactly what you expect.

@TellowKrinkle
Copy link
Contributor Author

ah... this is because the PR code is if (MIN_SSE >= 51 || (cpu_info.bAVX && cpu_info.bFMA)). I ...guess... this is fine?

Yeah, all the MIN_SSE does is compile out sections that will be completely useless in the case that someone does compile with a higher minimum sse/avx version. (It's the minimum sse version supported by the compile, not the maximum)

@TellowKrinkle
Copy link
Contributor Author

Took a peek at the fma codegen
It does seem to go into sse codegen in the final single-element cleanup (even though there's fma instrinsics sprinkled in) which is a bit goofy, but it should perform fine, though with a few extra movs and loads/stores to the stack. The most important part, the 256-bit ops in the hot loop, look as good as you can get with msvc (even if that highlighted unneccessary vmovaps slightly bothers me).
image
tbh I'm more annoyed at the fact that it didn't put names for the static functions into the debug symbols, though it might just be my disassembler not reading the pdb properly

@shuffle2
Copy link
Contributor

shuffle2 commented Jan 13, 2023

out of curiosity i tried compiling this pr with /arch:AVX2 (for the entire codebase): https://drive.google.com/file/d/1iNjRe4RJ9BMSkcv7pXcC7sEL1Qi3Tba2/view?usp=sharing
the function you show looks like this: https://gist.github.com/shuffle2/f0890a8ed022aabb3cd1aca0e6dc42c6
the only difference is outside of the loop as you indicated before. i haven't checked other functions.
edit: ah, the register spill in the prologue is also using vex-prefix.

tbh I'm more annoyed at the fact that it didn't put names for the static functions into the debug symbols, though it might just be my disassembler not reading the pdb properly

all names should be present, including inlined information. however even in tools using the standard pdb libraries, you typically do not see the inlined info unless you have an active thread context/stack, since inlined things can be interleaved in confusing ways. but, if you have a way to load some 'real' pdb parsing library (via wine or something?) it may work better than alternatives.

@shuffle2
Copy link
Contributor

shuffle2 commented Jan 13, 2023

Ah, here's an interesting effect: since msvc winds up emitting the same opcodes for CPUCull_SSE3::AreAllVerticesCulled and CPUCull_AVX::AreAllVerticesCulled (on the default /arch setting of sse2), the compiler actually deduplicates the function bodies and the avx version disappears:

  if ( cpu_info.bAVX )
  {
    v18 = CPUCull_SSE3::CullVertices_3_0_; // !!
  }
  else
  {
    if ( cpu_info.bSSE3 )
    {
      v18 = CPUCull_SSE3::CullVertices_3_0_;
      goto LABEL_67;
    }
    v18 = CPUCull_SSE::CullVertices_3_0_;
  }

here's the object file which still includes both copies of the function if you're interested: https://drive.google.com/file/d/1V7JsqU0dfSPRifdHIlSdZMEAA11PL4nj/view?usp=sharing

#include "VideoCommon/CPUCullImpl.h"
#define USE_FMA
#include "VideoCommon/CPUCullImpl.h"
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is already written in a good way. You should only need to transform this sequence of includes into individual files which do the define + include, then set flags on those files as needed.

@TellowKrinkle
Copy link
Contributor Author

TellowKrinkle commented Jan 14, 2023

https://gist.github.com/shuffle2/f0890a8ed022aabb3cd1aca0e6dc42c6

00000001`405d050e c5fc116d00      vmovups ymmword ptr [rbp],ymm5
00000001`405d0513 c5fc11b580000000 vmovups ymmword ptr [rbp+80h],ymm6
00000001`405d051b c5fc11bda0000000 vmovups ymmword ptr [rbp+0A0h],ymm7
00000001`405d0523 c57c1185c0000000 vmovups ymmword ptr [rbp+0C0h],ymm8
00000001`405d052b c57c118de0000000 vmovups ymmword ptr [rbp+0E0h],ymm9
00000001`405d0533 c57c115520      vmovups ymmword ptr [rbp+20h],ymm10
00000001`405d0538 c5fc116540      vmovups ymmword ptr [rbp+40h],ymm4
00000001`405d053d c57c115d60      vmovups ymmword ptr [rbp+60h],ymm11

Kind of disappointed, I was hoping for this to be removed by doing /arch:AVX2 (since now the whole function knows it can use ymm registers). Guess the "few extra loads/stores to the stack" were just MSVC codegen being great as always.

The avx version of CullVertices was mostly done because it was two lines of code to add with the stuff already set up for TransformVertices (the only benefit would be the removal of extra movaps instructions, which mainly helps Sandy Bridge processors, since Ivy Bridge+ has mov elimination for xmm's), so I'm not sure I really want to bother splitting this into 4 files just for this. (Kind of considered it for the movups [rbp+xxx], ymms because while they're fast for most CPUs, they're 20 cycles each on AMD Piledriver for some reason, meaning that one block adds a cost approximately equivalent to transforming 14 vertices, but it looks like that's unavoidable on MSVC)

@shuffle2
Copy link
Contributor

shuffle2 commented Jan 14, 2023

optimizing for piledriver (or anything before zen) would be a waste of time imo (zero such cpus show up in dolphin analytics)[1].

have you diff'd e.g. CPUCull_SSE3::CullVertices_0_0_ and CPUCull_AVX::AreAllVerticesCulled_0_0_ in the above sse2 and avx binaries ? the code looks completely different.

In general, if you see suboptimal/bad codegen, it would be great if you could file a bug at https://developercommunity.visualstudio.com/cpp/report (this stack spill does look like such a case)

1: To expand on that, I think we're at the point where we can essentially rely on AVX and AVX2 being present on cores running dolphin. It's nice to provide some fallback paths, but IRL only a generic/fallback path and something >=AVX2 is really needed.

@TellowKrinkle
Copy link
Contributor Author

have you diff'd e.g. CPUCull_SSE3::CullVertices_0_0_ and CPUCull_AVX::AreAllVerticesCulled_0_0_ in the above sse2 and avx binaries ? the code looks completely different.

They look pretty similar to me

Code Comparison

AVX (uiCA analysis)
AVX code

SSE (uiCA analysis)
SSE code

There is a noticeable speed difference on Sandy Bridge, but not on newer architectures.

In general, if you see suboptimal/bad codegen, it would be great if you could file a bug at https://developercommunity.visualstudio.com/cpp/report (this stack spill does look like such a case)

Done.

@shuffle2
Copy link
Contributor

I would say that's "completely different", considering that you'd think you're getting the AVX version, but you're really (on the current msvc build) getting the SSE one instead.

Maybe just put a flashy comment in the source code to the effect of it not really being AVX on msvc currently? Otherwise meh w/e

@TellowKrinkle
Copy link
Contributor Author

Maybe just put a flashy comment in the source code to the effect of it not really being AVX on msvc currently?

Something like that?

@shuffle2
Copy link
Contributor

Lgtm

@JMC47
Copy link
Contributor

JMC47 commented Jan 26, 2023

LGTM. Tested The Legend of Zelda: Twilight Princess and Metroid Prime 3 and saw significant gains.

@golivax
Copy link

golivax commented Jan 26, 2023

Newbie question: will Android users also most likely see a perf. boost on those two games (and potentially others) or does the CPU Culling only make sense/apply to desktop CPUs? Tks and sorry for the hijack

@JMC47
Copy link
Contributor

JMC47 commented Jan 26, 2023

Android will also see a performance boost, sometimes a bigger performance boost depending on the bottleneck. In Twilight Princess, my desktop is pretty good at powering through the minimap, but my phone is really slow at it. So it helps a lot more on my phone.

@delroth
Copy link
Member

delroth commented Jan 26, 2023

Merging this now, it's disabled by default anyway which should significantly reduce regression risks. Really cool feature... but now I wonder at which point do we just say "fuck it" and just run all the vertex processing on the CPU and keep the GPU for fragments only :p

@delroth delroth merged commit 9c9310b into dolphin-emu:master Jan 26, 2023
14 checks passed
@TellowKrinkle TellowKrinkle deleted the CPUCull branch January 28, 2023 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
10 participants