Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CPU Core Count detection and Enable Parallel Shader Compilation #9414

Merged
merged 3 commits into from Jan 4, 2022

Conversation

DevJPM
Copy link
Contributor

@DevJPM DevJPM commented Jan 2, 2021

Alternative Title: Actually Allow the Shader Pre-Compiler to go brrrrrrr

Background

The other day I was trying to play Dolphin on a friend's desktop with an 8-core CPU where I noticed that during shader pre-compilation the CPU utilization was laughably low, corresponding to 1-2 threads at most. Further investigation revealed that this is due to GraphicSettings using a default value of 1 for this - which disables automatic thread count choice and can't be changed easily via the GUI. So I updated that to -1 (which has the semantic of "pick an automatic runtime default").
Then I also noticed the thread count is capped at 4 - which makes sense for asynchronous background shader compilation, but not for blocking pre-compilation upon which the user has to actively wait. So a new function was chosen to compute this, allocating more ressources and not capping.

While investigating all of the above I also noticed that Dolphin determines num_cores to be 8 for my 2C4T CPU. So I investigated and found that Dolphin tries to "manually" extract the number of cores from the CPUID... and gets the number of addressable cores back instead of actual cores, skipping the step of checking which of these actually have the same name and are thus actually the same. See this Stackoverflow Q&A for more info and guidance. So, beause all of this topology stuff is really complex and unneeded in Dolphin I just replaced it with a call to std::thread::hardware_concurrency and leave it to the standard library to worry about it.

Changes

This does this following things:

  • Default to the runtime automatic number of threads for (pre-) compiling shaders (which also leaves enough cores for the OS and / or other important Dolphin threads)
  • Adds a distinct automatic thread count computation for pre-compilation (which has less other things going on
    and should scale better beyond 4 cores)
  • Removes the unused logical_core_count field from the CPU detection
  • Changes the semantics of num_cores from maximaum addressable number of cores to actually available CPU cores
    (which is also how it was actually used)
  • Updates the setting of the HTT flag now that AMD no longer lies about it for its Zen processors

Validation Matrices

The below matrices should hopefully document the current state of stability testing as a function of OS, Backend, Hardware Vendor and Driver Version for this feature.

Windows

Backend Nvidia AMD dGPU AMD iGPU Intel iGPU
D3D11 460.89 20.10.35.02 ? 27.20.100.9077
D3D12 460.89 20.10.35.02 ? 27.20.100.9077
OpenGL 460.89 20.10.35.02 ? 27.20.100.9077
Vulkan 460.89 20.10.35.02 ? 27.20.100.9077

Linux

Backend Nvidia AMD dGPU AMD iGPU Intel iGPU
OpenGL ? ? ? ?
Vulkan ? ? ? ?

MacOS

Backend Nvidia AMD dGPU AMD iGPU Intel iGPU
OpenGL macOS 10.15.7 ? ? ?
Vulkan macOS 10.15.7 ? ? ?

Android

Backend Adreno Mali
OpenGL ? ?
Vulkan ? ?

@DevJPM DevJPM force-pushed the master branch 2 times, most recently from ec9b3e0 to 4bff12f Compare January 2, 2021 15:39
@Techjar
Copy link
Contributor

Techjar commented Jan 2, 2021

If I correctly recall my conversations with Stenzek, we chose not to enable parallel pre-compilation by default because it causes crashes on some drivers.

@DevJPM
Copy link
Contributor Author

DevJPM commented Jan 2, 2021

Do you recall the nature of these crashes and whether we could somehow handle / retry with serial compilation on failure? Do we know whether this was "contained" to some backends (e.g. D3D12 / Vulkan not crashing while OpenGL / D3D11 sometimes doing)?

I really think it might be worth the effort here to turn this on at least sometimes - with the appropriate predicates if necessary.

Also, I have looked a bit around in GitHub but found nothing, where could I find such discussions if they happened in public?

@ghost
Copy link

ghost commented Jan 2, 2021

Also, I have looked a bit around in GitHub but found nothing, where could I find such discussions if they happened in public?

@DevJPM Dolphin Emulator official development chat is on freenode's IRC in #dolphin-dev
I'm afraid I don't think a public log is kept, but there should be someone there that's familiar with the history of this.

As far as driver's causing crashes, I do not know about that, however it's been quite some time (couple of years) since pre-compilation was introduces IIRC, the drivers could have improved since. Regardless, I still like the sound of trying to work around the driver crashes if you have the time/patience.

You might not even need to implement any fallback option if this PR is well tested and works fine. If you don't mind then it wouldn't hurt either, an issue may pop up in the future, on an edge case, or some niche device.

@DevJPM
Copy link
Contributor Author

DevJPM commented Jan 3, 2021

Yes, I'm willing to do what I can to land the parallelism in Dolphin.

Also I just checked and the relevant changes were committed on 30 July 2017.

Thanks for the pointer to IRC, I will set up a client solution and then ask over there.

As for testing I'm afraid I only have access to an Nvidia dGPU and an Intel iGPU on Windows, so I guess I'll have to hope that I can find some AMD iGPU / dGPU users (as well as Linux / Mac / Mobile Dolphin users) for testing whether the changes crash Dolphin (and if so how).

@delroth
Copy link
Member

delroth commented Jan 3, 2021

@stenzek to comment on parallel shader compilation by default

@shuffle2
Copy link
Contributor

Big 👍 from me, I really want to see this as well - compilation is instant for me with this enabled (even with only 10 threads or so), and I haven't seen any problems (nv gpu). The advantage is good enough that it would be worth implementing whatever detection is required to downgrade to single-threaded where it's required.

@lynxnb
Copy link
Member

lynxnb commented Jan 13, 2021

Would this work on mobile just like it does on desktop platforms? I'd be happy to test this on the Android side of things if it can be of any help. I'll edit this comment when I find some time in the next days to do some testing. I have an exynos SoC 9810, Mali-G72.

Edit: added my phone's specs. From a first try, I can't seem to be able to trigger pre-compilation of shaders, altough I can only test Wii games as they are the only I have dumps of. In-game, shader compilation stutter appears to be less impactful and to last a bit less, but it could be placebo.

@JosJuice
Copy link
Member

If there is anything that would make this not work on Android, it would probably be caused by the relatively bad state of mobile GPU drivers. So this would be worth testing on both Adreno and Mali.

@Bankaimaster999
Copy link
Contributor

I would also like to test this out especially on the mobile front.
I can contribute Snapdragon 845 testing when I got time.

@DevJPM
Copy link
Contributor Author

DevJPM commented Jan 13, 2021

@sspacelynx in the graphics settings there should be a tickbox for "compile shaders before starting" that enables pre-compilation (for me it's the 4th menu entry) and then you might need to switch to (asychronous?) Ubershades in the option above that to actually get the pre-compilation requested.

@JosJuice
Copy link
Member

If you set it to synchronous ubershaders, the pre-compilation will be either very short or non-existent (I don't remember which). So that setting should be set to one of the three other options.

@DevJPM
Copy link
Contributor Author

DevJPM commented Jan 13, 2021

Based on the things mentioned so far in this thread I have added validation matrices to the opening post which should catch / uncover essentially all driver-related issues if there are any and we manage to test all these platforms.

When confirming stability with this branch please specify on which hardware you tested (CPU + GPU + selected GPU) on which OS with which driver revision using which backend.

I'm not an expert on Dolphin for Android so if there's any other major GPUs please tell me, so I can add them.

@iwubcode
Copy link
Contributor

iwubcode commented Jan 21, 2021

I feel like I discussed this with Stenzek as well but I can't really recall any details.

I tested this with my amd rx480 gpu on a 6700k cpu. Win10 with the radeon driver version 20.10.35.02

Probably since I code reviewed the logic initially, I knew about the hidden settings and actually had my precompile threads at 4. Because of that, I didn't notice a great deal of difference though maybe a smidge faster. I was disappointed after @shuffle2 said it was instant!! I still would like to look at something like SPIRV in the future for even more performance benefit.

Regardless, I think this change seems fine. I would assume any driver specific bugs we can force back to the old default..

I tested all backends: D3D11, D3D12, Vulkan, and even OpenGL.

@shuffle2
Copy link
Contributor

@iwubcode i haven't tried with amd gpu, but maybe you can try with more threads? (considering that cpu, 7-9 just to play around?)
I see almost linear speedup here (just eyeballing it - not really measuring), until I don't really see the imgui progress bar at all anymore ;)

@Miksel12
Copy link
Contributor

I tested this with a GTX 1060 on 460.89 with a R5 3600 on W10.
All backends worked without problem.

@DevJPM
Copy link
Contributor Author

DevJPM commented Jan 23, 2021

I just tried it with my Intel i7-6500U and its Intel HD Graphics 520 (driver revision: 27.20.100.9077) and it works as expected (using 2ish of my 4 CPU threads). Though it appears that on the OpenGL backend the driver might serialize the compilation internally as induced CPU loads didn't exceed 28ish percent, though of course our primary concern here is stability and speed-ups in common usage scenarios.

Also thanks to @Miksel12 and @iwubcode for testing this as well, I have already entered the confirmations into the matrix in the opening post.

@pizuz
Copy link

pizuz commented Jan 31, 2021

Tested on macOS 10.15.7 Catalina and NVIDIA GeForce GT 750M with Paper Mario TTYD (notorious for taking unholy amounts of time to compile shaders):

No real performance benefit in either Vulkan or OpenGL (the latter throws the usual syntax error messages), utilizing 5% CPU during compilation. No crashes, so far.

@@ -84,9 +84,9 @@ const Info<bool> GFX_WAIT_FOR_SHADERS_BEFORE_STARTING{
{System::GFX, "Settings", "WaitForShadersBeforeStarting"}, false};
const Info<ShaderCompilationMode> GFX_SHADER_COMPILATION_MODE{
{System::GFX, "Settings", "ShaderCompilationMode"}, ShaderCompilationMode::Synchronous};
const Info<int> GFX_SHADER_COMPILER_THREADS{{System::GFX, "Settings", "ShaderCompilerThreads"}, 1};
const Info<int> GFX_SHADER_COMPILER_THREADS{{System::GFX, "Settings", "ShaderCompilerThreads"}, -1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this one alone and only change the precompiler threads, as doing async compile on multiple threads might cause more stutters or general performance degradation if a game generates a lot of new shaders at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this would likely require more precise investigation as to if and when using more than one thread there will be beneficial and how good the current heuristic of (# logical CPUs - 3) is.

@ghost
Copy link

ghost commented Feb 13, 2021

I can test Linux with AMDGPU if you want, just tell me what you need and I'll submit results (first time ever trying so just let me know or point me in the right direction).

Sys specs:
CPU: AMD Ryzen 3900XT
GPU AMD Radeon RX 580 8GB

@DevJPM
Copy link
Contributor Author

DevJPM commented Feb 14, 2021

@yokai-64 that would be much appreciated. Testing should be as simple as cloning the master branch of my fork (git clone --recursive https://github.com/DevJPM/dolphin.git) and then following the local Linux build instructions in the main readme file. Then boot up Dolphin, enter the graphics options and set the Shader compilation to "Asynchronous (Ubershaders)" and check "Compile Shaders before Starting". Finally start a game where you get the loading screen that shaders are compiling and see if it crashes. Repeat this test with all two backends (OpenGL + Vulkan). Optionally you can also check your system monitor to see if Dolphin uses more than 2 CPU cores during the compilation progress.

@ghost
Copy link

ghost commented Feb 15, 2021

Vulkan:

I couldn't get any compiler loading screen to show up - maybe something is already compiled? I cleared all the caches I could think of but nothing changed (this happened for all the games I have: Mario Kart Wii, Mario Galaxy 1 & 2 as well as Wii Sports and Wii Sports Resort).

There were also some weird graphical issues that are not present for me on mainstream Dolphin.


OpenGL:

Failed to boot any game, with segmentation fault:
2021-02-15_19-58_1

I couldn't find a log file, I tried running with dolphin-emu -d, dolphin-emu -l and dolphin-emu --logger


I'm aware that this initial analysis isn't all that helpful, so if there's anything else I can do please say and I'll try and improve these results.

For future reference:
CPU: AMD Ryzen 9 3900XT
GPU: AMD ATI Radeon RX 580
Kernel: 5.10.15-1-MANJARO
Display Driver Version: 4.6 (Compatibility Profile) Mesa 20.3.4

@CrusadingNinja
Copy link

What is currently preventing this PR from being merged?

@DevJPM
Copy link
Contributor Author

DevJPM commented Jun 3, 2021

What is currently preventing this PR from being merged?

It is currently blocked on the fact that we aren't sure enough that this won't crash (a lot of?) platforms.
I'm currently formulating an entry into Dolphin's internal bug work-around database to make this safe.

@DevJPM
Copy link
Contributor Author

DevJPM commented Jun 3, 2021

I have gated this multi-threaded shader pre-compilation using the bug feature on all non-D3D backends (which seemed to have done fine) to only be enabled on Windows or on MacOS with a Nvidia GPU. This should make it reasonably safe, while providing the benefits to Windows users (MacOS didn't actually do anything in the test sample we have?).

@PatrickFerry
Copy link
Contributor

The first three commits should be squashed together and there should be just one more as it stands. The commits merging master should be undone, or you can rebase on top of master if you want

@DevJPM DevJPM force-pushed the master branch 2 times, most recently from 8375f20 to 89f2033 Compare June 6, 2021 09:44
@DevJPM
Copy link
Contributor Author

DevJPM commented Jun 6, 2021

The first three commits should be squashed together and there should be just one more as it stands. The commits merging master should be undone, or you can rebase on top of master if you want

@PatrickFerry done. A rebase was needed as DriverDetails has been changed since February and git was confused by my additions.

@CrusadingNinja
Copy link

Is this still being worked on?

@CrusadingNinja
Copy link

Can this get a rebase?

@ghost
Copy link

ghost commented Nov 20, 2021

What is the status of this as of now?

@DevJPM
Copy link
Contributor Author

DevJPM commented Nov 20, 2021

What is the status of this as of now?

The intended status of this PR is that it's ready from my side until an authorized reviewer tells me what's wrong so I can fix that (minus a potential rebase). (In the past it feels like the main "issue" was that the relevant subsystem was authored by stenzek and they never had time to look at this PR)

I'll do another rebase now. (done)

This does this following things:

- Default to the runtime automatic number of threads for pre-compiling shaders
- Adds a distinct automatic thread count computation for pre-compilation  (which has less other things going on
and should scale better beyond 4 cores)
- Removes the unused logical_core_count field from the CPU detection
- Changes the semantics of num_cores from maximaum addressable number of cores to actually available CPU cores
(which is also how it was actually used)
- Updates the computation of the HTT flag now that AMD no longer lies about it for its Zen processors
- Background shader compilation is *not* enabled by default
@DevJPM DevJPM force-pushed the master branch 2 times, most recently from 0eae2f5 to 0b77d68 Compare November 20, 2021 15:21
@PatrickFerry
Copy link
Contributor

You also have to match the code up to the linter. You can either install the linter on your local machine, or click on the details in the failed check and copy and replace the affected lines.

@DevJPM
Copy link
Contributor Author

DevJPM commented Nov 22, 2021

You also have to match the code up to the linter. You can either install the linter on your local machine, or click on the details in the failed check and copy and replace the affected lines.

@PatrickFerry there was a leftover space apparently which I just fixed. (validated with a local lint)

@MayImilae
Copy link
Contributor

@DevJPM how would you like this to be tested? I can test macos on a Mac with Intel Graphics and on an M1 Max.

@DevJPM
Copy link
Contributor Author

DevJPM commented Nov 22, 2021

@DevJPM how would you like this to be tested? I can test macos on a Mac with Intel Graphics and on an M1 Max.

@MayImilae testing this on M1 and Intel iGPU on Macos would be quite appreciated. The concrete questions that a test should answer are:

  • Does it crash or cause any noticeable negative side-effects (e.g. bad video)?
  • Does it actually use more than 1 core (to be checked via CPU load monitoring tools)?

To test, you'll need to enable ubershaders with precompilation, get a minute or so of test gameplay in checking for any obvious bugs and test it on all your platform's graphics backends (which should only be OpenGL & Vulkan on Mac?).

Important Note As the current PR has not been tested for safety with MacOS, beyond MacOS + Nvidia GPU, you'll need to change the default bug value in /Source/Core/VideoCommon/DriverDetails.cpp for BUG_BROKEN_MULTITHREADED_SHADER_PRECOMPILATION (line 137 and 139) from true to false (for both backends) because otherwise it will default to the "safe" single-core compilation.

Depending on your results, I will update the above checks to also support Intel iGPU / M1 on Macos.

@MayImilae
Copy link
Contributor

MayImilae commented Nov 22, 2021

Compiling is kind of a pain for me. Can you compile a macos universal build for me to test?

@DevJPM
Copy link
Contributor Author

DevJPM commented Nov 22, 2021

Compiling is kind of a pain for me. Can you compile a macos universal build for me to test?

@MayImilae https://dl.dolphin-emu.org/prs/82/59/pr-9414-dolphin-latest-universal.dmg should be a universal build that enables parallel shader pre-compilation on all MacOS configurations. Should you report a problem, I will constrain the set of allowed GPU vendors again.

@MayImilae
Copy link
Contributor

MayImilae commented Nov 23, 2021

Well this is going to be a pain to test. Deleting Dolphin's Cache folder does nothing because MacOS is caching Dolphin's compiled shaders somewhere itself, and I haven't had any luck finding it. So I can't really do a repeatable test. Once a configuration has been compiled, that's it. I can never get it to compile again. Worse yet, macOS is caching everything as metal so it doesn't matter if I test with OpenGL or Vulkan - any configuration is cached in one graphics API, it's cached in the other. Since internally to macOS it's all metal anyway. So it's basically impossible for me to do OpenGL and Vulkan.

Fortunately the different MSAA levels are each a new configuration so I could at least test a little with that.

Performance wise, from what I can tell in my very limited testing... there's no discernible difference. It's using the same number of cores.

pr9414

Take the above as only preliminary testing though, as much of the ubershaders were already compiled before this test, the CPU usage is a lot less than my first run. I don't think this is very good data.

Fortunately this PR doesn't appear to be doing anything harmful on my M1. I tried a bunch of games and everything seems to be fine on both OpenGL and Vulkan on my M1 Max. I'm not going to call this definitive though, I'm very annoyed I can't get proper repeatable testing so I'm going to try to dig up that macos cache location and try to get some better results and do the same testing on my Intel mac.

@iwubcode
Copy link
Contributor

@JMC47 - this has been open almost a year with very little traction in terms of testing and has gone out of date several times. Since we have a path forward for fixing any broken drivers, would it make sense to merge this early in the month (after the next progress report) and retroactively fix anything that is broken?

@JMC47
Copy link
Contributor

JMC47 commented Jan 4, 2022

I'm not going to be able to hit a Progress Report this month due to some unfortunate circumstances, so that'll give some time for this to get tested before the next beta.

It's been waiting far too long, and is supposedly disabled on Android which should make things safer.

@JMC47 JMC47 merged commit 9a914d3 into dolphin-emu:master Jan 4, 2022
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet