New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CPU Core Count detection and Enable Parallel Shader Compilation #9414
Conversation
ec9b3e0
to
4bff12f
Compare
|
If I correctly recall my conversations with Stenzek, we chose not to enable parallel pre-compilation by default because it causes crashes on some drivers. |
|
Do you recall the nature of these crashes and whether we could somehow handle / retry with serial compilation on failure? Do we know whether this was "contained" to some backends (e.g. D3D12 / Vulkan not crashing while OpenGL / D3D11 sometimes doing)? I really think it might be worth the effort here to turn this on at least sometimes - with the appropriate predicates if necessary. Also, I have looked a bit around in GitHub but found nothing, where could I find such discussions if they happened in public? |
@DevJPM Dolphin Emulator official development chat is on freenode's IRC in #dolphin-dev As far as driver's causing crashes, I do not know about that, however it's been quite some time (couple of years) since pre-compilation was introduces IIRC, the drivers could have improved since. Regardless, I still like the sound of trying to work around the driver crashes if you have the time/patience. You might not even need to implement any fallback option if this PR is well tested and works fine. If you don't mind then it wouldn't hurt either, an issue may pop up in the future, on an edge case, or some niche device. |
|
Yes, I'm willing to do what I can to land the parallelism in Dolphin. Also I just checked and the relevant changes were committed on 30 July 2017. Thanks for the pointer to IRC, I will set up a client solution and then ask over there. As for testing I'm afraid I only have access to an Nvidia dGPU and an Intel iGPU on Windows, so I guess I'll have to hope that I can find some AMD iGPU / dGPU users (as well as Linux / Mac / Mobile Dolphin users) for testing whether the changes crash Dolphin (and if so how). |
|
@stenzek to comment on parallel shader compilation by default |
|
Big |
|
Would this work on mobile just like it does on desktop platforms? I'd be happy to test this on the Android side of things if it can be of any help. I'll edit this comment when I find some time in the next days to do some testing. I have an exynos SoC 9810, Mali-G72. Edit: added my phone's specs. From a first try, I can't seem to be able to trigger pre-compilation of shaders, altough I can only test Wii games as they are the only I have dumps of. In-game, shader compilation stutter appears to be less impactful and to last a bit less, but it could be placebo. |
|
If there is anything that would make this not work on Android, it would probably be caused by the relatively bad state of mobile GPU drivers. So this would be worth testing on both Adreno and Mali. |
|
I would also like to test this out especially on the mobile front. |
|
@sspacelynx in the graphics settings there should be a tickbox for "compile shaders before starting" that enables pre-compilation (for me it's the 4th menu entry) and then you might need to switch to (asychronous?) Ubershades in the option above that to actually get the pre-compilation requested. |
|
If you set it to synchronous ubershaders, the pre-compilation will be either very short or non-existent (I don't remember which). So that setting should be set to one of the three other options. |
|
Based on the things mentioned so far in this thread I have added validation matrices to the opening post which should catch / uncover essentially all driver-related issues if there are any and we manage to test all these platforms. When confirming stability with this branch please specify on which hardware you tested (CPU + GPU + selected GPU) on which OS with which driver revision using which backend. I'm not an expert on Dolphin for Android so if there's any other major GPUs please tell me, so I can add them. |
|
I feel like I discussed this with Stenzek as well but I can't really recall any details. I tested this with my amd rx480 gpu on a 6700k cpu. Win10 with the radeon driver version 20.10.35.02 Probably since I code reviewed the logic initially, I knew about the hidden settings and actually had my precompile threads at 4. Because of that, I didn't notice a great deal of difference though maybe a smidge faster. I was disappointed after @shuffle2 said it was instant!! I still would like to look at something like SPIRV in the future for even more performance benefit. Regardless, I think this change seems fine. I would assume any driver specific bugs we can force back to the old default.. I tested all backends: D3D11, D3D12, Vulkan, and even OpenGL. |
|
@iwubcode i haven't tried with amd gpu, but maybe you can try with more threads? (considering that cpu, 7-9 just to play around?) |
|
I tested this with a GTX 1060 on 460.89 with a R5 3600 on W10. |
|
I just tried it with my Intel i7-6500U and its Intel HD Graphics 520 (driver revision: 27.20.100.9077) and it works as expected (using 2ish of my 4 CPU threads). Though it appears that on the OpenGL backend the driver might serialize the compilation internally as induced CPU loads didn't exceed 28ish percent, though of course our primary concern here is stability and speed-ups in common usage scenarios. Also thanks to @Miksel12 and @iwubcode for testing this as well, I have already entered the confirmations into the matrix in the opening post. |
|
Tested on macOS 10.15.7 Catalina and NVIDIA GeForce GT 750M with Paper Mario TTYD (notorious for taking unholy amounts of time to compile shaders): No real performance benefit in either Vulkan or OpenGL (the latter throws the usual syntax error messages), utilizing 5% CPU during compilation. No crashes, so far. |
| @@ -84,9 +84,9 @@ const Info<bool> GFX_WAIT_FOR_SHADERS_BEFORE_STARTING{ | |||
| {System::GFX, "Settings", "WaitForShadersBeforeStarting"}, false}; | |||
| const Info<ShaderCompilationMode> GFX_SHADER_COMPILATION_MODE{ | |||
| {System::GFX, "Settings", "ShaderCompilationMode"}, ShaderCompilationMode::Synchronous}; | |||
| const Info<int> GFX_SHADER_COMPILER_THREADS{{System::GFX, "Settings", "ShaderCompilerThreads"}, 1}; | |||
| const Info<int> GFX_SHADER_COMPILER_THREADS{{System::GFX, "Settings", "ShaderCompilerThreads"}, -1}; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this one alone and only change the precompiler threads, as doing async compile on multiple threads might cause more stutters or general performance degradation if a game generates a lot of new shaders at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this would likely require more precise investigation as to if and when using more than one thread there will be beneficial and how good the current heuristic of (# logical CPUs - 3) is.
|
I can test Linux with AMDGPU if you want, just tell me what you need and I'll submit results (first time ever trying so just let me know or point me in the right direction). Sys specs: |
|
@yokai-64 that would be much appreciated. Testing should be as simple as cloning the master branch of my fork ( |
|
What is currently preventing this PR from being merged? |
It is currently blocked on the fact that we aren't sure enough that this won't crash (a lot of?) platforms. |
|
I have gated this multi-threaded shader pre-compilation using the bug feature on all non-D3D backends (which seemed to have done fine) to only be enabled on Windows or on MacOS with a Nvidia GPU. This should make it reasonably safe, while providing the benefits to Windows users (MacOS didn't actually do anything in the test sample we have?). |
|
The first three commits should be squashed together and there should be just one more as it stands. The commits merging master should be undone, or you can rebase on top of master if you want |
8375f20
to
89f2033
Compare
@PatrickFerry done. A rebase was needed as |
|
Is this still being worked on? |
|
Can this get a rebase? |
|
What is the status of this as of now? |
The intended status of this PR is that it's ready from my side until an authorized reviewer tells me what's wrong so I can fix that (minus a potential rebase). (In the past it feels like the main "issue" was that the relevant subsystem was authored by stenzek and they never had time to look at this PR) I'll do another rebase now. (done) |
This does this following things: - Default to the runtime automatic number of threads for pre-compiling shaders - Adds a distinct automatic thread count computation for pre-compilation (which has less other things going on and should scale better beyond 4 cores) - Removes the unused logical_core_count field from the CPU detection - Changes the semantics of num_cores from maximaum addressable number of cores to actually available CPU cores (which is also how it was actually used) - Updates the computation of the HTT flag now that AMD no longer lies about it for its Zen processors - Background shader compilation is *not* enabled by default
0eae2f5
to
0b77d68
Compare
|
You also have to match the code up to the linter. You can either install the linter on your local machine, or click on the details in the failed check and copy and replace the affected lines. |
@PatrickFerry there was a leftover space apparently which I just fixed. (validated with a local lint) |
|
@DevJPM how would you like this to be tested? I can test macos on a Mac with Intel Graphics and on an M1 Max. |
@MayImilae testing this on M1 and Intel iGPU on Macos would be quite appreciated. The concrete questions that a test should answer are:
To test, you'll need to enable ubershaders with precompilation, get a minute or so of test gameplay in checking for any obvious bugs and test it on all your platform's graphics backends (which should only be OpenGL & Vulkan on Mac?). Important Note As the current PR has not been tested for safety with MacOS, beyond MacOS + Nvidia GPU, you'll need to change the default bug value in Depending on your results, I will update the above checks to also support Intel iGPU / M1 on Macos. |
|
Compiling is kind of a pain for me. Can you compile a macos universal build for me to test? |
@MayImilae https://dl.dolphin-emu.org/prs/82/59/pr-9414-dolphin-latest-universal.dmg should be a universal build that enables parallel shader pre-compilation on all MacOS configurations. Should you report a problem, I will constrain the set of allowed GPU vendors again. |
|
Well this is going to be a pain to test. Deleting Dolphin's Cache folder does nothing because MacOS is caching Dolphin's compiled shaders somewhere itself, and I haven't had any luck finding it. So I can't really do a repeatable test. Once a configuration has been compiled, that's it. I can never get it to compile again. Worse yet, macOS is caching everything as metal so it doesn't matter if I test with OpenGL or Vulkan - any configuration is cached in one graphics API, it's cached in the other. Since internally to macOS it's all metal anyway. So it's basically impossible for me to do OpenGL and Vulkan. Fortunately the different MSAA levels are each a new configuration so I could at least test a little with that. Performance wise, from what I can tell in my very limited testing... there's no discernible difference. It's using the same number of cores. Take the above as only preliminary testing though, as much of the ubershaders were already compiled before this test, the CPU usage is a lot less than my first run. I don't think this is very good data. Fortunately this PR doesn't appear to be doing anything harmful on my M1. I tried a bunch of games and everything seems to be fine on both OpenGL and Vulkan on my M1 Max. I'm not going to call this definitive though, I'm very annoyed I can't get proper repeatable testing so I'm going to try to dig up that macos cache location and try to get some better results and do the same testing on my Intel mac. |
|
@JMC47 - this has been open almost a year with very little traction in terms of testing and has gone out of date several times. Since we have a path forward for fixing any broken drivers, would it make sense to merge this early in the month (after the next progress report) and retroactively fix anything that is broken? |
|
I'm not going to be able to hit a Progress Report this month due to some unfortunate circumstances, so that'll give some time for this to get tested before the next beta. It's been waiting far too long, and is supposedly disabled on Android which should make things safer. |





Alternative Title: Actually Allow the Shader Pre-Compiler to go brrrrrrr
Background
The other day I was trying to play Dolphin on a friend's desktop with an 8-core CPU where I noticed that during shader pre-compilation the CPU utilization was laughably low, corresponding to 1-2 threads at most. Further investigation revealed that this is due to
GraphicSettingsusing a default value of 1 for this - which disables automatic thread count choice and can't be changed easily via the GUI. So I updated that to-1(which has the semantic of "pick an automatic runtime default").Then I also noticed the thread count is capped at 4 - which makes sense for asynchronous background shader compilation, but not for blocking pre-compilation upon which the user has to actively wait. So a new function was chosen to compute this, allocating more ressources and not capping.
While investigating all of the above I also noticed that Dolphin determines
num_coresto be 8 for my 2C4T CPU. So I investigated and found that Dolphin tries to "manually" extract the number of cores from the CPUID... and gets the number of addressable cores back instead of actual cores, skipping the step of checking which of these actually have the same name and are thus actually the same. See this Stackoverflow Q&A for more info and guidance. So, beause all of this topology stuff is really complex and unneeded in Dolphin I just replaced it with a call tostd::thread::hardware_concurrencyand leave it to the standard library to worry about it.Changes
This does this following things:
and should scale better beyond 4 cores)
(which is also how it was actually used)
Validation Matrices
The below matrices should hopefully document the current state of stability testing as a function of OS, Backend, Hardware Vendor and Driver Version for this feature.
Windows
Linux
MacOS
Android