New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore advanced toolchain optimizations (e.g. PGO, LTO, whole program optimization, etcetera) #646
Comments
May be start from just -flto ? |
As an additional note, it might be a good idea to explore including Link Time Optimization (LTO) alongside PGO. There will be a need to tell the compiler what is externally visible. Supposedly, the gold linker can be used to help with this, but that would need investigation. Another idea is to to try concatenating all of the .cpp files and building them with |
@lieff You beat me to posting it. I edited the title to reflect the nature of this issue as encompassing more than just PGO. Quite frankly, I suspect that concatenating all of the files into a single compilation unit and then using -fwhole-program would be better than LTO, but it is up to the person who volunteers to explore this to decide what to try. Edit: Concatenating all of the files together and building them together is similar to Chromium's jumbo builds, although doing it to enable -fwhole-program would mean that it is for inter-procedural optimizations rather than reducing compile time: https://chromium.googlesource.com/chromium/src/+/lkcr/docs/jumbo.md Also, here is another thought. It would be interesting to try using LLVM/Clang with Google's Souper optimizer, especially with the other optimizations mentioned in place (i.e. PGO and WPO/LTO): https://github.com/google/souper There are other "superoptimizers" available that probably could be evaluated. They would make compile times skyrocket (taking days to months depending on how they are configured), but I hear that they can provide additional performance. I'd stick to the relatively low dangling fruit of PGO and LTO or a jumbo build with whole program optimization first though. |
why not you? http://mesonbuild.com/Builtin-options.html#base-options
A bunch of unit tests covering all aspects for general optimization, or an concrete game you want optimize DXVK for. |
Meson already have b_lto and b_pgo parameters, so it's build/packaging question, not really project related. |
@pchome I am not sure if I have time. If I thought I had time to do it, I would have done it rather than posting about it. We'll see if I do, but I find it doubtful.
The problem with games is that they rely on user input. I suspect that a game would be better than unit tests (although both could be run). We would need some way to start one from the commandline, have it run through a benchmark and then quit. @lieff How builds work is project related for any project. |
I can already tell you it won't. The shader compiler-related stutter happens inside the driver, and it is inherently slow due to differences in the D3D11 and Vulkan designs. Might still be worth looking at, but PGO only really helps optimize for one specific workload, LTO is notoriously broken, and any performance gain would be in the single-digit percentages. We already had Unity builds at some point, but for some strange reason they ended up being significantly slower than regular builds. |
@doitsujin My replies are inline:
That is unfortunate. Would you mind sharing how you profile? If I recall correctly, my usual profiling tricks don't give me much visibility into binaries running in Wine.
Would you name a few of the differences? I would like to know more. Are you referring to things like D3D binding slots vs vulkan descriptor sets?
I suggest leaving it to a volunteer and putting the
That is what I thought until I saw that Firefox improved its Javascript performance in general with PGO.
If it were up to me, I'd probably just dump all *.cpp files into a single file and then build with |
winelib builds of DXVK work with the usual Linux profiling tools and debuggers.
That's causing some pain elsewhere, but the main issue with shader compilation is that you can compile shaders individually in D3D, and the D3D11 driver will do a lot of magic during the respective |
that's how meson unity builds ( |
@doitsujin I take it that your profiling shows that most of the time there is spent in the graphics driver. This might be asking the obvious question, but is there no way to parallelize that process? For example, n shaders |
Also, LTO won't work for winelib builds, because you need LTOed WINE, or particularly I'm using LTO and PGO for my whole system wherever possible, and WINE is one of the unreached goals. |
@pchome I was leaning toward thinking that a so called unity build with -fwhole-program would be better than LTO. As I said above, LTO is fragile. If you build everything as one compilation unit with |
@doitsujin Nevermind about the parallelism. I need to do my own profiling. I have spent more time looking at this code than I really have at the moment, but I think I understand a few bits of it. In particular, the draw calls that you mentioned are likely in DxvkContext::commitGraphicsState. Given how much this piqued my interest, I'll probably profile the code at some point and learn where the time is being spent. Concurrent programming is always fun. ;) |
@pchome That is good to know. I still think opportunities for interprocedural optimization from those (with -fwhole-program) could be a low dangling fruit for someone who has only minor programming knowledge to explore. After reading what @doitsujin said and looking through the code, I found some more interesting avenues to explore. In particular, I am not seeing much threading and I see no use of machine prefetch hints in the code. I need to make time to profile to see where the bottlenecks are more clearly. |
After some thought, I think I should close this. It is probably not a great use of people's time, although I did learn some interesting things from the discussion. |
Why so? At least PGO is real, and quite easy to test. The only thing we should do -- create a list of small tests (maybe wine's d3d11 tests, or some other d3d11 demos), and define a final benchmark to test results. e.g.
|
Alright. I am reopening this. |
@doitsujin One last thing as I could not help myself from eyeballing the code a bit more. Does your profiling indicate that the shuttering is from |
|
@doitsujin That is tricky. Couldn't you just cache the DXBC shaders and other things that DXVK receives from the game and turns into a pipeline? Then on subsequent runs, if one of the shaders from a previous session are loaded by a game (identified by a matching checksum), DXVK could load the rest from cache and pre-create the pipeline? That is just a rough idea, but some kind of driver independent cache seems like the only way around it. |
Ok, I able to build standalone I going to do the same for other wine's dx10/dx11 related tests, and combine all together before sharing. |
https://github.com/pchome/wine-playground/tree/master/dx1x-tests Note:
d3d11 test out: So |
The tests that failed probably merit their own issues. |
Mostly no, missing interfaces, specific formats and probably WINE's internal stuff.
@doitsujin can check it by himself , if he'll want to. EDIT: A lot of them (tests) are failing even for wine. |
I have a theory on why unity builds took longer. For large projects, the headers can be substantially more complex than the files themselves. Furthermore, you can have many files to compile such that even with I believe that DXVK’s headers are not complex enough to save time with unity builds. However, there should be opportunities for strong interprocedural optimizations from unity builds if DXVK is adapted to support |
They seems always used for me. For comparison (where they disabled): |
@pchome Not for the official DXVK release builds or for the DXVK bundled with Proton. The same goes for the builds for which I posted size numbers. SSE and SSE2 were omitted from the 32-bit binaries. |
I've smoothed rough corners in my whole-program patch a bit, and going to include it into my ebuild repository using For those wanted to test: dxvk-0.80-whole-program-support2.patch.txt Use Note:
|
@pchome If you designed your patch to not change behavior unless told to use wpo, you can just apply it unconditionally. Conditional patching is highly discouraged in Gentoo ebuilds because it makes them less maintainable. You could also call the USE flag wpo. |
I built DXVK using AutoFDO (#646 (comment)). Also there is new warning during build: [-Wmaybe-uninitialized] for src/dxbc/dxbc_compiler.cpp:5034:31 Not sure is it me doing something wrong, or compiler/code issue. # AutoFDO
so_path=/tmp/lib64
data_path=/var/tmp/dxvk.profile
# Record profile data
perf record -e branch-instructions -o $data_path/dxvk.data d3d11-triangle.exe
# Create afdo profile
create_gcov --binary=$so_path/d3d11.dll.so --profile=$data_path/dxvk.data --gcov=$data_path/d3d11.dll.so.afdo -gcov_version=1 -use_lbr=false
create_gcov --binary=$so_path/dxgi.dll.so --profile=$data_path/dxvk.data --gcov=$data_path/dxgi.dll.so.afdo -gcov_version=1 -use_lbr=false
profile_merger $data_path/d3d11.dll.so.afdo $data_path/dxgi.dll.so.afdo -output_file=$data_path/dxvk.afdo -gcov_version=1 -use_lbr=false
# Use -fauto-profile=/var/tmp/dxvk.profile/dxvk.afdo |
|
Can anyone share Using something like: or using pmu-tools and example ( |
perf-dxvk.svg.gz |
@pchome What is AFDO? Also, that shows a significant amount of time is spent in |
@pchome I imagine that you are using your own wine build. Try applying this patch: It might make that faster. We would need to try to profile the native build in wine to see what it does, but hopefully, it does the same thing. |
AutoFDO, see above #646 (comment)
It's wine-staging-3.16[custom-cflags] (mean system *FLAGS, except LTO), patched w/ rebased esync+proton patches. So it's "custom Proton-3.16" build, already contains this patch. |
@pchome In that case, we need to consider reimplementing how time is looked up, perhaps by using the Also, I spotted a function that seems to take a little too long relative to the code, so I am going to disassemble it to see what it is doing. Maybe we can do better there with better optimization flags, maybe we could do better there by changing the code or maybe that code really does more than I realize. C++ is abstract enough that it is hard to do a ballpark estimate of how expensive code is just by eyeballing a few lines. Lastly, a program that draws triangles really is not the best choice to profile. It would be better to profile a unigine benchmark or overwatch. Also, I hope that you are using a sampling rate of 99Hz or 997Hz. Round number sample rates of 100Hz or 1000Hz can end up failing to spot expensive things that run at the same sample rate, which can happen. |
On second thought, there is a better way of doing time in this sort of application. instead of polling time, we could make this interrupt driven by having a volatile flag variable that controls whether we flush that we set on a timer. Then the flush code would check that flag. When it is set, it would unset it and do a flush. This should be less expensive than polling in this function. Maybe this could work for that: http://www.cplusplus.com/reference/future/future/ I would need to study the code, but perhaps we could make the flush itself timer driven so that we do not need that check and can just rely on the timer. |
Note: profile may contain calls from Also, I want to profile at least anything small, to prove profiling can be done on wine applications (*.exe/*.dll).
Yes, I planned to use -F99 on something "big", but I'll try it on triangles, thanks. Also, I believe my hardware is not fully suitable for profiling multithreaded applications (no |
Meanwhile I just built DXVK using AutoFDO, but w/o any optimizations: Edit: I see no reason to measure it's performance on this stage. |
|
|
I thought about it some more. I am not certain that a timer is even necessary. If the GPU is about to run out of things to do, then giving it more seems like a good idea. I don't see the point of having a timeout. We could try deleting this code and seeing how that works. |
Why? #!/bin/sh
mkdir -p $(pwd)/dxvk-test
export WINEPREFIX="$(pwd)/dxvk-test"
export WINEDLLOVERRIDES="d3dcompiler_47,d3d11,dxgi=n"
export DXVK_HUD=version,devinfo,fps
cp /tmp/lib64/dxgi.dll.so dxvk-test/drive_c/windows/system32/dxgi.dll
cp /tmp/lib64/d3d11.dll.so dxvk-test/drive_c/windows/system32/d3d11.dll
wine d3d11-triangle.exe That's all. |
@pchome I am using proton. I don't know why it does not work. It is on my todo list. I doubt that I will have time to do anything else here until later in the week. |
Last time I tried, the winelib builds didn't work on Proton because Proton's Wine version was not recent enough and winelib DXVK couldn't use native Vulkan calls. |
You can have different WINE version for tests and measures. What the problem? I don't get it. If problem somewhere in your distro, then
|
I am juggling several different things. Some things just get assigned a lower priority than others. I did a prototype of that patch a few days early, so do we really need to discuss how I have yet to allocate more time to try again with winelib? |
You spending more time randomly sticking a code you can't even test. You think it's ok to ask people to spend their time for testing random changes. Since this issue is not my blog, I'm reporting my experience and expect participants can reproduce it. My interest here is AutoFDO, PGO, LTO and |
The previous code had a call to an external symbol. The compiler cannot optimize across function calls to external symbols. The patched code is therefore easier for the compiler to optimize. Before I put this down, I made another version of the patch that does away with the added atomics, which makes the code even more compiler optimization friendly. I’ll push that version when I am back at my Gentoo development workstation. For what it is worth, doing away with atomics also improves compiler optimization opportunities, but it can be absurdly difficult to do right. The amount of code that you see changed to do that is disproportionally less than the amount of effort. During the course of this, we will probably find other source code changes that can make the compiler’s optimization passes do a better job. We really ought to make them in addition to experimenting with compiler optimizations. Also, I tested all versions that I pushed, but only as native binaries because I am not setup for doing tests with winelib builds yet. Just as you are not setup to build native binaries and can only test winelib builds, I am not setup for winelib builds (at least doing usable ones) and I can only test the native binaries. It would take time to setup something else. I estimate it would take several hours to debug why my winelib builds do not work and I don’t have time for that. It is on my todo list. It will be done eventually. It might only look like a few minutes to you (and perhaps it is), but I have been sucked into rabbit holes that should take only a few minutes in the past. Writing the patch was a small rabbit hole. I am not willing to jump into another one without adequate time to dedicate to it. Lastly, your triangles test is a horrible way of profiling for optimization opportunities, whether they are opportunities for better compilation or something else. Getting the time does not involve a system call thanks to vdso (and possibly also a patch done to proton to stop using CLOCK_MONOTONIC_RAW). There is very little that we can learn from profiling it because it does not exercise the code in a realistic way. Proper profiling requires running something that exercises the graphics API in a realistic way, such as a unigine benchmark. |
I had a chat with one of the LLVM experts from BNL in person yesterday about this. The only thing he could add that we were not already doing was to try building with Clang. I had hoped that he could suggest some compiler options known to generate better code at the expense of absurdly long compilation times, but he is not familiar with those. I heard about suoeroptimization from a researcher from the Institute for Advanced Computational Science that does work in compiler optimization. With super optimization, builds can take days, weeks or even months, but the performance of the generated code is better than from normal optimization. Someone might want to look into testing every tool that comes up when googling “gcc superoptimizer” or related queries. I am withholding names because I do not think the people involved would want to receive emails from random people on this topic. The researcher that I mentioned receives so many emails that she only responds to emails with specific keywords in the subject, so random people who email her are unlikely to receive a response. |
I have come to the conclusion that all of the low dangling fruit has been found. Further improvements will come from making code changes according to profiling. Also, I tested that patch. It makes no improvement in any game, even in the fairly unrealistic Rise of Nations: Extended Edition title screen. |
Someone with time to explore tweaks to the build system should look into doing PGO builds. There are descriptions of how this works here:
https://dom.as/2009/07/27/profile-guided-optimization-with-gcc/
https://gcc.gnu.org/onlinedocs/gcc-8.2.0/gcc/Instrumentation-Options.html
There are glowing reviews of PGO here:
https://cboard.cprogramming.com/tech-board/111902-pgo-amazing.html
https://www.activestate.com/blog/2014/06/python-performance-boost-using-profile-guided-optimization
https://clearlinux.org/blogs/profile-guided-optimization-mariadb-benchmarks
I suspect that PGO might help to reduce "stutter".
There are a couple of questions that need to be answered before PGO builds can be done:
The text was updated successfully, but these errors were encountered: