Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore advanced toolchain optimizations (e.g. PGO, LTO, whole program optimization, etcetera) #646

Closed
ryao opened this issue Sep 17, 2018 · 100 comments

Comments

@ryao
Copy link
Contributor

ryao commented Sep 17, 2018

Someone with time to explore tweaks to the build system should look into doing PGO builds. There are descriptions of how this works here:

https://dom.as/2009/07/27/profile-guided-optimization-with-gcc/
https://gcc.gnu.org/onlinedocs/gcc-8.2.0/gcc/Instrumentation-Options.html

There are glowing reviews of PGO here:

https://cboard.cprogramming.com/tech-board/111902-pgo-amazing.html
https://www.activestate.com/blog/2014/06/python-performance-boost-using-profile-guided-optimization
https://clearlinux.org/blogs/profile-guided-optimization-mariadb-benchmarks

I suspect that PGO might help to reduce "stutter".

There are a couple of questions that need to be answered before PGO builds can be done:

  1. How will the filesystem path for profile data work when using a wine prefix?
  2. What benchmark can be run to generate profile data? Presumably, the benchmark should be something that does not require human interaction.
@lieff
Copy link

lieff commented Sep 17, 2018

May be start from just -flto ?

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

As an additional note, it might be a good idea to explore including Link Time Optimization (LTO) alongside PGO. There will be a need to tell the compiler what is externally visible. Supposedly, the gold linker can be used to help with this, but that would need investigation.

Another idea is to to try concatenating all of the .cpp files and building them with -fwhole-program. This will require marking public functions with externally_visible, although it should generate a very well optimized binary.

@ryao ryao changed the title PGO builds Explore advanced toolchain optimizations (e.g. PGO, LTO, whole program optimization, etcetera) Sep 17, 2018
@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

@lieff You beat me to posting it. I edited the title to reflect the nature of this issue as encompassing more than just PGO.

Quite frankly, I suspect that concatenating all of the files into a single compilation unit and then using -fwhole-program would be better than LTO, but it is up to the person who volunteers to explore this to decide what to try.

Edit: Concatenating all of the files together and building them together is similar to Chromium's jumbo builds, although doing it to enable -fwhole-program would mean that it is for inter-procedural optimizations rather than reducing compile time:

https://chromium.googlesource.com/chromium/src/+/lkcr/docs/jumbo.md

Also, here is another thought. It would be interesting to try using LLVM/Clang with Google's Souper optimizer, especially with the other optimizations mentioned in place (i.e. PGO and WPO/LTO):

https://github.com/google/souper

There are other "superoptimizers" available that probably could be evaluated. They would make compile times skyrocket (taking days to months depending on how they are configured), but I hear that they can provide additional performance. I'd stick to the relatively low dangling fruit of PGO and LTO or a jumbo build with whole program optimization first though.

@pchome
Copy link
Contributor

pchome commented Sep 17, 2018

@ryao

it is up to the person who volunteers to explore this

why not you?

http://mesonbuild.com/Builtin-options.html#base-options
See b_lto, b_pgo and --unity

  1. What benchmark can be run to generate profile data?

A bunch of unit tests covering all aspects for general optimization, or an concrete game you want optimize DXVK for.

@lieff
Copy link

lieff commented Sep 17, 2018

Meson already have b_lto and b_pgo parameters, so it's build/packaging question, not really project related.

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

@pchome I am not sure if I have time. If I thought I had time to do it, I would have done it rather than posting about it. We'll see if I do, but I find it doubtful.

A bunch of unit tests covering all aspects for general optimization, or an concrete game you want optimize DXVK for.

The problem with games is that they rely on user input. I suspect that a game would be better than unit tests (although both could be run). We would need some way to start one from the commandline, have it run through a benchmark and then quit.

@lieff How builds work is project related for any project.

@doitsujin
Copy link
Owner

doitsujin commented Sep 17, 2018

I suspect that PGO might help to reduce "stutter".

I can already tell you it won't. The shader compiler-related stutter happens inside the driver, and it is inherently slow due to differences in the D3D11 and Vulkan designs.

Might still be worth looking at, but PGO only really helps optimize for one specific workload, LTO is notoriously broken, and any performance gain would be in the single-digit percentages.

We already had Unity builds at some point, but for some strange reason they ended up being significantly slower than regular builds.

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

@doitsujin My replies are inline:

I can already tell you it won't.

That is unfortunate. Would you mind sharing how you profile? If I recall correctly, my usual profiling tricks don't give me much visibility into binaries running in Wine.

The shader compiler-related stutter happens inside the driver, and it is inherently slow due to differences in the D3D11 and Vulkan designs.

Would you name a few of the differences? I would like to know more. Are you referring to things like D3D binding slots vs vulkan descriptor sets?

Might still be worth looking at

I suggest leaving it to a volunteer and putting the help-wanted label on this. This sort of experiment is something a volunteer could do.

PGO only really helps optimize for one specific workload

That is what I thought until I saw that Firefox improved its Javascript performance in general with PGO.

LTO is notoriously broken

If it were up to me, I'd probably just dump all *.cpp files into a single file and then build with -fwhole-program. It is less fragile than LTO and should work just as well. The caveat about needing to mark public functions/variables with externally_visible does apply. Otherwise, breakage will occur when the symbols are optimized away. Another issue would be that it would reduce the information available in backtraces.

@doitsujin
Copy link
Owner

doitsujin commented Sep 17, 2018

That is unfortunate. Would you mind sharing how you profile?

winelib builds of DXVK work with the usual Linux profiling tools and debuggers.

Are you referring to things like D3D binding slots vs vulkan descriptor sets?

That's causing some pain elsewhere, but the main issue with shader compilation is that you can compile shaders individually in D3D, and the D3D11 driver will do a lot of magic during the respective Create*Shader call, whereas Vulkan pipelines expect all shaders to be present (in SPIR-V, which then has to be optimized and translated to hardware instructions by the driver), as well as the full state vector, so we have to do all the work on the first draw that a specific shader is used with.

@pchome
Copy link
Contributor

pchome commented Sep 17, 2018

I'd probably just dump all *.cpp files into a single file

that's how meson unity builds (--unity on) work, but per module
http://mesonbuild.com/Unity-builds.html#unity-builds

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

@doitsujin I take it that your profiling shows that most of the time there is spent in the graphics driver. This might be asking the obvious question, but is there no way to parallelize that process?

For example, n shaders A[i] for i from 0 to n must be built, so m worker threads from j = 0 to m - 1 are created and they each do every A[i] where i % m == j. After they are all finished, the main thread just gathers all of the work from the worker threads. My feeling is that it is not that straightforward, but you piqued my curiosity.

@pchome
Copy link
Contributor

pchome commented Sep 17, 2018

Also, LTO won't work for winelib builds, because you need LTOed WINE, or particularly libwinecrt0.a.

I'm using LTO and PGO for my whole system wherever possible, and WINE is one of the unreached goals.

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

@pchome I was leaning toward thinking that a so called unity build with -fwhole-program would be better than LTO. As I said above, LTO is fragile. If you build everything as one compilation unit with -fwhole-program, you don't need LTO.

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

@doitsujin Nevermind about the parallelism. I need to do my own profiling. I have spent more time looking at this code than I really have at the moment, but I think I understand a few bits of it. In particular, the draw calls that you mentioned are likely in DxvkContext::commitGraphicsState. Given how much this piqued my interest, I'll probably profile the code at some point and learn where the time is being spent. Concurrent programming is always fun. ;)

@pchome
Copy link
Contributor

pchome commented Sep 17, 2018

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

@pchome That is good to know. I still think opportunities for interprocedural optimization from those (with -fwhole-program) could be a low dangling fruit for someone who has only minor programming knowledge to explore.

After reading what @doitsujin said and looking through the code, I found some more interesting avenues to explore. In particular, I am not seeing much threading and I see no use of machine prefetch hints in the code. I need to make time to profile to see where the bottlenecks are more clearly.

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

After some thought, I think I should close this. It is probably not a great use of people's time, although I did learn some interesting things from the discussion.

@ryao ryao closed this as completed Sep 17, 2018
@pchome
Copy link
Contributor

pchome commented Sep 17, 2018

Why so? At least PGO is real, and quite easy to test.

The only thing we should do -- create a list of small tests (maybe wine's d3d11 tests, or some other d3d11 demos), and define a final benchmark to test results.

e.g.

#!/bin/sh
run_benchmark
buld_pofile
run_tests
use_profile
run_benchmark

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

Alright. I am reopening this.

@ryao ryao reopened this Sep 17, 2018
@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

@doitsujin One last thing as I could not help myself from eyeballing the code a bit more. Does your profiling indicate that the shuttering is from dxvkgraphicspipeline::DxvkGraphicsPipeline()? I see 5 ->createShaderModule() calls there that probably could run in parallel.

@doitsujin
Copy link
Owner

doitsujin commented Sep 17, 2018

vkCreateShaderModule is literally a memcpy in actual Vulkan drivers. The expensive part is creating the Vulkan pipeline (vkCreateGraphicsPipelines).

@ryao
Copy link
Contributor Author

ryao commented Sep 17, 2018

@doitsujin That is tricky. Couldn't you just cache the DXBC shaders and other things that DXVK receives from the game and turns into a pipeline? Then on subsequent runs, if one of the shaders from a previous session are loaded by a game (identified by a matching checksum), DXVK could load the rest from cache and pre-create the pipeline? That is just a rough idea, but some kind of driver independent cache seems like the only way around it.

@pchome
Copy link
Contributor

pchome commented Sep 17, 2018

@pchome

create a list of small tests
maybe wine's d3d11 tests

Ok, I able to build standalone dxgi test from wine sources, it's executing quickly and looks like it can be used for PGO needs.
0026:dxgi: 6386 tests executed (0 marked as todo, 270 failures), 5 skipped.

I going to do the same for other wine's dx10/dx11 related tests, and combine all together before sharing.

EDIT:
dxgi.test.txt
dxgi_dxgi.log.txt

@pchome
Copy link
Contributor

pchome commented Sep 18, 2018

https://github.com/pchome/wine-playground/tree/master/dx1x-tests

Note:

  • winelib build
  • unmodified WINE sources
  • test-run.sh contains examples how to run
  • only dxgi and d3d11 tests finished correctly
  • d3d10.device is an RAM-consuming evil,
    failed with unimplemented function d3d10.dll.D3D10StateBlockMaskDifference
  • d3d10.effect - effects not supported in DXVK
  • d3d10_1 failed w/ exception and d3d10core failed w/ segfault

d3d11 test out: 0025:d3d11: 1154 tests executed (0 marked as todo, 201 failures), 1 skipped.

So dxgi and d3d11 tests could be used as is (for now), despite failures.
Others are requires a patching work.

@ryao
Copy link
Contributor Author

ryao commented Sep 18, 2018

The tests that failed probably merit their own issues.

@pchome
Copy link
Contributor

pchome commented Sep 18, 2018

Mostly no, missing interfaces, specific formats and probably WINE's internal stuff.

err:   D3D11: Cannot create texture:
  Format:  VK_FORMAT_E5B9G9R9_UFLOAT_PACK32
  Extent:  512x512x1
  Samples: 1
  Layers:  1
  Levels:  1
  Usage:   13
err:   DXGI: CheckInterfaceSupport: Unsupported interface
err:   db6f6ddb-ac77-4e88-8253-819df9bbf140

@doitsujin can check it by himself , if he'll want to.

EDIT: A lot of them (tests) are failing even for wine.
http://test.winehq.org/data/
http://test.winehq.org/data/64d9f309b7f74d4154e685c5d1d78c1b8335c0bc/index_Linux.html

@ryao
Copy link
Contributor Author

ryao commented Sep 18, 2018

I have a theory on why unity builds took longer. For large projects, the headers can be substantially more complex than the files themselves. Furthermore, you can have many files to compile such that even with -j$(nproc), each core must process a large number of them. The time savings from unity builds comes from parsing the headers only once for all of those files. If all of the additional time spent parsing all of the files that would be handled on other cores is less than the savings from not parsing the headers once for each file on a single core, you save time. If not, you do not save time.

I believe that DXVK’s headers are not complex enough to save time with unity builds. However, there should be opportunities for strong interprocedural optimizations from unity builds if DXVK is adapted to support -fwhole-program as part of them. This means marking functions externally accessible with the always_visible attribute according to the compiler documentation. This idea needs testing to see if it makes a difference.

@pchome
Copy link
Contributor

pchome commented Sep 28, 2018

because SSE and SSE2 were not being used

$ gcc -m32 -O3 -Q --target-help | grep -i '\-msse'
$ gcc -march=native -m32 -O3 -Q --target-help | grep -i '\-msse'

They seems always used for me.

For comparison (where they disabled):
$ gcc -march=pentium -m32 -O3 -Q --target-help | grep -i '\-msse'

@ryao
Copy link
Contributor Author

ryao commented Sep 28, 2018

@pchome Not for the official DXVK release builds or for the DXVK bundled with Proton. The same goes for the builds for which I posted size numbers. SSE and SSE2 were omitted from the 32-bit binaries.

@pchome
Copy link
Contributor

pchome commented Sep 29, 2018

I've smoothed rough corners in my whole-program patch a bit, and going to include it into my ebuild repository using experimental USE flag. Still rough hack, but it's OK for compiler. Tested only on winelib builds.

For those wanted to test: dxvk-0.80-whole-program-support2.patch.txt

Use $ meson configure -Dwhole_program=true in your build directory, or path -Dwhole_program=true along with --unity on option on configuration stage.
No other actions required (-fwhole-program will be added to project flags).

Note:

@ryao
Copy link
Contributor Author

ryao commented Sep 29, 2018

@pchome If you designed your patch to not change behavior unless told to use wpo, you can just apply it unconditionally. Conditional patching is highly discouraged in Gentoo ebuilds because it makes them less maintainable. You could also call the USE flag wpo.

@pchome
Copy link
Contributor

pchome commented Sep 29, 2018

I built DXVK using AutoFDO (#646 (comment)).
But test program fails w/ this libs.

Also there is new warning during build: [-Wmaybe-uninitialized] for src/dxbc/dxbc_compiler.cpp:5034:31
And using unity build along with -fwhole-program or -g3 during optimized compilation causes ICEs (during IPA pass: cp -- looks like -fipa-cp).

Not sure is it me doing something wrong, or compiler/code issue.

# AutoFDO
so_path=/tmp/lib64
data_path=/var/tmp/dxvk.profile

# Record profile data
perf record -e branch-instructions -o $data_path/dxvk.data  d3d11-triangle.exe

# Create afdo profile
create_gcov --binary=$so_path/d3d11.dll.so --profile=$data_path/dxvk.data --gcov=$data_path/d3d11.dll.so.afdo -gcov_version=1 -use_lbr=false
create_gcov --binary=$so_path/dxgi.dll.so --profile=$data_path/dxvk.data --gcov=$data_path/dxgi.dll.so.afdo -gcov_version=1 -use_lbr=false

profile_merger $data_path/d3d11.dll.so.afdo $data_path/dxgi.dll.so.afdo -output_file=$data_path/dxvk.afdo -gcov_version=1 -use_lbr=false

# Use -fauto-profile=/var/tmp/dxvk.profile/dxvk.afdo

@pchome
Copy link
Contributor

pchome commented Sep 29, 2018

whole-program patch
going to include it into my ebuild repository

pchome/dxvk-gentoo-overlay@476a143

@pchome
Copy link
Contributor

pchome commented Sep 30, 2018

Can anyone share perf.data generated on Intel system for DXVK?

Using something like:
perf record -b -e br_inst_retired.near_taken:pp -- wine d3d11-app.exe (I suppose)

or using pmu-tools and example (create_gcov and profile_* : google/autofdo)

@pchome
Copy link
Contributor

pchome commented Sep 30, 2018

@ryao

Do you have profiling data to show that there is not much room for improvement? Perhaps a flame graph?

perf-dxvk.svg.gz
From profile data collected for AFDO, not perf -ag.

@ryao
Copy link
Contributor Author

ryao commented Oct 1, 2018

@pchome What is AFDO? Also, that shows a significant amount of time is spent in __clock_gettime(). This can be optimized, although not by the compiler. I can look into it later this week.

@ryao
Copy link
Contributor Author

ryao commented Oct 1, 2018

@pchome I imagine that you are using your own wine build. Try applying this patch:

ValveSoftware/wine@781fbb0

It might make that faster. We would need to try to profile the native build in wine to see what it does, but hopefully, it does the same thing.

@pchome
Copy link
Contributor

pchome commented Oct 1, 2018

@ryao

What is AFDO?

AutoFDO, see above #646 (comment)

I imagine that you are using your own wine build. Try applying this patch:

It's wine-staging-3.16[custom-cflags] (mean system *FLAGS, except LTO), patched w/ rebased esync+proton patches. So it's "custom Proton-3.16" build, already contains this patch.

@ryao
Copy link
Contributor Author

ryao commented Oct 1, 2018

@pchome In that case, we need to consider reimplementing how time is looked up, perhaps by using the rdtsc instruction while implementing a fallback for processors where using it is a bad idea.

Also, I spotted a function that seems to take a little too long relative to the code, so I am going to disassemble it to see what it is doing. Maybe we can do better there with better optimization flags, maybe we could do better there by changing the code or maybe that code really does more than I realize. C++ is abstract enough that it is hard to do a ballpark estimate of how expensive code is just by eyeballing a few lines.

Lastly, a program that draws triangles really is not the best choice to profile. It would be better to profile a unigine benchmark or overwatch. Also, I hope that you are using a sampling rate of 99Hz or 997Hz. Round number sample rates of 100Hz or 1000Hz can end up failing to spot expensive things that run at the same sample rate, which can happen.

@ryao
Copy link
Contributor Author

ryao commented Oct 1, 2018

On second thought, there is a better way of doing time in this sort of application. instead of polling time, we could make this interrupt driven by having a volatile flag variable that controls whether we flush that we set on a timer. Then the flush code would check that flag. When it is set, it would unset it and do a flush. This should be less expensive than polling in this function. Maybe this could work for that:

http://www.cplusplus.com/reference/future/future/

I would need to study the code, but perhaps we could make the flush itself timer driven so that we do not need that check and can just rely on the timer.

@pchome
Copy link
Contributor

pchome commented Oct 1, 2018

rdtsc

Note: profile may contain calls from perf itself, and rdtsc very likely one of them.
Edit: Or, probably it called from DXVK_HUD, which is no subject for optimization.

Also, I want to profile at least anything small, to prove profiling can be done on wine applications (*.exe/*.dll).

Also, I hope that you are using a sampling rate of 99Hz or 997Hz.

Yes, I planned to use -F99 on something "big", but I'll try it on triangles, thanks.

Also, I believe my hardware is not fully suitable for profiling multithreaded applications (no -b, :pp, LBR, PEBS, ... support), at least this explains why profiling easily reproducible on ./sort example and constantly failing on DXVK.
But if someone can generate such profile on Intel system -- it can be reused on non-Intels (I suppose).

@pchome
Copy link
Contributor

pchome commented Oct 1, 2018

Meanwhile I just built DXVK using AutoFDO, but w/o any optimizations:
-O0 and -fwhole-program/"unity" build disabled. Other than -O0 optimization levels produce failing *.sos.

Edit: I see no reason to measure it's performance on this stage.

@ryao
Copy link
Contributor Author

ryao commented Oct 1, 2018

@pchome Would you try out this patch and see if it makes a difference in your flame graph?

b9b86b4

@pchome
Copy link
Contributor

pchome commented Oct 1, 2018

  1. Why don't you generate profile data on your side? (It was easier than I thought, to setup and configure all required parts.)
  2. https://github.com/doitsujin/dxvk/blob/master/src/util/thread.h#L14

@ryao
Copy link
Contributor Author

ryao commented Oct 1, 2018

@pchome

  1. I have not found time to fix my local winlib builds. I am spending more time on this than I should as it is, so fixing that is going to wait for while.
  2. I fixed this to use the dxvk::thread class. I did a 64-bit winelib build test, but as you know, I can't do winelib runtime tests at the moment. If you are willing to try it out, here is the revised version:

35ca9bc

@ryao
Copy link
Contributor Author

ryao commented Oct 1, 2018

I thought about it some more. I am not certain that a timer is even necessary. If the GPU is about to run out of things to do, then giving it more seems like a good idea. I don't see the point of having a timeout. We could try deleting this code and seeing how that works.

@pchome
Copy link
Contributor

pchome commented Oct 2, 2018

I did a 64-bit winelib build test, but as you know, I can't do winelib runtime tests at the moment.

Why?
Once you built it:

#!/bin/sh
mkdir -p $(pwd)/dxvk-test

export WINEPREFIX="$(pwd)/dxvk-test"
export WINEDLLOVERRIDES="d3dcompiler_47,d3d11,dxgi=n"
export DXVK_HUD=version,devinfo,fps

cp /tmp/lib64/dxgi.dll.so  dxvk-test/drive_c/windows/system32/dxgi.dll
cp /tmp/lib64/d3d11.dll.so  dxvk-test/drive_c/windows/system32/d3d11.dll

wine d3d11-triangle.exe

That's all.

@ryao
Copy link
Contributor Author

ryao commented Oct 2, 2018

@pchome I am using proton. I don't know why it does not work. It is on my todo list. I doubt that I will have time to do anything else here until later in the week.

@ssorgatem
Copy link
Contributor

Last time I tried, the winelib builds didn't work on Proton because Proton's Wine version was not recent enough and winelib DXVK couldn't use native Vulkan calls.

@pchome
Copy link
Contributor

pchome commented Oct 2, 2018

You can have different WINE version for tests and measures. What the problem? I don't get it.
I have four of them at the same time.

If problem somewhere in your distro, then

  • You can grab winegcc, winebuild, etc. from any wine-3.15+ .deb, .rpm ...
    And put them into your $PATH using different names, or use full path in build-wine32.txt
  • You can grab whole WINE from PlayOnLinux site (I hope they fixed vulkan in their builds), or same .deb, .rpm ...
    And use script
#!/bin/sh

tst_wine="/var/tmp/wine-3.17"

WINEPREFIX="/path/to/pfx"

WINE="${tst_wine}/bin/wine"
WINESERVER="${tst_wine}/bin/wineserver"
WINELOADER="${tst_wine}/bin/wine"
WINEDLLPATH="${tst_wine}/lib64/wine"

"${WINE}" d3d11-triangle.exe

@ryao
Copy link
Contributor Author

ryao commented Oct 2, 2018

I am juggling several different things. Some things just get assigned a lower priority than others. I did a prototype of that patch a few days early, so do we really need to discuss how I have yet to allocate more time to try again with winelib?

@pchome
Copy link
Contributor

pchome commented Oct 2, 2018

do we really need to discuss how I have yet to allocate more time to try again with winelib?

You spending more time randomly sticking a code you can't even test.

You think it's ok to ask people to spend their time for testing random changes.
No one asking you to setup winelib build, profile your MinGW build -- I don't see any difference.

Since this issue is not my blog, I'm reporting my experience and expect participants can reproduce it.
So yes, if you have something not working to "Explore advanced toolchain optimizations (e.g. PGO, LTO, whole program optimization, etcetera)" -- I'll try to help.

My interest here is AutoFDO, PGO, LTO and -fwhole-program, but not DXVK optimization process.
So if I can't get feedbacks here, then I'll find better place to continue my investigations.

@ryao
Copy link
Contributor Author

ryao commented Oct 2, 2018

The previous code had a call to an external symbol. The compiler cannot optimize across function calls to external symbols. The patched code is therefore easier for the compiler to optimize.

Before I put this down, I made another version of the patch that does away with the added atomics, which makes the code even more compiler optimization friendly. I’ll push that version when I am back at my Gentoo development workstation.

For what it is worth, doing away with atomics also improves compiler optimization opportunities, but it can be absurdly difficult to do right. The amount of code that you see changed to do that is disproportionally less than the amount of effort.

During the course of this, we will probably find other source code changes that can make the compiler’s optimization passes do a better job. We really ought to make them in addition to experimenting with compiler optimizations.

Also, I tested all versions that I pushed, but only as native binaries because I am not setup for doing tests with winelib builds yet. Just as you are not setup to build native binaries and can only test winelib builds, I am not setup for winelib builds (at least doing usable ones) and I can only test the native binaries. It would take time to setup something else. I estimate it would take several hours to debug why my winelib builds do not work and I don’t have time for that. It is on my todo list. It will be done eventually.

It might only look like a few minutes to you (and perhaps it is), but I have been sucked into rabbit holes that should take only a few minutes in the past. Writing the patch was a small rabbit hole. I am not willing to jump into another one without adequate time to dedicate to it.

Lastly, your triangles test is a horrible way of profiling for optimization opportunities, whether they are opportunities for better compilation or something else. Getting the time does not involve a system call thanks to vdso (and possibly also a patch done to proton to stop using CLOCK_MONOTONIC_RAW). There is very little that we can learn from profiling it because it does not exercise the code in a realistic way. Proper profiling requires running something that exercises the graphics API in a realistic way, such as a unigine benchmark.

@ryao
Copy link
Contributor Author

ryao commented Oct 3, 2018

I had a chat with one of the LLVM experts from BNL in person yesterday about this. The only thing he could add that we were not already doing was to try building with Clang. I had hoped that he could suggest some compiler options known to generate better code at the expense of absurdly long compilation times, but he is not familiar with those.

I heard about suoeroptimization from a researcher from the Institute for Advanced Computational Science that does work in compiler optimization. With super optimization, builds can take days, weeks or even months, but the performance of the generated code is better than from normal optimization. Someone might want to look into testing every tool that comes up when googling “gcc superoptimizer” or related queries.

I am withholding names because I do not think the people involved would want to receive emails from random people on this topic. The researcher that I mentioned receives so many emails that she only responds to emails with specific keywords in the subject, so random people who email her are unlikely to receive a response.

@ryao
Copy link
Contributor Author

ryao commented Oct 3, 2018

I have come to the conclusion that all of the low dangling fruit has been found. Further improvements will come from making code changes according to profiling.

Also, I tested that patch. It makes no improvement in any game, even in the fairly unrealistic Rise of Nations: Extended Edition title screen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants