Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problems/regression with many library dependencies #184

Open
kamrann opened this issue Mar 28, 2022 · 17 comments
Open

Performance problems/regression with many library dependencies #184

kamrann opened this issue Mar 28, 2022 · 17 comments
Labels
performance Performance issues or enhancement, not really a bug.

Comments

@kamrann
Copy link

kamrann commented Mar 28, 2022

Observed behaviour
b.exe is maxing out multiple cores for long periods during a b update, even when the project is already up to date. Same behaviour is seen for --dry-run mode.

build2 0.15.0-a.0.f7ff90672af8
libbutl 0.15.0-a.0.e9cf760289b4

Details/prior behaviour
Previously, running b update on an up to date configuration would take a little while to count through 'targets to update', but then when done would very quickly reach '100% of targets updated' and complete. It now will spend minutes going from 96% to 100%.

Unfortunately, having not worked with my project much for a number of months, I'm not able to pinpoint exactly when or what change is responsible for this issue (though I can investigate further given some direction).

  • It could be a regression, in which case I'd lean towards it probably being quite recent, but can't be certain.
  • It's possible it's not a regression at all, but a result of my adding boost as a bpkg dependency (my code hasn't changed, but previously was using boost via a manually added include path).

In case it's relevant, my project is by no means especially large, but it does have a high ratio of libraries to code. It's split up into ~40 libraries (mostly shared) in ~15 packages, with ~15 further 3rd party dependency libraries, plus a few boost dependencies (which transitively ends up being a lot of boost dependencies).

Profiling
I did some very minimal profiling which showed 5 worker threads spending about 50% of their time in dynamic_cast, with the remainder being map lookups/path_traits comparisons. This suggests the real problem is likely just algorithmic complexity - generally execution is ~30 recursive calls deep in process_libraries.

b-update-profiler-call-graph.txt

@boris-kolpackov
Copy link
Member

It's possible it's not a regression at all, but a result of my adding boost as a bpkg dependency (my code hasn't changed, but previously was using boost via a manually added include path).

I believe this is the most likely explanation but it would be helpful to confirm that this is definitely not a regression. One way to do it would be to try to re-create the same setup using the 0.14.0 version and see if there is any difference (another option, if you can no longer build your project with 0.14.0, would be to create a representative test and try it with both 0.14.0 and staged).

Even if it turns out that this is not a regression, there are still some tricks we can try. But I would like to first determine conclusivly this is not a regression.

@kamrann
Copy link
Author

kamrann commented Mar 30, 2022

Ok, I've reinstalled 0.14 and built the project in a new configuration with it.
I do seem to be seeing some difference, perhaps a 10-20% increase in times going from 0.14 to recent staged; but I'd have to do more rigorous testing to be certain this isn't just down to some other slight differences in the environment.

Regardless, it's clear that the fundamental issue relates to the number of library dependencies/boost packages, and is not a regression. b -n update: <out-path> is taking 4 minutes plus on an already up to date configuration on both 0.14 and staged.

I'm curious what the high-level algorithm is for process_libraries; I've used a sampling profiler so don't have call counts, but for it to be taking this long with all the time spent in such low level operations as dynamic_cast, the function must be getting called an incredibly high number of times.
Anyway, let me know if there's something further I can test to help investigate.

@boris-kolpackov
Copy link
Member

I'm curious what the high-level algorithm is for process_libraries; I've used a sampling profiler so don't have call counts, but for it to be taking this long with all the time spent in such low level operations as dynamic_cast, the function must be getting called an incredibly high number of times.

There is some background on this in issue #114 but the gist of it is that for a heavily inter-dependent library like Boost where most of the dependencies are interface dependencies, we could end up with a lot of visits to each library (as in millions). We've done quite a bit of work on this for 0.14.0 and in our tests things were quite reasonable (e.g., an up-to-date check is under a second).

What I suspect is happening in your case is due to a large number of libraries involved and those libraries having Boost libraries as interface dependencies, you are in a sense "breeding" duplicate interface dependencies. To test this hypothesis, would you be able to add the $cxx.deduplicate_export_libs() call like below in each of your libraries that has an interface dependency and see if this makes a difference?

import intf_libs  = libboost-array%lib{boost_array}
...
import intf_libs += libboost-unordered%lib{boost_unordered}

intf_libs = $cxx.deduplicate_export_libs($intf_libs)

@kamrann
Copy link
Author

kamrann commented Apr 2, 2022

So it certainly made a difference. Here is a breakdown of the various tests I've done in case the extra info is useful.

  1. I first added the deduplicate directive only to the library targets which had a direct interface dependency on a boost library (there were 4 of them). This yielded about a 25% time reduction in a dry run update on an up-to-date config.
  2. Next I went through all of my libraries and added the deduplication of interface libraries. This gave around a 90% improvement relative to the base case, but that still meant ~20s for the up-to-date dry run.
  3. I noticed that of the 4 libraries with a direct boost dependency, one was my core library which is a direct dependency of all others (and there are also lots of diamond shapes in my dependency graph, just as within boost). It had a dependency on boost-graph. I disabled that dependency, and supplied a direct -I for boost-graph in the couple of places where this dependency was actually used. With this change (and without any deduplication directives added), the dry run update was back to how it was before, almost instant.

I'm not sure if there is any obvious reason why (2) would fall so far short of (3) in its effects?

One other thing I noticed, somewhat unrelated but perhaps worth looking into. Adding/removing the deduplication directive resulted in recompilation of translation units. To quote you from issue #114:

In case of compilation, while a library can conspire to undefine other library's macro, we can probably ignore this as a pathological case and skip all of the subsequent options. We could also tighten this a bit and skip subsequent options only if they are all -I and -D.

I'm not sure what approach you ended up taking regarding the -I explosion in that issue, but it would seem from the above that in theory (and ideally), adding or removing duplicate dependency libraries should not result in compilation being considered out of date?

@boris-kolpackov
Copy link
Member

boris-kolpackov commented Apr 3, 2022

This gave around a 90% improvement relative to the base case, but that still meant ~20s for the up-to-date dry run.

Yes, that's better but still not usable. Do you think you would be able to describe your project hierarchy (or better yet, create and make available a test one) so that I can try to reproduce this? On my machine an up-to-date check for the whole of Boost (including a test for every library) takes a bit over a second, which is still not great but probably usable. So I am wondering if there is something special about your project, your hardware/OS, or something else.

I'm not sure if there is any obvious reason why (2) would fall so far short of (3) in its effects?

I don't think this is surprising: in (3) you've chopped off a large part of your build graph and replaced it with an assumption that boost-graph and all its dependencies are up-to-date and available via a single -I path.

One other thing I noticed, somewhat unrelated but perhaps worth looking into. Adding/removing the deduplication directive resulted in recompilation of translation units.

I don't think this is unexpected: deduplication is likely to change the order in which libraries are traversed which in turn may alter the order of -I/-D options. And for change tracking purposes we just hash them which means both the content and order of options matter.

@kamrann
Copy link
Author

kamrann commented Apr 3, 2022

Yep, when I get chance I'll do some investigation to try to see if I can repro this in a simpler project. I'll also look into generating a dependency graph for my project, since that may be instructive.

@kamrann
Copy link
Author

kamrann commented Apr 4, 2022

I don't think this is surprising: in (3) you've chopped off a large part of your build graph and replaced it with an assumption that boost-graph and all its dependencies are up-to-date and available via a single -I path.

This is true. I had assumed that there wasn't much being chopped off since I only removed the boost-graph dependency, and I had a few other boost dependencies still in which would likely be maintaining most of the transitive dependencies from boost-graph, but it turns out boost-graph specifically is a bit of a monster.

So here's a slightly simplified dependency graph I've generated. Note that it doesn't contain boost transitive dependencies.
image
So as noted previously, the killer seems to be the combo of boost-graph sitting at the top there, along with component-core-intf having a dependency on it whilst being a dependency of most everything else. But this is nothing new really, I think it's pretty clear that whatever the specifics of my particular project (it can definitely use some streamlining and I'll for sure be pruning that boost-graph dependency), the underlying issue is that the algorithmic complexity is non-linear in graph depth, so there just comes a point where relatively small increases in project complexity start to push the processing time up very quickly.

Running bpkg status -r libboost-graph is very instructive by the way (get ready with Ctrl-C!).
I guess the question is, is there a reason process_libraries needs to recurse from nodes which it's already visited and recursed from before? Perhaps there's a way to cache results somehow so subgraphs need only be recursively processed once?

@boris-kolpackov
Copy link
Member

boris-kolpackov commented Apr 11, 2022

Thanks for the graph, quite elaborate, I must say ;-). I've create a test repository that attempts to simulate this by picking (1) nodes that introduce boost dependencies as well as (2) "breeder" nodes that would cause the multiplication of such dependencies. Because I've only recreated a small fraction of the nodes, I compensated for that by duplicating the kbuild-temp as kbuild-perm (which should essentially double the dependency graph) plus I've added a bunch of boost libraries with a lot of dependencies to component-core-intf in addition to libboost-graph. I've also omitted any deduplication for now. The result is here: https://github.com/boris-kolpackov/boost-dependency

To test, I measured the time it took to do an up-to-date check (i.e., when everything is already up to date) of building-edit. On my machine (Linux, Xeon E-2288G, good NVMe SSD) it takes ~1.2s. I've also tried on Windows from our CI VM (QEMU/KVM running on Xeon E5-2695v2, good NVMe SSD) and there it took ~7s. Could you try this on your machine? This way I would be able to translate any improvements I make in this test to your real project.

Running bpkg status -r libboost-graph is very instructive by the way (get ready with Ctrl-C!).

Yes, we are aware of this (there is also -i|--immediate which could be more Boost-appropriate). I am not yet sure what we should (if anything) do about this.

I guess the question is, is there a reason process_libraries needs to recurse from nodes which it's already visited and recursed from before? Perhaps there's a way to cache results somehow so subgraphs need only be recursively processed once?

I think I mentioned the reason in #114, but in the nutshell, we prune the graph traversal everywhere except when collecting libraries to link. And there we cannot prune because the library should appear on the command line at the position (relative to other libraries) of the last encounter as opposed to the first (this is because the order in which libraries are linked could be important).

I am going to take a look at this again (based on the above test, provided it is representative of the times for your real project) and see if there is anything more we can optimize. One promising idea is to prune recursivle-header-only libraries since they don't link anything.

@boris-kolpackov
Copy link
Member

Added the deduplicate_export_libs branch to the boost-dependency repository where there is a call to $cxx.deduplicate_export_libs() in each library. For me this change slashed the up-to-date check time from 1.2s to 0.6s.

@kamrann
Copy link
Author

kamrann commented Apr 12, 2022

Could you try this on your machine? This way I would be able to translate any improvements I make in this test to your real project.

To confirm, running b update in cfg-dir/building-edit, twice, timing the second run? Bang on 4s. Down to 2.4s on deduplicate_export_libs branch. I'm on Windows.

Yes, we are aware of this (there is also -i|--immediate which could be more Boost-appropriate). I am not yet sure what we should (if anything) do about this.

Ah okay. I'd assumed the cause was the same and therefore you hadn't been aware of it, but perhaps it's not doing quite the same thing then.

I think I mentioned the reason in #114, but in the nutshell, we prune the graph traversal everywhere except when collecting libraries to link. And there we cannot prune because the library should appear on the command line at the position (relative to other libraries) of the last encounter as opposed to the first (this is because the order in which libraries are linked could be important).

Yep. Have you considered the topological sort approach I suggested in that thread? It has algorithmic complexity linear in graph size: O(num_nodes + num_edges). The full recursion, if I'm guessing correctly what it's doing, I believe has complexity along the lines of worst case O(degree^depth), where degree is the typical number of direct dependencies of a node, and depth is the depth of the dependency graph; depth is the dominant factor. So if it's indeed possible to replace the unpruned recursion with a topological sort followed by linear ordered pass over the libraries (even if it means multiple passes), I think this performance issue would entirely disappear. Likely this would be what is wanted for the bpkg status -r case too.

@boris-kolpackov
Copy link
Member

Ok, I've made a bunch of optimizations (including automatic deduplication as part of process_library() as well as pruning of recursively-header-only libraries). For me this reduced the up-to-date check from 1.2s to 0.4s with the manual deduplication no longer making any difference. These optimization are available in the latest staged toolchain.

Could you give it a try and see what difference it makes for you? I am interested in both my synthetic test but more importantly in your original project (without manual deduplication) that took 4m.

Also, the b update that you perform actually involves two things: the bdep-sync call (via the hook, which invokes the package manager to check that no new dependencies got introduced, etc) and the build system update proper. While for me the bdep-sync time is part of the noise, it would be good to separate it, just to make sure there are no surprises. The easiest way to do it is to set the BDEP_SYNC=0 environment variable. Another thing worth a check is to disable progress with the --no-progress option.

Have you considered the topological sort approach I suggested in that thread?

I have but the difficulties of implementing it (which I mentioned in this comment) are still there. More importantly, with the above optimizations, process_libraries() only takes ~10% of the execution time, at least on my synthetic test. Would you be able to capture the call graph profile from your real project?

@boris-kolpackov
Copy link
Member

I've optimized things some more and with the latest staged version I get 0.3s on the synthetic test on my machine with the load/match/execute splits being 0.16/0.10/0.04. Out of curiosity I also ran this test on an M1 Mac Mini (8G version) where it takes 0.51s (0.36/0.11/0.04).

@kamrann
Copy link
Author

kamrann commented Apr 20, 2022

Ok results on latest staged toolchain:

boost-dependency, master branch, in /building-edit
b update --no-progress
sync on: 1.67s
sync off: 1.45s

my project (manual deduplication removed)
b update --no-progress
sync on: 6.5s
sync off: 5.3s

So certainly much improved. And you're right, the performance trace is no longer so straightforward as it was before, so the amount of useful info I can give you is limited (unless you have access to Superluminal profiler, in which case I can upload the trace). Here's an overview:
image
The latter 40% is build2::execute, which looks to be the process_libraries call from what I can see. Within that, time is split mostly between the is_a calls and lookup_original. But the first 60% of the overall time is split between a create_bootstrap_outer/bootstrap_src phase (which appears to be mostly waiting on disk reads for buildfiles) and the match phase (which is mostly CPU-bound buildfile parsing). So a bit of a mixed bag, and I guess no single clear point to optimize further.

One tangential thing I noticed that may be of interest, it was unexpected for me. While the above timings are taken from separate configurations, I initially put both the boost-dependency project, and a clone of my own project, in a shared configuration; done that way, boost-dependency with sync on was taking 2.7s, meaning the sync part was 1.2s, same as for my project. That seemed unexpected, so I split out into individual configurations to achieve the above numbers. Is there some reason that a bdep sync for a project should take longer because another larger, unrelated project happens to also be initialized in the same configuration being used?

@boris-kolpackov
Copy link
Member

Thanks for the update. Definitely an improvement but still quite slow. Do you mind sharing what kind of hardware this is on (specifically, CPU, RAM, and disk)? That would help understand whether this is due to hardware or software (Windows).

Is there some reason that a bdep sync for a project should take longer because another larger, unrelated project happens to also be initialized in the same configuration being used?

Yes, this makes sense: bdep sync synchronizes all the packages in the build configuration. It can't be done in any other way since the "related" and "unrelated" projects might have common dependencies, etc.

@kamrann
Copy link
Author

kamrann commented Apr 21, 2022

Yes, this makes sense: bdep sync synchronizes all the packages in the build configuration. It can't be done in any other way since the "related" and "unrelated" projects might have common dependencies, etc.

Ok thanks. Just wanted to check there wasn't something unintentional going on there.

Do you mind sharing what kind of hardware this is on (specifically, CPU, RAM, and disk)? That would help understand whether this is due to hardware or software (Windows).

Sure. I'm using an Intel Hades Canyon NUC:
Intel i7-8809G 3.1GHz
32 GB RAM, 2666 MHz
Samsung 970 Pro SSD

Just to add, although clearly there is still room for improvement, with these optimizations it's now definitely back in the realm of being usable for my project, and anyway there are local workarounds I can do if I need to speed things up further. So completely understand if you feel you're at the point you need to shelve further optimization work for the moment. But of course let me know if you want me to test anything further. I'll make sure to keep a copy of my project in its current state for future comparisons.

@boris-kolpackov
Copy link
Member

Ok, thanks, so it's a bit dated (2018) 4-core mobile CPU.

Just to add, although clearly there is still room for improvement, with these optimizations it's now definitely back in the realm of being usable for my project, and anyway there are local workarounds I can do if I need to speed things up further. So completely understand if you feel you're at the point you need to shelve further optimization work for the moment. But of course let me know if you want me to test anything further. I'll make sure to keep a copy of my project in its current state for future comparisons.

Yes, thanks, that will definitely be helpful. I think I've now picked most of the high-impact, low-hanging fruit. One idea I have for further optimization is to get rid of wildcards in package archives (i.e., all the Boost libraries) by rewriting them to the expanded list during distribution. I did a quick test by simply disabling the expansion and that reduced the load part for the synthetic test on my machine from 0.16s to 0.11s so we are likely looking at ~20% overall time reduction (likely more on Mac OS and especially Windows).

@Klaim Klaim added the performance Performance issues or enhancement, not really a bug. label Jun 23, 2022
@boris-kolpackov
Copy link
Member

Another optimization idea: similar to rewriting wildcards we could add real directory buildfiles in place of the implied ones thus avoiding the match-load phase switching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or enhancement, not really a bug.
Development

No branches or pull requests

3 participants