-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopt intrusive data structures for better performance #3796
Comments
This benchmark is open source. The source files are: https://github.com/wichtounet/articles/blob/master/src/intrusive_list/bench.cpp The sources can be obtained locally by running On my modern Ryzen 7 5800X system with g++ 12.3.1_p20230526, I can confirm that the benchmark still shows significant advantages for the intrusive lists over the std::list in most workloads. For example, here is an excerpt from the sorting part of the benchmark:
The benchmark uses generics to such an extreme that it is difficult to tell what is managing the lifetime of objects. If std::list is managing it, I would expect it to do the equivalent of an intrusive data structure behind the scenes, such that std::list should not lose to an intrusive data structure, yet it is losing. Either the object life times are managed externally, or there is some overhead in the std::list that makes it slower than an intrusive linked list even when it should not be. Someone more familiar with C++ should be able to tell how the object lifetimes are managed. |
Do you have any real-world numbers on how much CPU time is actually spent operating on those data structures in actual games? I will not adopt boost as a dependency, and writing our own containers is only ever an option if there is a significant amount of performance to be gained. We already have some in places where it matters. |
It should not be possible to directly get real-world numbers on how much CPU time is actually spent operating on those data structures, since the compiler inlines code for operating on C++ STL containers into consumers. It would be more feasible to find hot code paths, replace the containers there with intrusive versions and benchmark that, but unfortunately, I do not know how to profile PE binaries on Linux to find the hot code paths. :/
As I wrote earlier, Boost.Intrusive is a header only library. Just copy the relevant header into the source tree and you can use it. There is no need to depend on boost itself. That said, Boost is not the only place where you can get intrusive containers, although it is the only implementation I could find that has public micro-benchmarks comparing its intrusive list against std::list. There are other options for such containers. For example: |
I haven't really seen it show up when profiling. |
For profiling of PE you can use my samply fork at https://gist.github.com/ishitatsuyuki/09eef31e16e5247718063cb2390cdc37. Like others I haven't really seen the data structures in question on the profile. They probably can be optimized, but the return is extremely diminishing. |
@ishitatsuyuki I have never seen a profiler capable of showing time spent in inlined code. Is samply able to do that? My expectation is that the time spent on a data structure would appear as time spent in some caller, with at most, the memory allocator making an appearance. The memory allocator is non-deterministic, so it is possible that it will be fast most of the time, then super slow and then fast again, which could appear to be minimal CPU time in total, but in reality include one or more latency spikes. After all, fast things do not typically appear in profiles, but slow things do, although if it is only slow occasionally, it will not show up as very much. :/ Ideally, we would want to generate histograms of function latencies. Then we would pick the ones with the biggest outliers and/or the most CPU time, replace the data structures used with intrusive versions and rebenchmark to see how the histograms were affected. That said, since C++ memory allocations, at least in Wine, end up going into malloc as far as I know, this seems relevant for how bad these spikes can be: https://www.forrestthewoods.com/blog/benchmarking-malloc-with-doom3/ Forrest Smith benchmarked malloc latencies:
His measured worst case malloc latency was 201 microseconds. This is not absolute worst case malloc latency. It is just what he measured the worst case to be when profiling Doom 3 for 7 minutes on his high end Windows system. His 99.999th percentile was 95 microseconds. Based on his data, he observed a 0.1ms latency spike from malloc once approximately every 8 seconds. Of course, these numbers are neither from DXVK on Linux nor from a modern game, but I suspect the latency numbers are not orders of magnitude different than what we would see if we measured on Linux. If we gathered data from DXVK, there is a decent chance that we would find that some minor performance gains could be made from avoiding unnecessary memory allocations in containers. Note that the 0.1ms every 8 seconds spike in Forrest Smith's data is somewhat misleading too, since you have plenty of little spikes that are more frequent and many malloc calls are being done per second, so the additional latency can add up into bigger intra-frame spikes. As Forrest Smith wrote:
We cannot avoid malloc, but avoiding data structures that add additional malloc operations could be a worthwile micro-optimization. Of course, we definitely should profile to find where the opportunities to do this are. We could probably use samply to find where the most CPU time is spent, see which containers were inlined into that, replace them with non-intrusive versions and then do before/after comparisons. It would definitely be better than blindly guessing, although it would not be as good for identifying problematic data structures and quantifying improvements from replacing them as the histogram latency approach. I know funclatency from iovisor/bcc could be used to profile function latencies inside the kernel and generate histograms for functions, but I am not sure how the equivalent would be done for DXVK. |
samply can deal with inlined frames correctly as long as debug symbols can be found. |
Random thoughts on this... It's currently rather cumbersome to experiment with alternative data structures because DXVK uses
Ordered
On Linux Microarchitectural profiling is very useful. On Intel CPUs use either their V-Tune or toplev from https://github.com/andikleen/pmu-tools/. On AMD use their uProf. |
Randomly picking data structures that are probably access once per frame is not going to add anything to the discussion. Before claiming that any of that is a problem, please provide profiling evidence or at least suggest something that should be implemented to quantify the impact. Here's a samply profile you can load into https://profiler.firefox.com. The dxvk-cs thread is mostly filled with driver overhead and our other state tracking overhead. None of the data structures you mentioned are visible there. PID 8763 2024-01-15 12.51 profile.json.gz Our biggest problem right now is the constant refcount manipulation going on in the CS thread and the fence (recycling) thread, but solving that without introducing correctness issues is extremely nontrivial and probably takes effort on the scale of rewriting the entire codebase to be achieved. |
I was skeptical that it was possible to even see those data structures due to C++'s inlining, but to my surprise, I found a stack that does show the data structure in the profile that you provided:
That said, the profile suggests that you are right that this micro-optimization is not worth the time it takes to implement it. I am closing this.
That is a hard problem. Thanks for letting me know. If I do more eyeballing for optimization opportunities, I will focus on that area. |
I notice that DXVK is using STL data structures everywhere. These are well known to be non-intrusive data structures. Non intrusive data structures allocate their nodes independently of the objects being stored, while intrusive data structures allocate them as part of those objects. The benefits of non-intrusive data structures are fewer memory accesses and fewer memory allocations. They are a bit old, but there are benchmarks comparing boost::intrusive::list against std::list, with the intrusive version outperforming the STL version nicely:
https://baptiste-wicht.com/posts/2012/12/cpp-benchmark-std-list-boost-intrusive-list.html
https://www.boost.org/doc/libs/1_84_0/doc/html/intrusive/performance.html
The things that make an intrusive linked list perform better than the STL's non-intrusive linked list also make intrusive versions of other data structures outperform their non-intrusive versions. Consequently, an intrusive self balancing binary tree is likely to outperform the STL's unordered_map.
Consider
./src/util/util_lru.h
, which contains an example of a multi-index container:This is roughly what happens:
Before we consider an intrusive data structure version of this, let us consider this enhancement:
Then the code becomes:
We have avoided the implicit allocations. Now this is truly O(log(N)). However, we can still improve the constant factor by dropping the final log(N) operation through the use of intrusive data structures. Imagine that we replace std::unordered_map and std::list with boost/intrusive/avltree.hpp and boost/intrusive/list.hpp respectively. The semantics are similar to the STL data structures on T *.
Consequently, this is what would happen:
Notice that no matter how you implement step 3, be it via remove+add or splice, you still have an O(1) operation. Unlike the STL's non-intrusive linked list, intrusive linked lists do not need much thought to be used efficiently.
The memory location of the object has not changed, so there is no need to update the AVL tree index. The only way this could be made even faster would be to use a hash table, although getting the sizing and hash function right is a pain that you do not have with self balancing binary search trees.
Lets look at another example from that file:
This is roughly what happens:
Now let us imagine that we have written an intrusive version of this using an intrusive AVL tree and an intrusive linked list. Here is a simple version:
This is truly a O(log(N)) insert operation. It is possible to give this a smaller constant factor when the lookup has a hit by simply updating the external pointers to point to the new object, and copying pointers pointing out from the old object into the new object. Some intrusive AVL tree libraries support doing this via their library functions. I am not sure if Boost's AVL tree does.
In any case, it should be clear that adopting intrusive data structures would be a performance improvement and it is an improvement that applies not only to multi-index containers, but any case where you manage object lifetimes yourself, rather than letting a container handle it. In both cases, you avoid unnecessary memory allocation/deallocations, as well as additional indirections (e.g. storing iterators/pointers as values) to access a value through a container.
Every object that ever will be put into an intrusive data structure needs to be modified to support it in advance, which is the main downside of using intrusive data structures. This "downside" can be beneficial when debugging, since you can see whether is inside a particular data structure or not just by examining it. It also makes it easier to keep track of which data structures might contain a specific object by adding comments at the internal nodes.
That said, Boost has its own comparison table:
https://www.boost.org/doc/libs/1_84_0/doc/html/intrusive/intrusive_vs_nontrusive.html#intrusive.intrusive_vs_nontrusive.properties_of_intrusive
For particularly small objects (e.g. 64bytes or less) that do not need multi-index containers where an unordered_map is currently used, you probably would find a B-Tree performs better than an intrusive data structure (at least that has been the experience of the OpenZFS project). However, the intrusive version should still outperform the non-intrusive version whenever object lifetimes are managed outside of the container. Most C projects (including the Linux kernel) use intrusive data structures almost exclusively. The main exceptions would be arrays and B-Trees. Intrusive data structures are something you can use just about everywhere.
The boost intrusive library is a header only library, so it is possible to use it to adopt intrusive data structures without adopting the larger boost library:
https://www.boost.org/doc/libs/1_84_0/doc/html/intrusive.html
Rewriting DXVK to use intrusive data structures everywhere would probably not be an efficient use of time, so my suggestion is to identify hot code paths that use the STL containers and modify them to use the boost intrusive containers.
The text was updated successfully, but these errors were encountered: