Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt intrusive data structures for better performance #3796

Closed
ryao opened this issue Jan 13, 2024 · 10 comments
Closed

Adopt intrusive data structures for better performance #3796

ryao opened this issue Jan 13, 2024 · 10 comments

Comments

@ryao
Copy link
Contributor

ryao commented Jan 13, 2024

I notice that DXVK is using STL data structures everywhere. These are well known to be non-intrusive data structures. Non intrusive data structures allocate their nodes independently of the objects being stored, while intrusive data structures allocate them as part of those objects. The benefits of non-intrusive data structures are fewer memory accesses and fewer memory allocations. They are a bit old, but there are benchmarks comparing boost::intrusive::list against std::list, with the intrusive version outperforming the STL version nicely:

https://baptiste-wicht.com/posts/2012/12/cpp-benchmark-std-list-boost-intrusive-list.html

https://www.boost.org/doc/libs/1_84_0/doc/html/intrusive/performance.html

The things that make an intrusive linked list perform better than the STL's non-intrusive linked list also make intrusive versions of other data structures outperform their non-intrusive versions. Consequently, an intrusive self balancing binary tree is likely to outperform the STL's unordered_map.

Consider ./src/util/util_lru.h, which contains an example of a multi-index container:

    void touch(const T& value) {
      auto cacheIter = m_cache.find(value);
      if (cacheIter == m_cache.end())
        return;

      m_list.erase(cacheIter->second);
      m_list.push_back(value);
      auto iter = m_list.cend();
      --iter;
      m_cache[value] = iter;
    }

This is roughly what happens:

  1. We lookup the node in m_cache. This is O(log(N))
  2. If the node is not in the cache, we exit early. This is O(1)
  3. Then we erase the node from the list, which does an implicit free. The STL documentation claims that the operation itself is O(1), but that is only true if you ignore the implicit free, which may be as bad as O(n^3).
  4. Then we add it again, doing an implicit allocation. Again, this may be as bad as O(n^3). This is also why std::list::push_back() is documented at cppreference as being able to throw out of memory exceptions.
  5. We then update the iterator in m_cache, which is O(log(N)).

Before we consider an intrusive data structure version of this, let us consider this enhancement:

    void touch(const T& value) {
      auto cacheIter = m_cache.find(value);
      if (cacheIter == m_cache.end())
        return;
      m_list.splice(m_list.end(), m_list, cacheIter->second);
      m_cache[value] = std::prev(m_list.end())
    }

Then the code becomes:

  1. We lookup the node in m_cache in O(log(N))
  2. If the node is not in the cache, we exit early. This is O(1)
  3. Then we splice the node to the end of the list, which is O(1).
  4. We then update the iterator in m_cache, which is O(log(N))

We have avoided the implicit allocations. Now this is truly O(log(N)). However, we can still improve the constant factor by dropping the final log(N) operation through the use of intrusive data structures. Imagine that we replace std::unordered_map and std::list with boost/intrusive/avltree.hpp and boost/intrusive/list.hpp respectively. The semantics are similar to the STL data structures on T *.

Consequently, this is what would happen:

  1. We lookup the node in the AVL tree. This is O(log(N))
  2. If the node is not in the cache, we exit early. This is O(1)
  3. We splice the node to the end of the list. This is O(1)

Notice that no matter how you implement step 3, be it via remove+add or splice, you still have an O(1) operation. Unlike the STL's non-intrusive linked list, intrusive linked lists do not need much thought to be used efficiently.

The memory location of the object has not changed, so there is no need to update the AVL tree index. The only way this could be made even faster would be to use a hash table, although getting the sizing and hash function right is a pain that you do not have with self balancing binary search trees.

Lets look at another example from that file:

    void insert(T value) {
      auto cacheIter = m_cache.find(value);
      if (cacheIter != m_cache.end())
        m_list.erase(cacheIter->second);

      m_list.push_back(value);
      auto iter = m_list.cend();
      iter--;
      m_cache[value] = iter;
    }

This is roughly what happens:

  1. We lookup the value in m_cache. This is O(log(N))
  2. If we have a hit, we remove it from the list. The documentation says this is O(1), but it will invoke the memory allocator's free function, which may be as bad as O(N^3).
  3. We then put the value at the end of the list. The documentation says this is O(1), but it will invoke the memory allocator's allocation function, which may be as bad as O(N^3).
  4. We get an iterator to the element past the end of the list. This is O(1).
  5. We reverse iterate once. This is O(1).
  6. We update m_cache with the iterator to the value at the end of the list. This is documented as being O(log(N)), but it can throw an exception from allocating memory whenever it is inserting a new entry, so it may be as bad as O(N^3). This is unlike the case of the touch() function that updates an existing entry, which should avoid an allocation.

Now let us imagine that we have written an intrusive version of this using an intrusive AVL tree and an intrusive linked list. Here is a simple version:

  1. We lookup the value in the AVL tree. This is O(log(N)).
  2. If we have a hit, we remove it from the list. This is O(1).
  3. We then put the value at the end of the list. This is O(1).
  4. We add the value to the AVL tree. This is O(log(N)).

This is truly a O(log(N)) insert operation. It is possible to give this a smaller constant factor when the lookup has a hit by simply updating the external pointers to point to the new object, and copying pointers pointing out from the old object into the new object. Some intrusive AVL tree libraries support doing this via their library functions. I am not sure if Boost's AVL tree does.

In any case, it should be clear that adopting intrusive data structures would be a performance improvement and it is an improvement that applies not only to multi-index containers, but any case where you manage object lifetimes yourself, rather than letting a container handle it. In both cases, you avoid unnecessary memory allocation/deallocations, as well as additional indirections (e.g. storing iterators/pointers as values) to access a value through a container.

Every object that ever will be put into an intrusive data structure needs to be modified to support it in advance, which is the main downside of using intrusive data structures. This "downside" can be beneficial when debugging, since you can see whether is inside a particular data structure or not just by examining it. It also makes it easier to keep track of which data structures might contain a specific object by adding comments at the internal nodes.

That said, Boost has its own comparison table:

https://www.boost.org/doc/libs/1_84_0/doc/html/intrusive/intrusive_vs_nontrusive.html#intrusive.intrusive_vs_nontrusive.properties_of_intrusive

For particularly small objects (e.g. 64bytes or less) that do not need multi-index containers where an unordered_map is currently used, you probably would find a B-Tree performs better than an intrusive data structure (at least that has been the experience of the OpenZFS project). However, the intrusive version should still outperform the non-intrusive version whenever object lifetimes are managed outside of the container. Most C projects (including the Linux kernel) use intrusive data structures almost exclusively. The main exceptions would be arrays and B-Trees. Intrusive data structures are something you can use just about everywhere.

The boost intrusive library is a header only library, so it is possible to use it to adopt intrusive data structures without adopting the larger boost library:

https://www.boost.org/doc/libs/1_84_0/doc/html/intrusive.html

Rewriting DXVK to use intrusive data structures everywhere would probably not be an efficient use of time, so my suggestion is to identify hot code paths that use the STL containers and modify them to use the boost intrusive containers.

@ryao
Copy link
Contributor Author

ryao commented Jan 13, 2024

https://baptiste-wicht.com/posts/2012/12/cpp-benchmark-std-list-boost-intrusive-list.html

This benchmark is open source. The source files are:

https://github.com/wichtounet/articles/blob/master/src/intrusive_list/bench.cpp
https://github.com/wichtounet/articles/blob/master/include/policies.hpp
https://github.com/wichtounet/articles/blob/master/src/intrusive_list/bench.cpp
https://github.com/mattreecebentley/plf_colony/blob/3362b420639d7dbe4cc61c8ea58e43f146c353c7/plf_colony.h

The sources can be obtained locally by running git clone --recursive https://github.com/wichtounet/articles.git, although before building it, you need to add #include "../plf_colony/plf_colony.h" to include/policies.hpp due to wichtounet/articles#14. Afterward, you can build it by doing make intrusive_list. Running it will put a substantial amount of data into graph.html, which despite the extension is not really HTML.

On my modern Ryzen 7 5800X system with g++ 12.3.1_p20230526, I can confirm that the benchmark still shows significant advantages for the intrusive lists over the std::list in most workloads. For example, here is an excerpt from the sorting part of the benchmark:

[line_chart width="700px" height="400px" scale_button="true" title="sort - Normal<32u>" h_title="Number of elements" v_title="ms"]
['x', 'auto_unlink_list', 'safe_list', 'normal_ilist', 'list'],
['100000', 9419, 9624, 9573, 12069],
['200000', 21273, 21208, 20995, 34057],
['300000', 35339, 35280, 39070, 49415],
['400000', 47199, 47777, 47515, 69085],
['500000', 64322, 60253, 61057, 98367],
['600000', 89818, 86414, 92880, 157097],
['700000', 118427, 121676, 109725, 200783],
['800000', 125920, 128668, 127506, 244285],
['900000', 145459, 146918, 141354, 274319],
['1000000', 171367, 180252, 161327, 303405],
[/line_chart]

The benchmark uses generics to such an extreme that it is difficult to tell what is managing the lifetime of objects. If std::list is managing it, I would expect it to do the equivalent of an intrusive data structure behind the scenes, such that std::list should not lose to an intrusive data structure, yet it is losing. Either the object life times are managed externally, or there is some overhead in the std::list that makes it slower than an intrusive linked list even when it should not be. Someone more familiar with C++ should be able to tell how the object lifetimes are managed.

@doitsujin
Copy link
Owner

Do you have any real-world numbers on how much CPU time is actually spent operating on those data structures in actual games?

I will not adopt boost as a dependency, and writing our own containers is only ever an option if there is a significant amount of performance to be gained. We already have some in places where it matters.

@ryao
Copy link
Contributor Author

ryao commented Jan 13, 2024

Do you have any real-world numbers on how much CPU time is actually spent operating on those data structures in actual games?

It should not be possible to directly get real-world numbers on how much CPU time is actually spent operating on those data structures, since the compiler inlines code for operating on C++ STL containers into consumers.

It would be more feasible to find hot code paths, replace the containers there with intrusive versions and benchmark that, but unfortunately, I do not know how to profile PE binaries on Linux to find the hot code paths. :/

I will not adopt boost as a dependency, and writing our own containers is only ever an option if there is a significant amount of performance to be gained. We already have some in places where it matters.

As I wrote earlier, Boost.Intrusive is a header only library. Just copy the relevant header into the source tree and you can use it. There is no need to depend on boost itself.

That said, Boost is not the only place where you can get intrusive containers, although it is the only implementation I could find that has public micro-benchmarks comparing its intrusive list against std::list. There are other options for such containers. For example:

https://github.com/graphitemaster/libintrusive

@K0bin
Copy link
Collaborator

K0bin commented Jan 13, 2024

Do you have any real-world numbers on how much CPU time is actually spent operating on those data structures in actual games?

I haven't really seen it show up when profiling.

@ishitatsuyuki
Copy link
Contributor

For profiling of PE you can use my samply fork at https://gist.github.com/ishitatsuyuki/09eef31e16e5247718063cb2390cdc37.

Like others I haven't really seen the data structures in question on the profile. They probably can be optimized, but the return is extremely diminishing.

@ryao
Copy link
Contributor Author

ryao commented Jan 15, 2024

@ishitatsuyuki I have never seen a profiler capable of showing time spent in inlined code. Is samply able to do that?

My expectation is that the time spent on a data structure would appear as time spent in some caller, with at most, the memory allocator making an appearance. The memory allocator is non-deterministic, so it is possible that it will be fast most of the time, then super slow and then fast again, which could appear to be minimal CPU time in total, but in reality include one or more latency spikes. After all, fast things do not typically appear in profiles, but slow things do, although if it is only slow occasionally, it will not show up as very much. :/

Ideally, we would want to generate histograms of function latencies. Then we would pick the ones with the biggest outliers and/or the most CPU time, replace the data structures used with intrusive versions and rebenchmark to see how the histograms were affected.

That said, since C++ memory allocations, at least in Wine, end up going into malloc as far as I know, this seems relevant for how bad these spikes can be:

https://www.forrestthewoods.com/blog/benchmarking-malloc-with-doom3/

Forrest Smith benchmarked malloc latencies:

Parsing log file: c:/temp/doom3_journal.txt
Running Replay

== Replay Results ==
Number of Mallocs:    5531709
Number of Frees:      5523236
Total Allocation:     2.47 gigabytes
Max Live Bytes:       330 megabytes
Average Allocation:   480 bytes
Median Allocation:    64 bytes
Average Malloc Time:  57 nanoseconds
Num Leaked Bytes:     7 megabytes

Alloc Time
Best:    21 nanoseconds
p1:      22 nanoseconds
p10:     22 nanoseconds
p25:     23 nanoseconds
p50:     31 nanoseconds
p75:     45 nanoseconds
p90:     58 nanoseconds
p95:     86 nanoseconds
p98:     192 nanoseconds
p99:     316 nanoseconds
p99.9:   3.07 microseconds
p99.99:  25.75 microseconds
p99.999: 95.01 microseconds
Worst:   200.74 microseconds

His measured worst case malloc latency was 201 microseconds. This is not absolute worst case malloc latency. It is just what he measured the worst case to be when profiling Doom 3 for 7 minutes on his high end Windows system. His 99.999th percentile was 95 microseconds. Based on his data, he observed a 0.1ms latency spike from malloc once approximately every 8 seconds.

Of course, these numbers are neither from DXVK on Linux nor from a modern game, but I suspect the latency numbers are not orders of magnitude different than what we would see if we measured on Linux. If we gathered data from DXVK, there is a decent chance that we would find that some minor performance gains could be made from avoiding unnecessary memory allocations in containers.

Note that the 0.1ms every 8 seconds spike in Forrest Smith's data is somewhat misleading too, since you have plenty of little spikes that are more frequent and many malloc calls are being done per second, so the additional latency can add up into bigger intra-frame spikes.

As Forrest Smith wrote:

The problem isn't malloc per se. The problem is if you call malloc once then you probably call it more than once. It's death by a thousand cuts.

We cannot avoid malloc, but avoiding data structures that add additional malloc operations could be a worthwile micro-optimization. Of course, we definitely should profile to find where the opportunities to do this are.

We could probably use samply to find where the most CPU time is spent, see which containers were inlined into that, replace them with non-intrusive versions and then do before/after comparisons. It would definitely be better than blindly guessing, although it would not be as good for identifying problematic data structures and quantifying improvements from replacing them as the histogram latency approach. I know funclatency from iovisor/bcc could be used to profile function latencies inside the kernel and generate histograms for functions, but I am not sure how the equivalent would be done for DXVK.

@ishitatsuyuki
Copy link
Contributor

samply can deal with inlined frames correctly as long as debug symbols can be found.

@turol
Copy link

turol commented Jan 15, 2024

Random thoughts on this...

It's currently rather cumbersome to experiment with alternative data structures because DXVK uses std data types directly. Adding type aliases for them in the dxvk namespace and replacing all uses with them would make it easier to change them all at once.

std::unordered_map and std::unordered_set might increase cache pressure because of their node-based design. On the other hand DXVK relies on that for pointer stability reason. A replacement would need to be compatible with that, like abseil's absl::node_hash_map. Other open addressing hash maps would need to use an extra unique_ptr which would probably defeat the point. This might be improved by using a custom pool allocator for nodes.

std::vector is reasonably good. Just make sure to call reserve appropriately. In some places either small vector optimization or static-sized vector might be appropriate.

Ordered std::map and std::set might also benefit from small and static storage optimization. If large ordered sets are required there are B-tree based implementations.

std::list is the one most likely to benefit from intrusiveness but there are only a few uses of it.

std::function is used in places. There are alternatives like https://github.com/TartanLlama/function_ref which are sometimes better.

On Linux perf can be used for profiling. It should be able to see inlined functions correctly if you have debug symbols. It might work on Wine if you're using binaries built with mingw which have DWARF debug symbols. I like hotspot for visualization.

Microarchitectural profiling is very useful. On Intel CPUs use either their V-Tune or toplev from https://github.com/andikleen/pmu-tools/. On AMD use their uProf.

@ishitatsuyuki
Copy link
Contributor

Randomly picking data structures that are probably access once per frame is not going to add anything to the discussion. Before claiming that any of that is a problem, please provide profiling evidence or at least suggest something that should be implemented to quantify the impact.

Here's a samply profile you can load into https://profiler.firefox.com. The dxvk-cs thread is mostly filled with driver overhead and our other state tracking overhead. None of the data structures you mentioned are visible there.

PID 8763 2024-01-15 12.51 profile.json.gz

Our biggest problem right now is the constant refcount manipulation going on in the CS thread and the fence (recycling) thread, but solving that without introducing correctness issues is extremely nontrivial and probably takes effort on the scale of rewriting the entire codebase to be achieved.

@ryao
Copy link
Contributor Author

ryao commented Jan 16, 2024

Randomly picking data structures that are probably access once per frame is not going to add anything to the discussion. Before claiming that any of that is a problem, please provide profiling evidence or at least suggest something that should be implemented to quantify the impact.

Here's a samply profile you can load into https://profiler.firefox.com. The dxvk-cs thread is mostly filled with driver overhead and our other state tracking overhead. None of the data structures you mentioned are visible there.

PID 8763 2024-01-15 12.51 profile.json.gz

I was skeptical that it was possible to even see those data structures due to C++'s inlining, but to my surprise, I found a stack that does show the data structure in the profile that you provided:

free [ucrtbase.dll]
std::_Deallocate(void*, unsigned long long) [/home/ishitatsuyuki/msvc/vc/tools/msvc/14.29.30133/include/xmemory]
std::_Default_allocator_traits<std::allocator<std::_List_node<std::pair<unsigned long long,std::function<void ()> >,void *> > >::deallocate(std::allocator<std::_List_node<std::pair<unsigned long long,std::function<void ()> >,void *> >&, std::_List_node<std::pair<unsigned long long,std::function<void ()> >,void *>* const, const unsigned long long) [/home/ishitatsuyuki/msvc/vc/tools/msvc/14.29.30133/include/xmemory]
std::_List_node<std::pair<unsigned long long,std::function<void ()> >,void *>::_Freenode0(std::allocator<std::_List_node<std::pair<unsigned long long,std::function<void ()> >,void *> >&, std::_List_node<std::pair<unsigned long long,std::function<void ()> >,void *>*) [/home/ishitatsuyuki/msvc/vc/tools/msvc/14.29.30133/include/list]
std::_List_node<std::pair<unsigned long long,std::function<void ()> >,void *>::_Freenode(std::allocator<std::_List_node<std::pair<unsigned long long,std::function<void ()> >,void *> >&, std::_List_node<std::pair<unsigned long long,std::function<void ()> >,void *>*) [/home/ishitatsuyuki/msvc/vc/tools/msvc/14.29.30133/include/list]
std::list<std::pair<unsigned long long,std::function<void ()> >,std::allocator<std::pair<unsigned long long,std::function<void ()> > > >::erase(const std::_List_const_iterator<std::_List_val<std::_List_simple_types<std::pair<unsigned long long,std::function<void ()> > > > >) [/home/ishitatsuyuki/msvc/vc/tools/msvc/14.29.30133/include/list]
dxvk::sync::CallbackFence::signal(unsigned long long) [/home/ishitatsuyuki/Documents/dxvk/build/../src/d3d11/../dxvk/../util/sync/sync_signal.h]
dxvk::Presenter::runFrameThread() [/home/ishitatsuyuki/Documents/dxvk/build/../src/dxvk/dxvk_presenter.cpp]
std::_Func_class<void>::operator()() const [/home/ishitatsuyuki/msvc/vc/tools/msvc/14.29.30133/include/functional]
dxvk::thread::threadProc(void*) [/home/ishitatsuyuki/Documents/dxvk/build/../src/util/thread.cpp]
BaseThreadInitThunk [kernel32.dll]
RtlUserThreadStart [ntdll.dll]

That said, the profile suggests that you are right that this micro-optimization is not worth the time it takes to implement it. I am closing this.

Our biggest problem right now is the constant refcount manipulation going on in the CS thread and the fence (recycling) thread, but solving that without introducing correctness issues is extremely nontrivial and probably takes effort on the scale of rewriting the entire codebase to be achieved.

That is a hard problem. Thanks for letting me know. If I do more eyeballing for optimization opportunities, I will focus on that area.

@ryao ryao closed this as completed Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants