New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better SigCache Implementation #8895

Merged
merged 2 commits into from Dec 15, 2016

Conversation

Projects
None yet
8 participants
@JeremyRubin
Contributor

JeremyRubin commented Oct 5, 2016

edited: This PR has been squashed with a few revisions from the initial PR. The unsquashed version is still available here master...JeremyRubin:cuckoocache-pull-request-not-squashed. The PR message has also been updated to reflect the current squashed version more closely.

Summary

This PR replaces the boost::unordered_set in sigcache with CuckooCache::cache.

Benefits

  • Better memory utilization (no pointers/linked lists/stored hashes)
  • Lock Free concurrent reads and erases
  • Lazy Deletion of entries (potentially beneficial in case of re-orgs)
  • Low cache thrashing -- a standard lookup and erase should only require ~9 cache lines (Hash 0..Hash 7, and erase flag)
  • Allocation free design
  • Insert guaranteed to terminate
  • Generations smartly delete older entries before newer ones

Design

The CuckooCache uses a design similar to a Cuckoo Hash Table. Each key inserted has a hash function 0 through hash function 7 (h0..h7).

  • Lookup(k) is a matter of examining location h0(k)..h7(k).
  • Insert(k) first sees if h0(k)..h7(k) is empty, and insets to one if so. If not, insert swaps k with the key at one more than the last hash location (so it goes in a cycle through h0..h7) and attempts to insert the swapped out key at it's other location, repeating (up to a limit) if necessary. At that limit, the last entry is dropped. Insert also checks a heuristic (described below) to age a generation before trying to insert.
  • Delete(k) first does a Lookup(k), then marks the corresponding bit in an std::atomic<uint8_t> as set as a relaxed atomic operation.

More information about the synchronization specification is in the code documentation.

Generation Heuristic

A bit vector is used to track if the element belongs to the current or prior generation. Once the number of the current generation's non-deleted elements exceeds ~45%, the generation is retired. The previously retired generation is marked deleted. Because the cache deletes lazily, these entries are still present but are more likely to be overwritten than elements in the currently retired generation. Generations are scanned to see if they should be retired using a heuristic of how many inserts would need to occur since the last scan to exceed the generation size.

There is a slight (1-bit per entry) memory overhead to this approach. This does not negatively impact this cache because the trade off is worth it for intelligently deleting older entries. Only one bit is used for simplicity, as we get three represented states (deleted, old, current). Using more generations (lets say, 1 more bit) would allow for more granular freeing of only the oldest quarter and not half per scan. This has advantages, but also an additional memory cost and two generations was sufficient for this use case.

Simulation

These simulation results are slightly outdated, @morcos will chime in below with some updated figures. It should be better/the same as below.
This can overall be seen as a 40% improvement on validation times over master on a 16 core machine simulated over a month. The simulation methodology sends relayed transactions and blocks to the node in the order they were received historically. On 4 and 1 core simulations, the improvement is not significant (but no worse). Our tests didn't really hit hard saturation on the cache, which suggests that the megabytes allocated to the cache could be reduced.

Testing

The PR includes a test suite which checks a few key properties:

  1. Check that values not inserted are never readable
  2. Check that a reasonable hit rate is achieved
  3. Check that erased elements are inserted on reasonably well rather than replacing fresh contents
  4. Test that (3) holds when erases are in parallel.

Future Work

There are future optimizations which may be able to bring some of the advantages seen with 16 cores to 4 and 1 cores, but this PR is focused on a minimal complexity patch with a demonstrable improvement over main as a base for future work in this direction.

There are also a few cleanup related items (such as cache initialization) which are not done in this patch to make it a minimal code change. These can be fixed in a later PR if this is accepted.

Acknowledgements

Thanks to @morcos, @sdaftuar, and @TheBlueMatt for their assistance and review on this patch.

@fanquake

This comment has been minimized.

Member

fanquake commented Oct 9, 2016

Compiling on OS X, with Xcode 8

  CXX      test/test_test_bitcoin-cuckoocache_tests.o
In file included from test/cuckoocache_tests.cpp:4:
In file included from /usr/local/include/boost/test/unit_test.hpp:18:
In file included from /usr/local/include/boost/test/test_tools.hpp:46:
In file included from /usr/local/include/boost/test/tools/old/impl.hpp:19:
In file included from /usr/local/include/boost/test/unit_test_log.hpp:24:
In file included from /usr/local/include/boost/test/utils/wrap_stringstream.hpp:26:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/sstream:174:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:138:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ios:216:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:18:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/mutex:177:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:992:39: error: 
      type 'float' cannot be used prior to '::' because it has no members
    : public binary_function<typename _Predicate::first_argument_type,
                                      ^
./cuckoocache.h:206:65: note: in instantiation of template class
      'std::__1::binary_negate<float>' requested here
        depth_limit = std::max((uint8_t)1, static_cast<uint8_t>(std::log...
                                                                ^
In file included from test/cuckoocache_tests.cpp:4:
In file included from /usr/local/include/boost/test/unit_test.hpp:18:
In file included from /usr/local/include/boost/test/test_tools.hpp:46:
In file included from /usr/local/include/boost/test/tools/old/impl.hpp:19:
In file included from /usr/local/include/boost/test/unit_test_log.hpp:24:
In file included from /usr/local/include/boost/test/utils/wrap_stringstream.hpp:26:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/sstream:174:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:138:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ios:216:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:18:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/mutex:177:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1002:36: error: 
      type 'float' cannot be used prior to '::' because it has no members
    bool operator()(const typename _Predicate::first_argument_type& __x,
                                   ^
In file included from test/cuckoocache_tests.cpp:5:
./cuckoocache.h:206:70: error: no member named 'log2' in namespace 'std'
        depth_limit = std::max((uint8_t)1, static_cast<uint8_t>(std::log2(static...
                                                                ~~~~~^
3 errors generated.
@JeremyRubin

This comment has been minimized.

Contributor

JeremyRubin commented Oct 9, 2016

@fanquake should be fixed, wasn't included for some reason.

@theuni

This comment has been minimized.

Member

theuni commented Oct 10, 2016

Concept ACK and code review ACK.

I still need to read up on the security/performance guarantees to understand better, but the implementation looks sane.

@TheBlueMatt

This comment has been minimized.

Contributor

TheBlueMatt commented Oct 16, 2016

Concept ACK, if nothing else the increase in entries-per-byte should be a big win for memory usage.

@JeremyRubin

This comment has been minimized.

Contributor

JeremyRubin commented Oct 20, 2016

I just pushed up 4 commits which further improve the cuckoo cache. These commits add epochs/generations for the cache entries which improves the hit rate. Below is a brief description of the changes, for more detail, see the code & code documentation.

The first 2 commits are changes to the tests. The first commit simplifies what is checked in the existing tests (it was over specific to that Cache's design). The second adds a new test which fails under the existing code because the code does not prioritize newer entries.

The third patch adds generations to cache which work as follows: a bit vector is used to track if the element belongs to the current or prior generation. Once the number of the current generation's non-deleted elements exceeds ~45%, the generation is retired. The previously retired generation is marked deleted. Because the cache deletes lazily, these entries are still present but are more likely to be overwritten than elements in the currently retired generation. Generations are scanned to see if they should be retired using a heuristic of how many inserts would need to occur since the last scan to exceed the generation size.

The fourth commit increases the number of hashes used per element from 2 to 8. The benefit of this change is that it permits using a large generation size (e.g., 45%) compared to 2 hashes (30%). A larger generation size is the "effective size" of the cache, i.e., new entries stay in the current generation for longer.

Performance is markedly better, especially under attack scenarios. @morcos has run several simulations which back up this claim.

There is a slight (1-bit per entry) memory overhead to this approach. This does not negatively impact this cache because the trade off is worth it for intelligently deleting older entries. Only one bit is used for simplicity, as we get three represented states (deleted, old, current). Using more generations (lets say, 1 more bit) would allow for more granular freeing of only the oldest quarter and not half per scan. This has advantages, but also an additional memory cost and two generations was sufficient for this use case.

On a slightly separate note, another side benefit of this cache that I failed to mention when I opened the PR (before these added commits) is that inserts are guaranteed to terminate (in the sigcache currently in master, insert may run forever if GetRand is "unlucky").

@JeremyRubin JeremyRubin force-pushed the JeremyRubin:cuckoocache-pull-request branch 2 times, most recently from 1e86d21 to 1cfb7d1 Oct 20, 2016

@JeremyRubin

This comment has been minimized.

Contributor

JeremyRubin commented Oct 21, 2016

I've squashed all the commits (unsquashed still available here if you've already looked at this. master...JeremyRubin:cuckoocache-pull-request-not-squashed). I've also edited the PR message.

The current Travis failed build seems to be related to the ongoing Dyn DDoS attack, as Travis is telling me variants of
The repository at JeremyRubin/bitcoin was not found. and The repository at bitcoin/bitcoin was not found.. When that attack eases, I'll make travis retry.

@JeremyRubin JeremyRubin force-pushed the JeremyRubin:cuckoocache-pull-request branch from 1cfb7d1 to 3ba2bf4 Oct 22, 2016

*/
inline std::array<uint32_t, 8> compute_hashes(const Element& e) const
{
return {hash_function.template operator()<0>(e) % size,

This comment has been minimized.

@sipa

sipa Oct 22, 2016

Member

Instead of size, store the log2 of the size, and use a bitshift or mask here. Modulus operations are very slow.

This comment has been minimized.

@JeremyRubin

JeremyRubin Oct 22, 2016

Contributor

Yes, this is a rather slow operation, but I don't quite fully understand your operation... doesn't that only work if you're at power of two size?

*/
inline std::array<uint32_t, 8> compute_hashes(const Element& e) const
{
return {hash_function.template operator()<0>(e) % size,

This comment has been minimized.

@sipa

sipa Oct 22, 2016

Member

Yes.

This comment has been minimized.

@JeremyRubin

JeremyRubin Oct 22, 2016

Contributor

Unfortunately the size currently isn't restricted to be a power of two, and by default is not a power of two. Are you suggesting we limit it to be a power of two?

*/
inline std::array<uint32_t, 8> compute_hashes(const Element& e) const
{
return {hash_function.template operator()<0>(e) % size,

This comment has been minimized.

@sipa

sipa Oct 22, 2016

Member

Yes :) Sorry if that wasn't obvious. We could benchmark whether it matters compared to other operations of course, but restricting to a power of two means you at most get something that's a factor sqrt(2) off of your desired size, which seems acceptable for a cache.

This comment has been minimized.

@JeremyRubin

JeremyRubin Oct 22, 2016

Contributor

Apologies, long rambly response below:

Overall, I think it's a reasonable idea to do this. Modulo is slow. It's not the bottleneck, but it's a low cost way to make this code faster.

There are a couple of weird things though. You have currently about 2 bits of bookkeeping overhead per entry (one bit for erasure, one bit for epoch). If you're restricting to a power of two size for the main cache memory, you also may as well do some of the following if the user specifies more space: more generations for finer grained generations; additional fee tracking to preferentially evict low fee items; some kind of mempool pointer/index map to evict signatures on mempool eviction. (preface: I don't like this idea, but for sake of discussion) You could also keep a large and a small cache and look up from both; letting you target size = 2^m + 2^n.

Also, power of two works reasonably well low in the series (1,2,4,8,16,32,64,128) but it seems to be kind of uncomfortably that at say a 1GB cache you have to choose between 1 and 2 GB if you want to increase a little bit. Yes, sqrt, but people have fixed memory sizes so it does kind of matter.

Maybe a better inbetween would be to do as (roughly) follows (should work on GCC, clang, and msvc):

DEFAULT_MAX_SIGCACHE_SIZE = 32;
#ifdef __MSC_VER
#include <intrin.h>
#define __builtin_popcount __popcnt
#endif
bool fast_mod = __builtin_popcount(size) == 1;
compute_hash(E e) {
    if (fast_mod)
        // compute hash with masks
    else
       // compute hash with mod
}

and by default not count the bookkeeping bits in the size computation. That way if you pass in a power of two megabytes as the parameter, you get this magic boost in performance, and if you pass in a non power of two if you get what you asked for. The branch shouldn't hurt because it should be easily predicted.

@morcos

This comment has been minimized.

Member

morcos commented Oct 28, 2016

Tested ACK either 3ba2bf4 or 121dfcd

I gave some nits about the comments offline.
I am ambivalent about the change to eliminate modulo.

I have extensively tested the performance of this patch. Lock contention on the sig cache is a serious bottleneck to ConnectBlock performance with 8 or more cores. This patch appears to all but eliminate that contention and leads to a 40% improvement in ConnectBlock time for 16 cores. It also allows for further performance improvements that would not see any benefit until this bottleneck was removed.

I have also tested the hit rate performance and it is a significant improvement over the existing sigcache in the event of a spam attack, for a long running node, or in the event of a reorg.

In addition to reviewing the included tests, I've also run my own tests and done code review to be sure that there are no false positives.

@morcos

This comment has been minimized.

Member

morcos commented Nov 1, 2016

ACK 453aef4

@fanquake

This comment has been minimized.

Member

fanquake commented Nov 9, 2016

Testing this (OSX 10.12, XCode 8.1):
Compiling threw 1 new warning:

In file included from script/sigcache.cpp:14:
./cuckoocache.h:220:17: warning: suggest braces around initialization of subobject [-Wmissing-braces]
        return {hash_function.template operator()<0>(e) & hash_mask,
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./cuckoocache.h:438:40: note: in instantiation of member function 'CuckooCache::cache<uint256, (anonymous
      namespace)::SignatureCacheHasher>::compute_hashes' requested here
        std::array<uint32_t, 8> locs = compute_hashes(e);
                                       ^
script/sigcache.cpp:70:25: note: in instantiation of member function 'CuckooCache::cache<uint256,
      (anonymous namespace)::SignatureCacheHasher>::contains' requested here
        return setValid.contains(entry, erase);

Errors trying to compile the tests:

In file included from test/cuckoocache_tests.cpp:5:
./cuckoocache.h:218:36: error: implicit instantiation of undefined template 'std::__1::array<unsigned int, 8>'
    inline std::array<uint32_t, 8> compute_hashes(const Element& e) const
                                   ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__tuple:116:65: note: template is declared here
template <class _Tp, size_t _Size> struct _LIBCPP_TYPE_VIS_ONLY array;
                                                                ^
In file included from test/cuckoocache_tests.cpp:5:
./cuckoocache.h:368:33: error: implicit instantiation of undefined template 'std::__1::array<unsigned int, 8>'
        std::array<uint32_t, 8> locs = compute_hashes(e);
                                ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__tuple:116:65: note: template is declared here
template <class _Tp, size_t _Size> struct _LIBCPP_TYPE_VIS_ONLY array;
                                                                ^
In file included from test/cuckoocache_tests.cpp:5:
./cuckoocache.h:438:33: error: implicit instantiation of undefined template 'std::__1::array<unsigned int, 8>'
        std::array<uint32_t, 8> locs = compute_hashes(e);
                                ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__tuple:116:65: note: template is declared here
template <class _Tp, size_t _Size> struct _LIBCPP_TYPE_VIS_ONLY array;
  CXX      test/test_test_bitcoin-key_tests.o
In file included from test/cuckoocache_tests.cpp:4:
In file included from /usr/local/include/boost/test/unit_test.hpp:18:
In file included from /usr/local/include/boost/test/test_tools.hpp:46:
In file included from /usr/local/include/boost/test/tools/old/impl.hpp:19:
In file included from /usr/local/include/boost/test/unit_test_log.hpp:18:
In file included from /usr/local/include/boost/test/tree/observer.hpp:17:
In file included from /usr/local/include/boost/test/detail/global_typedef.hpp:15:
In file included from /usr/local/include/boost/test/utils/basic_cstring/basic_cstring.hpp:21:
In file included from /usr/local/include/boost/test/utils/basic_cstring/bcs_char_traits.hpp:25:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/string:439:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/algorithm:628:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:1740:31: error: no matching constructor
      for initialization of 'block_activity'
            ::new((void*)__p) _Up(_VSTD::forward<_Args>(__args)...);
                              ^   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:1656:18: note: in instantiation of
      function template specialization 'std::__1::allocator<block_activity>::construct<block_activity, const unsigned int &, CuckooCache::cache<uint256,
      cuckoocache_tests::uint256Hasher> &>' requested here
            {__a.construct(__p, _VSTD::forward<_Args>(__args)...);}
                 ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:1502:14: note: in instantiation of
      function template specialization 'std::__1::allocator_traits<std::__1::allocator<block_activity> >::__construct<block_activity, const unsigned int
      &, CuckooCache::cache<uint256, cuckoocache_tests::uint256Hasher> &>' requested here
            {__construct(__has_construct<allocator_type, _Tp*, _Args...>(),
             ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/deque:1801:21: note: in instantiation of
      function template specialization 'std::__1::allocator_traits<std::__1::allocator<block_activity> >::construct<block_activity, const unsigned int &,
      CuckooCache::cache<uint256, cuckoocache_tests::uint256Hasher> &>' requested here
    __alloc_traits::construct(__a, _VSTD::addressof(*__base::end()), _VSTD::forward<_Args>(__args)...);
                    ^
test/cuckoocache_tests.cpp:368:18: note: in instantiation of function template specialization 'std::__1::deque<block_activity,
      std::__1::allocator<block_activity> >::emplace_back<const unsigned int &, CuckooCache::cache<uint256, cuckoocache_tests::uint256Hasher> &>'
      requested here
        last_few.emplace_back(BLOCK_SIZE, set);
                 ^
test/cuckoocache_tests.cpp:391:5: note: in instantiation of function template specialization
      'cuckoocache_tests::test_cache_generations<CuckooCache::cache<uint256, cuckoocache_tests::uint256Hasher> >' requested here
    test_cache_generations<CuckooCache::cache<uint256, uint256Hasher>>();
    ^
test/cuckoocache_tests.cpp:324:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 2 were provided
    struct block_activity {
           ^
test/cuckoocache_tests.cpp:324:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 2 were provided
4 errors generated.
@JeremyRubin

This comment has been minimized.

Contributor

JeremyRubin commented Nov 9, 2016

@fanquake I think I can fix the warnings but I'm curious as to how much that's a goal given that that flag should be deprecated & there are lots of build errors. Happy to push a squashme if you think so.

The build errors should be fixable, must be a different include order or something, I need to add a #include<array>.

@fanquake

This comment has been minimized.

Member

fanquake commented Nov 9, 2016

@JeremyRubin The warning isn't so much of an issue, was just making a note of it.

@JeremyRubin

This comment has been minimized.

Contributor

JeremyRubin commented Nov 10, 2016

@fanquake please confirm fix when you have a moment.

@fanquake

This comment has been minimized.

Member

fanquake commented Nov 10, 2016

@JeremyRubin Everything compiling file now.

Can you suggest tests/benchmarks that reviewers can run to test the performance increase here, or is that going to be hard for anyone without a-lot of cores?

@MarcoFalke MarcoFalke added this to the 0.14.0 milestone Nov 10, 2016

* - The name "allow_erase" is used because the real discard happens later.
* - The name "please_keep" is used because elements may be erased anyways on insert.
*
* @tparam Element should be a POD type that is 32-alignable

This comment has been minimized.

@sipa

sipa Nov 15, 2016

Member

Where does the requirement for POD come from?

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 16, 2016

Contributor

I suppose the way it is currently written (using swap and move) it should be safe to remove both requirements (POD and 32-alignable).

Prior versions of the code I think were unsafe for that use case.

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 16, 2016

Contributor

(I can do a squashme with just that unless you have other small feedbacks)

*
* User Must Guarantee:
*
* 1) Write Requires synchronized access (e.g., a lock)

This comment has been minimized.

@sipa

sipa Nov 15, 2016

Member

Any reason why these locks aren't integrated into the cache class? It isn't obvious to be how to correctly synchronize the erase method with the rest.

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 16, 2016

Contributor

Yes there is a reason.

The locks aren't actually needed at all how the cache is currently used. All that needs to happen is per thread: at the beginning of a block processing, a memory acquire; at the end of block processing a memory release. Additionally, one more lock must be acquired before block processing by the master to ensure that there are no concurrent writers (from addtomempool) and that master must not release until all slaves release as well. This is much better than the repeated acquire/release of locking/unlocking on every single signature operation.

I have written a version of the code which has this semantics, but some feedback from others felt that it was fragile & prone to someone else breaking that behavior down the line (bad for maintainability). In the interest of keeping that optimization available and not introducing a maintenance hazard, I kept the locking where it was in the "layer up".

@JeremyRubin

This comment has been minimized.

Contributor

JeremyRubin commented Nov 17, 2016

@fanquake yeah it'll be tough, there's not much speed improvement from losing the erase locks until ~8 cores. What you could test is better memory efficiency by allocating smaller caches (2 4 8 16...) and comparing performance to a node running with that size and the old cache. I'd recommend running one node and peering your test nodes exclusively thought that to ensure nodes are getting the same messages for a given time. But if you're hardware limited that will be hard.

@TheBlueMatt

A few minor nits on the code, but I'd be fine with it as-is...more of my comments were about your comments 👍.

std::swap(mem, d.mem);
}
/** bit_set sets an entry as discardable.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

Nit: You have some trailing whitespace on 3 lines in this file, including this one.

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

👍

* that is only thread unsafe on calls to setup. This class bit-packs collection
* flags for memory efficiency.
*
* All operations are std::memory_order_relaxed so external mechanisms must

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

Why not use release/acquire - its identical instructions on x86 and wont require a full flush anywhere else to be correct (indeed, relaxed is maybe better the way its implemented now, but if we want the read/write locks in the sigcache to be more effecient we might not want to have a full flush there).

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

You're right it's the same on x86, but it isn't for ARM. We want the operation to be relaxed, so release or acquire are over constrained.

If later changes want to change the memory model, they should do so then. It's hard to say if a full flush is better or worse than lots of release/acquires; that would require benchmarking.

atomic instruction mappings: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

* table, the entry attempted to be inserted is evicted.
*
*/
inline void insert(Element e)

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

It seems to be a strange api to require that a copy be made as the parameter in the function, and then rely on std::move to make insertion effecient (which we cant do for uint256, since it doesnt have any non-POD memory). Seems just as good (and much more common) to pass in a const-reference and then just let operator=() handle the copy.

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

We also do a std::swap on e so that's why we have a mutable copy.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Dec 1, 2016

Contributor

std::swap is just as slow for uint256, though. Since there is no dynamically-allocated memory we cant speed it up, really.

// case behavior (no intermittent erases) would exceed epoch size,
// with a reasonable minimum scan size.
epoch_heuristic_counter = std::max(1u, std::max(epoch_size / 16,
epoch_size - std::min(epoch_size, epoch_unused_count)));

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

The std::min should be a NOP. we're in an else(epoch_size > epoch_unused_count).

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

👍

/** cache implements a cache with properties similar to a cuckoo-set
*
* The cache is able to hold up to (~(uint32_t) 1) elements.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

nit: technically it can hold up to (~(uint32_t) 1) - 1, because invalid() uses ~(uint32_t) 1, though I think you meant to use 0.

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

I thought there was a reason I used ~(uint32_t) 1 rather than ~(uint32_t) 0, but for the life of me can't recall it. Will change it to be that, but would appreciate you to review that there was no reason.

* 1) Write Requires synchronized access (e.g., a lock)
* 2) Read Requires no concurrent Write, synchronized with the last insert.
* 3) Erase requires no concurrent Write, synchronized with last insert.
* 4) An Eraser must release all Erases before allowing a new Writer.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

I believe this also refers to an old/different version of this patch.

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

I think it's correct? I'll try to clarify.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Dec 1, 2016

Contributor

Oh, indeed, I believe I was mistaken.

* @tparam Element should be a POD type that is 32-alignable
* @tparam Hash should be a function/callable which takes a template parameter
* hash_select and an Element and extracts a hash from it. Should return
* high-entropy hashes for `Hash h; h<0>(e) and h<1>(e)`.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

Also now out-of-date: you're using 8 hashes.

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

👍

/** epoch_heuristic_counter is used to determine when a epoch
* might be aged & an expensive scan should be done.
* epoch_heuristic_counter is incremented on insert and reset to the

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

s/incremented/decremented/

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

👍

*/
void epoch_check()
{
if ((--epoch_heuristic_counter) != 0)

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

nit: technically this is an off-by-one: because you check this prior to actually doing the insert you'll always do an "expensive" scan twice for the first (and all) epochs. Practically, you might just want to set epoch_heuristic_count to something like epoch_size + 10 by default, though I suppose it doesnt matter all that much.

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

👍 -- I fixed it, I think, by decrementing inside the if block. (didn't want to do postfix to stop the counter from underflowing)

/** insert loops at most depth_limit times trying to insert a hash
* at various locations in the table via a variant of the Cuckoo Algorithm
* with two hash locations.

This comment has been minimized.

@TheBlueMatt

TheBlueMatt Nov 18, 2016

Contributor

s/two/eight/

This comment has been minimized.

@JeremyRubin

JeremyRubin Nov 30, 2016

Contributor

👍

@sipa

This comment has been minimized.

Member

sipa commented Nov 30, 2016

@JeremyRubin Care to have a look at @TheBlueMatt's comments above?

@JeremyRubin

This comment has been minimized.

Contributor

JeremyRubin commented Nov 30, 2016

Should be all addressed.

@TheBlueMatt

This comment has been minimized.

Contributor

TheBlueMatt commented Dec 1, 2016

LGTM at 88b58d3

@morcos

This comment has been minimized.

Member

morcos commented Dec 5, 2016

reACK 88b58d3

@sdaftuar

Code review ACK, will test. Few comment nits/questions below.

Also, it would be nice if we had a unit test for the sigcache itself...

*
* 1) On first iter, always false so defaults to locs[0]
* 2) Second iter, last_loc == locs[0] so will go to locs[1]
*

This comment has been minimized.

@sdaftuar

sdaftuar Dec 7, 2016

Member

I think this comment needs to be updated now that we have more than 2 hashes, right? If I understand right, the algorithm now is:

Swap with the element one past the last one looked at.  Example:

1) On first iter, always false so defaults to locs[0].
2) Second iter, last_loc == locs[k] for some k, so will go to locs[k+1] (wrapping back to 0 if necessary).
*/
inline bool contains(const Element& e, const bool erase) const
{

This comment has been minimized.

@sdaftuar

sdaftuar Dec 7, 2016

Member

nit: random newline

*
* This may exhibit platform endian dependent behavior but because these are
* nonced hashes (random) and this state is only ever used locally it is safe.
* All that matter is local consistency.

This comment has been minimized.

@sdaftuar

sdaftuar Dec 7, 2016

Member

nit: "matter" -> "matters"

}
};
// Initialized outisde of VerifySignature to avoid atomic operation per call

This comment has been minimized.

@sdaftuar

sdaftuar Dec 7, 2016

Member

nit: Can you clarify this comment -- what is the atomic operation you're referring to?

Also, it seems like this comment is here to explain why you moved the code from its old place to the new place (which is helpful for me to understand!), but it's perhaps not very helpful here in its current form to future code readers who haven't seen the previous implementation.

This comment has been minimized.

@JeremyRubin

JeremyRubin Dec 7, 2016

Contributor

I believe that a function local static initialized uses an atomic (perhaps even a locking!) operation under the hood for the case where concurrent callers enter the function initially (C++11 guarantees this to be safe).

I also believe that this is not the case for a global static (initialized before concurrency is allowed).

I'm not 100% certain the spec forces conformity on this point, but I think it is at least the case in most implementations.

@JeremyRubin

This comment has been minimized.

Contributor

JeremyRubin commented Dec 12, 2016

@sdaftuar, d4294ad should fix your documentation concerns.

RE: sigcache unit tests, I could whip something up, but I think it's a little bit hard to test meaningfully (i.e., beyond what other tests using the CachingTransactionSignatureChecker already cover) without opening up the anonymous namespace in script/sigcache.cpp. We can't detect actual cache hits v.s. misses with just the CachingTransactionSignatureChecker::VerifySignature as it just returns a bool. As this would require a bit of restructuring, it's probably easiest review that as a separate PR, if needed.

@sdaftuar

This comment has been minimized.

Member

sdaftuar commented Dec 13, 2016

Tested ACK 88b58d3, and the comments in d4294ad look good too, thanks.

Re: sigcache tests, I don't know the best way to address. If I could come up with a test suite that exercised the logic sufficiently to make changes like this safer, then I'd happily suggest that, but it seems like a hard problem. Still, after reviewing I'm comfortable enough with this PR to proceed without additional sigcache tests (and it's great that we have tests for the cuckoocache at least).

@sipa

This comment has been minimized.

Member

sipa commented Dec 14, 2016

utACK. Please squash? :)

JeremyRubin added some commits Oct 5, 2016

Add CuckooCache implementation and replace the sigcache map_type with it
SQUASHME: Change cuckoocache to only work for powers of two, to avoid mod operator
SQUASHME: Update Documentation and simplify logarithm logic
SQUASHME: OSX Build Errors
SQUASHME: minor Feedback from sipa + bluematt
SQUASHME: DOCONLY: Clarify a few comments.
Add unit tests for the CuckooCache
SQUASHME: Update Tests for other SQUASHMEs

@JeremyRubin JeremyRubin force-pushed the JeremyRubin:cuckoocache-pull-request branch from d4294ad to 67dac4e Dec 14, 2016

@JeremyRubin

This comment has been minimized.

Contributor

JeremyRubin commented Dec 14, 2016

@sipa Squashed down to just cache and testing commits.

unsquashed version preserved at https://github.com/JeremyRubin/bitcoin/tree/cuckoocache-pull-request-unsquashed

@sipa sipa merged commit 67dac4e into bitcoin:master Dec 15, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

sipa added a commit that referenced this pull request Dec 15, 2016

Merge #8895: Better SigCache Implementation
67dac4e Add unit tests for the CuckooCache (Jeremy Rubin)
c9e69fb Add CuckooCache implementation and replace the sigcache map_type with it (Jeremy Rubin)

codablock added a commit to codablock/dash that referenced this pull request Jan 18, 2018

Merge bitcoin#8895: Better SigCache Implementation
67dac4e Add unit tests for the CuckooCache (Jeremy Rubin)
c9e69fb Add CuckooCache implementation and replace the sigcache map_type with it (Jeremy Rubin)

Warrows added a commit to Warrows/PIVX that referenced this pull request Feb 23, 2018

216k155 added a commit to LUX-Core/lux that referenced this pull request Aug 25, 2018

Add - Cuckoocache, Massive NULL -> nullptr
cuckoocache reference: bitcoin/bitcoin#8895

Signed-off-by: k155 <216k155@luxcore.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment