Refactor ShardedCache for more sharing, static polymorphism #10801

pdillinger · 2022-10-10T23:45:49Z

Summary: The motivations for this change include

Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size.
- This change frees up space by eliminating the need for the hash field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash.
Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads
- ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks).
- You'll also notice less reinterpret_casting and other boilerplate in the Cache implementations, as this can go in ShardedCache.

More detail:

Don't have LRUCacheShard maintain std::shared_ptr<SecondaryCache> copies (extra refcount) when LRUCache can be in charge of keeping a shared_ptr.
Renamed capacity_mutex_ to config_mutex_ to better represent the scope of what it guards.
Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression.

Test Plan: Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash.

Performance:
Create with TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16

Test with

TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16

Before: readrandom [AVG 150 runs] : 321147 (± 253) ops/sec
After: readrandom [AVG 150 runs] : 321530 (± 326) ops/sec

So possibly ~0.1% improvement.

And with -cache_type=hyper_clock_cache:
Before: readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec
After: readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec

So roughly 5% improvement!

Summary: The motivations for this change include * Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size. * This change frees up space by eliminating the need for the `hash` field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash. * Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads * ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks). * You'll also notice less `reinterpret_cast`ing and other boilerplate in the Cache implementations, as this can go in ShardedCache. More detail: * Don't have LRUCacheShard maintain `std::shared_ptr<SecondaryCache>` copies (extra refcount) when LRUCache can be in charge of keeping a `shared_ptr`. * Renamed `capacity_mutex_` to `config_mutex_` to better represent the scope of what it guards. * Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression. Test Plan: Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash. Performance: Create with `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16` Test with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16 ``` Before: `readrandom [AVG 150 runs] : 321147 (± 253) ops/sec` After: `readrandom [AVG 150 runs] : 321530 (± 326) ops/sec` So possibly ~0.1% improvement. And with `-cache_type=hyper_clock_cache`: Before: `readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec` After: `readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec` So roughly 5% improvement!

facebook-github-bot · 2022-10-11T01:51:45Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…r_simp

facebook-github-bot · 2022-10-11T17:23:33Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-10-11T19:09:12Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

anand1976 · 2022-10-12T22:16:24Z

cache/sharded_cache.h

-  virtual std::string GetPrintableOptions() const { return ""; }
+      size_t average_entries_per_lock, size_t* state) = 0;
+  void EraseUnRefEntries() = 0;
+  */


Keeping the above functions pure virtual and marking the derived class as final should have the same devirtualizing effect, no?

We would not get the convenience of the function signatures using the implementation-specific HandleImpl. I also find it confusing to mix templates and virtual functions. It can be harder to understand what is intended.

cache/lru_cache.cc

…r_simp

facebook-github-bot · 2022-10-17T21:58:34Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

…r_simp

facebook-github-bot · 2022-10-18T20:01:24Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-10-18T21:13:47Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: There were some cases where we saved a reference to the hash value instead of saving the hash value itself. This fixes by generally doing Rollback as soon as an entry is no longer visible to Lookups, so we don't need to copy the value for later passing to Rollback. Test Plan: watch for clean crash test, TSAN

Summary: In #10801 in ClockHandleTable::Evict, we saved a reference to the hash value (`const UniqueId64x2& hashed_key`) instead of saving the hash value itself before marking the handle as empty and thus free for use by other threads. This could lead to Rollback seeing the wrong hash value for updating the `displacements` after an entry is removed. The fix is (like other places) to copy the hash value before it's released. (We could Rollback while we own the entry, but that creates more dependences between atomic updates, because in that case, based on the code, the Rollback writes would have to happen before or after the entry is released by marking empty. By doing the relaxed Rollback after marking empty, there's more opportunity for re-ordering / ILP.) Intended follow-up: refactoring for better code sharing in clock_cache.cc Pull Request resolved: #10843 Test Plan: watch for clean crash test, TSAN Reviewed By: siying Differential Revision: D40579680 Pulled By: pdillinger fbshipit-source-id: 258e43b3b80bc980a161d5c675ccc6708ecb8025

Summary: For clean-up and in preparation for some other anticipated changes, including * A new dynamically-scaling variant of HyperClockCache * SecondaryCache support for HyperClockCache This change does some refactoring for current and future code sharing and reusability. (Including follow-up on facebook#10843) * TBD whether new variant will be a HyperClockCache or use some other name, so namespace is just clock_cache for the family of structures. * A number of helper functions introduced and used. * Pre-emptively split ClockHandle (shared among lock-free clock cache variants) and HandleImpl (specific to a kind of Table), and introduce template to plug new Table implementation into ClockCacheShard. * Mostly using helper functions. Some things like `Rollback()` and `FreeDataMarkEmpty()` were not combined because `Rollback()` is Table-specific while `FreeDataMarkEmpty()` can be used with different table implementations. * Performance testing indicated that despite more opportunities for parallelism, making a local copy of handle data for processing after marking an entry empty was slower than doing that processing before marking the entry empty (but after marking it "under construction"), thus avoiding a few words of copying data. At least for now, this answers the "TODO? Delay freeing?" questions (no). Test Plan: minor test updates in refactoring; no functionality change Same setup as facebook#10801: Before: `readrandom [AVG 81 runs] : 627992 (± 5124) ops/sec` After: `readrandom [AVG 81 runs] : 637512 (± 4866) ops/sec` I've been getting some inconsistent results on restarts like the system is not being fair to the two processes, so I'm not sure there's such a real difference.

Summary: For clean-up and in preparation for some other anticipated changes, including * A new dynamically-scaling variant of HyperClockCache * SecondaryCache support for HyperClockCache This change does some refactoring for current and future code sharing and reusability. (Including follow-up on #10843) ## clock_cache.h * TBD whether new variant will be a HyperClockCache or use some other name, so namespace is just clock_cache for the family of structures. * A number of helper functions introduced and used. * Pre-emptively split ClockHandle (shared among lock-free clock cache variants) and HandleImpl (specific to a kind of Table), and introduce template to plug new Table implementation into ClockCacheShard. ## clock_cache.cc * Mostly using helper functions. Some things like `Rollback()` and `FreeDataMarkEmpty()` were not combined because `Rollback()` is Table-specific while `FreeDataMarkEmpty()` can be used with different table implementations. * Performance testing indicated that despite more opportunities for parallelism, making a local copy of handle data for processing after marking an entry empty was slower than doing that processing before marking the entry empty (but after marking it "under construction"), thus avoiding a few words of copying data. At least for now, this answers the "TODO? Delay freeing?" questions (no). Pull Request resolved: #10887 Test Plan: fixed a unit testing gap; other minor test updates for refactoring No functionality change ## Performance Same setup as #10801: Before: `readrandom [AVG 81 runs] : 627992 (± 5124) ops/sec` After: `readrandom [AVG 81 runs] : 637512 (± 4866) ops/sec` I've been getting some inconsistent results on restarts like the system is not being fair to the two processes, so I'm not sure there's such a real difference. Reviewed By: anand1976 Differential Revision: D40959240 Pulled By: pdillinger fbshipit-source-id: 0a8f3646b3bdb5bc7aaad60b26790b0779189949

wanghanrong9841 · 2023-09-27T08:09:48Z

cache/sharded_cache.h

-  int GetNumShardBits() const;
-  uint32_t GetNumShards() const;
+    size_t aepl = opts.average_entries_per_lock;
+    aepl = std::min(aepl, size_t{1});


@pdillinger Shall we use 'std::max' instead of 'std::min'?

pdillinger requested a review from anand1976 October 10, 2022 23:45

facebook-github-bot added the CLA Signed label Oct 10, 2022

pdillinger added 2 commits October 10, 2022 17:02

Fix windows alignas

5d718ca

Fix C++20 and sentinel cache key

ae6a69d

pdillinger added 2 commits October 11, 2022 09:59

Fix lints

572bb5e

Merge remote-tracking branch 'origin/main' into sharded_cache_refacto…

70fa496

…r_simp

anand1976 reviewed Oct 12, 2022

View reviewed changes

pdillinger added 2 commits October 17, 2022 14:46

Merge remote-tracking branch 'origin/main' into sharded_cache_refacto…

71c941f

…r_simp

Remove remaining obsolete reinterpret_cast

4772a16

anand1976 approved these changes Oct 18, 2022

View reviewed changes

Merge remote-tracking branch 'origin/main' into sharded_cache_refacto…

e42cba0

…r_simp

facebook-github-bot closed this in 7555243 Oct 19, 2022

pdillinger mentioned this pull request Oct 21, 2022

Fix HyperClockCache Rollback bug in #10801 #10843

Closed

pdillinger mentioned this pull request Oct 26, 2022

Refactor (Hyper)ClockCache code #10887

Closed

wanghanrong9841 reviewed Sep 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ShardedCache for more sharing, static polymorphism #10801

Refactor ShardedCache for more sharing, static polymorphism #10801

pdillinger commented Oct 10, 2022 •

edited

Loading

facebook-github-bot commented Oct 11, 2022

facebook-github-bot commented Oct 11, 2022

facebook-github-bot commented Oct 11, 2022

anand1976 Oct 12, 2022

pdillinger Oct 13, 2022

facebook-github-bot commented Oct 17, 2022

facebook-github-bot commented Oct 18, 2022

facebook-github-bot commented Oct 18, 2022

wanghanrong9841 Sep 27, 2023

Refactor ShardedCache for more sharing, static polymorphism #10801

Refactor ShardedCache for more sharing, static polymorphism #10801

Conversation

pdillinger commented Oct 10, 2022 • edited Loading

facebook-github-bot commented Oct 11, 2022

facebook-github-bot commented Oct 11, 2022

facebook-github-bot commented Oct 11, 2022

anand1976 Oct 12, 2022

Choose a reason for hiding this comment

pdillinger Oct 13, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Oct 17, 2022

facebook-github-bot commented Oct 18, 2022

facebook-github-bot commented Oct 18, 2022

wanghanrong9841 Sep 27, 2023

Choose a reason for hiding this comment

pdillinger commented Oct 10, 2022 •

edited

Loading