Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot management scales poorly #5083

Open
mpoeter opened this issue Mar 19, 2019 · 16 comments
Open

Snapshot management scales poorly #5083

mpoeter opened this issue Mar 19, 2019 · 16 comments
Assignees
Labels
performance Issues related to performance that may or may not be bugs

Comments

@mpoeter
Copy link
Contributor

mpoeter commented Mar 19, 2019

GetSnapshotImpl and ReleaseSnapshot acquire the global mutex_ to perform their operations. This can result in high contention and therefore bad scalability for workloads that make heavy use of snapshots. This can be seen by running the "randomwithverify" benchmark (this seems to be the only one that uses snapshots).

This is a screenshot from a VTune analysis of the "randomwithverify" benchmark with 4 threads. It clearly shows how the snapshot operations are serialized (the yellow lines indicate transitions where a mutex unlock wakes another thread that was waiting for it).
image
Obviously, a higher number of threads intensifies this problem...

I have considered different ways to improve the situation:

  • use a lock-free doubly-linked list for the snapshot list (e.g., based on Sundell and Tsigas).
  • use N separate snapshot lists, each protected by its own mutex, and distribute the operating threads evenly among them.
  • if the mutex is already locked, record the operation and let the mutex-owning thread apply the recorded operations (similar to flat combining).

But obviously they all have their own pros and cons, so I would like to get some input from the core team.

I suppose you are aware of this potential bottleneck - do you have other ideas or even plans to improve scalability of snapshots? If we can agree on how to approach this, I am happy to make the according changes and create a PR.

@maysamyabandeh
Copy link
Contributor

The overhead is not zero but it is also not the major bottlenecks in the benchmarks that I work with, like sysbench read-write, read-only, update-noindex. I, as probably many others, tried a couple of approaches in the past but did not see a tangible improvement in the benchmark results.
If you have any ideas feel free to try it out and if it made tangible improvement with reasonable complexity feel free to send a PR. My suggestion:

  1. Start with a standard benchmark that would highlight the bottleneck
  2. Do a rapid/hackish prototype of your approach, measure the impact on the benchmark, and share the results.
  3. Using the number and the feedback it would be much easier to decide if you want to spend time to work on a production-quality patch for this change.

@mpoeter
Copy link
Contributor Author

mpoeter commented Mar 21, 2019

I originally found this bottleneck while profiling ArangoDB which uses RocksDB as its storage engine. It looks like the only benchmark in db_bench that uses snapshots is randomwithverify. I ran db_bench with the following parameters:

--format_version=3 --memtablerep=hash_linkedlist --prefix_size=16 --allow_concurrent_memtable_write=0 --use_block_based_filter=1 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_size_ratio=0.2 --cache_numshardbits=-1 --compaction_readahead_size=4124672 --level_compaction_dynamic_level_bytes=1 --level0_file_num_compaction_trigger=2 --benchmarks=randomwithverify --deletepercent=1 --readwritepercent=95 --transaction_db=1 --use_existing_db=0 --num=2428800 --numdistinct=100 --threads=8

For which the profiler recorded the following CPU utilization.
image
This shows that there is some potential for improvement. 🙂 (in my original use case the results look even worse)
I will experiment a bit and report back about the results, but I have a two questions:

  • GetSnapshotImpl acquires the mutex_ before loading the snapshot sequence and inserting the snapshot into the snapshot list. Does this have to be atomic? I suppose not (i.e., I don't have to hold the mutex_ to load the sequence).
  • ReleaseSnapshot unconditionally calls UpdateOldestSnapshot for each column family and possibly schedules a compaction. Wouldn't it be sufficient to do this only if the oldest snapshot has changed, i.e., only if the currently released snapshot was the oldest one?

@yiwu-arbug
Copy link
Contributor

rocksdb internally has another way to get consistent view for Get() or range scan when user don't provide a snapshot. It is by holding a sequence number and a SuperVersion reference. The later holds a version of the LSM tree. The sequence number can be acquired without lock, and the SuperVersion is stored thread-locally and can be acquire without lock in most cases. If lock contention on GetSnapshot is really a problem for some use cases, I'm wondering if it is possible to expose the SuperVersion as another type of snapshot to user.

@mpoeter
Copy link
Contributor Author

mpoeter commented Mar 31, 2019

Thanks @yiwu-arbug, but according to the comment in the code "accessing members of this class is not thread-safe and requires external synchronization (ie db mutex held or on write thread)", so I don't think that would help in this case.

After a few iterations I now have a solution that scales quite well. I ran the following benchmark with varying number of threads:

--format_version=3 --memtablerep=hash_linkedlist --prefix_size=16 --allow_concurrent_memtable_write=0 --use_block_based_filter=1 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_size_ratio=0.2 --cache_numshardbits=-1 --compaction_readahead_size=4124672 --level_compaction_dynamic_level_bytes=1 --level0_file_num_compaction_trigger=2 --benchmarks=randomwithverify --deletepercent=1 --readwritepercent=98 --transaction_db=1 --use_existing_db=0 --num=1500000 --numdistinct=100 --compression_type=none

RocksDB:    version 6.0
Date:       Thu Mar 28 12:42:46 2019
CPU:        160 * Intel(R) Xeon(R) CPU E7- 8850  @ 2.00GHz
CPUCache:   24576 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1500000
Prefix:    16 bytes
Keys per prefix:    0
RawSize:    165.9 MB (estimated)
FileSize:   94.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Memtablerep: hash_linkedlist
Perf Level: 1

The original code produced the following results (the first number is the number of threads).

ORIGINAL
 8 - randomwithverify :   33.127 micros/op 239043 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:729778)
16 - randomwithverify :   63.849 micros/op 249517 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:732338)
24 - randomwithverify :   96.734 micros/op 247353 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:730886)
32 - randomwithverify :  127.513 micros/op 250369 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:730720)
48 - randomwithverify :  191.137 micros/op 250212 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:730222)
64 - randomwithverify :  255.991 micros/op 249181 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:730322)

My new snapshot list produced the following results:

NEW
 8 - randomwithverify :   20.195 micros/op 393435 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:729583)
16 - randomwithverify :   39.700 micros/op 399848 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:731007)
24 - randomwithverify :   30.133 micros/op 773984 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:729223)
32 - randomwithverify :   37.077 micros/op 825816 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:728155)
48 - randomwithverify :   54.790 micros/op 852402 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:728579)
64 - randomwithverify :   71.376 micros/op 867057 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:727323)

There are still a few minor issues to solve (like at the moment a long running snapshot blocks reclamation of older SnapshotImpl instances), but IMHO this approach seems promising.

@yiwu-arbug
Copy link
Contributor

@mpoeter may I know what's your change?

@mpoeter
Copy link
Contributor Author

mpoeter commented Apr 1, 2019

I reimplemented the SnapshotList and removed the locks in GetSnapshotImpl and ReleaseSnapshot. Inserting new snapshots via SnapshotList::New is lock-free. SnapshotList::Delete marks a snapshot as deleted (no lock required). If the currently deleted snapshot was the oldest, we remove all the deleted threads until the first snapshot that is not marked as deleted; for this a spin-lock is used. This spin-lock is also used to protect other operations that access snapshot instances like GetOldestSnapshotTime and GetAll.

I will try to improve the code and fix the remaining issues. Once they are sorted out I'll create a PR where we can discuss the changes in more detail.

@yiwu-arbug
Copy link
Contributor

Making SnapshotList lock-free is nice. Regarding SuperVersion it can be obtained lock-free, see ColumnFamilyData::GetThreadLocalSuperVersion().

@siying
Copy link
Contributor

siying commented Apr 1, 2019

@maysamyabandeh MyRocks uses 2PC, and has binlog dependency. It's likely that another application doesn't depend on those functionalities. So I would be careful in using sysbench to rule out a performance bottleneck.

@mpoeter
Copy link
Contributor Author

mpoeter commented Apr 4, 2019

@yiwu-arbug my implementation is only partially lock-free - removing entries from the list still requires a lock. That's why most threads only mark their snapshot as "deleted" and leave removal of that entry to the owner of the oldest active snapshot. The main reason is that I want to avoid the additional complexity of introducing a reclamation scheme like EBR (epoch based reclamation) or DEBRA. Also, a lock-free remove operation would be more expensive, resulting in unnecessary overhead for uncontended remove ops.
Thanks for the hint about GetThreadLocalSuperVersion, I will take a look at that.

@maysamyabandeh
Copy link
Contributor

@yiwu-arbug are you still planning to look into this?

@yiwu-arbug
Copy link
Contributor

yiwu-arbug commented Sep 9, 2019

No plan on our side. We identified this is not a bottleneck for us.

@mpoeter
Copy link
Contributor Author

mpoeter commented Sep 10, 2019

@maysamyabandeh @yiwu-arbug Unfortunately I had to put my experiments on ice for the last few months. I still want to keep working on this at some point, I just have to find the time for it.
Anyway, after some more considerations I think I will probably go with a fully lock-free queue, even if that means I have to introduce some reclamation scheme.

@frew
Copy link

frew commented Jan 6, 2022

@mpoeter We're running into this issue some (via TiKV's use of RocksDB snapshots) - did you ever end up being able to get back to this?

@mpoeter
Copy link
Contributor Author

mpoeter commented Jan 16, 2022

@frew Yes, but only for a short time before I was again sucked into a different project. However, I switched jobs in the meantime and am now working even more with RocksDB, so there is a good chance I might look into this again, but probably not in the next weeks.

@riversand963 riversand963 added the performance Issues related to performance that may or may not be bugs label Feb 2, 2022
@riversand963 riversand963 self-assigned this Mar 22, 2022
@frew
Copy link

frew commented Mar 29, 2022

@riversand963 Are you working on this issue? No worries either way - just trying to figure out logistics on our end.

@riversand963
Copy link
Contributor

@frew I haven't started this yet, but I am interested in this. Self-assigning to either work on it or follow-up.
I haven't caught up with the discussion here so far either, but will do later.
I wonder whether we can reduce time holding db mutex by allowing applications to share snapshots. Currently, iiuc, a new snapshot object is created and inserted to the snapshot list, even if they may have the same seq. Insertion into the list requires the mutex. If we reduce the number of insertions, I guess we may be able to reduce contention.
This will require API change to GetSnapshot, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to performance that may or may not be bugs
Projects
None yet
Development

No branches or pull requests

6 participants