Snapshot management scales poorly #5083

mpoeter · 2019-03-19T10:10:20Z

GetSnapshotImpl and ReleaseSnapshot acquire the global mutex_ to perform their operations. This can result in high contention and therefore bad scalability for workloads that make heavy use of snapshots. This can be seen by running the "randomwithverify" benchmark (this seems to be the only one that uses snapshots).

This is a screenshot from a VTune analysis of the "randomwithverify" benchmark with 4 threads. It clearly shows how the snapshot operations are serialized (the yellow lines indicate transitions where a mutex unlock wakes another thread that was waiting for it).

Obviously, a higher number of threads intensifies this problem...

I have considered different ways to improve the situation:

use a lock-free doubly-linked list for the snapshot list (e.g., based on Sundell and Tsigas).
use N separate snapshot lists, each protected by its own mutex, and distribute the operating threads evenly among them.
if the mutex is already locked, record the operation and let the mutex-owning thread apply the recorded operations (similar to flat combining).

But obviously they all have their own pros and cons, so I would like to get some input from the core team.

I suppose you are aware of this potential bottleneck - do you have other ideas or even plans to improve scalability of snapshots? If we can agree on how to approach this, I am happy to make the according changes and create a PR.

The text was updated successfully, but these errors were encountered:

maysamyabandeh · 2019-03-19T16:54:56Z

The overhead is not zero but it is also not the major bottlenecks in the benchmarks that I work with, like sysbench read-write, read-only, update-noindex. I, as probably many others, tried a couple of approaches in the past but did not see a tangible improvement in the benchmark results.
If you have any ideas feel free to try it out and if it made tangible improvement with reasonable complexity feel free to send a PR. My suggestion:

Start with a standard benchmark that would highlight the bottleneck
Do a rapid/hackish prototype of your approach, measure the impact on the benchmark, and share the results.
Using the number and the feedback it would be much easier to decide if you want to spend time to work on a production-quality patch for this change.

mpoeter · 2019-03-21T15:07:15Z

I originally found this bottleneck while profiling ArangoDB which uses RocksDB as its storage engine. It looks like the only benchmark in db_bench that uses snapshots is randomwithverify. I ran db_bench with the following parameters:

--format_version=3 --memtablerep=hash_linkedlist --prefix_size=16 --allow_concurrent_memtable_write=0 --use_block_based_filter=1 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_size_ratio=0.2 --cache_numshardbits=-1 --compaction_readahead_size=4124672 --level_compaction_dynamic_level_bytes=1 --level0_file_num_compaction_trigger=2 --benchmarks=randomwithverify --deletepercent=1 --readwritepercent=95 --transaction_db=1 --use_existing_db=0 --num=2428800 --numdistinct=100 --threads=8

For which the profiler recorded the following CPU utilization.

This shows that there is some potential for improvement. 🙂 (in my original use case the results look even worse)
I will experiment a bit and report back about the results, but I have a two questions:

GetSnapshotImpl acquires the mutex_ before loading the snapshot sequence and inserting the snapshot into the snapshot list. Does this have to be atomic? I suppose not (i.e., I don't have to hold the mutex_ to load the sequence).
ReleaseSnapshot unconditionally calls UpdateOldestSnapshot for each column family and possibly schedules a compaction. Wouldn't it be sufficient to do this only if the oldest snapshot has changed, i.e., only if the currently released snapshot was the oldest one?

yiwu-arbug · 2019-03-21T18:30:44Z

rocksdb internally has another way to get consistent view for Get() or range scan when user don't provide a snapshot. It is by holding a sequence number and a SuperVersion reference. The later holds a version of the LSM tree. The sequence number can be acquired without lock, and the SuperVersion is stored thread-locally and can be acquire without lock in most cases. If lock contention on GetSnapshot is really a problem for some use cases, I'm wondering if it is possible to expose the SuperVersion as another type of snapshot to user.

mpoeter · 2019-03-31T14:48:14Z

Thanks @yiwu-arbug, but according to the comment in the code "accessing members of this class is not thread-safe and requires external synchronization (ie db mutex held or on write thread)", so I don't think that would help in this case.

After a few iterations I now have a solution that scales quite well. I ran the following benchmark with varying number of threads:

--format_version=3 --memtablerep=hash_linkedlist --prefix_size=16 --allow_concurrent_memtable_write=0 --use_block_based_filter=1 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_size_ratio=0.2 --cache_numshardbits=-1 --compaction_readahead_size=4124672 --level_compaction_dynamic_level_bytes=1 --level0_file_num_compaction_trigger=2 --benchmarks=randomwithverify --deletepercent=1 --readwritepercent=98 --transaction_db=1 --use_existing_db=0 --num=1500000 --numdistinct=100 --compression_type=none

RocksDB:    version 6.0
Date:       Thu Mar 28 12:42:46 2019
CPU:        160 * Intel(R) Xeon(R) CPU E7- 8850  @ 2.00GHz
CPUCache:   24576 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1500000
Prefix:    16 bytes
Keys per prefix:    0
RawSize:    165.9 MB (estimated)
FileSize:   94.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Memtablerep: hash_linkedlist
Perf Level: 1

The original code produced the following results (the first number is the number of threads).

ORIGINAL
 8 - randomwithverify :   33.127 micros/op 239043 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:729778)
16 - randomwithverify :   63.849 micros/op 249517 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:732338)
24 - randomwithverify :   96.734 micros/op 247353 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:730886)
32 - randomwithverify :  127.513 micros/op 250369 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:730720)
48 - randomwithverify :  191.137 micros/op 250212 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:730222)
64 - randomwithverify :  255.991 micros/op 249181 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:730322)

My new snapshot list produced the following results:

NEW
 8 - randomwithverify :   20.195 micros/op 393435 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:729583)
16 - randomwithverify :   39.700 micros/op 399848 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:731007)
24 - randomwithverify :   30.133 micros/op 773984 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:729223)
32 - randomwithverify :   37.077 micros/op 825816 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:728155)
48 - randomwithverify :   54.790 micros/op 852402 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:728579)
64 - randomwithverify :   71.376 micros/op 867057 ops/sec; ( get:1470000 put:15000 del:15000 total:1500000 found:727323)

There are still a few minor issues to solve (like at the moment a long running snapshot blocks reclamation of older SnapshotImpl instances), but IMHO this approach seems promising.

yiwu-arbug · 2019-04-01T17:08:09Z

@mpoeter may I know what's your change?

mpoeter · 2019-04-01T17:59:06Z

I reimplemented the SnapshotList and removed the locks in GetSnapshotImpl and ReleaseSnapshot. Inserting new snapshots via SnapshotList::New is lock-free. SnapshotList::Delete marks a snapshot as deleted (no lock required). If the currently deleted snapshot was the oldest, we remove all the deleted threads until the first snapshot that is not marked as deleted; for this a spin-lock is used. This spin-lock is also used to protect other operations that access snapshot instances like GetOldestSnapshotTime and GetAll.

I will try to improve the code and fix the remaining issues. Once they are sorted out I'll create a PR where we can discuss the changes in more detail.

yiwu-arbug · 2019-04-01T18:57:03Z

Making SnapshotList lock-free is nice. Regarding SuperVersion it can be obtained lock-free, see ColumnFamilyData::GetThreadLocalSuperVersion().

siying · 2019-04-01T19:04:08Z

@maysamyabandeh MyRocks uses 2PC, and has binlog dependency. It's likely that another application doesn't depend on those functionalities. So I would be careful in using sysbench to rule out a performance bottleneck.

mpoeter · 2019-04-04T17:11:21Z

@yiwu-arbug my implementation is only partially lock-free - removing entries from the list still requires a lock. That's why most threads only mark their snapshot as "deleted" and leave removal of that entry to the owner of the oldest active snapshot. The main reason is that I want to avoid the additional complexity of introducing a reclamation scheme like EBR (epoch based reclamation) or DEBRA. Also, a lock-free remove operation would be more expensive, resulting in unnecessary overhead for uncontended remove ops.
Thanks for the hint about GetThreadLocalSuperVersion, I will take a look at that.

maysamyabandeh · 2019-09-09T21:36:28Z

@yiwu-arbug are you still planning to look into this?

yiwu-arbug · 2019-09-09T23:00:10Z

No plan on our side. We identified this is not a bottleneck for us.

mpoeter · 2019-09-10T17:22:40Z

@maysamyabandeh @yiwu-arbug Unfortunately I had to put my experiments on ice for the last few months. I still want to keep working on this at some point, I just have to find the time for it.
Anyway, after some more considerations I think I will probably go with a fully lock-free queue, even if that means I have to introduce some reclamation scheme.

frew · 2022-01-06T17:26:14Z

@mpoeter We're running into this issue some (via TiKV's use of RocksDB snapshots) - did you ever end up being able to get back to this?

mpoeter · 2022-01-16T15:50:44Z

@frew Yes, but only for a short time before I was again sucked into a different project. However, I switched jobs in the meantime and am now working even more with RocksDB, so there is a good chance I might look into this again, but probably not in the next weeks.

frew · 2022-03-29T17:28:19Z

@riversand963 Are you working on this issue? No worries either way - just trying to figure out logistics on our end.

riversand963 · 2022-03-29T17:54:41Z

@frew I haven't started this yet, but I am interested in this. Self-assigning to either work on it or follow-up.
I haven't caught up with the discussion here so far either, but will do later.
I wonder whether we can reduce time holding db mutex by allowing applications to share snapshots. Currently, iiuc, a new snapshot object is created and inserted to the snapshot list, even if they may have the same seq. Insertion into the list requires the mutex. If we reduce the number of insertions, I guess we may be able to reduce contention.
This will require API change to GetSnapshot, though.

gengliqi mentioned this issue Oct 13, 2021

Considering not to get snapshot for transaction commands in gRPC threads tikv/tikv#11042

Closed

riversand963 added the performance Issues related to performance that may or may not be bugs label Feb 2, 2022

riversand963 self-assigned this Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot management scales poorly #5083

Snapshot management scales poorly #5083

mpoeter commented Mar 19, 2019

maysamyabandeh commented Mar 19, 2019

mpoeter commented Mar 21, 2019

yiwu-arbug commented Mar 21, 2019

mpoeter commented Mar 31, 2019

yiwu-arbug commented Apr 1, 2019

mpoeter commented Apr 1, 2019

yiwu-arbug commented Apr 1, 2019

siying commented Apr 1, 2019

mpoeter commented Apr 4, 2019

maysamyabandeh commented Sep 9, 2019

yiwu-arbug commented Sep 9, 2019 •

edited

mpoeter commented Sep 10, 2019

frew commented Jan 6, 2022

mpoeter commented Jan 16, 2022

frew commented Mar 29, 2022

riversand963 commented Mar 29, 2022

Snapshot management scales poorly #5083

Snapshot management scales poorly #5083

Comments

mpoeter commented Mar 19, 2019

maysamyabandeh commented Mar 19, 2019

mpoeter commented Mar 21, 2019

yiwu-arbug commented Mar 21, 2019

mpoeter commented Mar 31, 2019

yiwu-arbug commented Apr 1, 2019

mpoeter commented Apr 1, 2019

yiwu-arbug commented Apr 1, 2019

siying commented Apr 1, 2019

mpoeter commented Apr 4, 2019

maysamyabandeh commented Sep 9, 2019

yiwu-arbug commented Sep 9, 2019 • edited

mpoeter commented Sep 10, 2019

frew commented Jan 6, 2022

mpoeter commented Jan 16, 2022

frew commented Mar 29, 2022

riversand963 commented Mar 29, 2022

yiwu-arbug commented Sep 9, 2019 •

edited