-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a BlockCacheLookupContext to enable fine-grained block cache tracing. #5421
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HaoyuHuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One general comment regarding BlockBasedTable: the patch in general adds the BlockCacheLookupContext* parameter as the first parameter of methods. Since this tracing is auxiliary functionality, I feel a better place would be closer to the end of the parameter list, e.g. after GetContext* where applicable, similarly to how it's done in the FilterBlockReader classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken a first pass. Can we remove the usage of default value of function arguments?
…_flush Summary: Pull Request resolved: facebook#5422 Differential Revision: D15706212 Pulled By: ltamasi fbshipit-source-id: 0acf060fb8568efee51c033e50b492bcf1095a4c
@HaoyuHuang has updated the pull request. Re-import the pull request |
@HaoyuHuang has updated the pull request. Re-import the pull request |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HaoyuHuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Special characters like slashes and parentheses are not supported. Pull Request resolved: facebook#5424 Differential Revision: D15708067 Pulled By: ltamasi fbshipit-source-id: 90527ec3ee882a0cdd1249c3946f5eff2ff7c115
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pass with a few comments.
@HaoyuHuang has updated the pull request. Re-import the pull request |
…bleIterator (facebook#5428) Summary: PR facebook#5111 reduced the number of key comparisons when iterating with upper/lower bounds; however, this caused a regression for MyRocks. Reverting to the previous behavior in BlockBasedTableIterator as a hotfix. Pull Request resolved: facebook#5428 Differential Revision: D15721038 Pulled By: ltamasi fbshipit-source-id: 5450106442f1763bccd17f6cfd648697f2ae8b6c
…error (facebook#5412) Summary: I'm not able to prove it, but the stress test failure may be caused by the following sequence of events - 1. Crash db_stress while writing the log file. This should result in a corrupted WAL. 2. Run db_stress with recycle_log_file_num=1. Crash during recovery immediately after writing manifest and updating the current file. The old log from the previous run is left behind, but the memtable would have been flushed during recovery and the CF log number will point to the newer log 3. Run db_stress with recycle_log_file_num=0. During recovery, the old log file will be processed and the corruption will be detected. Since the CF has moved ahead, we get the "SST file is ahead of WAL" error Test - 1. stress_crash 2. make check Pull Request resolved: facebook#5412 Differential Revision: D15699120 Pulled By: anand1976 fbshipit-source-id: 9092ce81e7c4a0b4b4e66560c23ea4812a4d9cbe
Can you run some in-memory benchmark comparing the performance with that of master, and make sure there is no regression. |
Sure. I can do that. I will give an update here on Monday. |
…lt type (facebook#5432) Summary: This affects some TSAN builds: env/env_test.cc: In member function ‘virtual void rocksdb::EnvPosixTestWithParam_MultiRead_Test::TestBody()’: env/env_test.cc:1126:76: error: type qualifiers ignored on cast result type [-Werror=ignored-qualifiers] auto data = NewAligned(kSectorSize * 8, static_cast<const char>(i + 1)); ^ env/env_test.cc:1154:77: error: type qualifiers ignored on cast result type [-Werror=ignored-qualifiers] auto buf = NewAligned(kSectorSize * 8, static_cast<const char>(i*2 + 1)); ^ Pull Request resolved: facebook#5432 Differential Revision: D15727277 Pulled By: ltamasi fbshipit-source-id: dc0e687b123e7c4d703ccc0c16b7167e07d1c9b0
I loaded 1 million key-value pairs into the database and ran the readrandom benchmark. I gave the block cache 1 GB to make sure all reads hit the block cache after warmup. The throughput is comparable. Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 |
Summary: The patch reduces the contention over prepared_mutex_ using these techniques: 1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues. 2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop() 3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last. Pull Request resolved: facebook#5420 Differential Revision: D15741985 Pulled By: maysamyabandeh fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM. Left a few minor comments.
Status ReadCompressionDictBlock( | ||
FilePrefetchBuffer* prefetch_buffer, | ||
std::unique_ptr<const BlockContents>* compression_dict_block) const; | ||
Status PrefetchIndexAndFilterBlocks( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No empty line please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Removed.
@@ -1360,9 +1376,11 @@ Status BlockBasedTable::ReadCompressionDictBlock( | |||
} | |||
|
|||
Status BlockBasedTable::PrefetchIndexAndFilterBlocks( | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No empty line please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
@@ -3052,6 +3110,8 @@ Status BlockBasedTable::Prefetch(const Slice* const begin, | |||
} | |||
|
|||
Status BlockBasedTable::VerifyChecksum() { | |||
// TODO(haoyu): This function is called by external sst ingestion and user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
External sst ingestion is also called by user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comments.
…cebook#5413) Summary: In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay. When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families. Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables. 1. Primary triggers a flush and flushes "cf1". "default" is **not** flushed. 2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST. 3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed. Pull Request resolved: facebook#5413 Differential Revision: D15637407 Pulled By: riversand963 fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580
@HaoyuHuang has updated the pull request. Re-import the pull request |
…ook#5314) Summary: Instead of creating a new DataBlockIterator for every key in a MultiGet batch, reuse it if the next key is in the same block. This results in a small 1-2% cpu improvement. TEST_TMPDIR=/dev/shm/multiget numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 Without the change - multireadrandom : 3.066 micros/op 326122 ops/sec; (29375968 of 29375968 found) With the change - multireadrandom : 3.003 micros/op 332945 ops/sec; (29983968 of 29983968 found) Pull Request resolved: facebook#5314 Differential Revision: D15742108 Pulled By: anand1976 fbshipit-source-id: 220fb0b8eea9a0d602ddeb371528f7af7936d771
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@HaoyuHuang has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HaoyuHuang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@HaoyuHuang merged this pull request in 5efa0d6. |
…racing. (facebook#5421) Summary: BlockCacheLookupContext only contains the caller for now. We will trace block accesses at five places: 1. BlockBasedTable::GetFilter. 2. BlockBasedTable::GetUncompressedDict. 3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.) 4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.) 5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.) We create the context at: 1. BlockBasedTable::Get. (kUserGet) 2. BlockBasedTable::MultiGet. (kUserMGet) 3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.) 4. BlockBasedTable::Open. (kPrefetch) 5. Index/Filter::CacheDependencies. (kPrefetch) 6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize). I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable. Throughput of this PR: 231334 ops/s. Throughput of the master branch: 238428 ops/s. Experiment setup: RocksDB: version 6.2 Date: Mon Jun 10 10:42:51 2019 CPU: 24 * Intel Core Processor (Skylake) CPUCache: 16384 KB Keys: 20 bytes each Values: 100 bytes each (100 bytes after compression) Entries: 1000000 Prefix: 20 bytes Keys per prefix: 0 RawSize: 114.4 MB (estimated) FileSize: 114.4 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: NoCompression Compression sampling rate: 0 Memtablerep: skip_list Perf Level: 1 Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120 TODOs: 1. Create a caller for external SST file ingestion and differentiate the callers for iterator. 2. Integrate tracer to trace block cache accesses. Pull Request resolved: facebook#5421 Differential Revision: D15704258 Pulled By: HaoyuHuang fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
BlockCacheLookupContext only contains the caller for now.
We will trace block accesses at five places:
We create the context at:
I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
Throughput of this PR: 231334 ops/s.
Throughput of the master branch: 238428 ops/s.
Experiment setup:
RocksDB: version 6.2
Date: Mon Jun 10 10:42:51 2019
CPU: 24 * Intel Core Processor (Skylake)
CPUCache: 16384 KB
Keys: 20 bytes each
Values: 100 bytes each (100 bytes after compression)
Entries: 1000000
Prefix: 20 bytes
Keys per prefix: 0
RawSize: 114.4 MB (estimated)
FileSize: 114.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1
Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000
Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120
TODOs: