Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a BlockCacheLookupContext to enable fine-grained block cache tracing. #5421

Closed

Conversation

HaoyuHuang
Copy link
Contributor

@HaoyuHuang HaoyuHuang commented Jun 6, 2019

BlockCacheLookupContext only contains the caller for now.
We will trace block accesses at five places:

  1. BlockBasedTable::GetFilter.
  2. BlockBasedTable::GetUncompressedDict.
  3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
  4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
  5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)

We create the context at:

  1. BlockBasedTable::Get. (kUserGet)
  2. BlockBasedTable::MultiGet. (kUserMGet)
  3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
  4. BlockBasedTable::Open. (kPrefetch)
  5. Index/Filter::CacheDependencies. (kPrefetch)
  6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).

I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
Throughput of this PR: 231334 ops/s.
Throughput of the master branch: 238428 ops/s.

Experiment setup:
RocksDB: version 6.2
Date: Mon Jun 10 10:42:51 2019
CPU: 24 * Intel Core Processor (Skylake)
CPUCache: 16384 KB
Keys: 20 bytes each
Values: 100 bytes each (100 bytes after compression)
Entries: 1000000
Prefix: 20 bytes
Keys per prefix: 0
RawSize: 114.4 MB (estimated)
FileSize: 114.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1

Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000

Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120

TODOs:

  1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
  2. Integrate tracer to trace block cache accesses.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoyuHuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general comment regarding BlockBasedTable: the patch in general adds the BlockCacheLookupContext* parameter as the first parameter of methods. Since this tracing is auxiliary functionality, I feel a better place would be closer to the end of the parameter list, e.g. after GetContext* where applicable, similarly to how it's done in the FilterBlockReader classes.

Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken a first pass. Can we remove the usage of default value of function arguments?

db/version_set.h Outdated Show resolved Hide resolved
db/version_set.h Outdated Show resolved Hide resolved
db/version_set.h Outdated Show resolved Hide resolved
ltamasi and others added 2 commits June 6, 2019 16:17
…_flush

Summary: Pull Request resolved: facebook#5422

Differential Revision: D15706212

Pulled By: ltamasi

fbshipit-source-id: 0acf060fb8568efee51c033e50b492bcf1095a4c
@facebook-github-bot
Copy link
Contributor

@HaoyuHuang has updated the pull request. Re-import the pull request

@facebook-github-bot
Copy link
Contributor

@HaoyuHuang has updated the pull request. Re-import the pull request

@HaoyuHuang
Copy link
Contributor Author

  1. I moved BlockCacheLookupContext to be after GetContext in functions where there is a GetContext. Otherwise, I moved it to be the last function argument.
  2. I removed the default values of functions arguments for the functions that this PR changes.

@HaoyuHuang HaoyuHuang requested review from riversand963 and ltamasi and removed request for riversand963 June 7, 2019 00:20
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoyuHuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
Special characters like slashes and parentheses are not supported.
Pull Request resolved: facebook#5424

Differential Revision: D15708067

Pulled By: ltamasi

fbshipit-source-id: 90527ec3ee882a0cdd1249c3946f5eff2ff7c115
Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass with a few comments.

table/block_based/partitioned_filter_block.h Outdated Show resolved Hide resolved
table/block_based/block_based_table_reader.cc Outdated Show resolved Hide resolved
table/block_based/block_based_table_reader.cc Outdated Show resolved Hide resolved
table/block_based/block_based_table_reader.cc Outdated Show resolved Hide resolved
table/block_based/block_based_table_reader.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@HaoyuHuang has updated the pull request. Re-import the pull request

ltamasi and others added 2 commits June 7, 2019 15:17
…bleIterator (facebook#5428)

Summary:
PR facebook#5111 reduced the number of key comparisons when iterating with
upper/lower bounds; however, this caused a regression for MyRocks.
Reverting to the previous behavior in BlockBasedTableIterator as a hotfix.
Pull Request resolved: facebook#5428

Differential Revision: D15721038

Pulled By: ltamasi

fbshipit-source-id: 5450106442f1763bccd17f6cfd648697f2ae8b6c
…error (facebook#5412)

Summary:
I'm not able to prove it, but the stress test failure may be caused by the following sequence of events -

1. Crash db_stress while writing the log file. This should result in a corrupted WAL.
2. Run db_stress with recycle_log_file_num=1. Crash during recovery immediately after writing manifest and updating the current file. The old log from the previous run is left behind, but the memtable would have been flushed during recovery and the CF log number will point to the newer log
3. Run db_stress with recycle_log_file_num=0. During recovery, the old log file will be processed and the corruption will be detected. Since the CF has moved ahead, we get the "SST file is ahead of WAL" error

Test -
1. stress_crash
2. make check
Pull Request resolved: facebook#5412

Differential Revision: D15699120

Pulled By: anand1976

fbshipit-source-id: 9092ce81e7c4a0b4b4e66560c23ea4812a4d9cbe
@riversand963
Copy link
Contributor

riversand963 commented Jun 8, 2019

Can you run some in-memory benchmark comparing the performance with that of master, and make sure there is no regression.
Update:
with db_bench will be good enough for now.

@HaoyuHuang
Copy link
Contributor Author

Sure. I can do that. I will give an update here on Monday.

…lt type (facebook#5432)

Summary:
This affects some TSAN builds:

env/env_test.cc: In member function ‘virtual void rocksdb::EnvPosixTestWithParam_MultiRead_Test::TestBody()’:
env/env_test.cc:1126:76: error: type qualifiers ignored on cast result type [-Werror=ignored-qualifiers]
       auto data = NewAligned(kSectorSize * 8, static_cast<const char>(i + 1));
                                                                            ^
env/env_test.cc:1154:77: error: type qualifiers ignored on cast result type [-Werror=ignored-qualifiers]
       auto buf = NewAligned(kSectorSize * 8, static_cast<const char>(i*2 + 1));
                                                                             ^
Pull Request resolved: facebook#5432

Differential Revision: D15727277

Pulled By: ltamasi

fbshipit-source-id: dc0e687b123e7c4d703ccc0c16b7167e07d1c9b0
@HaoyuHuang
Copy link
Contributor Author

I loaded 1 million key-value pairs into the database and ran the readrandom benchmark. I gave the block cache 1 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
Throughput of this PR: 231334 ops/s.
Throughput of the master branch: 238428 ops/s.

Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000

Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000

Summary:
The patch reduces the contention over prepared_mutex_ using these techniques:
1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
Pull Request resolved: facebook#5420

Differential Revision: D15741985

Pulled By: maysamyabandeh

fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. Left a few minor comments.

Status ReadCompressionDictBlock(
FilePrefetchBuffer* prefetch_buffer,
std::unique_ptr<const BlockContents>* compression_dict_block) const;
Status PrefetchIndexAndFilterBlocks(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No empty line please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Removed.

@@ -1360,9 +1376,11 @@ Status BlockBasedTable::ReadCompressionDictBlock(
}

Status BlockBasedTable::PrefetchIndexAndFilterBlocks(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No empty line please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -3052,6 +3110,8 @@ Status BlockBasedTable::Prefetch(const Slice* const begin,
}

Status BlockBasedTable::VerifyChecksum() {
// TODO(haoyu): This function is called by external sst ingestion and user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

External sst ingestion is also called by user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comments.

riversand963 and others added 2 commits June 10, 2019 12:58
…cebook#5413)

Summary:
In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay.
When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families.
Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables.
1. Primary triggers a flush and flushes "cf1". "default" is **not** flushed.
2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST.
3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed.
Pull Request resolved: facebook#5413

Differential Revision: D15637407

Pulled By: riversand963

fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580
@facebook-github-bot
Copy link
Contributor

@HaoyuHuang has updated the pull request. Re-import the pull request

…ook#5314)

Summary:
Instead of creating a new DataBlockIterator for every key in a MultiGet batch, reuse it if the next key is in the same block. This results in a small 1-2% cpu improvement.

TEST_TMPDIR=/dev/shm/multiget numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4

Without the change -
multireadrandom :       3.066 micros/op 326122 ops/sec; (29375968 of 29375968 found)

With the change -
multireadrandom :       3.003 micros/op 332945 ops/sec; (29983968 of 29983968 found)
Pull Request resolved: facebook#5314

Differential Revision: D15742108

Pulled By: anand1976

fbshipit-source-id: 220fb0b8eea9a0d602ddeb371528f7af7936d771
Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@facebook-github-bot
Copy link
Contributor

@HaoyuHuang has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoyuHuang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@HaoyuHuang merged this pull request in 5efa0d6.

vagogte pushed a commit to vagogte/rocksdb that referenced this pull request Jun 18, 2019
…racing. (facebook#5421)

Summary:
BlockCacheLookupContext only contains the caller for now.
We will trace block accesses at five places:
1. BlockBasedTable::GetFilter.
2. BlockBasedTable::GetUncompressedDict.
3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)

We create the context at:
1. BlockBasedTable::Get. (kUserGet)
2. BlockBasedTable::MultiGet. (kUserMGet)
3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
4. BlockBasedTable::Open. (kPrefetch)
5. Index/Filter::CacheDependencies. (kPrefetch)
6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).

I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
Throughput of this PR: 231334 ops/s.
Throughput of the master branch: 238428 ops/s.

Experiment setup:
RocksDB:    version 6.2
Date:       Mon Jun 10 10:42:51 2019
CPU:        24 * Intel Core Processor (Skylake)
CPUCache:   16384 KB
Keys:       20 bytes each
Values:     100 bytes each (100 bytes after compression)
Entries:    1000000
Prefix:    20 bytes
Keys per prefix:    0
RawSize:    114.4 MB (estimated)
FileSize:   114.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1

Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000

Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120

TODOs:
1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
2. Integrate tracer to trace block cache accesses.
Pull Request resolved: facebook#5421

Differential Revision: D15704258

Pulled By: HaoyuHuang

fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants