-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two-level Indexes #1814
Two-level Indexes #1814
Conversation
Partition Index blocks and use a Partition-index as a 2nd level index. The two-level index can be used by setting BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and configuring BlockBasedTableOptions::index_per_partition
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for delay. I haven't finish reading the reader part, but some comments for the rest.
include/rocksdb/table.h
Outdated
@@ -86,6 +86,9 @@ struct BlockBasedTableOptions { | |||
// The hash index, if enabled, will do the hash lookup when | |||
// `Options.prefix_extractor` is provided. | |||
kHashSearch, | |||
|
|||
// A two-level index implementation. Both levels are binary search indexes. | |||
kTwoLevelIndexSearch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the new value to block_base_table_index_type_string_map
in util/options_helper.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is the feature consider complete (I think you plan to interleave secondary index and filter blocks)? If not, shall we not expose this enum, or have table builder return error for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will add that.
My thinking is to allow enable/disable each feature separately to measure their impact in the benchmarks. That extends to the interleaving phase. I have not thought it through about the switches of how to enable/disable them though and am open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can simply add comment here saying the feature is experimental and not ready for use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Will do that.
table/block_based_table_builder.cc
Outdated
std::string key; | ||
std::unique_ptr<IndexBuilder> value; | ||
}; | ||
std::vector<Entry> entries; // list of partitioned indexes and their keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing underscore, e.g. s/entries/entries_/
same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for catching this.
table/block_based_table_builder.cc
Outdated
std::string handle_encoding; | ||
last_partition_block_handle.EncodeTo(&handle_encoding); | ||
index_block_builder_.Add(last_entry.key, handle_encoding); | ||
entries.erase(entries.begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a deque or list be better than vector, if you want to remove from head?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Will update it.
include/rocksdb/table.h
Outdated
@@ -138,6 +141,9 @@ struct BlockBasedTableOptions { | |||
// Same as block_restart_interval but used for the index block. | |||
int index_block_restart_interval = 1; | |||
|
|||
// number of index keys per partition of indexes in a multi-level index | |||
uint64_t index_per_partition = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Shall we have a better default value?
- Have you consider partition index by size instead of by number of indexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking. The implementation for using number of blocks per partition was straightforward and I figured it is good enough to enable benchmarking phase and when we got good result we can revise the implementation for better configuration or more optimized implementation. Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
if (!s.ok()) { | ||
return s; | ||
auto index_builder_status = r->index_builder->Finish(&index_blocks); | ||
if (index_builder_status.IsIncomplete()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving partitioned index block logic here would make the code easier to read?
if (status.IsInComplete()) {
while (status.IsInComplete()) {
// write partitioned index
}
} else {
if (!status.ok()) {
return status;
}
// write meta
// write index
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed with @maysamyabandeh, please ignore this comment as it will mess up block order.
table/block_based_table_reader.cc
Outdated
@@ -153,8 +153,9 @@ class BlockBasedTable::IndexReader { | |||
virtual ~IndexReader() {} | |||
|
|||
// Create an iterator for index access. | |||
// An iter is passed in, if it is not null, update this one and return it | |||
// If it is null, create a new Iterator | |||
// If a non-null iter is passed in it MIGHT be used instead of creating a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the logic here. When it will use the BlockIter pass in and when it will not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously it would always make use of the passed BlockIter object. Currently the existing classes that inherit from index reader still make use of it. However the new class that is added with this patch, index reader for partitioned indexes, does not. So we needed to update the contract and make the users to check the return value to verify whether the input object was used by NewIterator or it was ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks messy to have to check if result is on heap or stack. Can we update PartitionIndexReader
to use the input iter
?
table/block_based_table_reader.cc
Outdated
@@ -1380,6 +1439,28 @@ class BlockBasedTable::BlockEntryIteratorState : public TwoLevelIteratorState { | |||
bool skip_filters_; | |||
}; | |||
|
|||
BlockBasedTable::IndexPartitionIteratorState::IndexPartitionIteratorState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems IndexPartitionIteratorState
is same as BlockEntryIteratorState
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! Due to similarity of index and data blocks the impl of IndexPartitionIteratorState seems to have evolved to be identical to that of BlockEntryIteratorState. I will remove it then.
include/rocksdb/table.h
Outdated
@@ -86,6 +86,9 @@ struct BlockBasedTableOptions { | |||
// The hash index, if enabled, will do the hash lookup when | |||
// `Options.prefix_extractor` is provided. | |||
kHashSearch, | |||
|
|||
// A two-level index implementation. Both levels are binary search indexes. | |||
kTwoLevelIndexSearch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can simply add comment here saying the feature is experimental and not ready for use.
Please make sure we clearly document the format in the code comments (and in wiki pages after it is committed). |
@maysamyabandeh updated the pull request - view changes - changes since last import |
* The format on the disk would be I I I I I I IP where I is block containing a | ||
* partition of indexes built using ShortenedIndexBuilder and IP is a block | ||
* containing a secondary index on the partitions, built using | ||
* ShortenedIndexBuilder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying I tried to document the format here. Is there any other place in the source that needs to get updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format after commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the link. I updated the wiki.
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good to me in general.
* The format on the disk would be I I I I I I IP where I is block containing a | ||
* partition of indexes built using ShortenedIndexBuilder and IP is a block | ||
* containing a secondary index on the partitions, built using | ||
* ShortenedIndexBuilder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format after commit?
table/block_based_table_reader.h
Outdated
@@ -149,6 +150,9 @@ class BlockBasedTable : public TableReader { | |||
// The key retrieved are internal keys. | |||
Status GetKVPairsFromDataBlocks(std::vector<KVPairBlock>* kv_pair_blocks); | |||
|
|||
//class IndexPartitionIteratorState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused. Remove?
table/block_based_table_reader.cc
Outdated
@@ -153,8 +153,9 @@ class BlockBasedTable::IndexReader { | |||
virtual ~IndexReader() {} | |||
|
|||
// Create an iterator for index access. | |||
// An iter is passed in, if it is not null, update this one and return it | |||
// If it is null, create a new Iterator | |||
// If a non-null iter is passed in it MIGHT be used instead of creating a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks messy to have to check if result is on heap or stack. Can we update PartitionIndexReader
to use the input iter
?
btw. run |
@maysamyabandeh updated the pull request - view changes - changes since last import |
ed442eb
to
a071c44
Compare
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@maysamyabandeh updated the pull request - view changes - changes since last import |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
2 similar comments
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@IslamAbdelRahman has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Recently we started to see more Travis time-out. We added one extra test case here, which may make the tests run slightly longer, which may be a reason. Can you check whether the newly added test case runs much slower than others? |
Here are the differences between the slow and fast travis runs:
|
@maysamyabandeh I can work on moving some critical tests out of these suites so the remaining ones can run on MemEnv. How about that? |
Is this related to Multi-level index? |
Partition Index blocks and use a Partition-index as a 2nd level index.
The two-level index can be used by setting
BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and
configuring BlockBasedTableOptions::index_per_partition
t15539501