Two-level Indexes #1814

maysamyabandeh · 2017-01-27T03:25:55Z

Partition Index blocks and use a Partition-index as a 2nd level index.

The two-level index can be used by setting
BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and
configuring BlockBasedTableOptions::index_per_partition

t15539501

Partition Index blocks and use a Partition-index as a 2nd level index. The two-level index can be used by setting BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and configuring BlockBasedTableOptions::index_per_partition

facebook-github-bot · 2017-01-27T03:26:21Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-01-27T17:33:58Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-01-27T17:37:37Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

yiwu-arbug

Sorry for delay. I haven't finish reading the reader part, but some comments for the rest.

yiwu-arbug · 2017-01-30T18:48:56Z

include/rocksdb/table.h

@@ -86,6 +86,9 @@ struct BlockBasedTableOptions {
    // The hash index, if enabled, will do the hash lookup when
    // `Options.prefix_extractor` is provided.
    kHashSearch,
+
+    // A two-level index implementation. Both levels are binary search indexes.
+    kTwoLevelIndexSearch,


Add the new value to block_base_table_index_type_string_map in util/options_helper.h.

Also, is the feature consider complete (I think you plan to interleave secondary index and filter blocks)? If not, shall we not expose this enum, or have table builder return error for it?

Sure will add that.

My thinking is to allow enable/disable each feature separately to measure their impact in the benchmarks. That extends to the interleaving phase. I have not thought it through about the switches of how to enable/disable them though and am open to suggestions.

Or we can simply add comment here saying the feature is experimental and not ready for use.

Good idea! Will do that.

yiwu-arbug · 2017-01-30T19:30:15Z

table/block_based_table_builder.cc

+    std::string key;
+    std::unique_ptr<IndexBuilder> value;
+  };
+  std::vector<Entry> entries;  // list of partitioned indexes and their keys


nit: missing underscore, e.g. s/entries/entries_/
same below.

Oh, thanks for catching this.

yiwu-arbug · 2017-01-30T20:07:58Z

table/block_based_table_builder.cc

+      std::string handle_encoding;
+      last_partition_block_handle.EncodeTo(&handle_encoding);
+      index_block_builder_.Add(last_entry.key, handle_encoding);
+      entries.erase(entries.begin());


would a deque or list be better than vector, if you want to remove from head?

True. Will update it.

yiwu-arbug · 2017-01-30T20:20:35Z

include/rocksdb/table.h

@@ -138,6 +141,9 @@ struct BlockBasedTableOptions {
  // Same as block_restart_interval but used for the index block.
  int index_block_restart_interval = 1;

+  // number of index keys per partition of indexes in a multi-level index
+  uint64_t index_per_partition = 1;


Shall we have a better default value?

Have you consider partition index by size instead of by number of indexes?

Good thinking. The implementation for using number of blocks per partition was straightforward and I figured it is good enough to enable benchmarking phase and when we got good result we can revise the implementation for better configuration or more optimized implementation. Would that work?

sounds good.

yiwu-arbug · 2017-01-30T22:23:59Z

table/block_based_table_builder.cc

-  if (!s.ok()) {
-    return s;
+  auto index_builder_status = r->index_builder->Finish(&index_blocks);
+  if (index_builder_status.IsIncomplete()) {


I think moving partitioned index block logic here would make the code easier to read?

if (status.IsInComplete()) { while (status.IsInComplete()) { // write partitioned index } } else { if (!status.ok()) { return status; } // write meta // write index }

discussed with @maysamyabandeh, please ignore this comment as it will mess up block order.

yiwu-arbug · 2017-01-31T00:22:20Z

table/block_based_table_reader.cc

@@ -153,8 +153,9 @@ class BlockBasedTable::IndexReader {
  virtual ~IndexReader() {}

  // Create an iterator for index access.
-  // An iter is passed in, if it is not null, update this one and return it
-  // If it is null, create a new Iterator
+  // If a non-null iter is passed in it MIGHT be used instead of creating a new


I'm not familiar with the logic here. When it will use the BlockIter pass in and when it will not?

Previously it would always make use of the passed BlockIter object. Currently the existing classes that inherit from index reader still make use of it. However the new class that is added with this patch, index reader for partitioned indexes, does not. So we needed to update the contract and make the users to check the return value to verify whether the input object was used by NewIterator or it was ignored.

It looks messy to have to check if result is on heap or stack. Can we update PartitionIndexReader to use the input iter?

yiwu-arbug · 2017-01-31T00:31:35Z

table/block_based_table_reader.cc

@@ -1380,6 +1439,28 @@ class BlockBasedTable::BlockEntryIteratorState : public TwoLevelIteratorState {
  bool skip_filters_;
 };

+BlockBasedTable::IndexPartitionIteratorState::IndexPartitionIteratorState(


seems IndexPartitionIteratorState is same as BlockEntryIteratorState?

Great catch! Due to similarity of index and data blocks the impl of IndexPartitionIteratorState seems to have evolved to be identical to that of BlockEntryIteratorState. I will remove it then.

yiwu-arbug · 2017-01-31T00:43:38Z

include/rocksdb/table.h

@@ -86,6 +86,9 @@ struct BlockBasedTableOptions {
    // The hash index, if enabled, will do the hash lookup when
    // `Options.prefix_extractor` is provided.
    kHashSearch,
+
+    // A two-level index implementation. Both levels are binary search indexes.
+    kTwoLevelIndexSearch,


Or we can simply add comment here saying the feature is experimental and not ready for use.

siying · 2017-01-31T19:24:27Z

Please make sure we clearly document the format in the code comments (and in wiki pages after it is committed).

facebook-github-bot · 2017-02-01T18:38:48Z

@maysamyabandeh updated the pull request - view changes - changes since last import

maysamyabandeh · 2017-02-01T18:39:48Z

table/block_based_table_builder.cc

+ * The format on the disk would be I I I I I I IP where I is block containing a
+ * partition of indexes built using ShortenedIndexBuilder and IP is a block
+ * containing a secondary index on the partitions, built using
+ * ShortenedIndexBuilder.


@siying I tried to document the format here. Is there any other place in the source that needs to get updated?

update https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format after commit?

Thanks for the link. I updated the wiki.

facebook-github-bot · 2017-02-01T18:40:54Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-02-01T22:43:41Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-02-01T22:44:14Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

yiwu-arbug

Look good to me in general.

yiwu-arbug · 2017-02-06T19:01:49Z

table/block_based_table_builder.cc

+ * The format on the disk would be I I I I I I IP where I is block containing a
+ * partition of indexes built using ShortenedIndexBuilder and IP is a block
+ * containing a secondary index on the partitions, built using
+ * ShortenedIndexBuilder.


update https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format after commit?

yiwu-arbug · 2017-02-06T19:12:27Z

table/block_based_table_reader.h

@@ -149,6 +150,9 @@ class BlockBasedTable : public TableReader {
  // The key retrieved are internal keys.
  Status GetKVPairsFromDataBlocks(std::vector<KVPairBlock>* kv_pair_blocks);

+  //class IndexPartitionIteratorState;


Unused. Remove?

yiwu-arbug · 2017-02-06T19:27:15Z

table/block_based_table_reader.cc

@@ -153,8 +153,9 @@ class BlockBasedTable::IndexReader {
  virtual ~IndexReader() {}

  // Create an iterator for index access.
-  // An iter is passed in, if it is not null, update this one and return it
-  // If it is null, create a new Iterator
+  // If a non-null iter is passed in it MIGHT be used instead of creating a new


It looks messy to have to check if result is on heap or stack. Can we update PartitionIndexReader to use the input iter?

yiwu-arbug · 2017-02-06T20:15:18Z

btw. run make format before commit?

facebook-github-bot · 2017-02-06T20:29:39Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-02-06T20:34:29Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-02-06T20:36:02Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-02-06T21:54:57Z

@maysamyabandeh updated the pull request - view changes - changes since last import

facebook-github-bot · 2017-02-06T21:55:30Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-02-07T00:16:00Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-02-07T00:24:13Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-02-07T00:24:51Z

@IslamAbdelRahman has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-02-07T00:28:20Z

@maysamyabandeh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying · 2017-02-08T19:56:32Z

Recently we started to see more Travis time-out. We added one extra test case here, which may make the tests run slightly longer, which may be a reason. Can you check whether the newly added test case runs much slower than others?

maysamyabandeh · 2017-02-08T21:29:54Z

Here are the differences between the slow and fast travis runs:

$ cat log-fast.txt | sort -nt'(' -k2 -r | grep  OK | head
[       OK ] ExternalSSTFileTest.CompactDuringAddFileRandom (74505 ms)
[       OK ] ExternalSSTFileTest.OverlappingRanges (60118 ms)
[       OK ] DBWALTest.RecoverFromCorruptedWALWithoutFlush (52164 ms)
[       OK ] ExternalSSTFileTest.IngestFileWithGlobalSeqnoRandomized (38311 ms)
[       OK ] DBIteratorTest.PinnedDataIteratorRandomized (29303 ms)
[       OK ] DBTestCompactionFilter.CompactionFilterWithValueChange (20010 ms)
[       OK ] FaultTest/FaultInjectionTest.FaultTest/0 (19314 ms)
[       OK ] FaultTest/FaultInjectionTest.FaultTest/1 (17737 ms)
[       OK ] DBWALTest.kPointInTimeRecovery (14958 ms)
[       OK ] ManualCompactionTest.Test (13979 ms)

$ cat log-slow.txt | sort -nt'(' -k2 -r | grep OK | head
[       OK ] ExternalSSTFileTest.CompactDuringAddFileRandom (237878 ms)
[       OK ] ExternalSSTFileTest.OverlappingRanges (169291 ms)
[       OK ] ExternalSSTFileTest.IngestFileWithGlobalSeqnoRandomized (164520 ms)
[       OK ] DBWALTest.RecoverFromCorruptedWALWithoutFlush (74575 ms)
[       OK ] DBIteratorTest.PinnedDataIteratorRandomized (61493 ms)
[       OK ] FaultTest/FaultInjectionTest.FaultTest/0 (39987 ms)
[       OK ] FaultTest/FaultInjectionTest.FaultTest/1 (32871 ms)
[       OK ] DBWALTest.RollLog (27616 ms)
[       OK ] DBTestCompactionFilter.CompactionFilterWithValueChange (25138 ms)
[       OK ] DBCompactionTest.DeleteFileRange (23110 ms)

siying · 2017-02-24T17:28:11Z

@maysamyabandeh I can work on moving some critical tests out of these suites so the remaining ones can run on MemEnv. How about that?

maysamyabandeh · 2017-02-24T17:35:01Z

Is this related to Multi-level index?

Two-level Indexes

fde11c4

Partition Index blocks and use a Partition-index as a 2nd level index. The two-level index can be used by setting BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and configuring BlockBasedTableOptions::index_per_partition

maysamyabandeh requested a review from yiwu-arbug January 27, 2017 03:25

facebook-github-bot added the CLA Signed label Jan 27, 2017

Fix java build failure

ddef10c

yiwu-arbug reviewed Jan 30, 2017

View reviewed changes

yiwu-arbug reviewed Jan 31, 2017

View reviewed changes

apply comments

4b9cb56

maysamyabandeh commented Feb 1, 2017

View reviewed changes

fix junit failure

a1f61c9

yiwu-arbug reviewed Feb 6, 2017

View reviewed changes

yiwu-arbug approved these changes Feb 6, 2017

View reviewed changes

improve comments

a071c44

maysamyabandeh force-pushed the partition-index branch from ed442eb to a071c44 Compare February 6, 2017 20:34

add username to TODO

8332bd0

facebook-github-bot closed this in 69d5262 Feb 7, 2017

maysamyabandeh mentioned this pull request Feb 24, 2017

Fix some bugs in MockEnv #1914

Closed

Two-level Indexes #1814

Two-level Indexes #1814

Conversation

maysamyabandeh commented Jan 27, 2017 • edited Loading

facebook-github-bot commented Jan 27, 2017

facebook-github-bot commented Jan 27, 2017

facebook-github-bot commented Jan 27, 2017

yiwu-arbug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siying commented Jan 31, 2017

facebook-github-bot commented Feb 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 1, 2017

facebook-github-bot commented Feb 1, 2017

facebook-github-bot commented Feb 1, 2017

yiwu-arbug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiwu-arbug commented Feb 6, 2017

facebook-github-bot commented Feb 6, 2017

facebook-github-bot commented Feb 6, 2017

facebook-github-bot commented Feb 6, 2017

facebook-github-bot commented Feb 6, 2017

facebook-github-bot commented Feb 6, 2017

facebook-github-bot commented Feb 7, 2017

facebook-github-bot commented Feb 7, 2017

facebook-github-bot commented Feb 7, 2017

facebook-github-bot commented Feb 7, 2017

siying commented Feb 8, 2017

maysamyabandeh commented Feb 8, 2017 • edited Loading

siying commented Feb 24, 2017

maysamyabandeh commented Feb 24, 2017

maysamyabandeh commented Jan 27, 2017 •

edited

Loading

maysamyabandeh commented Feb 8, 2017 •

edited

Loading