DataBlockHashIndex: Standalone Implementation with Unit Test #4139

fgwu · 2018-07-16T18:14:00Z

Summary:
The first step of the DataBlockHashIndex implementation. A string based hash table is implemented and unit-tested.

DataBlockHashIndexBuilder: Add() takes pairs of <key, restart_index>, and formats it into a string when Finish() is called.
DataBlockHashIndex: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation.

Test Plan:
Unit test: data_block_hash_index_test
make check -j 32

facebook-github-bot

@fgwu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

sagar0 · 2018-07-16T23:01:29Z

util/coding.h

@@ -32,6 +32,7 @@ namespace rocksdb {
 const unsigned int kMaxVarint64Length = 10;

 // Standard Put... routines append to a string
+extern void PutFixed16(std::string* dst, uint16_t value);


The changes here in coding.h might be a good candidate to split into a separate PR, with their own unit test in coding_test.cc, instead of clubbing them with DataBlockHashIndex discussion and implementation.

sagar0 · 2018-07-16T23:57:53Z

CMakeLists.txt

@@ -921,6 +922,7 @@ if(WITH_TESTS)
        table/cleanable_test.cc
        table/cuckoo_table_builder_test.cc
        table/cuckoo_table_reader_test.cc
+        table/data_block_hash_index.cc


This should be data_block_hash_index_test.cc ... and that's why the windows build failed.

Summary: The first step of the DataBlockHashIndex implementation. A string based hash table is implemented and unit-tested. DataBlockHashIndexBuilder: Add() takes pairs of <key, restart_index>, and formats it into a string when Finish() is called. DataBlockHashIndex: initialized by the formatted string, and can interpret it as a hash table. Supporting Seek(). Test Plan: Unit test: data_block_hash_index_test make check -j 32 Reviewers: Sagar Vemuri

Summary: The Seek() in the initial implementation is inefficient in that it needs vector creation and emplace operations, which can be eliminated by a iteration implementation.

…t_ from Iterator

facebook-github-bot · 2018-07-17T07:31:20Z

@fgwu has updated the pull request. Re-import the pull request

… fail

facebook-github-bot · 2018-07-17T07:35:09Z

@fgwu has updated the pull request. Re-import the pull request

sagar0 · 2018-07-18T22:43:52Z

table/data_block_hash_index.h

+
+namespace rocksdb {
+
+// The format of the datablock hash map is as follows:


Lets mention here that this is an experimental feature that is being added to improve point-lookup cost.

Lets also mention that this Index is per block, and will live within the data-block, to avoid people confusing with the per-table Index blocks.

sagar0 · 2018-07-18T22:54:14Z

table/data_block_hash_index.h

+//
+// Each bucket B has the following structure:
+// [TAG RESTART_INDEX][TAG RESTART_INDEX]...[TAG RESTART_INDEX]
+// where TAG is the hash value of the second hash funtion.


Lets explain here how to locate a key/value by giving a example.
Can you also provide more information here about why you decided to use as second hash function? Put a reference to the paper too from which this double-hashing idea is taken, so that the original authors get the credit.

sagar0 · 2018-07-18T22:57:44Z

table/data_block_hash_index.cc

+const uint32_t kSeed_tag = 214; /* second hash seed */
+
+inline uint16_t HashToBucket(const Slice& s, uint16_t num_buckets) {
+  return (uint16_t)rocksdb::Hash(s.data(), s.size(), kSeed) % num_buckets;


static_cast<uint16_t>

sagar0 · 2018-07-18T22:59:15Z

table/data_block_hash_index.h

+ private:
+  const char *data_;
+  uint16_t size_;
+  uint16_t num_buckets_;


Lets add a few comments here about how you came to the conclusion that 2 bytes should be more than enough.

sagar0 · 2018-07-18T23:00:36Z

table/data_block_hash_index.cc

+namespace rocksdb {
+
+const uint32_t kSeed = 2018;
+const uint32_t kSeed_tag = 214; /* second hash seed */


How did you come up with these seeds? Should we use some prime numbers?

The Hash() I used (util/hash.cc) is not based on prime, instead, it uses bitwise XOR and rotation to calculate the hash value.

rocksdb/util/hash.cc

Lines 17 to 31 in 1857576

uint32_t Hash(const char* data, size_t n, uint32_t seed) {

// Similar to murmur hash

const uint32_t m = 0xc6a4a793;

const uint32_t r = 24;

const char* limit = data + n;

uint32_t h = static_cast<uint32_t>(seed ^ (n * m));

// Pick up four bytes at a time

while (data + 4 <= limit) {

uint32_t w = DecodeFixed32(data);

data += 4;

h += w;

h *= m;

h ^= (h >> 16);

}

It does not have to be prime. Examples:

rocksdb/cache/sharded_cache.h

Line 85 in 1857576

return Hash(s.data(), s.size(), 0);

rocksdb/util/hash.h

Line 27 in 1857576

return Hash(s.data(), s.size(), 397);

sagar0 · 2018-07-18T23:03:20Z

table/data_block_hash_index.cc

+  /* push a TAG to avoid false postive */
+  /* the TAG is the hash function value of another seed */
+  uint16_t tag = static_cast<uint16_t>(
+      rocksdb::Hash(key.data(), key.size(), kSeed_tag));


rocksdb::Hash() returns uint32_t ... so what affect will the loss of precision have on our algorithm?

Casting to uint16 (effectively chops the higher 16 bit of hash off) has the benefit of reducing the space overhead as well as a more compact bucket that is more CPU cache line friendly. The downside is that there will be more hash collision. The current experiment shows that reducing the TAG length improves throughput. So I assume the benefit overweights the downside.

facebook-github-bot · 2018-07-19T00:19:58Z

@fgwu has updated the pull request. Re-import the pull request

facebook-github-bot · 2018-07-19T00:29:08Z

@fgwu has updated the pull request. Re-import the pull request

sagar0 · 2018-07-19T01:30:21Z

Restarted your failed travis build. Hopefully these annoying failures will be fixed by #4154 and #4151 .

facebook-github-bot · 2018-07-20T22:40:37Z

@fgwu has updated the pull request. Re-import the pull request

sagar0 · 2018-07-20T22:51:50Z

table/data_block_hash_index.h

+
+ private:
+  uint16_t num_buckets_;
+  std::vector<std::vector<uint16_t>> buckets_;


Its fine for now, but lets convert the interior std::vector<uint16_t> to something like struct Bucket {uint16_t tag, uint16_t restart_index} later, so that it is cleaner and immediately obvious what the fields are. That way you can also avoid doing 2 * sizeof(uint16_t) at various places, and instead do sizeof(Bucket).

facebook-github-bot

@fgwu is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2018-07-23T21:14:22Z

@fgwu has updated the pull request. Re-import the pull request

facebook-github-bot

@fgwu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2018-07-23T23:08:12Z

@fgwu has updated the pull request. Re-import the pull request

facebook-github-bot

@fgwu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

sagar0

Thanks for fixing the "land" issues.

…k#4139) Summary: The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested. `DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called. `DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation. Pull Request resolved: facebook#4139 Reviewed By: sagar0 Differential Revision: D8866764 Pulled By: fgwu fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10

facebook-github-bot added the CLA Signed label Jul 16, 2018

facebook-github-bot reviewed Jul 16, 2018

View reviewed changes

sagar0 reviewed Jul 16, 2018

View reviewed changes

fgwu added 6 commits July 17, 2018 00:27

DataBlockHashIndex: Replace the inefficent Seek() with NewInterator

1367cc6

Summary: The Seek() in the initial implementation is inefficient in that it needs vector creation and emplace operations, which can be eliminated by a iteration implementation.

DataBlockHashIndex: reduce index table size by change uint32 to uint16

7db1aad

DataBlockHashIndex: Bug fix in GetFixed16(); Remove unused field star…

a6b7c42

…t_ from Iterator

DataBlockHashIndex: second hash tag uint16 conversion fixed

bf03c79

DataBlockHashIndex: Rename test name to DataBlockHashIndex

42c5ce6

fgwu force-pushed the fwu_data_block_hash_index branch from 3b44e3c to 42c5ce6 Compare July 17, 2018 07:31

DataBlockHashIndex: Fixed CMakeLists.txt bug that causing appveyor to…

8ba23a0

… fail

sagar0 reviewed Jul 18, 2018

View reviewed changes

DataBlockHashIndex: Update algorithm description in comments.

365a9df

DataBlockHashIndex: fixed the size check assertion for the data_block

378fd7f

DataBlockHashIndex: Update the algorithm description.

ba1dfff

sagar0 reviewed Jul 20, 2018

View reviewed changes

sagar0 approved these changes Jul 20, 2018

View reviewed changes

facebook-github-bot reviewed Jul 21, 2018

View reviewed changes

DataBlockHashIndex: Fix buck build bug by adding .cc to TARGETS

cebb844

facebook-github-bot reviewed Jul 23, 2018

View reviewed changes

DataBlockHashIndex: Fixed Lint warnings

5b256b0

facebook-github-bot reviewed Jul 23, 2018

View reviewed changes

sagar0 approved these changes Jul 23, 2018

View reviewed changes

facebook-github-bot closed this in 8805ec2 Jul 24, 2018

fgwu deleted the fwu_data_block_hash_index branch July 26, 2018 06:05

fgwu mentioned this pull request Aug 20, 2018

[RFC] Standalone InBlockHashIndex implementation with unit test #4043

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataBlockHashIndex: Standalone Implementation with Unit Test #4139

DataBlockHashIndex: Standalone Implementation with Unit Test #4139

fgwu commented Jul 16, 2018 •

edited by sagar0

Loading

facebook-github-bot left a comment

sagar0 Jul 16, 2018

sagar0 Jul 16, 2018 •

edited

Loading

facebook-github-bot commented Jul 17, 2018

facebook-github-bot commented Jul 17, 2018

sagar0 Jul 18, 2018 •

edited

Loading

sagar0 Jul 18, 2018 •

edited

Loading

sagar0 Jul 18, 2018

sagar0 Jul 18, 2018

sagar0 Jul 18, 2018

fgwu Jul 19, 2018 •

edited

Loading

sagar0 Jul 18, 2018

fgwu Jul 19, 2018

facebook-github-bot commented Jul 19, 2018

facebook-github-bot commented Jul 19, 2018

sagar0 commented Jul 19, 2018

facebook-github-bot commented Jul 20, 2018

sagar0 Jul 20, 2018

facebook-github-bot left a comment

facebook-github-bot commented Jul 23, 2018

facebook-github-bot left a comment

facebook-github-bot commented Jul 23, 2018

facebook-github-bot left a comment

sagar0 left a comment


		namespace rocksdb {

		// The format of the datablock hash map is as follows:

	uint32_t Hash(const char* data, size_t n, uint32_t seed) {
	// Similar to murmur hash
	const uint32_t m = 0xc6a4a793;
	const uint32_t r = 24;
	const char* limit = data + n;
	uint32_t h = static_cast<uint32_t>(seed ^ (n * m));

	// Pick up four bytes at a time
	while (data + 4 <= limit) {
	uint32_t w = DecodeFixed32(data);
	data += 4;
	h += w;
	h *= m;
	h ^= (h >> 16);
	}

DataBlockHashIndex: Standalone Implementation with Unit Test #4139

DataBlockHashIndex: Standalone Implementation with Unit Test #4139

Conversation

fgwu commented Jul 16, 2018 • edited by sagar0 Loading

facebook-github-bot left a comment

Choose a reason for hiding this comment

sagar0 Jul 16, 2018

Choose a reason for hiding this comment

sagar0 Jul 16, 2018 • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Jul 17, 2018

facebook-github-bot commented Jul 17, 2018

sagar0 Jul 18, 2018 • edited Loading

Choose a reason for hiding this comment

sagar0 Jul 18, 2018 • edited Loading

Choose a reason for hiding this comment

sagar0 Jul 18, 2018

Choose a reason for hiding this comment

sagar0 Jul 18, 2018

Choose a reason for hiding this comment

sagar0 Jul 18, 2018

Choose a reason for hiding this comment

fgwu Jul 19, 2018 • edited Loading

Choose a reason for hiding this comment

sagar0 Jul 18, 2018

Choose a reason for hiding this comment

fgwu Jul 19, 2018

Choose a reason for hiding this comment

facebook-github-bot commented Jul 19, 2018

facebook-github-bot commented Jul 19, 2018

sagar0 commented Jul 19, 2018

facebook-github-bot commented Jul 20, 2018

sagar0 Jul 20, 2018

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 23, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 23, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

sagar0 left a comment

Choose a reason for hiding this comment

fgwu commented Jul 16, 2018 •

edited by sagar0

Loading

sagar0 Jul 16, 2018 •

edited

Loading

sagar0 Jul 18, 2018 •

edited

Loading

sagar0 Jul 18, 2018 •

edited

Loading

fgwu Jul 19, 2018 •

edited

Loading