Ribbon: initial (general) algorithms and basic unit test #7491

pdillinger · 2020-10-01T22:28:51Z

This is intended as the first commit toward a near-optimal alternative to static Bloom filters for SSTs. Stephan Walzer and I have agreed upon the name "Ribbon" for a PHSF based on his linear system construction in "Efficient Gauss Elimination for Near-Quadratic Matrices with One Short Random Block per Row, with Applications" ("SGauss") and my much faster "on the fly" algorithm for gaussian elimination (or for this linear system, "banding"), which can be faster than peeling while also more compact and flexible. See util/ribbon_alg.h for more detailed introduction and background. RIBBON = Rapid Incremental Boolean Banding ON-the-fly

This commit just adds generic (templatized) core algorithms and a basic unit test showing some features, including the ability to construct structures within 2.5% space overhead vs. information theoretic lower bound. (Compare to cache-local Bloom filter's ~50% space overhead -> ~30% reduction anticipated.) This commit does not include the storage scheme necessary to make queries fast, especially for filter queries, nor fractional "result bits", but there is some description already and those implementations will come soon. Nor does this commit add FilterPolicy support, for use in SST files, but that will also come soon.

pdillinger · 2020-10-01T22:39:38Z

Oops, forgot to make sure it compiles with TEST_UINT128_COMPAT

jay-zhuang · 2020-10-23T04:06:54Z

util/ribbon_alg.h

+    for (Index j = 0; j < kResultBits; ++j) {
+      // Compute next solution bit at row i, column j (see derivation below)
+      CoeffRow tmp = state[j] << 1;
+      int bit = BitParity(tmp & cr) ^ ((rr >> j) & 1);


could it be bool bit?

I believe some people (not sure about compilers) consider it bad style to cast between integral types and bool. And Google style calls for 'int' for small integer values.

But it might generate the same code, with one less static_cast, to do

bool bit = (parity) != 0; tmp |= bit ? CoeffRow{1} : CoeffRow{0};

That new version is generally about the same, except generating one less instruction on gcc 4.9.x :) https://godbolt.org/z/9MP8zT

jay-zhuang · 2020-10-23T17:36:56Z

util/ribbon_impl.h

+    num_starts_ = num_starts;
+  }
+
+  Index GetNumStarts() const { return num_starts_; }


Wondering why we store num_starts vs. num_slots? I feel num_slots is more straight forward. Or num_keys if it also store a kFactor.

I should add a comment for that. It's because num_starts is on the extreme critical path for queries: its value is needed (for FastRange) to know where to start probing in memory. We don't want any extra instructions for getting that value.

jay-zhuang · 2020-10-23T17:53:23Z

util/ribbon_impl.h

+  using Key = Hash;
+  using Seed = typename RehasherTypesAndSettings::Seed;
+
+  static Hash HashFn(const Hash& input, Seed seed) {


Just a question, how it's different from XXH3p_64bits_withSeed :

rocksdb/util/xxh3p.h

Lines 1093 to 1099 in 3e74505

XXH3p_64bits_withSeed(const void* input, size_t len, XXH64_hash_t seed)

{

if (len <= 16) return XXH3p_len_0to16_64b((const xxh_u8*)input, len, kSecret, seed);

if (len <= 128) return XXH3p_len_17to128_64b((const xxh_u8*)input, len, kSecret, sizeof(kSecret), seed);

if (len <= XXH3p_MIDSIZE_MAX) return XXH3p_len_129to240_64b((const xxh_u8*)input, len, kSecret, sizeof(kSecret), seed);

return XXH3p_hashLong_64b_withSeed((const xxh_u8*)input, len, seed);

}

Could them be combined?

XXH3p_len_4to8_64b (used in XXH3p_64bits_withSeed) has some extra mixing that seems to be unnecessary in this application. (Trying to minimize unnecessary hash computation.)

xxh_u64 const mix64 = len + ((keyed ^ (keyed >> 51)) * PRIME32_1); XXH3p_avalanche((mix64 ^ (mix64 >> 47)) * PRIME64_2)

I should probably say that StandardRehasher is not intended for use (at least untested) with hash-sized values that are not already uniformly distributed.

jay-zhuang · 2020-10-23T18:14:04Z

util/ribbon_alg.h

+      // more rows. Thus, for valid solution, the dot product of the
+      // solution column with the coefficient row has to equal the result
+      // at that column,
+      //   BitParity(tmp & cr) == ((rr >> j) & 1)


Question: how to make sure BitParity(tmp & cr) == ((rr >> j) & 1) as tmp could be changed for next row?
For example, to simplify, assume kResultBits = 1, r = 2. then tmp only has 2 bits, the first bit tmp[0] (tmp[x] means the x bit in tmp) it makes sure: (tmp[0]+cr[0])^(tmp[1]+cr[1]) == rr[0]. But tmp[1] is not assigned yet and will be generated later. How do we make sure we don't need to go back and modify tmp[0]?

Our initial back-substitution state is all zeros, though the initial state is unused because that would mean referencing an "out of bounds" slot, and unoccupied banding rows have cr as all zeros.

Each step through here we set new tmp[x+1] (bit x + 1) to old tmp[x] (bit x). Since we're moving backward through the rows, the second bit of tmp (aka tmp[1] or (tmp >> 1) & 1) will hold the value assigned to the "next" (but most recently assigned) row.

I see, as the last r rows is an upper triangular matrix which will be used to set the initial state. I was confused that it is a band matrix.

jay-zhuang · 2020-10-23T18:28:45Z

util/ribbon_impl.h

+    StandardHasher<TypesAndSettings>::ResetSeed();
+    do {
+      Reset(num_slots);
+      bool success = AddRange(begin, end);


What would be the success rate? Say r=128, 1 million keys, kFactor=1.025. rr = 8 bits. Is there a formula for that?

There's no code here for that yet. Other code will come that choses an appropriate factor for appropriate success rate (generally staying above 50%).

jay-zhuang · 2020-10-23T18:32:49Z

util/ribbon_impl.h

+      }
+    } while (StandardHasher<TypesAndSettings>::NextSeed(max_seed));
+    // no seed through max_seed worked
+    return false;


Question: what's the expectation from the user if it's failed? Increase the kNumSlots and retry? If that's the case, a helper function to do that might be helpful.

I think that's going to be application specific. Increasing kNumSlots could be an approach, but I expect us to fall back on Bloom filter if we find ourselves with extraordinarily bad luck, or something is broken. We should be able to make that reliably extremely rare without approaches like increasing kNumSlots.

facebook-github-bot

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

jay-zhuang · 2020-10-23T23:21:53Z

util/ribbon_alg.h

+// #################### Ribbon on-the-fly banding #######################
+//
+// "Banding" is what we call the process of reducing the inputs to an
+// upper-triangluar r-band matrix ready for finishing a solution with


typo: triangular

jay-zhuang · 2020-10-23T23:22:31Z

util/ribbon_alg.h

+// The enhanced algorithm is based on these observations:
+// - When processing a coefficient row with first 1 in column j,
+//   - If it's the first at column j to be processed, it can be part of
+//     the banding at row j. (And that descision never overwritten, with


typo: decision

jay-zhuang · 2020-10-23T23:24:13Z

util/ribbon_alg.h

+//   // Given a hash value, return the r-bit sequence of coefficients to
+//   // associate with it. It's generally OK if
+//   //   sizeof(CoeffRow) > sizeof(Hash)
+//   // as long as the hash itself is not too prone to collsions for the


typo: collisions

jay-zhuang · 2020-10-23T23:25:25Z

util/ribbon_alg.h

+
+// InterleavedSolutionStorage is row-major at a high level, for good
+// locality, and column-major at a low level, for CPU efficiency
+// especially in filter querys or relatively small number of result bits


jay-zhuang · 2020-10-23T23:26:19Z

util/ribbon_impl.h

+// to apply a different seed. This hasher seeds a 1-to-1 mixing
+// transformation to apply a seed to an existing hash (or hash-sized key).
+//
+// Testing suggests essentially no degredation of solution success rate


degradation?

jay-zhuang

Nice algorithm 👍👍👍

pdillinger · 2020-10-25T20:44:39Z

Thanks, I'll fix those typos (and more) in next PR

facebook-github-bot · 2020-10-26T04:37:19Z

@pdillinger merged this pull request in 25d54c7.

Summary: This is intended as the first commit toward a near-optimal alternative to static Bloom filters for SSTs. Stephan Walzer and I have agreed upon the name "Ribbon" for a PHSF based on his linear system construction in "Efficient Gauss Elimination for Near-Quadratic Matrices with One Short Random Block per Row, with Applications" ("SGauss") and my much faster "on the fly" algorithm for gaussian elimination (or for this linear system, "banding"), which can be faster than peeling while also more compact and flexible. See util/ribbon_alg.h for more detailed introduction and background. RIBBON = Rapid Incremental Boolean Banding ON-the-fly This commit just adds generic (templatized) core algorithms and a basic unit test showing some features, including the ability to construct structures within 2.5% space overhead vs. information theoretic lower bound. (Compare to cache-local Bloom filter's ~50% space overhead -> ~30% reduction anticipated.) This commit does not include the storage scheme necessary to make queries fast, especially for filter queries, nor fractional "result bits", but there is some description already and those implementations will come soon. Nor does this commit add FilterPolicy support, for use in SST files, but that will also come soon. Pull Request resolved: facebook#7491 Reviewed By: jay-zhuang Differential Revision: D24517954 Pulled By: pdillinger fbshipit-source-id: 0119ee597e250d7e0edd38ada2ba50d755606fa7

Ribbon start (new name for SGauss)

bbd56d0

facebook-github-bot added the CLA Signed label Oct 1, 2020

pdillinger requested a review from jay-zhuang October 1, 2020 22:39

pdillinger added 9 commits October 1, 2020 15:59

Fix for TEST_UINT128_COMPAT

62f9f92

include array

20ce535

Merge remote-tracking branch 'origin/master' into ribbon1

88eba31

make format

0905ec8

Fix msvc narrowing warning

8e27d81

Merge remote-tracking branch 'origin/master' into ribbon2

035188f

Add Rehasher and enhance Ribbon unit test

4aa86c9

Merge remote-tracking branch 'origin/master' into ribbon2

339c1da

More comments

0a5923f

jay-zhuang reviewed Oct 23, 2020

View reviewed changes

pdillinger added 4 commits October 23, 2020 14:38

Address comments, and other minor/cosmetic fixes

9af2008

Merge remote-tracking branch 'origin/master' into ribbon1

f9ce0a2

Fix merge issue

f28214d

Fix GFLAGS issue

4d7626f

facebook-github-bot reviewed Oct 23, 2020

View reviewed changes

Fix uint64_t -> double conversion for msvc

d132775

facebook-github-bot reviewed Oct 23, 2020

View reviewed changes

jay-zhuang reviewed Oct 23, 2020

View reviewed changes

jay-zhuang approved these changes Oct 23, 2020

View reviewed changes

facebook-github-bot closed this in 25d54c7 Oct 26, 2020

facebook-github-bot added the Merged label Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ribbon: initial (general) algorithms and basic unit test #7491

Ribbon: initial (general) algorithms and basic unit test #7491

pdillinger commented Oct 1, 2020 •

edited

Loading

pdillinger commented Oct 1, 2020

jay-zhuang Oct 23, 2020

pdillinger Oct 23, 2020

pdillinger Oct 23, 2020

jay-zhuang Oct 23, 2020 •

edited

Loading

pdillinger Oct 23, 2020 •

edited

Loading

jay-zhuang Oct 23, 2020

pdillinger Oct 23, 2020

jay-zhuang Oct 23, 2020

pdillinger Oct 23, 2020

jay-zhuang Oct 23, 2020

jay-zhuang Oct 23, 2020

pdillinger Oct 23, 2020

jay-zhuang Oct 23, 2020

pdillinger Oct 23, 2020

facebook-github-bot left a comment

facebook-github-bot left a comment

jay-zhuang Oct 23, 2020

jay-zhuang Oct 23, 2020

jay-zhuang Oct 23, 2020

jay-zhuang Oct 23, 2020

jay-zhuang Oct 23, 2020

jay-zhuang left a comment

pdillinger commented Oct 25, 2020

facebook-github-bot commented Oct 26, 2020

	XXH3p_64bits_withSeed(const void* input, size_t len, XXH64_hash_t seed)
	{
	if (len <= 16) return XXH3p_len_0to16_64b((const xxh_u8*)input, len, kSecret, seed);
	if (len <= 128) return XXH3p_len_17to128_64b((const xxh_u8*)input, len, kSecret, sizeof(kSecret), seed);
	if (len <= XXH3p_MIDSIZE_MAX) return XXH3p_len_129to240_64b((const xxh_u8*)input, len, kSecret, sizeof(kSecret), seed);
	return XXH3p_hashLong_64b_withSeed((const xxh_u8*)input, len, seed);
	}

Ribbon: initial (general) algorithms and basic unit test #7491

Ribbon: initial (general) algorithms and basic unit test #7491

Conversation

pdillinger commented Oct 1, 2020 • edited Loading

pdillinger commented Oct 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jay-zhuang Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

pdillinger Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jay-zhuang left a comment

Choose a reason for hiding this comment

pdillinger commented Oct 25, 2020

facebook-github-bot commented Oct 26, 2020

pdillinger commented Oct 1, 2020 •

edited

Loading

jay-zhuang Oct 23, 2020 •

edited

Loading

pdillinger Oct 23, 2020 •

edited

Loading