PARQUET-1617: Add more detail to Bloom filter spec by chenjunjiedada · Pull Request #140 · apache/parquet-format

chenjunjiedada · 2019-07-05T09:03:38Z

This PR addresses PARQUET-1617 .

BloomFilter.md

zivanfi · 2019-07-05T13:55:17Z

BloomFilter.md

 #### Algorithm
 In the initial algorithm, the most significant 32 bits from the hash value are used as the
 index to select a block from bitset. The lower 32 bits of the hash value, along with eight
 constant salt values, are used to compute the bit to set in each lane of the block. The


What is the purpose of salting here?

As the paper notes, the multiply-shift hashing technique is highly dependent on the fixed multiplicand. This follows the common technique of calling a non-key parameter to a hash function the "salt" or the "seed".

A short summary of the multiply-shift hashing technique would be welcome here as the linked paper is 33 pages long.

OK, I replaced it with wiki https://en.wikipedia.org/wiki/Hash_function#Multiplicative_hashing.

I'm sorry, but from the current specification it is still not clear to me what the salt is used for. I have a theory based on @jbapple's comment (see below), please confirm or correct it, but either way it should be described in the document as well.

So here is my theory: A bloom filter requires k different hash functions, each mapping the same input value x to (potentially and hopefully) different output values. Instead of k different hash functions, we can employ a single bivariate function using the same x as one of the input values but k different values for the other input variable. These k different values are the salt. So, am I even close to the correct answer? :)

Basically, you are in the right direction. Think it another way, different salts are used to construct different hash functions, as the form in the referred wiki

hash_i(x) = salt_i * x >> y

Since the target hash value is [0, 31], so we right shift y = 27 bits. As a result, we get eight hash values, which are indexes of the tiny bloom filter.

Makes sense, thanks. I consider an indexed function with a single parameter the same as non-indexed one with two parameters, so we are on the same page. Could you please incorporate this explanation into the spec? Additionally, it may be worth mentioning that in the linked wikipedia article, a corresponds to the salt, since the article does not use the same terminology. Thanks!

zivanfi · 2019-07-05T13:57:37Z

BloomFilter.md

@@ -106,15 +116,19 @@ following formula. The output is in bits per distinct element:
 -8 / log(1 - pow(p, 1.0 / 8));


Is it c = -8 / log(1 - pow(p, 1.0 / 8))?

The three backticks followed by a string tell Markdown renderers what language to render the fixed-width text as. Here, it is the C language.

I was not referring to the c after the three backtick, but to the previous sentence ("The output is in bits per distinct element:") and the definition "Let c := m/n be the bits-per-element rate." in the first linked paper. I think in its current form its unclear what this formula means and if it is indeed c that we calculate it would help to state this by prepending "c = " and adding the defining c. Actually m, n and k would also benefit from having a definitions section.

OK, I added some definition in the first section. Please see the latest commit

BloomFilter.md

jbapple · 2019-07-06T01:54:45Z

BloomFilter.md

 #### Algorithm
 In the initial algorithm, the most significant 32 bits from the hash value are used as the
 index to select a block from bitset. The lower 32 bits of the hash value, along with eight
 constant salt values, are used to compute the bit to set in each lane of the block. The


As the paper notes, the multiply-shift hashing technique is highly dependent on the fixed multiplicand. This follows the common technique of calling a non-key parameter to a hash function the "salt" or the "seed".

jbapple · 2019-07-06T01:56:10Z

BloomFilter.md

@@ -106,15 +116,19 @@ following formula. The output is in bits per distinct element:
 -8 / log(1 - pow(p, 1.0 / 8));


The three backticks followed by a string tell Markdown renderers what language to render the fixed-width text as. Here, it is the C language.

zivanfi

Thanks for the improvements.

ggershinsky · 2019-07-09T06:51:56Z

BloomFilter.md

+offset is stored in column chunk metadata.
+
+#### Encryption
+The Bloom filter offset is stored in column chunk metadata which will be encrypted with the column


I'd suggest mentioning that in columns with sensitive data, bloom filters expose a subset of the sensitive information (with a quick explanation of what is exposed), and therefore need to be encrypted with the column key. Bloom filters of other (not sensitive) columns do not need to be encrypted.

ggershinsky · 2019-07-09T06:53:24Z

BloomFilter.md

+offset is stored in column chunk metadata.
+
+#### Encryption
+The Bloom filter offset is stored in column chunk metadata which will be encrypted with the column


"column chunk metadata which will be encrypted with the column key"
The encryption of column chunk metadata is mostly required for statistics protection, and is not significant in the context of bloom filters, since their offset is not a secret.

ggershinsky · 2019-07-09T07:05:46Z

BloomFilter.md

+
+#### Encryption
+The Bloom filter offset is stored in column chunk metadata which will be encrypted with the column
+key when encryption is enabled. The Bloom filter data itself should also be encrypted with column


Bloom filters are stored similarly to pages - with a page header (of the BloomFilterPageHeader type), and a page "data", or bitset. Therefore, they will be encrypted in the same way as we encrypt pages today - encrypting the PageHeader structure, and encrypting the page (bitset) itself, both with the same column key, but with different AAD module types - "BloomFilter PageHeader" (8) and "BloomFilter Page" (9) (let me know if you prefer the latter to be called "BloomFilter Bitset") . I'll create a pull request to update the encryption spec with these 2 new types.

Also, it would be good to see the BloomFilter Thrift structures listed and explained in this doc (Bloom filter spec), in the previous section (File Format). It will be useful both in general, and in particular for explanation of their encryption.

I 'm OK with BloomFilter Page, it looks pairwise. Will add thrift in next commit.

Sounds good. By the way, a question - what are the typical and maximal sizes of the bitsets? (is
there is a way to estimate these?) If not too big, we might always encrypt them with GCM cipher (to make them tamper-proof), in both encryption algorithms.

If we set the false positive rate to 1%, 1.2MB bitset could hold 1 million distinct values. So I would take 1MB as the typical maximal size of Bloom filter bitset, that should satisfy most of the cases.

Cool. If this is the order of magnitude of max size of chunk's bitset (and since, by design, Bloom filters are smaller than dictionary pages), we don't have to treat Bloom filter encryption identically to pages, and can apply maximal protection even in the relaxed GCM_CTR algorithm.

More specifically, looks like Bloom filters have two serializable modules (if I got this right).

One is the PageHeader thrift structure (with its internal fields, including optional BloomFilterPageHeader bloom_filter_page_header). This structure is serialized by Thrift, and written to file output stream, somewhere close to the footer.

It is followed by the Bitset, serialized and written right after the filter header.

For filters in sensitive columns, both the PageHeader structure and the Bitset will be encrypted
using the AES GCM cipher, with the same column key, but with different AAD module types - "BloomFilter PageHeader" (8) and "BloomFilter Bitset" (9).

Welcome :)
The excel stops at 2MB, which is much smaller than a total of chunk data pages. When users configure the max size, their goal will also be to make filter much smaller than the data; otherwise its just easier to read all data pages. Moreover, "using less space than dictionaries" is the formal goal of this spec. So I think we should be fine with second proposal.

The second proposal can be written up along these lines -
Bloom filters have two serializable modules - the PageHeader thrift structure (with its internal fields, including the BloomFilterPageHeader bloom_filter_page_header), and the Bitset. The header structure is serialized by Thrift, and written to file output stream; it is followed by the serialized Bitset.

For Bloom filters in sensitive columns, each of the two modules will be encrypted after serialization, and then written to the file. The encryption will be performed using the AES GCM cipher, with the same column key, but with different AAD module types - "BloomFilter Header" (8) and "BloomFilter Bitset" (9). The length of the encrypted buffer is written before the buffer, as described in the Parquet encryption specification.

So, to sum this up, you have two proposals, with the following trade-off:
somewhat simpler design -vs - protection against bitset tampering attacks in GCM_CTR encryption algorithm.

The performance will be identical in all practical usecases. In extreme cases where bitset size is comparable to data size, the throughput of the first proposal would be somewhat higher (only in GCM_CTR algo, and only in Java 8 and below) - but then, I guess, encryption throughput would be the least of your worries, you dont want users to get to this extreme.

Thanks, @ggershinsky, Nice writeup.

As your summary in last, I would prefer to adopt AES GCM as well. So you will update Encryption.md to include these, right?

Yep, I'll open a jira/pr to update Encryption.md with the new module types for Bloom filter encryption.

ggershinsky · 2019-07-09T07:14:43Z

BloomFilter.md

-3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps)
-4. [xxHash](https://cyan4973.github.io/xxHash/)
-5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)
+The Bloom filter data of a column, which contains the size of the filter in bytes, the algorithm,


The Bloom filter data of a column,
nit: The Bloom filter data of a column chunk,

Nice catch, updated in the latest commit.

chenjunjiedada · 2019-07-10T07:48:08Z

@ggershinsky, please take a look at the encryption part.

ggershinsky · 2019-07-10T08:29:26Z

Looks good, thanks.

chenjunjiedada · 2019-07-10T09:55:03Z

Thanks @ggershinsky.

@zivanfi, The last concern about encryption had been addressed. Can we merge this now?

PARQUET-1617: Add more detail to Bloom filter spec

6fee741

zivanfi reviewed Jul 5, 2019

View reviewed changes

jbapple approved these changes Jul 6, 2019

View reviewed changes

chenjunjiedada added 2 commits July 8, 2019 20:01

address coments

682d0c8

Add some detailed explaination

0b31834

zivanfi approved these changes Jul 9, 2019

View reviewed changes

ggershinsky reviewed Jul 9, 2019

View reviewed changes

Address comments for encryption

7384e3e

chenjunjiedada referenced this pull request Jul 10, 2019

PARUQET-1609: Update to use xxHash as hash strategy (#139)

8f1783e

chenjunjiedada added 2 commits July 10, 2019 15:02

Fix an incorrect description of xxhash

dc099cb

PARQUET-1618: Update Bloom filter spec for encryption

40ecb85

zivanfi merged commit 17e5abf into apache:master Jul 10, 2019

asfimport mentioned this pull request Jun 23, 2024

Add more details to bloom filter spec #372

Closed

		@@ -106,15 +116,19 @@ following formula. The output is in bits per distinct element:
		-8 / log(1 - pow(p, 1.0 / 8));

Conversation

chenjunjiedada commented Jul 5, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zivanfi Jul 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zivanfi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggershinsky Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggershinsky Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada commented Jul 10, 2019

Uh oh!

ggershinsky commented Jul 10, 2019

Uh oh!

chenjunjiedada commented Jul 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

zivanfi Jul 8, 2019 •

edited

Loading

ggershinsky Jul 9, 2019 •

edited

Loading

ggershinsky Jul 9, 2019 •

edited

Loading