PARQUET-1617: Add more detail to Bloom filter spec#140
PARQUET-1617: Add more detail to Bloom filter spec#140zivanfi merged 6 commits intoapache:masterfrom
Conversation
BloomFilter.md
Outdated
| #### Algorithm | ||
| In the initial algorithm, the most significant 32 bits from the hash value are used as the | ||
| index to select a block from bitset. The lower 32 bits of the hash value, along with eight | ||
| constant salt values, are used to compute the bit to set in each lane of the block. The |
There was a problem hiding this comment.
What is the purpose of salting here?
There was a problem hiding this comment.
As the paper notes, the multiply-shift hashing technique is highly dependent on the fixed multiplicand. This follows the common technique of calling a non-key parameter to a hash function the "salt" or the "seed".
There was a problem hiding this comment.
A short summary of the multiply-shift hashing technique would be welcome here as the linked paper is 33 pages long.
There was a problem hiding this comment.
OK, I replaced it with wiki https://en.wikipedia.org/wiki/Hash_function#Multiplicative_hashing.
There was a problem hiding this comment.
I'm sorry, but from the current specification it is still not clear to me what the salt is used for. I have a theory based on @jbapple's comment (see below), please confirm or correct it, but either way it should be described in the document as well.
So here is my theory: A bloom filter requires k different hash functions, each mapping the same input value x to (potentially and hopefully) different output values. Instead of k different hash functions, we can employ a single bivariate function using the same x as one of the input values but k different values for the other input variable. These k different values are the salt. So, am I even close to the correct answer? :)
There was a problem hiding this comment.
Basically, you are in the right direction. Think it another way, different salts are used to construct different hash functions, as the form in the referred wiki
hashi(x) = salti * x >> y
Since the target hash value is [0, 31], so we right shift y = 27 bits. As a result, we get eight hash values, which are indexes of the tiny bloom filter.
There was a problem hiding this comment.
Makes sense, thanks. I consider an indexed function with a single parameter the same as non-indexed one with two parameters, so we are on the same page. Could you please incorporate this explanation into the spec? Additionally, it may be worth mentioning that in the linked wikipedia article, a corresponds to the salt, since the article does not use the same terminology. Thanks!
BloomFilter.md
Outdated
| @@ -106,15 +116,19 @@ following formula. The output is in bits per distinct element: | |||
| -8 / log(1 - pow(p, 1.0 / 8)); | |||
There was a problem hiding this comment.
Is it c = -8 / log(1 - pow(p, 1.0 / 8))?
There was a problem hiding this comment.
The three backticks followed by a string tell Markdown renderers what language to render the fixed-width text as. Here, it is the C language.
There was a problem hiding this comment.
I was not referring to the c after the three backtick, but to the previous sentence ("The output is in bits per distinct element:") and the definition "Let c := m/n be the bits-per-element rate." in the first linked paper. I think in its current form its unclear what this formula means and if it is indeed c that we calculate it would help to state this by prepending "c = " and adding the defining c. Actually m, n and k would also benefit from having a definitions section.
There was a problem hiding this comment.
OK, I added some definition in the first section. Please see the latest commit
BloomFilter.md
Outdated
| #### Algorithm | ||
| In the initial algorithm, the most significant 32 bits from the hash value are used as the | ||
| index to select a block from bitset. The lower 32 bits of the hash value, along with eight | ||
| constant salt values, are used to compute the bit to set in each lane of the block. The |
There was a problem hiding this comment.
As the paper notes, the multiply-shift hashing technique is highly dependent on the fixed multiplicand. This follows the common technique of calling a non-key parameter to a hash function the "salt" or the "seed".
BloomFilter.md
Outdated
| @@ -106,15 +116,19 @@ following formula. The output is in bits per distinct element: | |||
| -8 / log(1 - pow(p, 1.0 / 8)); | |||
There was a problem hiding this comment.
The three backticks followed by a string tell Markdown renderers what language to render the fixed-width text as. Here, it is the C language.
zivanfi
left a comment
There was a problem hiding this comment.
Thanks for the improvements.
BloomFilter.md
Outdated
| offset is stored in column chunk metadata. | ||
|
|
||
| #### Encryption | ||
| The Bloom filter offset is stored in column chunk metadata which will be encrypted with the column |
There was a problem hiding this comment.
I'd suggest mentioning that in columns with sensitive data, bloom filters expose a subset of the sensitive information (with a quick explanation of what is exposed), and therefore need to be encrypted with the column key. Bloom filters of other (not sensitive) columns do not need to be encrypted.
BloomFilter.md
Outdated
| offset is stored in column chunk metadata. | ||
|
|
||
| #### Encryption | ||
| The Bloom filter offset is stored in column chunk metadata which will be encrypted with the column |
There was a problem hiding this comment.
"column chunk metadata which will be encrypted with the column key"
The encryption of column chunk metadata is mostly required for statistics protection, and is not significant in the context of bloom filters, since their offset is not a secret.
BloomFilter.md
Outdated
|
|
||
| #### Encryption | ||
| The Bloom filter offset is stored in column chunk metadata which will be encrypted with the column | ||
| key when encryption is enabled. The Bloom filter data itself should also be encrypted with column |
There was a problem hiding this comment.
Bloom filters are stored similarly to pages - with a page header (of the BloomFilterPageHeader type), and a page "data", or bitset. Therefore, they will be encrypted in the same way as we encrypt pages today - encrypting the PageHeader structure, and encrypting the page (bitset) itself, both with the same column key, but with different AAD module types - "BloomFilter PageHeader" (8) and "BloomFilter Page" (9) (let me know if you prefer the latter to be called "BloomFilter Bitset") . I'll create a pull request to update the encryption spec with these 2 new types.
Also, it would be good to see the BloomFilter Thrift structures listed and explained in this doc (Bloom filter spec), in the previous section (File Format). It will be useful both in general, and in particular for explanation of their encryption.
There was a problem hiding this comment.
I 'm OK with BloomFilter Page, it looks pairwise. Will add thrift in next commit.
There was a problem hiding this comment.
Sounds good. By the way, a question - what are the typical and maximal sizes of the bitsets? (is
there is a way to estimate these?) If not too big, we might always encrypt them with GCM cipher (to make them tamper-proof), in both encryption algorithms.
There was a problem hiding this comment.
If we set the false positive rate to 1%, 1.2MB bitset could hold 1 million distinct values. So I would take 1MB as the typical maximal size of Bloom filter bitset, that should satisfy most of the cases.
There was a problem hiding this comment.
Cool. If this is the order of magnitude of max size of chunk's bitset (and since, by design, Bloom filters are smaller than dictionary pages), we don't have to treat Bloom filter encryption identically to pages, and can apply maximal protection even in the relaxed GCM_CTR algorithm.
More specifically, looks like Bloom filters have two serializable modules (if I got this right).
-
One is the PageHeader thrift structure (with its internal fields, including
optional BloomFilterPageHeader bloom_filter_page_header). This structure is serialized by Thrift, and written to file output stream, somewhere close to the footer. -
It is followed by the Bitset, serialized and written right after the filter header.
For filters in sensitive columns, both the PageHeader structure and the Bitset will be encrypted
using the AES GCM cipher, with the same column key, but with different AAD module types - "BloomFilter PageHeader" (8) and "BloomFilter Bitset" (9).
There was a problem hiding this comment.
Welcome :)
The excel stops at 2MB, which is much smaller than a total of chunk data pages. When users configure the max size, their goal will also be to make filter much smaller than the data; otherwise its just easier to read all data pages. Moreover, "using less space than dictionaries" is the formal goal of this spec. So I think we should be fine with second proposal.
There was a problem hiding this comment.
The second proposal can be written up along these lines -
Bloom filters have two serializable modules - the PageHeader thrift structure (with its internal fields, including the BloomFilterPageHeader bloom_filter_page_header), and the Bitset. The header structure is serialized by Thrift, and written to file output stream; it is followed by the serialized Bitset.
For Bloom filters in sensitive columns, each of the two modules will be encrypted after serialization, and then written to the file. The encryption will be performed using the AES GCM cipher, with the same column key, but with different AAD module types - "BloomFilter Header" (8) and "BloomFilter Bitset" (9). The length of the encrypted buffer is written before the buffer, as described in the Parquet encryption specification.
So, to sum this up, you have two proposals, with the following trade-off:
somewhat simpler design -vs - protection against bitset tampering attacks in GCM_CTR encryption algorithm.
The performance will be identical in all practical usecases. In extreme cases where bitset size is comparable to data size, the throughput of the first proposal would be somewhat higher (only in GCM_CTR algo, and only in Java 8 and below) - but then, I guess, encryption throughput would be the least of your worries, you dont want users to get to this extreme.
There was a problem hiding this comment.
Thanks, @ggershinsky, Nice writeup.
As your summary in last, I would prefer to adopt AES GCM as well. So you will update Encryption.md to include these, right?
There was a problem hiding this comment.
Yep, I'll open a jira/pr to update Encryption.md with the new module types for Bloom filter encryption.
BloomFilter.md
Outdated
| 3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps) | ||
| 4. [xxHash](https://cyan4973.github.io/xxHash/) | ||
| 5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf) | ||
| The Bloom filter data of a column, which contains the size of the filter in bytes, the algorithm, |
There was a problem hiding this comment.
The Bloom filter data of a column,
nit: The Bloom filter data of a column chunk,
There was a problem hiding this comment.
Nice catch, updated in the latest commit.
|
@ggershinsky, please take a look at the encryption part. |
|
Looks good, thanks. |
|
Thanks @ggershinsky. @zivanfi, The last concern about encryption had been addressed. Can we merge this now? |
This PR addresses PARQUET-1617 .