Skip to content

PARQUET-1617: Add more detail to Bloom filter spec#140

Merged
zivanfi merged 6 commits intoapache:masterfrom
chenjunjiedada:PARQUET-1617
Jul 10, 2019
Merged

PARQUET-1617: Add more detail to Bloom filter spec#140
zivanfi merged 6 commits intoapache:masterfrom
chenjunjiedada:PARQUET-1617

Conversation

@chenjunjiedada
Copy link
Copy Markdown
Contributor

This PR addresses PARQUET-1617 .

BloomFilter.md Outdated
#### Algorithm
In the initial algorithm, the most significant 32 bits from the hash value are used as the
index to select a block from bitset. The lower 32 bits of the hash value, along with eight
constant salt values, are used to compute the bit to set in each lane of the block. The
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of salting here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the paper notes, the multiply-shift hashing technique is highly dependent on the fixed multiplicand. This follows the common technique of calling a non-key parameter to a hash function the "salt" or the "seed".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short summary of the multiply-shift hashing technique would be welcome here as the linked paper is 33 pages long.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@zivanfi zivanfi Jul 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but from the current specification it is still not clear to me what the salt is used for. I have a theory based on @jbapple's comment (see below), please confirm or correct it, but either way it should be described in the document as well.

So here is my theory: A bloom filter requires k different hash functions, each mapping the same input value x to (potentially and hopefully) different output values. Instead of k different hash functions, we can employ a single bivariate function using the same x as one of the input values but k different values for the other input variable. These k different values are the salt. So, am I even close to the correct answer? :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, you are in the right direction. Think it another way, different salts are used to construct different hash functions, as the form in the referred wiki

hashi(x) = salti * x >> y

Since the target hash value is [0, 31], so we right shift y = 27 bits. As a result, we get eight hash values, which are indexes of the tiny bloom filter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks. I consider an indexed function with a single parameter the same as non-indexed one with two parameters, so we are on the same page. Could you please incorporate this explanation into the spec? Additionally, it may be worth mentioning that in the linked wikipedia article, a corresponds to the salt, since the article does not use the same terminology. Thanks!

BloomFilter.md Outdated
@@ -106,15 +116,19 @@ following formula. The output is in bits per distinct element:
-8 / log(1 - pow(p, 1.0 / 8));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it c = -8 / log(1 - pow(p, 1.0 / 8))?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three backticks followed by a string tell Markdown renderers what language to render the fixed-width text as. Here, it is the C language.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not referring to the c after the three backtick, but to the previous sentence ("The output is in bits per distinct element:") and the definition "Let c := m/n be the bits-per-element rate." in the first linked paper. I think in its current form its unclear what this formula means and if it is indeed c that we calculate it would help to state this by prepending "c = " and adding the defining c. Actually m, n and k would also benefit from having a definitions section.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I added some definition in the first section. Please see the latest commit

BloomFilter.md Outdated
#### Algorithm
In the initial algorithm, the most significant 32 bits from the hash value are used as the
index to select a block from bitset. The lower 32 bits of the hash value, along with eight
constant salt values, are used to compute the bit to set in each lane of the block. The
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the paper notes, the multiply-shift hashing technique is highly dependent on the fixed multiplicand. This follows the common technique of calling a non-key parameter to a hash function the "salt" or the "seed".

BloomFilter.md Outdated
@@ -106,15 +116,19 @@ following formula. The output is in bits per distinct element:
-8 / log(1 - pow(p, 1.0 / 8));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three backticks followed by a string tell Markdown renderers what language to render the fixed-width text as. Here, it is the C language.

Copy link
Copy Markdown
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements.

BloomFilter.md Outdated
offset is stored in column chunk metadata.

#### Encryption
The Bloom filter offset is stored in column chunk metadata which will be encrypted with the column
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest mentioning that in columns with sensitive data, bloom filters expose a subset of the sensitive information (with a quick explanation of what is exposed), and therefore need to be encrypted with the column key. Bloom filters of other (not sensitive) columns do not need to be encrypted.

BloomFilter.md Outdated
offset is stored in column chunk metadata.

#### Encryption
The Bloom filter offset is stored in column chunk metadata which will be encrypted with the column
Copy link
Copy Markdown
Contributor

@ggershinsky ggershinsky Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"column chunk metadata which will be encrypted with the column key"
The encryption of column chunk metadata is mostly required for statistics protection, and is not significant in the context of bloom filters, since their offset is not a secret.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

BloomFilter.md Outdated

#### Encryption
The Bloom filter offset is stored in column chunk metadata which will be encrypted with the column
key when encryption is enabled. The Bloom filter data itself should also be encrypted with column
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bloom filters are stored similarly to pages - with a page header (of the BloomFilterPageHeader type), and a page "data", or bitset. Therefore, they will be encrypted in the same way as we encrypt pages today - encrypting the PageHeader structure, and encrypting the page (bitset) itself, both with the same column key, but with different AAD module types - "BloomFilter PageHeader" (8) and "BloomFilter Page" (9) (let me know if you prefer the latter to be called "BloomFilter Bitset") . I'll create a pull request to update the encryption spec with these 2 new types.

Also, it would be good to see the BloomFilter Thrift structures listed and explained in this doc (Bloom filter spec), in the previous section (File Format). It will be useful both in general, and in particular for explanation of their encryption.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 'm OK with BloomFilter Page, it looks pairwise. Will add thrift in next commit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. By the way, a question - what are the typical and maximal sizes of the bitsets? (is
there is a way to estimate these?) If not too big, we might always encrypt them with GCM cipher (to make them tamper-proof), in both encryption algorithms.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set the false positive rate to 1%, 1.2MB bitset could hold 1 million distinct values. So I would take 1MB as the typical maximal size of Bloom filter bitset, that should satisfy most of the cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. If this is the order of magnitude of max size of chunk's bitset (and since, by design, Bloom filters are smaller than dictionary pages), we don't have to treat Bloom filter encryption identically to pages, and can apply maximal protection even in the relaxed GCM_CTR algorithm.

More specifically, looks like Bloom filters have two serializable modules (if I got this right).

  • One is the PageHeader thrift structure (with its internal fields, including optional BloomFilterPageHeader bloom_filter_page_header). This structure is serialized by Thrift, and written to file output stream, somewhere close to the footer.

  • It is followed by the Bitset, serialized and written right after the filter header.

For filters in sensitive columns, both the PageHeader structure and the Bitset will be encrypted
using the AES GCM cipher, with the same column key, but with different AAD module types - "BloomFilter PageHeader" (8) and "BloomFilter Bitset" (9).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome :)
The excel stops at 2MB, which is much smaller than a total of chunk data pages. When users configure the max size, their goal will also be to make filter much smaller than the data; otherwise its just easier to read all data pages. Moreover, "using less space than dictionaries" is the formal goal of this spec. So I think we should be fine with second proposal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second proposal can be written up along these lines -
Bloom filters have two serializable modules - the PageHeader thrift structure (with its internal fields, including the BloomFilterPageHeader bloom_filter_page_header), and the Bitset. The header structure is serialized by Thrift, and written to file output stream; it is followed by the serialized Bitset.

For Bloom filters in sensitive columns, each of the two modules will be encrypted after serialization, and then written to the file. The encryption will be performed using the AES GCM cipher, with the same column key, but with different AAD module types - "BloomFilter Header" (8) and "BloomFilter Bitset" (9). The length of the encrypted buffer is written before the buffer, as described in the Parquet encryption specification.

So, to sum this up, you have two proposals, with the following trade-off:
somewhat simpler design -vs - protection against bitset tampering attacks in GCM_CTR encryption algorithm.

The performance will be identical in all practical usecases. In extreme cases where bitset size is comparable to data size, the throughput of the first proposal would be somewhat higher (only in GCM_CTR algo, and only in Java 8 and below) - but then, I guess, encryption throughput would be the least of your worries, you dont want users to get to this extreme.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ggershinsky, Nice writeup.

As your summary in last, I would prefer to adopt AES GCM as well. So you will update Encryption.md to include these, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll open a jira/pr to update Encryption.md with the new module types for Bloom filter encryption.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BloomFilter.md Outdated
3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps)
4. [xxHash](https://cyan4973.github.io/xxHash/)
5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)
The Bloom filter data of a column, which contains the size of the filter in bytes, the algorithm,
Copy link
Copy Markdown
Contributor

@ggershinsky ggershinsky Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Bloom filter data of a column,
nit: The Bloom filter data of a column chunk,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, updated in the latest commit.

@chenjunjiedada
Copy link
Copy Markdown
Contributor Author

@ggershinsky, please take a look at the encryption part.

@ggershinsky
Copy link
Copy Markdown
Contributor

Looks good, thanks.

@chenjunjiedada
Copy link
Copy Markdown
Contributor Author

Thanks @ggershinsky.

@zivanfi, The last concern about encryption had been addressed. Can we merge this now?

@zivanfi zivanfi merged commit 17e5abf into apache:master Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants