PARQUET-1630: Update Bloom filter format #146

chenjunjiedada · 2019-08-04T14:15:39Z

No description provided.

chenjunjiedada · 2019-08-04T14:41:53Z

Hi @jbapple,

Could you please help to take a look firstly?

jbapple · 2019-08-05T01:03:38Z

In the interest of expediency, I have added my own PR, here:

#147

Would you be willing to limit this patch to the section about the file format only?

chenjunjiedada · 2019-08-05T01:45:42Z

Thanks @jbapple, the algorithm description looks much clear:)

Will update this to just format section

chenjunjiedada · 2019-08-05T02:13:16Z

Done! @jbapple please take a look this PR as well.

jbapple · 2019-08-05T02:19:58Z

BloomFilter.md

-filter data offset is stored in column chunk metadata. Here are Bloom filter definitions in
-thrift:
+Each multi-block Bloom filter is required to work for only one column chunk. The data of a multi-block Bloom
+filter contains a header of Bloom filter, which must includes the size of the filter in bytes, the algorithm,


when you say "a header of Bloom filter", do you mean "the header of one Bloom filter"?

Also, "include the size", not "includes the size".

BloomFilter.md

jbapple · 2019-08-05T02:20:34Z

BloomFilter.md

@@ -181,6 +182,9 @@ struct ColumnMetaData {

 ```

+The Bloom filter data is stored right after pages indexes, the file layout is look like:


Suggested change

The Bloom filter data is stored right after pages indexes, the file layout is look like:

The Bloom filter data is stored right after the page indexes, and the file layout looks like:

This is still incorrect. Can you look at my suggestion more carefully and identify what you missed?

Oops. I think I forgot to use git add before git commit --amend.

jbapple · 2019-08-05T03:01:23Z

BloomFilter.md

+filter contains the header of one Bloom filter, which must include the size of the filter in bytes, the algorithm,
+the hash function, and the Bloom filter bitset. The offset in column chunk metadata points to the start of
+the Bloom filter header. 
+Here are Bloom filter definitions in thrift:


Suggested change

Here are Bloom filter definitions in thrift:

Here are the Bloom filter definitions in thrift:

jbapple · 2019-08-05T05:14:29Z

LGTM

chenjunjiedada · 2019-08-07T07:11:51Z

@jbapple From the mail thread we are planning to add compression into bloom filter file format, so how about adding a field to store compression algorithm?

jbapple · 2019-08-08T13:40:23Z

I think this PR (and other PRs should) do one thing each.

I said in the email thread "I'll send a PR for that after #147 is checked in to avoid rebase racing."

chenjunjiedada · 2019-08-09T01:26:38Z

@rdblue, could you please take a look at this as well. This doesn't include compression thing, Jim will submit a separate PR for that as mentioned.

BloomFilter.md

rdblue · 2019-08-21T16:45:33Z

@chenjunjiedada, this is getting close. Please have a look at my latest comments.

rdblue · 2019-08-26T23:27:27Z

+1

chenjunjiedada added 2 commits August 4, 2019 22:14

PARQUET-1630: Resolve Bloom filter spec concerns

0b87f16

minor updates

9107f85

Revert algorithm update

6e89d59

jbapple suggested changes Aug 5, 2019

View reviewed changes

Fix grammar issues

ea3ce71

jbapple reviewed Aug 5, 2019

View reviewed changes

fix gramma issues

defc267

jbapple approved these changes Aug 6, 2019

View reviewed changes

rdblue reviewed Aug 12, 2019

View reviewed changes

BloomFilter.md Outdated Show resolved Hide resolved

rdblue reviewed Aug 12, 2019

View reviewed changes

BloomFilter.md Outdated Show resolved Hide resolved

address comments

3a2b108

chenjunjiedada force-pushed the PARQUET-1630 branch 2 times, most recently from ea57431 to cbfcbc1 Compare August 19, 2019 16:21

address comments

6e7622a

chenjunjiedada force-pushed the PARQUET-1630 branch from cbfcbc1 to 6e7622a Compare August 19, 2019 16:22

chenjunjiedada changed the title ~~PARQUET-1630: Resolve Bloom filter spec concerns~~ PARQUET-1630: Update Bloom filter format Aug 20, 2019

rdblue reviewed Aug 21, 2019

View reviewed changes

BloomFilter.md Outdated Show resolved Hide resolved

update words

d228c6c

rdblue merged commit 3fb10e0 into apache:master Aug 26, 2019

chenjunjiedada deleted the PARQUET-1630 branch May 15, 2020 01:41

asfimport mentioned this pull request Jun 23, 2024

Resolve Bloom filter spec concerns #376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1630: Update Bloom filter format #146

PARQUET-1630: Update Bloom filter format #146

chenjunjiedada commented Aug 4, 2019

chenjunjiedada commented Aug 4, 2019

jbapple commented Aug 5, 2019

chenjunjiedada commented Aug 5, 2019

chenjunjiedada commented Aug 5, 2019

jbapple Aug 5, 2019

jbapple Aug 5, 2019

jbapple Aug 5, 2019

chenjunjiedada Aug 5, 2019 •

edited

Loading

jbapple Aug 5, 2019

jbapple commented Aug 5, 2019

chenjunjiedada commented Aug 7, 2019

jbapple commented Aug 8, 2019

chenjunjiedada commented Aug 9, 2019

rdblue commented Aug 21, 2019

rdblue commented Aug 26, 2019

		@@ -181,6 +182,9 @@ struct ColumnMetaData {

		```

		The Bloom filter data is stored right after pages indexes, the file layout is look like:

	Here are Bloom filter definitions in thrift:
	Here are the Bloom filter definitions in thrift:

PARQUET-1630: Update Bloom filter format #146

PARQUET-1630: Update Bloom filter format #146

Conversation

chenjunjiedada commented Aug 4, 2019

chenjunjiedada commented Aug 4, 2019

jbapple commented Aug 5, 2019

chenjunjiedada commented Aug 5, 2019

chenjunjiedada commented Aug 5, 2019

jbapple Aug 5, 2019

Choose a reason for hiding this comment

jbapple Aug 5, 2019

Choose a reason for hiding this comment

jbapple Aug 5, 2019

Choose a reason for hiding this comment

chenjunjiedada Aug 5, 2019 • edited Loading

Choose a reason for hiding this comment

jbapple Aug 5, 2019

Choose a reason for hiding this comment

jbapple commented Aug 5, 2019

chenjunjiedada commented Aug 7, 2019

jbapple commented Aug 8, 2019

chenjunjiedada commented Aug 9, 2019

rdblue commented Aug 21, 2019

rdblue commented Aug 26, 2019

chenjunjiedada Aug 5, 2019 •

edited

Loading