Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1630: Update Bloom filter format #146

Merged
merged 8 commits into from
Aug 26, 2019

Conversation

chenjunjiedada
Copy link
Contributor

No description provided.

@chenjunjiedada
Copy link
Contributor Author

Hi @jbapple,

Could you please help to take a look firstly?

@jbapple
Copy link
Contributor

jbapple commented Aug 5, 2019

In the interest of expediency, I have added my own PR, here:

#147

Would you be willing to limit this patch to the section about the file format only?

@chenjunjiedada
Copy link
Contributor Author

Thanks @jbapple, the algorithm description looks much clear:)

Will update this to just format section

@chenjunjiedada
Copy link
Contributor Author

Done! @jbapple please take a look this PR as well.

BloomFilter.md Outdated
filter data offset is stored in column chunk metadata. Here are Bloom filter definitions in
thrift:
Each multi-block Bloom filter is required to work for only one column chunk. The data of a multi-block Bloom
filter contains a header of Bloom filter, which must includes the size of the filter in bytes, the algorithm,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you say "a header of Bloom filter", do you mean "the header of one Bloom filter"?

Also, "include the size", not "includes the size".

BloomFilter.md Outdated Show resolved Hide resolved
BloomFilter.md Outdated
@@ -181,6 +182,9 @@ struct ColumnMetaData {

```

The Bloom filter data is stored right after pages indexes, the file layout is look like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Bloom filter data is stored right after pages indexes, the file layout is look like:
The Bloom filter data is stored right after the page indexes, and the file layout looks like:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still incorrect. Can you look at my suggestion more carefully and identify what you missed?

Copy link
Contributor Author

@chenjunjiedada chenjunjiedada Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I think I forgot to use git add before git commit --amend.

BloomFilter.md Outdated
filter contains the header of one Bloom filter, which must include the size of the filter in bytes, the algorithm,
the hash function, and the Bloom filter bitset. The offset in column chunk metadata points to the start of
the Bloom filter header.
Here are Bloom filter definitions in thrift:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here are Bloom filter definitions in thrift:
Here are the Bloom filter definitions in thrift:

@jbapple
Copy link
Contributor

jbapple commented Aug 5, 2019

LGTM

@chenjunjiedada
Copy link
Contributor Author

@jbapple From the mail thread we are planning to add compression into bloom filter file format, so how about adding a field to store compression algorithm?

@jbapple
Copy link
Contributor

jbapple commented Aug 8, 2019

I think this PR (and other PRs should) do one thing each.

I said in the email thread "I'll send a PR for that after #147 is checked in to avoid rebase racing."

@chenjunjiedada
Copy link
Contributor Author

@rdblue, could you please take a look at this as well. This doesn't include compression thing, Jim will submit a separate PR for that as mentioned.

BloomFilter.md Outdated Show resolved Hide resolved
BloomFilter.md Outdated Show resolved Hide resolved
@chenjunjiedada chenjunjiedada force-pushed the PARQUET-1630 branch 2 times, most recently from ea57431 to cbfcbc1 Compare August 19, 2019 16:21
@chenjunjiedada chenjunjiedada changed the title PARQUET-1630: Resolve Bloom filter spec concerns PARQUET-1630: Update Bloom filter format Aug 20, 2019
BloomFilter.md Outdated Show resolved Hide resolved
@rdblue
Copy link
Contributor

rdblue commented Aug 21, 2019

@chenjunjiedada, this is getting close. Please have a look at my latest comments.

@rdblue
Copy link
Contributor

rdblue commented Aug 26, 2019

+1

@rdblue rdblue merged commit 3fb10e0 into apache:master Aug 26, 2019
@chenjunjiedada chenjunjiedada deleted the PARQUET-1630 branch May 15, 2020 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants