New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2257: Add bloom_filter_length to ColumnMetaData #194
Conversation
@@ -753,6 +753,9 @@ struct ColumnMetaData { | |||
|
|||
/** Byte offset from beginning of file to Bloom filter data. **/ | |||
14: optional i64 bloom_filter_offset; | |||
|
|||
/** Size of Bloom filter data, in bytes. **/ | |||
15: optional i32 bloom_filter_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that if length exists, offset must exists. However, if offset not exists, length should not exist?
And reader should check that:
- Does it have offset
- If has, can it go fast path that using bloom_filter_length to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the writer side:
- Old writer only writes offset.
- New writer should write length as well.
On the reader side:
- Old reader only checks offset.
- New reader checks offset and then checks if length exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for future implementers we should add a comment noting that existing of one element does not indicate existence of the other (even though it is already implied by mapping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
I have thought to add a comment to say that bloom_filter_length
is introduced by v2.10.0 (the next unreleased version) and readers may not expect its existence from legacy files or some implementations.
@emkornfield @pitrou Do you have any comment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment otherwise looks OK to me, apologies for the late review.
I have updated the comment. Please take a look, thanks! @emkornfield |
The specs has only added
bloom_filter_offset
to locate the bloom filter. The reader cannot load the bloom filter in a single shot until it parses the bloom filter header to get the total size.This patch proposes to add an optional bloom_filter_length field to track the size of bloom filter to facilitate I/O scheduling. The specs already do the similar things for column index and offset index.