Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2257: Add bloom_filter_length to ColumnMetaData #194

Merged
merged 2 commits into from Apr 15, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Mar 13, 2023

The specs has only added bloom_filter_offset to locate the bloom filter. The reader cannot load the bloom filter in a single shot until it parses the bloom filter header to get the total size.

struct ColumnMetaData {
  /** Byte offset from beginning of file to Bloom filter data. **/
  14: optional i64 bloom_filter_offset;
}

This patch proposes to add an optional bloom_filter_length field to track the size of bloom filter to facilitate I/O scheduling. The specs already do the similar things for column index and offset index.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 13, 2023

@@ -753,6 +753,9 @@ struct ColumnMetaData {

/** Byte offset from beginning of file to Bloom filter data. **/
14: optional i64 bloom_filter_offset;

/** Size of Bloom filter data, in bytes. **/
15: optional i32 bloom_filter_length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that if length exists, offset must exists. However, if offset not exists, length should not exist?

And reader should check that:

  1. Does it have offset
  2. If has, can it go fast path that using bloom_filter_length to read?

Copy link
Member Author

@wgtmac wgtmac Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the writer side:

  • Old writer only writes offset.
  • New writer should write length as well.

On the reader side:

  • Old reader only checks offset.
  • New reader checks offset and then checks if length exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for future implementers we should add a comment noting that existing of one element does not indicate existence of the other (even though it is already implied by mapping).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

I have thought to add a comment to say that bloom_filter_length is introduced by v2.10.0 (the next unreleased version) and readers may not expect its existence from legacy files or some implementations.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 20, 2023

@emkornfield @pitrou Do you have any comment?

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment otherwise looks OK to me, apologies for the late review.

@wgtmac
Copy link
Member Author

wgtmac commented Apr 6, 2023

One minor comment otherwise looks OK to me, apologies for the late review.

I have updated the comment. Please take a look, thanks! @emkornfield

@wgtmac wgtmac merged commit 9b040cc into apache:master Apr 15, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants