Skip to content

GH-46404: [C++][Python] add support for max_page_header_size for pyarrow#47758

Open
ayushbansal07 wants to merge 1 commit into
apache:mainfrom
ayushbansal07:feature/max_page_header_size-pyarrow
Open

GH-46404: [C++][Python] add support for max_page_header_size for pyarrow#47758
ayushbansal07 wants to merge 1 commit into
apache:mainfrom
ayushbansal07:feature/max_page_header_size-pyarrow

Conversation

@ayushbansal07
Copy link
Copy Markdown
Contributor

@ayushbansal07 ayushbansal07 commented Oct 8, 2025

Rationale for this change

Support in pyarrow to read parquet files with page header statistics which contain values larger than 8MiB.

What changes are included in this PR?

  • Add max_page_header_size in ReaderOptions ReaderProperties.
  • Remove max_page_header_size_ from SerializedPageReader and started reading it from properties_. Had to make properties_ inside SerializedPageReader non-const for set_max_page_header_size() api.

Are these changes tested?

Yes, tried with a sample python code.

Are there any user-facing changes?

Yes

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 8, 2025

⚠️ GitHub issue #46404 has been automatically assigned in GitHub to PR creator.

@@ -260,7 +261,7 @@ class SerializedPageReader : public PageReader {
// Fills in data_page_statistics.
bool ShouldSkipPage(EncodedStatistics* data_page_statistics);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest keeping it as const. We can change the function set_max_page_header_size to do nothing after this PR and mark it as deprecated.

arrow::TimeUnit::type coerce_int96_timestamp_unit = arrow::TimeUnit::NANO;
Type::type binary_type = Type::BINARY;
Type::type list_type = Type::LIST;
uint32_t max_page_header_size = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 0 but not a meaningful default?

};

// 16 MB is the default maximum page header size
static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;
constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also let's make this int32_t not uint32_t.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 9, 2025
/// Set the size of the buffered stream buffer in bytes.
void set_buffer_size(int64_t size) { buffer_size_ = size; }

/// Return the size of the buffered stream buffer. 0 means default
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, can you make sure the docstrings actually describe the methods inside of blindly copy-pasting them?

Comment on lines +108 to +110
uint32_t max_page_header_size() const { return max_page_header_size_; }
/// Set the size of the buffered stream buffer in bytes. 0 means default
void set_max_page_header_size(uint32_t size) { max_page_header_size_ = size; }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change these to take int32_t? We try to avoid unsigned integers in public API.

};

// 16 MB is the default maximum page header size
static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also let's make this int32_t not uint32_t.

the canonical `arrow.uuid` extension type).
max_page_header_size : int, default None
If not None, override the maximum size of a page header.
Deafults to 16MB, which should be sufficient for most Parquet files.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Deafults to 16MB, which should be sufficient for most Parquet files.
Defaults to 16MB, which should be sufficient for most Parquet files.

thrift_container_size_limit=None, filesystem=None,
page_checksum_verification=False, arrow_extensions_enabled=True):
page_checksum_verification=False, arrow_extensions_enabled=True,
max_page_header_size=None):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok, but is the new argument also available on pq.read_table and friends?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants