GH-46404: [C++][Python] add support for max_page_header_size for pyarrow#47758
GH-46404: [C++][Python] add support for max_page_header_size for pyarrow#47758ayushbansal07 wants to merge 1 commit into
Conversation
|
|
| @@ -260,7 +261,7 @@ class SerializedPageReader : public PageReader { | |||
| // Fills in data_page_statistics. | |||
| bool ShouldSkipPage(EncodedStatistics* data_page_statistics); | |||
|
|
|||
There was a problem hiding this comment.
I'd suggest keeping it as const. We can change the function set_max_page_header_size to do nothing after this PR and mark it as deprecated.
| arrow::TimeUnit::type coerce_int96_timestamp_unit = arrow::TimeUnit::NANO; | ||
| Type::type binary_type = Type::BINARY; | ||
| Type::type list_type = Type::LIST; | ||
| uint32_t max_page_header_size = 0; |
There was a problem hiding this comment.
Why 0 but not a meaningful default?
| }; | ||
|
|
||
| // 16 MB is the default maximum page header size | ||
| static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; |
There was a problem hiding this comment.
| static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; | |
| constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; |
There was a problem hiding this comment.
Also let's make this int32_t not uint32_t.
| /// Set the size of the buffered stream buffer in bytes. | ||
| void set_buffer_size(int64_t size) { buffer_size_ = size; } | ||
|
|
||
| /// Return the size of the buffered stream buffer. 0 means default |
There was a problem hiding this comment.
Please, can you make sure the docstrings actually describe the methods inside of blindly copy-pasting them?
| uint32_t max_page_header_size() const { return max_page_header_size_; } | ||
| /// Set the size of the buffered stream buffer in bytes. 0 means default | ||
| void set_max_page_header_size(uint32_t size) { max_page_header_size_ = size; } |
There was a problem hiding this comment.
Can we change these to take int32_t? We try to avoid unsigned integers in public API.
| }; | ||
|
|
||
| // 16 MB is the default maximum page header size | ||
| static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; |
There was a problem hiding this comment.
Also let's make this int32_t not uint32_t.
| the canonical `arrow.uuid` extension type). | ||
| max_page_header_size : int, default None | ||
| If not None, override the maximum size of a page header. | ||
| Deafults to 16MB, which should be sufficient for most Parquet files. |
There was a problem hiding this comment.
| Deafults to 16MB, which should be sufficient for most Parquet files. | |
| Defaults to 16MB, which should be sufficient for most Parquet files. |
| thrift_container_size_limit=None, filesystem=None, | ||
| page_checksum_verification=False, arrow_extensions_enabled=True): | ||
| page_checksum_verification=False, arrow_extensions_enabled=True, | ||
| max_page_header_size=None): |
There was a problem hiding this comment.
It's ok, but is the new argument also available on pq.read_table and friends?
Rationale for this change
Support in pyarrow to read parquet files with page header statistics which contain values larger than 8MiB.
What changes are included in this PR?
Are these changes tested?
Yes, tried with a sample python code.
Are there any user-facing changes?
Yes
pyarrow#46404