Skip to content

Conversation

@zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Nov 17, 2025

Which issue does this PR close?

When i try to wrapper ParquetObjectReader and implement our internal metadata cache, and i will pass the option to the inner ParquetObjectReader, but it does not respect the index policy option even it's not skip, and it always be false and will not load page index which i want to prefetch and cache.

cc @alamb

Rationale for this change

Support option with page index if it's not skip.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 17, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- this makes sense to me

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@zhuqi-lucas
Copy link
Contributor Author

Thank you @zhuqi-lucas -- this makes sense to me

Thank you @alamb for quick review!

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me. I'm not up on all uses of this API so I wonder if down the road we can just get rid of the preloading options and require use of ArrowReaderOptions if one wants control over page index loading.

I think for now we should add a note to with_preload_column_index and with_preload_offset_index that settings in ArrowReaderOptions might override, as this is non-obvious.

@zhuqi-lucas
Copy link
Contributor Author

This looks ok to me. I'm not up on all uses of this API so I wonder if down the road we can just get rid of the preloading options and require use of ArrowReaderOptions if one wants control over page index loading.

I think for now we should add a note to with_preload_column_index and with_preload_offset_index that settings in ArrowReaderOptions might override, as this is non-obvious.

Thank you @etseidl for review and good suggestion, i added comments in latest commit. And i also created a follow-up ticket, we may can get rid of the reloading options and require use of ArrowReaderOptions.
#8862

@alamb
Copy link
Contributor

alamb commented Nov 18, 2025

This looks ok to me. I'm not up on all uses of this API so I wonder if down the road we can just get rid of the preloading options and require use of ArrowReaderOptions if one wants control over page index loading.

I think for now we should add a note to with_preload_column_index and with_preload_offset_index that settings in ArrowReaderOptions might override, as this is non-obvious.

I agree the current state of the structures is confusing (nothing that you did @zhuqi-lucas).

In my opinion part of the confusion stems from having ParuetObjectReader responsible both for the IO with object_store AND for reading/caching the ParquetMetadata

I think those are separate concerns and in an ideal API they wouldn't be in the same object

@alamb alamb merged commit e4fdefb into apache:main Nov 18, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 18, 2025

Thank you @zhuqi-lucas and @etseidl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Respect page index policy option for ParquetObjectReader when it's not skip

3 participants