Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Parquet page index read support #33596

Closed
Tracked by #26168
asfimport opened this issue Dec 14, 2022 · 2 comments · Fixed by #14964
Closed
Tracked by #26168

[C++][Parquet] Parquet page index read support #33596

asfimport opened this issue Dec 14, 2022 · 2 comments · Fixed by #14964

Comments

@asfimport
Copy link
Collaborator

Implement read support for parquet page index and expose it from the reader API.

Reporter: Gang Wu / @wgtmac
Assignee: Gang Wu / @wgtmac

PRs and other links:

Note: This issue was originally created as ARROW-18434. Please see the migration documentation for further details.

pitrou added a commit that referenced this issue Feb 2, 2023
Basically, the patch provides following implementation:
- Define `class RowGroupPageIndexReader` to read page index from a parquet row group. It internally leverages implementation from Apache Impala [link](https://github.com/apache/impala/blob/efa426453a8af3728bc272b9158f5564ce37e0ea/be/src/exec/parquet/parquet-page-index.cc#L41) to merge I/O chunks of page index in the same row group.
- Define `class PageIndexReader` to create `RowGroupPageIndexReader` for each row group.
- `ParquetFileReader` internally creates and caches a single `PageIndexReader` object and exposes it to the end user.

Limitation:
- Reading page index from encrypted parquet file is not yet supported. It takes some effort to understand the specs and will be done in a separate patch after the completion of write logic of page index.
* Closes: #33596

Lead-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 12.0.0 milestone Feb 2, 2023
@pitrou
Copy link
Member

pitrou commented Feb 2, 2023

@wgtmac Can you add a comment so that we can assign you here?

@wgtmac
Copy link
Member

wgtmac commented Feb 3, 2023

@wgtmac Can you add a comment so that we can assign you here?

Thanks @pitrou!

sjperkins pushed a commit to sjperkins/arrow that referenced this issue Feb 10, 2023
…e#14964)

Basically, the patch provides following implementation:
- Define `class RowGroupPageIndexReader` to read page index from a parquet row group. It internally leverages implementation from Apache Impala [link](https://github.com/apache/impala/blob/efa426453a8af3728bc272b9158f5564ce37e0ea/be/src/exec/parquet/parquet-page-index.cc#L41) to merge I/O chunks of page index in the same row group.
- Define `class PageIndexReader` to create `RowGroupPageIndexReader` for each row group.
- `ParquetFileReader` internally creates and caches a single `PageIndexReader` object and exposes it to the end user.

Limitation:
- Reading page index from encrypted parquet file is not yet supported. It takes some effort to understand the specs and will be done in a separate patch after the completion of write logic of page index.
* Closes: apache#33596

Lead-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
…e#14964)

Basically, the patch provides following implementation:
- Define `class RowGroupPageIndexReader` to read page index from a parquet row group. It internally leverages implementation from Apache Impala [link](https://github.com/apache/impala/blob/efa426453a8af3728bc272b9158f5564ce37e0ea/be/src/exec/parquet/parquet-page-index.cc#L41) to merge I/O chunks of page index in the same row group.
- Define `class PageIndexReader` to create `RowGroupPageIndexReader` for each row group.
- `ParquetFileReader` internally creates and caches a single `PageIndexReader` object and exposes it to the end user.

Limitation:
- Reading page index from encrypted parquet file is not yet supported. It takes some effort to understand the specs and will be done in a separate patch after the completion of write logic of page index.
* Closes: apache#33596

Lead-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
…e#14964)

Basically, the patch provides following implementation:
- Define `class RowGroupPageIndexReader` to read page index from a parquet row group. It internally leverages implementation from Apache Impala [link](https://github.com/apache/impala/blob/efa426453a8af3728bc272b9158f5564ce37e0ea/be/src/exec/parquet/parquet-page-index.cc#L41) to merge I/O chunks of page index in the same row group.
- Define `class PageIndexReader` to create `RowGroupPageIndexReader` for each row group.
- `ParquetFileReader` internally creates and caches a single `PageIndexReader` object and exposes it to the end user.

Limitation:
- Reading page index from encrypted parquet file is not yet supported. It takes some effort to understand the specs and will be done in a separate patch after the completion of write logic of page index.
* Closes: apache#33596

Lead-authored-by: Gang Wu <ustcwg@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants