Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34053: [C++][Parquet] Write parquet page index #34054

Merged
merged 5 commits into from
Apr 10, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Feb 6, 2023

Rationale for this change

Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it.

What changes are included in this PR?

Parquet file writer collects page index from all data pages and serializes page index into the file.

Are these changes tested?

Not yet, will be added later.

Are there any user-facing changes?

WriterProperties::enable_write_page_index() and WriterProperties::disable_write_page_index() have been added to toggle it on and off.

@github-actions
Copy link

github-actions bot commented Feb 6, 2023

@github-actions
Copy link

github-actions bot commented Feb 6, 2023

⚠️ GitHub issue #34053 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member Author

wgtmac commented Feb 6, 2023

@pitrou @emkornfield Could you please take a look? The interface and implementation are complete with detailed comments. I will add test gradually.

@wgtmac
Copy link
Member Author

wgtmac commented Feb 9, 2023

I have hit several issues while working on this patch. They must be resolved before proceeding. So I have switched it to draft and will work on the blocking issues first.

Blocking issues:

@etseidl
Copy link
Contributor

etseidl commented Feb 10, 2023

@wgtmac another issue to consider as you implement the page index is rows that span multiple pages. With nested columns, it is possible to have single rows that are so large that they exceed the requested page size. arrow-cpp currently will honor the page size by splitting these rows across multiple pages. The current parquet spec, however, seems to require that pages begin at row boundaries (i.e. the repetition level R is 0 for the first value in each page, see here and here). Do you concur and think this should be another blocking issue or part of this PR?

@wgtmac
Copy link
Member Author

wgtmac commented Feb 11, 2023

@wgtmac another issue to consider as you implement the page index is rows that span multiple pages. With nested columns, it is possible to have single rows that are so large that they exceed the requested page size. arrow-cpp currently will honor the page size by splitting these rows across multiple pages. The current parquet spec, however, seems to require that pages begin at row boundaries (i.e. the repetition level R is 0 for the first value in each page, see here and here). Do you concur and think this should be another blocking issue or part of this PR?

Thanks for the information. @etseidl

Yes I have already noticed that a record may span across different pages. But in the parquet-cpp, the page size check always happens at the end of each batch. Therefore it guarantees that a page will not split any record. Please check this function as well as where it is called for reference: https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1376

@etseidl
Copy link
Contributor

etseidl commented Feb 11, 2023

Yes I have already noticed that a record may span across different pages. But in the parquet-cpp, the page size check always happens at the end of each batch. Therefore it guarantees that a page will not split any record. Please check this function as well as where it is called for reference: https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1376

Perhaps I'm misunderstanding, but it appears that the function you referenced is called after a batch of values is written...I don't see where it is guaranteed that the end of a batch is also the end of a row. But thanks for working on the page indexes, I think it's an important feature that arrow-cpp currently lacks.

@wgtmac
Copy link
Member Author

wgtmac commented Feb 11, 2023

Yes I have already noticed that a record may span across different pages. But in the parquet-cpp, the page size check always happens at the end of each batch. Therefore it guarantees that a page will not split any record. Please check this function as well as where it is called for reference: https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1376

Perhaps I'm misunderstanding, but it appears that the function you referenced is called after a batch of values is written...I don't see where it is guaranteed that the end of a batch is also the end of a row. But thanks for working on the page indexes, I think it's an important feature that arrow-cpp currently lacks.

Please correct me if I am wrong. At least the arrow parquet writer guarantees this by calling ColumnWriter::WriteArrow like this: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L154. Yes, the ParquetFileWriter itself does not prevent this.

@emkornfield
Copy link
Contributor

Please correct me if I am wrong. At least the arrow parquet writer guarantees this by calling ColumnWriter::WriteArrow like this: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L154. Yes, the ParquetFileWriter itself does not prevent this.

I think @etseidl is correct here, WriteArrow is working on leaf arrays and IIRC rep/def levels in the code references are the only way to recover record boundaries. Sorry its been a busy week will aim to catchup on reviews next week. It would also be nice to not special case this for Arrow even it does somehow work there.

@wgtmac
Copy link
Member Author

wgtmac commented Feb 11, 2023

Please correct me if I am wrong. At least the arrow parquet writer guarantees this by calling ColumnWriter::WriteArrow like this: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L154. Yes, the ParquetFileWriter itself does not prevent this.

I think @etseidl is correct here, WriteArrow is working on leaf arrays and IIRC rep/def levels in the code references are the only way to recover record boundaries. Sorry its been a busy week will aim to catchup on reviews next week. It would also be nice to not special case this for Arrow even it does somehow work there.

@emkornfield Thanks for the explanation! No problem and this is not ready to review due to a series of blocking issues ahead.

I strongly agree that writing via arrow should not be a special case. It sounds like splitting page at record boundary is a new blocking issue now.

@wgtmac wgtmac force-pushed the write_page_index branch 2 times, most recently from 43facfa to 62a5528 Compare February 24, 2023 08:39
@wgtmac wgtmac marked this pull request as ready for review February 24, 2023 08:40
@wgtmac
Copy link
Member Author

wgtmac commented Feb 24, 2023

This is ready to review now. Please take a look when you have time, thanks! @emkornfield @wjones127 @pitrou

@wgtmac wgtmac force-pushed the write_page_index branch 2 times, most recently from 5d70802 to c753018 Compare February 24, 2023 09:21
@wgtmac
Copy link
Member Author

wgtmac commented Feb 24, 2023

Opened an issue for the CI failure which is unrelated to this PR: #34328

cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 6, 2023
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I also tested it locally by turning on page index by default and verify the tests pass for C++ and Python (after changing assertions that it is turned off by default).

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting change review Awaiting change review awaiting merge Awaiting merge labels Apr 7, 2023
@wgtmac
Copy link
Member Author

wgtmac commented Apr 10, 2023

This looks good to me. I also tested it locally by turning on page index by default and verify the tests pass for C++ and Python (after changing assertions that it is turned off by default).

Thanks for your review!

@wgtmac
Copy link
Member Author

wgtmac commented Apr 10, 2023

@wjones127 Could we make it to the v12.0.0 release?

@wjones127
Copy link
Member

I will merge this later today if @emkornfield has no further comments.

@wjones127 wjones127 merged commit 0cf4ffa into apache:main Apr 10, 2023
@ursabot
Copy link

ursabot commented Apr 11, 2023

Benchmark runs are scheduled for baseline = a7c4e05 and contender = 0cf4ffa. 0cf4ffa is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.62% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.15% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0cf4ffaa ec2-t3-xlarge-us-east-2
[Finished] 0cf4ffaa test-mac-arm
[Finished] 0cf4ffaa ursa-i9-9960x
[Finished] 0cf4ffaa ursa-thinkcentre-m75q
[Finished] a7c4e05a ec2-t3-xlarge-us-east-2
[Finished] a7c4e05a test-mac-arm
[Finished] a7c4e05a ursa-i9-9960x
[Finished] a7c4e05a ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
### Rationale for this change

Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it.

### What changes are included in this PR?

Parquet file writer collects page index from all data pages and serializes page index into the file.

### Are these changes tested?

Not yet, will be added later.

### Are there any user-facing changes?

`WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off.
* Closes: apache#34053

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
### Rationale for this change

Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it.

### What changes are included in this PR?

Parquet file writer collects page index from all data pages and serializes page index into the file.

### Are these changes tested?

Not yet, will be added later.

### Are there any user-facing changes?

`WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off.
* Closes: apache#34053

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
### Rationale for this change

Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it.

### What changes are included in this PR?

Parquet file writer collects page index from all data pages and serializes page index into the file.

### Are these changes tested?

Not yet, will be added later.

### Are there any user-facing changes?

`WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off.
* Closes: apache#34053

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Write parquet page index
7 participants