-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34053: [C++][Parquet] Write parquet page index #34054
Conversation
|
7fa4e81
to
4c1b80c
Compare
@pitrou @emkornfield Could you please take a look? The interface and implementation are complete with detailed comments. I will add test gradually. |
84c990e
to
870bcd2
Compare
I have hit several issues while working on this patch. They must be resolved before proceeding. So I have switched it to draft and will work on the blocking issues first. Blocking issues:
|
@wgtmac another issue to consider as you implement the page index is rows that span multiple pages. With nested columns, it is possible to have single rows that are so large that they exceed the requested page size. arrow-cpp currently will honor the page size by splitting these rows across multiple pages. The current parquet spec, however, seems to require that pages begin at row boundaries (i.e. the repetition level R is 0 for the first value in each page, see here and here). Do you concur and think this should be another blocking issue or part of this PR? |
Thanks for the information. @etseidl Yes I have already noticed that a record may span across different pages. But in the parquet-cpp, the page size check always happens at the end of each batch. Therefore it guarantees that a page will not split any record. Please check this function as well as where it is called for reference: https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1376 |
Perhaps I'm misunderstanding, but it appears that the function you referenced is called after a batch of values is written...I don't see where it is guaranteed that the end of a batch is also the end of a row. But thanks for working on the page indexes, I think it's an important feature that arrow-cpp currently lacks. |
Please correct me if I am wrong. At least the arrow parquet writer guarantees this by calling |
I think @etseidl is correct here, WriteArrow is working on leaf arrays and IIRC rep/def levels in the code references are the only way to recover record boundaries. Sorry its been a busy week will aim to catchup on reviews next week. It would also be nice to not special case this for Arrow even it does somehow work there. |
@emkornfield Thanks for the explanation! No problem and this is not ready to review due to a series of blocking issues ahead. I strongly agree that writing via arrow should not be a special case. It sounds like splitting page at record boundary is a new blocking issue now. |
43facfa
to
62a5528
Compare
This is ready to review now. Please take a look when you have time, thanks! @emkornfield @wjones127 @pitrou |
5d70802
to
c753018
Compare
Opened an issue for the CI failure which is unrelated to this PR: #34328 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I also tested it locally by turning on page index by default and verify the tests pass for C++ and Python (after changing assertions that it is turned off by default).
Thanks for your review! |
@wjones127 Could we make it to the v12.0.0 release? |
I will merge this later today if @emkornfield has no further comments. |
Benchmark runs are scheduled for baseline = a7c4e05 and contender = 0cf4ffa. 0cf4ffa is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
### Rationale for this change Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it. ### What changes are included in this PR? Parquet file writer collects page index from all data pages and serializes page index into the file. ### Are these changes tested? Not yet, will be added later. ### Are there any user-facing changes? `WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off. * Closes: apache#34053 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
### Rationale for this change Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it. ### What changes are included in this PR? Parquet file writer collects page index from all data pages and serializes page index into the file. ### Are these changes tested? Not yet, will be added later. ### Are there any user-facing changes? `WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off. * Closes: apache#34053 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
### Rationale for this change Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it. ### What changes are included in this PR? Parquet file writer collects page index from all data pages and serializes page index into the file. ### Are these changes tested? Not yet, will be added later. ### Are there any user-facing changes? `WriterProperties::enable_write_page_index()` and `WriterProperties::disable_write_page_index()` have been added to toggle it on and off. * Closes: apache#34053 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
Rationale for this change
Parquet C++ reader supports reading page index from file, but the writer does not yet support writing it.
What changes are included in this PR?
Parquet file writer collects page index from all data pages and serializes page index into the file.
Are these changes tested?
Not yet, will be added later.
Are there any user-facing changes?
WriterProperties::enable_write_page_index()
andWriterProperties::disable_write_page_index()
have been added to toggle it on and off.