support write page index #1777

liukun4515 · 2022-06-02T08:33:52Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

From the issue #1705, the feature page index is important for arrow-rs reading and writing.
@Ted-Jiang is working on the reading path for the page index #1749.

I will work on the writing path for page index in the parquet-rs.

@alamb
@tustvold

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

tustvold · 2022-06-06T17:16:46Z

One thing I stumbled across whilst reviewing the read side, is that the ColumnIndex has an implicit assumption that semantic records don't span page boundaries. That is the first repetition level of each page is 0. I think we may need to enforce this when writing file with a column index. https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L920

nevi-me · 2022-06-12T19:35:11Z

One thing I stumbled across whilst reviewing the read side, is that the ColumnIndex has an implicit assumption that semantic records don't span page boundaries. That is the first repetition level of each page is 0. I think we may need to enforce this when writing file with a column index. https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L920

I think it should be considered a bug if part of a record is recorded over 2 page boundaries. It would create ambiguity on the record, and force a reader to potentially read the page in reverse to find the start of the record. Great catch!

liukun4515 · 2022-06-21T06:48:34Z

@tustvold
I have read the write path in the https://github.com/apache/parquet-mr and am very familiar with the write path of parquet.
Diff column chunk in the same row group may contain diff number of pages.
A value just can be encoded into only one page.

liukun4515 · 2022-06-21T07:36:02Z

After go through current write path in the arrow-rs, I think we need to do some minor refactor to collect and write Page Index to the tail of the parquet file.

liukun4515 · 2022-06-21T07:50:36Z

 * For example:
 *
 * <pre>
 * rows   col1   col2   col3
 *      ┌──────┬──────┬──────┐
 *   0  │  p0  │      │      │
 *      ╞══════╡  p0  │  p0  │
 *  20  │ p1(X)│------│------│
 *      ╞══════╪══════╡      │
 *  40  │ p2(X)│      │------│
 *      ╞══════╡ p1(X)╞══════╡
 *  60  │ p3(X)│      │------│
 *      ╞══════╪══════╡      │
 *  80  │  p4  │      │  p1  │
 *      ╞══════╡  p2  │      │
 * 100  │  p5  │      │      │
 *      └──────┴──────┴──────┘

liukun4515 · 2022-06-24T07:52:05Z

In the current parquet writer for rust version:
start row group
start a column
write data to the column
end the column and will get the column chunk metadata
end row group

before flush the filemetadata to the disk, we should edit the column chunk metadata and add the index offset and index length metadata.

liukun4515 · 2022-07-16T07:58:21Z

this issue can't be closed.

liukun4515 added the enhancement Any new improvement worthy of a entry in the changelog label Jun 2, 2022

alamb mentioned this issue Jun 2, 2022

Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex) #1705

Closed

2 tasks

liukun4515 mentioned this issue Jun 24, 2022

Add column index writer for parquet #1935

Merged

liukun4515 closed this as completed Jul 16, 2022

alamb added parquet Changes to the parquet crate and removed enhancement Any new improvement worthy of a entry in the changelog labels Jul 21, 2022

This was referenced Feb 7, 2023

Variable fragment sizes for Parquet writer rapidsai/cudf#12685

Merged

Should Parquet pages begin on the start of a row? #3680

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support write page index #1777

support write page index #1777

liukun4515 commented Jun 2, 2022 •

edited

Loading

tustvold commented Jun 6, 2022 •

edited

Loading

nevi-me commented Jun 12, 2022

liukun4515 commented Jun 21, 2022

liukun4515 commented Jun 21, 2022

liukun4515 commented Jun 21, 2022

liukun4515 commented Jun 24, 2022

liukun4515 commented Jul 16, 2022

support write page index #1777

support write page index #1777

Comments

liukun4515 commented Jun 2, 2022 • edited Loading

tustvold commented Jun 6, 2022 • edited Loading

nevi-me commented Jun 12, 2022

liukun4515 commented Jun 21, 2022

liukun4515 commented Jun 21, 2022

liukun4515 commented Jun 21, 2022

liukun4515 commented Jun 24, 2022

liukun4515 commented Jul 16, 2022

liukun4515 commented Jun 2, 2022 •

edited

Loading

tustvold commented Jun 6, 2022 •

edited

Loading