Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support write page index #1777

Closed
Tracked by #1705
liukun4515 opened this issue Jun 2, 2022 · 7 comments
Closed
Tracked by #1705

support write page index #1777

liukun4515 opened this issue Jun 2, 2022 · 7 comments
Labels
parquet Changes to the parquet crate

Comments

@liukun4515
Copy link
Contributor

liukun4515 commented Jun 2, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

From the issue #1705, the feature page index is important for arrow-rs reading and writing.
@Ted-Jiang is working on the reading path for the page index #1749.

I will work on the writing path for page index in the parquet-rs.

@alamb
@tustvold

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@liukun4515 liukun4515 added the enhancement Any new improvement worthy of a entry in the changelog label Jun 2, 2022
@tustvold
Copy link
Contributor

tustvold commented Jun 6, 2022

One thing I stumbled across whilst reviewing the read side, is that the ColumnIndex has an implicit assumption that semantic records don't span page boundaries. That is the first repetition level of each page is 0. I think we may need to enforce this when writing file with a column index. https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L920

@nevi-me
Copy link
Contributor

nevi-me commented Jun 12, 2022

One thing I stumbled across whilst reviewing the read side, is that the ColumnIndex has an implicit assumption that semantic records don't span page boundaries. That is the first repetition level of each page is 0. I think we may need to enforce this when writing file with a column index. https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L920

I think it should be considered a bug if part of a record is recorded over 2 page boundaries. It would create ambiguity on the record, and force a reader to potentially read the page in reverse to find the start of the record. Great catch!

@liukun4515
Copy link
Contributor Author

@tustvold
I have read the write path in the https://github.com/apache/parquet-mr and am very familiar with the write path of parquet.
Diff column chunk in the same row group may contain diff number of pages.
A value just can be encoded into only one page.

@liukun4515
Copy link
Contributor Author

After go through current write path in the arrow-rs, I think we need to do some minor refactor to collect and write Page Index to the tail of the parquet file.

@liukun4515
Copy link
Contributor Author

 * For example:
 *
 * <pre>
 * rows   col1   col2   col3
 *      ┌──────┬──────┬──────┐
 *   0  │  p0  │      │      │
 *      ╞══════╡  p0  │  p0  │
 *  20  │ p1(X)│------│------│
 *      ╞══════╪══════╡      │
 *  40  │ p2(X)│      │------│
 *      ╞══════╡ p1(X)╞══════╡
 *  60  │ p3(X)│      │------│
 *      ╞══════╪══════╡      │
 *  80  │  p4  │      │  p1  │
 *      ╞══════╡  p2  │      │
 * 100  │  p5  │      │      │
 *      └──────┴──────┴──────┘

@liukun4515
Copy link
Contributor Author

In the current parquet writer for rust version:
start row group
start a column
write data to the column
end the column and will get the column chunk metadata
end row group

before flush the filemetadata to the disk, we should edit the column chunk metadata and add the index offset and index length metadata.

@liukun4515
Copy link
Contributor Author

this issue can't be closed.

@alamb alamb added parquet Changes to the parquet crate and removed enhancement Any new improvement worthy of a entry in the changelog labels Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

4 participants