Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi diskRanges for ChunkReader #1955

Closed
Ted-Jiang opened this issue Jun 28, 2022 · 6 comments
Closed

Support multi diskRanges for ChunkReader #1955

Ted-Jiang opened this issue Jun 28, 2022 · 6 comments
Labels
parquet Changes to the parquet crate

Comments

@Ted-Jiang
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
related to #1775
When i implement page index skipping #1792 , i found

/// The ChunkReader trait generates readers of chunks of a source.
/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
/// For an object store reader, each read can be mapped to a range request.
pub trait ChunkReader: Length + Send + Sync {
    type T: Read + Send;
    /// get a serialy readeable slice of the current reader
    /// This should fail if the slice exceeds the current bounds
    fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
}

it assume read whole column chunk bytes array, but when facing like

     * rows   col1   col2   col3
     *      ┌──────┬──────┬──────┐
     *   0  │  p0  │      │      │
     *      ╞══════╡  p0  │  p0  │
     *  20  │ p1(X)│------│------│
     *      ╞══════╪══════╡      │
     *  40  │ p2   │      │------│
     *      ╞══════╡ p1(X)╞══════╡
     *  60  │ p3(X)│      │------│
     *      ╞══════╪══════╡      │
     *  80  │  p4  │      │  p1  │
     *      ╞══════╡  p2  │      │
     * 100  │  p5  │      │      │
     *      └──────┴──────┴──────┘

read col1 page1 and page3 we need skip other pages
we should pass two offsets

Describe the solution you'd like
pass multi strart and length

 fn get_read(&self, start: vec<u64>, length: vec<usize>) -> Result<Self::T>;

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@Ted-Jiang Ted-Jiang added the enhancement Any new improvement worthy of a entry in the changelog label Jun 28, 2022
@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Jun 28, 2022

@tustvold @alamb PTAL

@tustvold
Copy link
Contributor

tustvold commented Jun 28, 2022

Why not just call get_read for each page instead of for the entire column chunk? There is no requirement for get_read to delimit column chunks, after all the same trait is used to read the footer, etc...

Somewhat related, but something to keep in mind is how this will all work with ParquetRecordBatchStream. This does not make use of ChunkReader, and is instead push-based, needing to know the ranges to fetch up-front. It should just be a case of making InMemoryColumnChunk sparse and teaching InMemoryColumnChunkReader to read it correctly, but it is probably worth thinking about how this will work, especially as apache/datafusion#2677 moves to using it

@Ted-Jiang
Copy link
Member Author

Why not just call get_read for each page instead of for the entire column chunk? There is no requirement for get_read to delimit column chunks, after all the same trait is used to read the footer, etc...

make sense.

how this will all work with ParquetRecordBatchStream

😂 For now i only check page filter in ParquetRecordBatchReader, For AsyncFileReader i haven't check the code (because i found datafusion now use iterator mode).

Is there any need to support in ParquetRecordBatchReader, or They reuse a lot of logic between them (support one is like almost support both).

@tustvold How expert thinks 😊

@Ted-Jiang
Copy link
Member Author

It seems use IOx ObjectStore will only support asyn reader?

could you show me the code example of IOX integrate with arrow-rs

@tustvold
Copy link
Contributor

tustvold commented Jun 28, 2022

Is there any need to support in ParquetRecordBatchReader, or They reuse a lot of logic between them (support one is like almost support both).

They reuse a lot of logic, however, the logic that differs concerns the IO for fetching pages. So support for this would need to be explicitly added.

could you show me the code example of IOX integrate with arrow-rs

Currently IOx fetches the entire file to memory and does not perform IO to object storage directly. This was partly driven by the limited support for more sophisticated predicate pushdown, and the fact IO was not a dominating factor for our query workloads.

That being said, apache/datafusion#2677 switches DataFusion to using the async interface directly, and apache/datafusion#2504 has more about how I envisage this fitting with the rayon-based scheduler longer-term. Any feedback would be most welcome 😄

@Ted-Jiang
Copy link
Member Author

Wow! Wonderful work! changing with each passing day 😂 i will catch up 😊

@alamb alamb added parquet Changes to the parquet crate and removed enhancement Any new improvement worthy of a entry in the changelog labels Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants