Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Standard Library IO Abstractions in Parquet #1163

Closed
tustvold opened this issue Jan 12, 2022 · 9 comments · Fixed by #4156
Closed

Use Standard Library IO Abstractions in Parquet #1163

tustvold opened this issue Jan 12, 2022 · 9 comments · Fixed by #4156
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

tustvold commented Jan 12, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The parquet crate has a number of IO abstractions that at least on the surface appear very similar to those in the Rust standard library.

The major distinction appears to concern mutability, with various components using a mixture of TryClone and RefCell internally to placate the borrow checker. This makes the code fairly hard to reason about, as cloned file descriptors share the same seek position. Additionally it prevents creating readers from &mut File or similar.

At a more holistic level, I'm not really sure of the use-case for non-exclusive IO, but I could be missing something here?

A non-exhaustive list of potential candidates for replacement:

  • Postition - std::io::seek(SeekFrom::Current(0))
  • Length - std::io::seek(SeekFrom::End(0))
  • ChunkReader - std::io::Read
  • SliceableCursor - std::io::Cursor
  • ParquetReader - std::io::Seek + std::io::Read
  • ParquetWriter - std::io::Seek + std::io::Write
  • FileSink - BufWriter<File>
  • FileSource - BufReader<File>

Describe the solution you'd like

I would ideally like to be able to create a parquet reader with anything implementing std::io::Seek and std::io::Read.

Similarly I would like to be able to create parquet writer with anything implementing std::io::Seek and std::io::Write.

These are standard traits within the Rust ecosystem and supporting them will simplify inter-operation with other crates, and reduce cognitive load on users and contributors.

The blanket implementations on these traits will for free allow using mutable references, instead of owned values.

Describe alternatives you've considered

Preserve the current behaviour

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Jan 12, 2022
@tustvold tustvold changed the title Parquet Read/Write Traits Use Standard Library IO Abstractions in Parquet Jan 12, 2022
@alamb
Copy link
Contributor

alamb commented Jan 12, 2022

I think the difference I have observed in the past was that the parquet abstractions also allow Clone -- as I recall it was to allow clients to read from different columns concurrently (each column would get its own Clone of the I/O object -- aka it would get its own file descriptor) so they could be read / advanced concurrently

There may well be a better way to handle this

@alamb
Copy link
Contributor

alamb commented Jan 12, 2022

FWIW I think it would be great to avoid all the custom abstractions 👍

@tustvold
Copy link
Contributor Author

tustvold commented Jan 12, 2022

Aah yes, the FileReader API expects to be able to give out RowGroupReader which in turn give out PageReader. These are all owned constructs and so expect to be able to pass around Arc<FileReader>.

However, looking at the code I'm not sure it can actually be used concurrently. It doesn't appear to have any synchronisation, and yet is using the same underlying file descriptor. I think it might be possible to make the code race if used in such a way... Some testing is needed 🤔

Edit: In fact it is impossible to use the reader in this way as the FileReader and friends return boxed trait objects, without Send bounds. This is how FileSource can get away with using a RefCell, the whole chain isn't Send and therefore cannot be sent to another thread.

@tustvold
Copy link
Contributor Author

I had a brief play around with this and found the following.

The write side is serial, and so it should be possible to use standard library abstractions. The current trait topology will likely require using Rc<RefCell<W>> or similar. I tried using mutable borrows, but this runs into issues as the types need to be boxed in order to be used in traits (due to lack of GATs) but by value trait methods (i.e. fn close(self)) aren't object safe, which makes the ergonomics of such an API suck as you need to manually std::mem::drop.

The read side is more complicated, the problem can be seen in SerializedRowGroupReader::get_column_page_reader. This wants to return a Box<dyn PageReader> which can be used asynchronously (although not concurrently) with respect to others from the same row group. This is what requires FileSource, we want buffered reads on a shared file descriptor.

It occurs to me that one of the things an async API able to support object stores will need is a sparse file abstraction, ultimately this is what the read path wants. I'll therefore park this for now, and see what shakes out of that.

@alamb
Copy link
Contributor

alamb commented Jan 13, 2022

Also related #937

@rdettai
Copy link
Contributor

rdettai commented Mar 13, 2022

Hi @tustvold ! I am the weird mind who introduced the ChunkReader 😄.

The main benefit compared to a regular reader is that you can specify the size of the chunk you plan to read, which enables you to set the right range when you call GET on the object store. Not sure how we could achieve this with plain std::io::Read. But I ended up offloading the download scheduling to a separate module anyway, and I think that this is what you will want to do in most cases to optimize your link with the object store (this is also what is done in IOx, isn't it?).

Another initial goal was indeed to try to achieve parallelism between columns, but I never succeeded because the entire structure of the parquet reader was against it, and I didn't have enough Rust experience to fight it 😉.

@tustvold
Copy link
Contributor Author

tustvold commented Apr 1, 2022

Hi @rdettai, I realize I never responded directly to your comment. I have been documenting my experiments on the parquet API in #1473, which I think overlaps a fair deal with the points you raise about the object store interface and introducing more parallelism into the parquet reader. I would be very interested in hearing any thoughts you might have

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2022
tustvold added a commit that referenced this issue May 25, 2022
…`ParquetWriter` trait (#1717) (#1163) (#1719)

* Rustify parquet writer (#1717) (#1163)

* Fix parquet_derive

* Fix benches

* Fix parquet_derive tests

* Use raw vec instead of Cursor

* Review feedback

* Fix unnecessary unwrap
@tustvold
Copy link
Contributor Author

tustvold commented Aug 17, 2022

Coming back to this, we now have OffsetIndex integration which lets us read pages without requiring buffered reading. With #2478 this integrates very nicely with in-memory data.

If it is acceptable to pre-fetch the entire column chunk in the absence of an OffsetIndex, which FWIW is what we do with async IO, then we can likely greatly simplify the implementation, removing FileSource, etc...

tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 17, 2022
alamb pushed a commit that referenced this issue Aug 17, 2022
@tustvold
Copy link
Contributor Author

tustvold commented Dec 7, 2022

If it is acceptable to pre-fetch the entire column chunk in the absence of an OffsetIndex, which FWIW is what we do with async IO, then we can likely greatly simplify the implementation, removing FileSource, etc...

The results in tustvold/access-log-bench#2 would definitely suggest this idea may have merit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants