Use Standard Library IO Abstractions in Parquet #1163

tustvold · 2022-01-12T15:33:42Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The parquet crate has a number of IO abstractions that at least on the surface appear very similar to those in the Rust standard library.

The major distinction appears to concern mutability, with various components using a mixture of TryClone and RefCell internally to placate the borrow checker. This makes the code fairly hard to reason about, as cloned file descriptors share the same seek position. Additionally it prevents creating readers from &mut File or similar.

At a more holistic level, I'm not really sure of the use-case for non-exclusive IO, but I could be missing something here?

A non-exhaustive list of potential candidates for replacement:

Postition - std::io::seek(SeekFrom::Current(0))
Length - std::io::seek(SeekFrom::End(0))
ChunkReader - std::io::Read
SliceableCursor - std::io::Cursor
ParquetReader - std::io::Seek + std::io::Read
ParquetWriter - std::io::Seek + std::io::Write
FileSink - BufWriter<File>
FileSource - BufReader<File>

Describe the solution you'd like

I would ideally like to be able to create a parquet reader with anything implementing std::io::Seek and std::io::Read.

Similarly I would like to be able to create parquet writer with anything implementing std::io::Seek and std::io::Write.

These are standard traits within the Rust ecosystem and supporting them will simplify inter-operation with other crates, and reduce cognitive load on users and contributors.

The blanket implementations on these traits will for free allow using mutable references, instead of owned values.

Describe alternatives you've considered

Preserve the current behaviour

The text was updated successfully, but these errors were encountered:

alamb · 2022-01-12T16:26:53Z

I think the difference I have observed in the past was that the parquet abstractions also allow Clone -- as I recall it was to allow clients to read from different columns concurrently (each column would get its own Clone of the I/O object -- aka it would get its own file descriptor) so they could be read / advanced concurrently

There may well be a better way to handle this

alamb · 2022-01-12T16:27:19Z

FWIW I think it would be great to avoid all the custom abstractions 👍

tustvold · 2022-01-12T16:47:27Z

Aah yes, the FileReader API expects to be able to give out RowGroupReader which in turn give out PageReader. These are all owned constructs and so expect to be able to pass around Arc<FileReader>.

However, looking at the code I'm not sure it can actually be used concurrently. It doesn't appear to have any synchronisation, and yet is using the same underlying file descriptor. I think it might be possible to make the code race if used in such a way... Some testing is needed 🤔

Edit: In fact it is impossible to use the reader in this way as the FileReader and friends return boxed trait objects, without Send bounds. This is how FileSource can get away with using a RefCell, the whole chain isn't Send and therefore cannot be sent to another thread.

tustvold · 2022-01-13T09:34:14Z

I had a brief play around with this and found the following.

The write side is serial, and so it should be possible to use standard library abstractions. The current trait topology will likely require using Rc<RefCell<W>> or similar. I tried using mutable borrows, but this runs into issues as the types need to be boxed in order to be used in traits (due to lack of GATs) but by value trait methods (i.e. fn close(self)) aren't object safe, which makes the ergonomics of such an API suck as you need to manually std::mem::drop.

The read side is more complicated, the problem can be seen in SerializedRowGroupReader::get_column_page_reader. This wants to return a Box<dyn PageReader> which can be used asynchronously (although not concurrently) with respect to others from the same row group. This is what requires FileSource, we want buffered reads on a shared file descriptor.

It occurs to me that one of the things an async API able to support object stores will need is a sparse file abstraction, ultimately this is what the read path wants. I'll therefore park this for now, and see what shakes out of that.

alamb · 2022-01-13T18:56:16Z

Also related #937

rdettai · 2022-03-13T20:47:44Z

Hi @tustvold ! I am the weird mind who introduced the ChunkReader 😄.

The main benefit compared to a regular reader is that you can specify the size of the chunk you plan to read, which enables you to set the right range when you call GET on the object store. Not sure how we could achieve this with plain std::io::Read. But I ended up offloading the download scheduling to a separate module anyway, and I think that this is what you will want to do in most cases to optimize your link with the object store (this is also what is done in IOx, isn't it?).

Another initial goal was indeed to try to achieve parallelism between columns, but I never succeeded because the entire structure of the parquet reader was against it, and I didn't have enough Rust experience to fight it 😉.

tustvold · 2022-04-01T16:47:56Z

Hi @rdettai, I realize I never responded directly to your comment. I have been documenting my experiments on the parquet API in #1473, which I think overlaps a fair deal with the points you raise about the object store interface and introducing more parallelism into the parquet reader. I would be very interested in hearing any thoughts you might have

…`ParquetWriter` trait (#1717) (#1163) (#1719) * Rustify parquet writer (#1717) (#1163) * Fix parquet_derive * Fix benches * Fix parquet_derive tests * Use raw vec instead of Cursor * Review feedback * Fix unnecessary unwrap

tustvold · 2022-08-17T11:32:06Z

Coming back to this, we now have OffsetIndex integration which lets us read pages without requiring buffered reading. With #2478 this integrates very nicely with in-memory data.

If it is acceptable to pre-fetch the entire column chunk in the absence of an OffsetIndex, which FWIW is what we do with async IO, then we can likely greatly simplify the implementation, removing FileSource, etc...

tustvold · 2022-12-07T14:48:17Z

If it is acceptable to pre-fetch the entire column chunk in the absence of an OffsetIndex, which FWIW is what we do with async IO, then we can likely greatly simplify the implementation, removing FileSource, etc...

The results in tustvold/access-log-bench#2 would definitely suggest this idea may have merit

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Jan 12, 2022

tustvold mentioned this issue Jan 12, 2022

Use tempfile for parquet tests #1165

Merged

tustvold changed the title ~~Parquet Read/Write Traits~~ Use Standard Library IO Abstractions in Parquet Jan 12, 2022

tustvold mentioned this issue Mar 8, 2022

Avoid repeated open for one single file and simplify object reader API on the sync part apache/datafusion#1905

Closed

tustvold mentioned this issue Mar 22, 2022

Requirements for Async Parquet API #1473

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2022

Rustify parquet writer (apache#1717) (apache#1163)

3b328dc

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2022

Rustify parquet writer (apache#1717) (apache#1163)

f60e262

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2022

Rustify parquet writer (apache#1717) (apache#1163)

f0e869a

tustvold mentioned this issue May 21, 2022

Change parquet writers to use standard std:io::Write rather custom ParquetWriter trait (#1717) (#1163) #1719

Merged

This was referenced Jun 2, 2022

Implement ChunkReader for Bytes, deprecate SliceableCursor #1775

Merged

Separate Page IO from Page Decode #1810

Merged

Add public API for decoding parquet footer #1804

Merged

tustvold mentioned this issue Jun 10, 2022

Refactor parquet::arrow module #1827

Merged

tustvold mentioned this issue Jun 28, 2022

Use InMemoryColumnChunkReader (~20% faster) #1956

Merged

tustvold mentioned this issue Aug 17, 2022

Add ChunkReader::get_bytes #2478

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 17, 2022

Remove Position trait (apache#1163)

c0ab43a

tustvold mentioned this issue Aug 17, 2022

Remove Position trait (#1163) #2479

Merged

alamb pushed a commit that referenced this issue Aug 17, 2022

Remove Position trait (#1163) (#2479)

4e4902f

This was referenced Apr 28, 2023

Document ChunkReader (#4118) #4147

Closed

Cleanup ChunkReader (#4118) #4156

Merged

tustvold closed this as completed in #4156 May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Standard Library IO Abstractions in Parquet #1163

Use Standard Library IO Abstractions in Parquet #1163

tustvold commented Jan 12, 2022 •

edited

Loading

alamb commented Jan 12, 2022

alamb commented Jan 12, 2022

tustvold commented Jan 12, 2022 •

edited

Loading

tustvold commented Jan 13, 2022

alamb commented Jan 13, 2022

rdettai commented Mar 13, 2022

tustvold commented Apr 1, 2022

tustvold commented Aug 17, 2022 •

edited

Loading

tustvold commented Dec 7, 2022

Use Standard Library IO Abstractions in Parquet #1163

Use Standard Library IO Abstractions in Parquet #1163

Comments

tustvold commented Jan 12, 2022 • edited Loading

alamb commented Jan 12, 2022

alamb commented Jan 12, 2022

tustvold commented Jan 12, 2022 • edited Loading

tustvold commented Jan 13, 2022

alamb commented Jan 13, 2022

rdettai commented Mar 13, 2022

tustvold commented Apr 1, 2022

tustvold commented Aug 17, 2022 • edited Loading

tustvold commented Dec 7, 2022

tustvold commented Jan 12, 2022 •

edited

Loading

tustvold commented Jan 12, 2022 •

edited

Loading

tustvold commented Aug 17, 2022 •

edited

Loading