-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources #8300
Conversation
I first moved the footer parsing out of the |
from https://issues.apache.org/jira/browse/ARROW-10135, it seems like your goal is to support reading parquet files from sources other than files. I wonder if you have tried implementing the Then you can create a file reader from that other source like: impl ParquetReader for ThingThatImplementsParquetReader {
...
}
...
let source = ThingThatImplementsParquetReader::new();
let file_reader = SerializedFileReader::new(source);
... |
The discussion with @alamb about the need for an intermediate layer when reading a parquet file is discussed on JIRA The highlights of the current implementation:
@alamb : can you take a look at the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdettai -- I looked at ChunckReader
and skimmed how it was used in the rest of the code.
TLDR is I think it is a nice improvement and in my opinion this PR moves the Rust parquet writer forward.
Some comments:
- I think using a more standard spelling of
Chunk
rather thanChunck
might improve the readability somewhat - I wonder how important separating out the types and
get_read
andget_read_seek
are? Given the underlying implementation ofChunckReader
will have to implement seekable streams anyways, I wonder how much more performance is to be gained for the additional complexity in API (indeed bothFile
andSliceableCursor
appear to use the same type forT
andU
) - I like
SliceableCursor
-- it is a nice way to avoid the copies that occur with the actual Cursor implementation - I think we should begin moving the parquet reader towards using
Arc
rather thanRc
so as to better prepare ourselves for multi-threading
rust/parquet/src/file/reader.rs
Outdated
|
||
pub trait ReadChunck: Read + Length {} | ||
pub trait ReadSeekChunck: ReadChunck + Seek {} | ||
impl<T: Read + Length> ReadChunck for T {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this pattern (of automatically implementing the combined trait for anything that implements Read
+ Length
(I struggled doing that with FileReader when working with the Parquet API earlier in my Rust days)
|
||
/// This is object to use if your file is already in memory. | ||
/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices". | ||
/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally suggest moving to Arc
here rather than Rc
across the board in the Parquet reader. I suggest Arc
to "future proof" the code as Arc
almost always is needed with async
and multi-threading (as Rc
doesn't implement Send
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we implement this separately? That'd be my preference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes maybe move this into a separate PR? given the current one is already big enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a new feature but an adaptation of what we had before. Without SliceableCursor
there is no way to read from RAM the way it was done with Cursor
. Instead of directly implementing the new ChunkReader
trait for Cursor
and having the same problem of copying the data all over the place as we did before, I tweeked the Cursor
a bit to share its data between the Read
slices.
Thanks a lot for your time and comments!
You mean "a more correct spelling" 😉
Right, I was just looking at this. There might be some simplifications that could be implemented here.
That is true, but there is a major problem with the way Give me a bit of time, I will try to fix these. I'll come back to you if I have further doubts. Thanks again !!! |
@alamb I think I addressed most of your concerns. The only one that remains is the necessity to prepare for async, but I have digged a little bit into and I think that tackling this properly will require work that is not really in the scope of this PR. There is more to it than converting Rc to Arc because of this shared handle to the file. Lets do that work in an other PR associated to a sub-tasks of ARROW-9464 or simply associated to ARROW-9674. I will work on it next week, but until then we can validate and merge this one. |
I agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to compile an internal project that uses this branch against an internal project we have (that uses SerializedFileReader
with something other than File
and I hit several compiler errors which I haven't fully worked through.
The first is with TryClone
being private to the crate (comments inline below) though upon further reading of this PR it seems like the right thing for me to have done is to change the struct we are using to implement ChunkReader instead of TryClone, Length, etc.
Another error is the following:
error[E0599]: no method named `metadata` found for struct `parquet::file::serialized_reader::SerializedFileReader<R>` in the current scope
--> delorean_parquet/src/metadata.rs:46:35
|
46 | let parquet_metadata = reader.metadata();
| ^^^^^^^^ private field, not a method
Sadly I ran out of time this morning to keep looking at this. I will try and spend some more time either later today or perhaps tomorrow.
/// get a serialy readeable slice of the current reader | ||
/// This should fail if the slice exceeds the current bounds | ||
fn get_read(&self, start: u64, length: usize) -> Result<Self::T>; | ||
} | ||
|
||
// ---------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried to compile an in-house project with this branch I got the following error:
error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
--> delorean_parquet/src/lib.rs:13:28
|
13 | file::reader::{Length, TryClone},
| ^^^^^^^^ no `TryClone` in `parquet::file::reader`
error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
When I tried to use the copy in util::io
I got:
error[E0603]: module `util` is private
--> delorean_parquet/src/lib.rs:14:5
|
14 | util::io::{TryClone},
| ^^^^ private module
|
When I used the import in seriailized_reader
I got a similar error:
error[E0603]: trait `TryClone` is private
--> delorean_parquet/src/lib.rs:14:30
|
14 | file::serialized_reader::TryClone,
| ^^^^^^^^ private trait
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb is this still an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is (as the ParquetWriter
uses TryClone
) but I may not have tested with the latest changes from @rdettai
Sorry I don't have more time to spend on this -- I am still struggling to get my existing project code to compile against this branch and this PR is pretty massive to digest. One other thing I found this morning is that the I have also been trying to figure out if the So I guess all in all my conclusion is that the ideas behind this PR are good but:
|
I'll also only be able to review this in the next 2 week(end)s, and I'd like to go through it before it's merged. I'm also going to reach out to a few active crates that depend on Parquet, so I can solicit their feedback on these changes. |
I should probably make clear that anyone using the I actually think the changes in this PR will make such custom input sources easier to write in the future, but existing code will have to be rejiggered. |
I'm open to moving traits such as It is a good observation that |
@rdettai --
Yes, actually that was our experience (when we used a buffered implementation, we found that the underlying (large) buffer got copies around a lot. I think in general, the cleaner thing (and my preference) would be to simply port my code to use
The upside is that it would be somewhat more backwards compatible, but unless other reviewers feel strongly I personally think we should just do the breaking change, merge this PR, (maybe even remove |
An argument in favor of having |
6997952
to
a084211
Compare
I'm principally fine with the changes, and the direction of the PR. I'll highly likely be working more on the non-Arrow side of Parquet, so I expect that I might be making a lot of changes on top of this. There seems to be some demand for supporting sources and sinks other than files, and as this helps with that; I'm pro getting it merged. @rdettai are there still more changes that you intend on making, and @alamb are all your queries and concerns addressed? Thanks for the detailed review. |
@nevi-me -- I think this is good enough to merge; Once it is merged, I'll try and port my internal project asap (which has custom readers and writers) and make a PR to do any touchups needed (like re-exporting TryClone if need be) |
From the description the changes proposed here makes sense to me as well - but seems the PR itself is messed up and contains many unrelated changes. |
also removed seek from ChunkReader and fixed typo on "chunck" word
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
a084211
to
5f9ba42
Compare
@nevi-me Depends on whether we want @sunchao I tried to restrain myself on this PR 😄. Honestly, I had to move quite a lot of things around because this touches a core API and things were very "monolithic". There are two points were I'm getting a little bit out of the main concern:
|
Thanks @rdettai . I'll take a look at this PR today. |
/// The ChunkReader trait generates readers of chunks of a source. | ||
/// For a file system reader, each chunk might contain a clone of File bounded on a given range. | ||
/// For an object store reader, each read can be mapped to a range request. | ||
pub trait ChunkReader: Length { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe ChunkRead
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used Reader because it does not implem Read
but generates things that implem Read
. But I admit that I'm not 100% happy with ChunkReader
and it would be nice to have something more explicit as this is a very public part of the API.
/// Length should return the total number of bytes in the input source. | ||
/// It's mainly used to read the metadata, which is at the end of the source. | ||
#[allow(clippy::len_without_is_empty)] | ||
pub trait Length { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe io.rs is more suitable for this trait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not because because it is also used by ChunkReader
and io.rs contains internal implementations. But I admit that I still don't have it very clear how modules should be structured in Rust 😄
|
||
/// This is object to use if your file is already in memory. | ||
/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices". | ||
/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes maybe move this into a separate PR? given the current one is already big enough.
I'm having second thoughts about the |
I personally think the By adding more sophistication (FromStart, FromEnd) you may save an S3 or other object store` metadata request to get the length in theory, but I suspect most applications will already have the length metadata from other sources anyway (e.g. because they had to list the objects for some other reason) so acquiring the length may not be as expensive as it first appears. |
I agree that this PR is hanging, but as this is an API change, I guess its better to think things through before moving forward! 😄 This should maybe have been prepared in a design document rather than in the PR... 😅 Its true that most of the time you will have the length around from an other source. But shouldn't we prefer a trait that contains exactly the minimum amount of information we need to make the
I can have the implem ready by the end of the day! |
Hi! I did some experimentation with this new interface (with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my side. Thanks @rdettai !
FYI I have filed https://issues.apache.org/jira/browse/ARROW-10390 for the issue with |
…arquet writers As of #8300, it no longer appears possible to implement a `ParquetWriter` to provide a custom writer because it is not possible to implement the trait as `TryClone` is not publicly exported. This PR publically exports that trait and adds an end to end test demonstrating it is possible to create a custom writer Here is what happens if you try and use `TryClone` on master ``` error[E0603]: module `util` is private --> delorean_parquet/src/writer.rs:11:5 | 11 | util::io::TryClone, | ^^^^ private module | note: the module `util` is defined here --> /Users/alamb/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/7155cd5/rust/parquet/src/lib.rs:39:1 | 39 | mod util; | ^^^^^^^^^ ``` Closes #8528 from alamb/alamb/ARROW-10390-custom-parquet-writer Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…ileSource ## Rationale 7155cd5 / #8300 Reworked how the parquet reader traits were implemented to be interms of a `ChunkReader` trait (for the better, in my opinion). That commit includes two helper classes, `SliceableCursor` and `FileSource`, which implement `ChunkReader` for a `Cursor` like thing and `File`s, respectively. My project instantiates a `SerializedFileWriter` from the parquet crate with `struct`s that wrap `File` and `Cursor` and thus I would like to re-use the logic in `SliceableCursor` and `FileSource` without having to copy/paste them. ## Changes 1. Publically export `SliceableCursor` and `FileSource` 2. Implement `Seek` for SliceableCursor 3. Implement `Debug` for both `SliceableCursor` and `FileSource` Closes #8534 from alamb/alamb/ARROW-10396-expose-helpers Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
…arquet writers As of apache/arrow#8300, it no longer appears possible to implement a `ParquetWriter` to provide a custom writer because it is not possible to implement the trait as `TryClone` is not publicly exported. This PR publically exports that trait and adds an end to end test demonstrating it is possible to create a custom writer Here is what happens if you try and use `TryClone` on master ``` error[E0603]: module `util` is private --> delorean_parquet/src/writer.rs:11:5 | 11 | util::io::TryClone, | ^^^^ private module | note: the module `util` is defined here --> /Users/alamb/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/c148da1/rust/parquet/src/lib.rs:39:1 | 39 | mod util; | ^^^^^^^^^ ``` Closes #8528 from alamb/alamb/ARROW-10390-custom-parquet-writer Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…ileSource ## Rationale apache/arrow@c148da1 / apache/arrow#8300 Reworked how the parquet reader traits were implemented to be interms of a `ChunkReader` trait (for the better, in my opinion). That commit includes two helper classes, `SliceableCursor` and `FileSource`, which implement `ChunkReader` for a `Cursor` like thing and `File`s, respectively. My project instantiates a `SerializedFileWriter` from the parquet crate with `struct`s that wrap `File` and `Cursor` and thus I would like to re-use the logic in `SliceableCursor` and `FileSource` without having to copy/paste them. ## Changes 1. Publically export `SliceableCursor` and `FileSource` 2. Implement `Seek` for SliceableCursor 3. Implement `Debug` for both `SliceableCursor` and `FileSource` Closes #8534 from alamb/alamb/ARROW-10396-expose-helpers Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
…rces https://issues.apache.org/jira/browse/ARROW-10135 Closes apache#8300 from rdettai/ARROW-10135-parquet-file-reader Authored-by: rdettai <rdettai@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…arquet writers As of apache#8300, it no longer appears possible to implement a `ParquetWriter` to provide a custom writer because it is not possible to implement the trait as `TryClone` is not publicly exported. This PR publically exports that trait and adds an end to end test demonstrating it is possible to create a custom writer Here is what happens if you try and use `TryClone` on master ``` error[E0603]: module `util` is private --> delorean_parquet/src/writer.rs:11:5 | 11 | util::io::TryClone, | ^^^^ private module | note: the module `util` is defined here --> /Users/alamb/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/7155cd5/rust/parquet/src/lib.rs:39:1 | 39 | mod util; | ^^^^^^^^^ ``` Closes apache#8528 from alamb/alamb/ARROW-10390-custom-parquet-writer Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…ileSource ## Rationale apache@7155cd5 / apache#8300 Reworked how the parquet reader traits were implemented to be interms of a `ChunkReader` trait (for the better, in my opinion). That commit includes two helper classes, `SliceableCursor` and `FileSource`, which implement `ChunkReader` for a `Cursor` like thing and `File`s, respectively. My project instantiates a `SerializedFileWriter` from the parquet crate with `struct`s that wrap `File` and `Cursor` and thus I would like to re-use the logic in `SliceableCursor` and `FileSource` without having to copy/paste them. ## Changes 1. Publically export `SliceableCursor` and `FileSource` 2. Implement `Seek` for SliceableCursor 3. Implement `Debug` for both `SliceableCursor` and `FileSource` Closes apache#8534 from alamb/alamb/ARROW-10396-expose-helpers Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
https://issues.apache.org/jira/browse/ARROW-10135