-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: Implement an AsyncReader for avro using ObjectStore #8930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
jecsand838
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing a partial review with some high level thoughts.
I'll wait for you to finish before resuming.
|
|
||
| [features] | ||
| default = ["deflate", "snappy", "zstd", "bzip2", "xz"] | ||
| default = ["deflate", "snappy", "zstd", "bzip2", "xz", "object_store"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about having object_store as a default imo. Seems a bit heavy to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that was a development thing, not my intention^^
| /// 5. If no range was originally provided, reads the full file. | ||
| /// 6. If the range is 0, file_size is 0, or `range.end` is less than the header length, finish immediately. | ||
| pub struct AsyncAvroReader { | ||
| store: Arc<dyn object_store::ObjectStore>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the biggest high-level concern I have is the object_store hardwiring. My gut tells me we'd be better off with a generic AsyncFileReader<T: AsyncRead + AsyncSeek> or similar trait as the primary abstraction, with object_store as one feature flagged adapter imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was wondering how to implement this.
Perhaps just let the user(e.g. Datafusion) provide the impl and be completely agnostic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very fair. I think there's incredible value for an AsyncFileReader in arrow-avro, especially if implemented in a generic manner which is highly re-usable across downstream projects. Also object_store makes sense as a first class adapter imo.
My original intention was providing the building blocks for projects such as DataFusion to use for more concrete domain specific implementations.
I'd recommend looking into the parquet crate for inspiration. It uses an abstraction and provides a ParquetObjectReader
Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile |
100% I'm working on that right now and won't stop until I have a PR. That was a solid catch. The schema logic is an area of the code I mean to (or would welcome) a full refactor of. I knew it would eventually come back. |
| pub async fn try_new( | ||
| store: Arc<dyn object_store::ObjectStore>, | ||
| location: Path, | ||
| range: Option<Range<u64>>, | ||
| file_size: u64, | ||
| reader_schema: Option<AvroSchema>, | ||
| batch_size: usize, | ||
| ) -> Result<Self, ArrowError> { | ||
| let file_size = if file_size == 0 { | ||
| store | ||
| .head(&location) | ||
| .await | ||
| .map_err(|err| { | ||
| ArrowError::AvroError(format!("HEAD request failed for file, {err}")) | ||
| })? | ||
| .size | ||
| } else { | ||
| file_size | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'd probably consider using either a builder pattern or define a AsyncAvroReaderOptions struct for these params.
Which issue does this PR close?
Rationale for this change
Allows for proper file splitting within an asynchronous context.
What changes are included in this PR?
The raw implementation, allowing for file splitting, starting mid-block(read until sync marker is found), and further reading until end of block is found.
This reader currently requires a reader_schema is provided if type-promotion, schema-evolution, or projection are desired.
This is done so because #8928 is currently blocking proper parsing from an ArrowSchema
Are these changes tested?
Yes
Are there any user-facing changes?
Only the addition.
Other changes are internal to the crate (namely the way Decoder is created from parts)