Skip to content

Conversation

@EmilyMatt
Copy link

Which issue does this PR close?

Rationale for this change

Allows for proper file splitting within an asynchronous context.

What changes are included in this PR?

The raw implementation, allowing for file splitting, starting mid-block(read until sync marker is found), and further reading until end of block is found.
This reader currently requires a reader_schema is provided if type-promotion, schema-evolution, or projection are desired.
This is done so because #8928 is currently blocking proper parsing from an ArrowSchema

Are these changes tested?

Yes

Are there any user-facing changes?

Only the addition.
Other changes are internal to the crate (namely the way Decoder is created from parts)

Copy link
Contributor

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.


[features]
default = ["deflate", "snappy", "zstd", "bzip2", "xz"]
default = ["deflate", "snappy", "zstd", "bzip2", "xz", "object_store"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about having object_store as a default imo. Seems a bit heavy to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that was a development thing, not my intention^^

/// 5. If no range was originally provided, reads the full file.
/// 6. If the range is 0, file_size is 0, or `range.end` is less than the header length, finish immediately.
pub struct AsyncAvroReader {
store: Arc<dyn object_store::ObjectStore>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the biggest high-level concern I have is the object_store hardwiring. My gut tells me we'd be better off with a generic AsyncFileReader<T: AsyncRead + AsyncSeek> or similar trait as the primary abstraction, with object_store as one feature flagged adapter imo.

Copy link
Author

@EmilyMatt EmilyMatt Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wondering how to implement this.
Perhaps just let the user(e.g. Datafusion) provide the impl and be completely agnostic.

Copy link
Contributor

@jecsand838 jecsand838 Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very fair. I think there's incredible value for an AsyncFileReader in arrow-avro, especially if implemented in a generic manner which is highly re-usable across downstream projects. Also object_store makes sense as a first class adapter imo.

My original intention was providing the building blocks for projects such as DataFusion to use for more concrete domain specific implementations.

I'd recommend looking into the parquet crate for inspiration. It uses an abstraction and provides a ParquetObjectReader

@EmilyMatt
Copy link
Author

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

@jecsand838
Copy link
Contributor

jecsand838 commented Nov 26, 2025

Flushing a partial review with some high level thoughts.
I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

100% I'm working on that right now and won't stop until I have a PR. That was a solid catch.

The schema logic is an area of the code I mean to (or would welcome) a full refactor of. I knew it would eventually come back.

Comment on lines +101 to +119
pub async fn try_new(
store: Arc<dyn object_store::ObjectStore>,
location: Path,
range: Option<Range<u64>>,
file_size: u64,
reader_schema: Option<AvroSchema>,
batch_size: usize,
) -> Result<Self, ArrowError> {
let file_size = if file_size == 0 {
store
.head(&location)
.await
.map_err(|err| {
ArrowError::AvroError(format!("HEAD request failed for file, {err}"))
})?
.size
} else {
file_size
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd probably consider using either a builder pattern or define a AsyncAvroReaderOptions struct for these params.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement an async AvroReader

2 participants