-
Notifications
You must be signed in to change notification settings - Fork 34
feat(rust/sedona-datasource): Implement generic RecordBatchReader-based format #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// Simple file format specification | ||
| /// | ||
| /// In DataFusion, various parts of the file format are split among the | ||
| /// FileFormatFactory, the FileFormat, the FileSource, the FileOpener, | ||
| /// and a few other traits. This trait is designed to provide a few | ||
| /// important features of a natively implemented FileFormat but consolidating | ||
| /// the components of implementing the format in the same place. This is | ||
| /// intended to provide a less verbose way to implement readers for a wide | ||
| /// variety of spatial formats. | ||
| #[async_trait] | ||
| pub trait ExternalFormatSpec: Debug + Send + Sync { | ||
| /// Infer a schema for a given file | ||
| /// | ||
| /// Given a single file, infer what schema [ExternalFormatSpec::open_reader] | ||
| /// would produce in the absence of any other guidance. | ||
| async fn infer_schema(&self, location: &Object) -> Result<Schema>; | ||
|
|
||
| /// Open a [RecordBatchReader] for a given file | ||
| /// | ||
| /// The implementation must handle the `file_projection`; however, | ||
| /// need not handle the `filters` (but may use them for pruning). | ||
| async fn open_reader(&self, args: &OpenReaderArgs) | ||
| -> Result<Box<dyn RecordBatchReader + Send>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the user-facing API that would be used to implement readers
| /// Create a [ListingTable] from an [ExternalFormatSpec] and one or more URLs | ||
| /// | ||
| /// This can be used to resolve a format specification into a TableProvider that | ||
| /// may be registered with a [SessionContext]. | ||
| pub async fn external_listing_table( | ||
| spec: Arc<dyn ExternalFormatSpec>, | ||
| context: &SessionContext, | ||
| table_paths: Vec<ListingTableUrl>, | ||
| check_extension: bool, | ||
| ) -> Result<ListingTable> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrapper we'd use to implement read_xxxx() in Python
| /// Create a [FileFormatFactory] from a [ExternalFormatSpec] | ||
| /// | ||
| /// The FileFormatFactory is the object that may be registered with a | ||
| /// SessionStateBuilder to allow SQL queries to access this format. | ||
| #[derive(Debug)] | ||
| pub struct ExternalFormatFactory { | ||
| spec: Arc<dyn ExternalFormatSpec>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we'd use to register a format with the session so that things like SELECT * FROM 'gpkg/*.gpkg' work in SQL.
rust/sedona-datasource/Cargo.toml
Outdated
| chrono = { workspace = true } | ||
| datafusion = { workspace = true, features = ["parquet"] } | ||
| datafusion-catalog = { workspace = true } | ||
| datafusion-common = { workspace = true } | ||
| datafusion-execution = { workspace = true } | ||
| datafusion-expr = { workspace = true } | ||
| datafusion-physical-expr = { workspace = true } | ||
| datafusion-physical-plan = { workspace = true } | ||
| float_next_after = { workspace = true } | ||
| geo-traits = { workspace = true } | ||
| futures = { workspace = true } | ||
| object_store = { workspace = true } | ||
| parquet = { workspace = true } | ||
| sedona-common = { path = "../sedona-common" } | ||
| sedona-expr = { path = "../sedona-expr" } | ||
| sedona-functions = { path = "../sedona-functions" } | ||
| sedona-geometry = { path = "../sedona-geometry" } | ||
| sedona-schema = { path = "../sedona-schema" } | ||
| serde = { workspace = true } | ||
| serde_json = { workspace = true } | ||
| serde_with = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there unneeded dependencies? I believe that chrono, float_next_after, and parquet are not needed here.
rust/sedona-datasource/src/format.rs
Outdated
| _conf: FileSinkConfig, | ||
| _order_requirements: Option<LexRequirement>, | ||
| ) -> Result<Arc<dyn ExecutionPlan>> { | ||
| not_impl_err!("writing not yet supported for SimpleSedonaFormat") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "ExternalFileFormat"?
rust/sedona-datasource/src/format.rs
Outdated
| fn open(&self, file_meta: FileMeta, _file: PartitionedFile) -> Result<FileOpenFuture> { | ||
| if file_meta.range.is_some() { | ||
| return sedona_internal_err!( | ||
| "Expected SimpleOpener to open a single partition per file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this should be ExternalFileOpener.
Kontinuation
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API looks good to me, I only have a few comments about some names in error messages.
This PR implements a simple "file format" aimed at making it easier to implement GDAL/OGR reading, although the boiler plate is applicable to reading various formats and/or implementing them in a higher-level language like Python.
This is basically a watered down version of the DataFusion FileFormat that's a bit easier to implement (at the expense of not supporting some features of the file format). The basic idea is that if you can answer these two questions:
...you can implement a file format. This is significantly easier than the existing situation which involves chasing listing tables, file formats, file format factories, and file openers. The
OpenReaderArgspasses along a filter and projection so that a format can push down a filter if it would like.This might need some tweaks when we actually implement a file format with it, but I think this is ready for review!