Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Oct 28, 2025

This PR implements a simple "file format" aimed at making it easier to implement GDAL/OGR reading, although the boiler plate is applicable to reading various formats and/or implementing them in a higher-level language like Python.

This is basically a watered down version of the DataFusion FileFormat that's a bit easier to implement (at the expense of not supporting some features of the file format). The basic idea is that if you can answer these two questions:

#[async_trait]
pub trait RecordBatchReaderFormatSpec: Debug + Send + Sync {
    async fn infer_schema(&self, location: &Object) -> Result<Schema>;
    async fn open_reader(&self, args: &OpenReaderArgs)
        -> Result<Box<dyn RecordBatchReader + Send>>;
    // ... more options for advanced features
}

...you can implement a file format. This is significantly easier than the existing situation which involves chasing listing tables, file formats, file format factories, and file openers. The OpenReaderArgs passes along a filter and projection so that a format can push down a filter if it would like.

This might need some tweaks when we actually implement a file format with it, but I think this is ready for review!

@paleolimbot paleolimbot mentioned this pull request Oct 31, 2025
Comment on lines +30 to +52
/// Simple file format specification
///
/// In DataFusion, various parts of the file format are split among the
/// FileFormatFactory, the FileFormat, the FileSource, the FileOpener,
/// and a few other traits. This trait is designed to provide a few
/// important features of a natively implemented FileFormat but consolidating
/// the components of implementing the format in the same place. This is
/// intended to provide a less verbose way to implement readers for a wide
/// variety of spatial formats.
#[async_trait]
pub trait ExternalFormatSpec: Debug + Send + Sync {
/// Infer a schema for a given file
///
/// Given a single file, infer what schema [ExternalFormatSpec::open_reader]
/// would produce in the absence of any other guidance.
async fn infer_schema(&self, location: &Object) -> Result<Schema>;

/// Open a [RecordBatchReader] for a given file
///
/// The implementation must handle the `file_projection`; however,
/// need not handle the `filters` (but may use them for pruning).
async fn open_reader(&self, args: &OpenReaderArgs)
-> Result<Box<dyn RecordBatchReader + Send>>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the user-facing API that would be used to implement readers

Comment on lines +32 to +41
/// Create a [ListingTable] from an [ExternalFormatSpec] and one or more URLs
///
/// This can be used to resolve a format specification into a TableProvider that
/// may be registered with a [SessionContext].
pub async fn external_listing_table(
spec: Arc<dyn ExternalFormatSpec>,
context: &SessionContext,
table_paths: Vec<ListingTableUrl>,
check_extension: bool,
) -> Result<ListingTable> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrapper we'd use to implement read_xxxx() in Python

Comment on lines +47 to +54
/// Create a [FileFormatFactory] from a [ExternalFormatSpec]
///
/// The FileFormatFactory is the object that may be registered with a
/// SessionStateBuilder to allow SQL queries to access this format.
#[derive(Debug)]
pub struct ExternalFormatFactory {
spec: Arc<dyn ExternalFormatSpec>,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we'd use to register a format with the session so that things like SELECT * FROM 'gpkg/*.gpkg' work in SQL.

@paleolimbot paleolimbot marked this pull request as ready for review November 6, 2025 16:02
Comment on lines 46 to 66
chrono = { workspace = true }
datafusion = { workspace = true, features = ["parquet"] }
datafusion-catalog = { workspace = true }
datafusion-common = { workspace = true }
datafusion-execution = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-physical-expr = { workspace = true }
datafusion-physical-plan = { workspace = true }
float_next_after = { workspace = true }
geo-traits = { workspace = true }
futures = { workspace = true }
object_store = { workspace = true }
parquet = { workspace = true }
sedona-common = { path = "../sedona-common" }
sedona-expr = { path = "../sedona-expr" }
sedona-functions = { path = "../sedona-functions" }
sedona-geometry = { path = "../sedona-geometry" }
sedona-schema = { path = "../sedona-schema" }
serde = { workspace = true }
serde_json = { workspace = true }
serde_with = { workspace = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there unneeded dependencies? I believe that chrono, float_next_after, and parquet are not needed here.

_conf: FileSinkConfig,
_order_requirements: Option<LexRequirement>,
) -> Result<Arc<dyn ExecutionPlan>> {
not_impl_err!("writing not yet supported for SimpleSedonaFormat")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "ExternalFileFormat"?

fn open(&self, file_meta: FileMeta, _file: PartitionedFile) -> Result<FileOpenFuture> {
if file_meta.range.is_some() {
return sedona_internal_err!(
"Expected SimpleOpener to open a single partition per file"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this should be ExternalFileOpener.

Copy link
Member

@Kontinuation Kontinuation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks good to me, I only have a few comments about some names in error messages.

@paleolimbot paleolimbot merged commit 964c4dd into apache:main Nov 7, 2025
12 checks passed
@paleolimbot paleolimbot deleted the get-me-gdal branch November 7, 2025 20:46
@paleolimbot paleolimbot added this to the 0.2.0 milestone Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants