-
Notifications
You must be signed in to change notification settings - Fork 34
feat(rust/sedona-geoparquet): Add GeoParquet writer for non-spatial output #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9e08bc3 to
25c6a8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for writing GeoParquet files in Sedona's Rust implementation by implementing a GeoParquet writer that properly handles spatial columns. The changes enable DataFrames containing spatial data to be written to Parquet format with appropriate GeoParquet metadata.
Key changes:
- Added
write_geoparquetmethod to theSedonaDataFrametrait with support for custom write options - Implemented GeoParquet metadata generation and physical plan creation for writing spatial data
- Extended Python API with
to_parquetmethod for DataFrame objects supporting partitioning and sorting
Reviewed Changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona/src/context.rs | Added write_geoparquet method and SedonaWriteOptions struct for DataFrame writing |
| rust/sedona-geoparquet/src/writer.rs | New writer implementation with GeoParquet metadata generation |
| rust/sedona-geoparquet/src/options.rs | New options struct for GeoParquet version configuration |
| rust/sedona-geoparquet/src/metadata.rs | Added Default implementations for metadata structs |
| rust/sedona-geoparquet/src/format.rs | Updated format factory to support writer functionality |
| python/sedonadb/python/sedonadb/dataframe.py | Added to_parquet method to Python DataFrame API |
| python/sedonadb/src/dataframe.rs | Rust implementation of Python to_parquet method |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| /// Write GeoParquet 1.0 metadata | ||
| /// | ||
| /// GeoParquet 1.0 has the widest support among readers and writers; however | ||
| /// it does not include row-group level statistics. | ||
| V1_0, | ||
|
|
||
| /// Write GeoParquet 1.1 metadata and optional bounding box column | ||
| /// | ||
| /// A bbox column will be included for any column where the Parquet options would | ||
| /// have otherwise written statistics (which it will by default). | ||
| /// This option may be more computationally expensive; however, will result in | ||
| /// row-group level statistics that some readers (e.g., SedonaDB) can use to prune | ||
| /// row groups on read. | ||
| V1_1, | ||
|
|
||
| /// Write GeoParquet 2.0 | ||
| /// | ||
| /// The GeoParquet 2.0 options is identical to GeoParquet 1.0 except the underlying storage | ||
| /// of spatial columns is Parquet native geometry, where the Parquet writer will include | ||
| /// native statistics according to the underlying Parquet options. Some readers | ||
| /// (e.g., SedonaDB) can use these statistics to prune row groups on read. | ||
| V2_0, | ||
|
|
||
| /// Do not write GeoParquet metadata | ||
| /// | ||
| /// This option suppresses GeoParquet metadata; however, spatial types will be written as | ||
| /// Parquet native Geometry/Geography when this is supported by the underlying writer. | ||
| Omitted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiayuasu Only V1_0 is implemented in this PR, but is this is what I had in mind for how a user will specify output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good enough. Since this PR follows V1_0, does it have bbox value in its file metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR, because that requires a true override/copy of the ParquetSink. I will probably try V1_1 before attempting that (also doesn't involve an override of the sink because it is a simple projection).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
cda812c to
049a31b
Compare
This PR adds a writer for Parquet that writes GeoParquet metadata for output with spatial columns. Currently this doesn't write a bbox because that requires something a bit more sophisticated than just popping some metadata into the Parquet write options.