Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Sep 10, 2025

This PR adds a writer for Parquet that writes GeoParquet metadata for output with spatial columns. Currently this doesn't write a bbox because that requires something a bit more sophisticated than just popping some metadata into the Parquet write options.

import sedonadb

sd = sedonadb.connect()

sd.sql("SELECT ST_SetSRID(ST_GeomFromText('POINT (0 1)'), 4326) as geometry").to_parquet("foofy.parquet")
sd.read_parquet("foofy.parquet").show()
#> ┌────────────┐
#> │  geometry  │
#> │  geometry  │
#> ╞════════════╡
#> │ POINT(0 1) │
#> └────────────┘

@paleolimbot paleolimbot force-pushed the parquet-writer branch 2 times, most recently from 9e08bc3 to 25c6a8b Compare September 12, 2025 18:15
@paleolimbot paleolimbot requested a review from Copilot September 13, 2025 03:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for writing GeoParquet files in Sedona's Rust implementation by implementing a GeoParquet writer that properly handles spatial columns. The changes enable DataFrames containing spatial data to be written to Parquet format with appropriate GeoParquet metadata.

Key changes:

  • Added write_geoparquet method to the SedonaDataFrame trait with support for custom write options
  • Implemented GeoParquet metadata generation and physical plan creation for writing spatial data
  • Extended Python API with to_parquet method for DataFrame objects supporting partitioning and sorting

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
rust/sedona/src/context.rs Added write_geoparquet method and SedonaWriteOptions struct for DataFrame writing
rust/sedona-geoparquet/src/writer.rs New writer implementation with GeoParquet metadata generation
rust/sedona-geoparquet/src/options.rs New options struct for GeoParquet version configuration
rust/sedona-geoparquet/src/metadata.rs Added Default implementations for metadata structs
rust/sedona-geoparquet/src/format.rs Updated format factory to support writer functionality
python/sedonadb/python/sedonadb/dataframe.py Added to_parquet method to Python DataFrame API
python/sedonadb/src/dataframe.rs Rust implementation of Python to_parquet method

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@paleolimbot paleolimbot marked this pull request as ready for review September 13, 2025 03:13
Comment on lines +24 to +51
/// Write GeoParquet 1.0 metadata
///
/// GeoParquet 1.0 has the widest support among readers and writers; however
/// it does not include row-group level statistics.
V1_0,

/// Write GeoParquet 1.1 metadata and optional bounding box column
///
/// A bbox column will be included for any column where the Parquet options would
/// have otherwise written statistics (which it will by default).
/// This option may be more computationally expensive; however, will result in
/// row-group level statistics that some readers (e.g., SedonaDB) can use to prune
/// row groups on read.
V1_1,

/// Write GeoParquet 2.0
///
/// The GeoParquet 2.0 options is identical to GeoParquet 1.0 except the underlying storage
/// of spatial columns is Parquet native geometry, where the Parquet writer will include
/// native statistics according to the underlying Parquet options. Some readers
/// (e.g., SedonaDB) can use these statistics to prune row groups on read.
V2_0,

/// Do not write GeoParquet metadata
///
/// This option suppresses GeoParquet metadata; however, spatial types will be written as
/// Parquet native Geometry/Geography when this is supported by the underlying writer.
Omitted,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiayuasu Only V1_0 is implemented in this PR, but is this is what I had in mind for how a user will specify output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good enough. Since this PR follows V1_0, does it have bbox value in its file metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR, because that requires a true override/copy of the ParquetSink. I will probably try V1_1 before attempting that (also doesn't involve an override of the sink because it is a simple projection).

@jiayuasu jiayuasu merged commit 42a16a1 into apache:main Sep 13, 2025
12 checks passed
@paleolimbot paleolimbot deleted the parquet-writer branch October 8, 2025 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants