Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Nov 28, 2025

The issue identified by the GDAL filter pushdown in #384 / #380 also affects GeoParquet reads. The following query wouldn't complete for me before this change (after this change it completes in ~40s on a debug build).

import sedona.db

sd = sedona.db.connect()

buildings_url = (
    "s3://overturemaps-us-west-2/release/2025-11-19.0/theme=buildings/type=building/"
)

target_wkt = (
    "POLYGON ((-73.21 44.03, -73.21 43.98, -73.11 43.97, -73.12 44.03, -73.21 44.03))"
)

sd.read_parquet(
    buildings_url,
    options={"aws.skip_signature": True, "aws.region": "us-west-2"},
).to_view("buildings")

sd.sql(f"""
SELECT * FROM buildings
WHERE ST_Intersects(geometry, ST_SetSRID(ST_GeomFromText('{target_wkt}'), 4326))
""").count()
#> 2845

The changes I tried to mostly isolate to sedona-expr to ensure this change is more tightly scoped. I opened #389 to more holistically solve this since we're about to revisit that code shortly when we update DataFusion to provide Geometry/Geography parquet support (unless there's consensus that we should just do that now).

@paleolimbot paleolimbot marked this pull request as ready for review November 29, 2025 04:07
Comment on lines 415 to 416
/// name <https://github.com/apache/sedona-db/pull/385>. This option may
/// be removed if the incorrect index can be resolved upstream.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have any (and I mean any) hunches of what the exact behavior causing this in DataFusion is, now would be a good time to mention it while you're fresh off your investigation. Not top priority, but I'd eventually like to narrow it down enough to submit an issue at least.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the continued prods to investigate this more deeply...this particular prod I think resulted in me finding the root cause here (not a datafusion issue 😬 ). I'll push a (hopefully) fix shortly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paleolimbot and others added 3 commits November 29, 2025 21:16
@paleolimbot paleolimbot requested a review from Copilot November 30, 2025 04:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in GeoParquet pruning where spatial filters fail when the number of projected columns is less than the geometry column's index. The fix introduces TableGeoStatistics to handle column resolution by name instead of position, avoiding index out-of-bounds errors.

  • Introduces TableGeoStatistics enum to resolve statistics by name or position
  • Updates SpatialFilter::evaluate to return Result<bool> and use TableGeoStatistics
  • Refactors statistics handling in GeoParquet file opener to use the new approach

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
rust/sedona-geoparquet/src/file_opener.rs Updates spatial filter evaluation calls to use TableGeoStatistics and handle new Result return type
rust/sedona-expr/src/statistics.rs Makes GeoStatistics::unspecified() a const for efficiency
rust/sedona-expr/src/spatial_filter.rs Introduces TableGeoStatistics enum and refactors SpatialFilter::evaluate to return Result
rust/sedona-datasource/src/spec.rs Adds documentation clarifying that filter Column indices are relative to file_projection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@paleolimbot paleolimbot merged commit 24c4a11 into apache:main Dec 1, 2025
14 checks passed
@paleolimbot paleolimbot deleted the pruning-geoparquet-by-name branch December 1, 2025 02:21
paleolimbot added a commit that referenced this pull request Dec 1, 2025
…lumns is less than the geometry column index (#385)

Co-authored-by: Peter Nguyen <petern0408@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
pwrliang pushed a commit to pwrliang/sedona-db that referenced this pull request Dec 6, 2025
…lumns is less than the geometry column index (apache#385)

Co-authored-by: Peter Nguyen <petern0408@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants