Skip to content

Comments

[GH-2095] Implement scalable workaround for GeoSeries.__init__()#2096

Closed
petern48 wants to merge 7 commits intoapache:masterfrom
petern48:scalable_workaround
Closed

[GH-2095] Implement scalable workaround for GeoSeries.__init__()#2096
petern48 wants to merge 7 commits intoapache:masterfrom
petern48:scalable_workaround

Conversation

@petern48
Copy link
Contributor

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

In a previous PR, I wrote a workaround for the behavior of creating a Series from a Series. That workaround required the use of to_pandas() which is not scalable. I previously submitting a fix to Spark was told this would be a new feature that could not be backported. I then realized, I could just manually implement the logic in our codebase in the mean time since it's all in the constructor.

How was this patch tested?

Ensured existing tests pass

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

@petern48 petern48 requested a review from zhangfengcdt July 15, 2025 16:46
@petern48 petern48 marked this pull request as ready for review July 15, 2025 16:46
@petern48 petern48 requested a review from jiayuasu as a code owner July 15, 2025 16:46
@petern48 petern48 marked this pull request as draft July 15, 2025 22:22
@petern48
Copy link
Contributor Author

This PR makes the code lose the CRS info again. It's because of the way Sedona serializes the shapely Geometry objects. Unfortunately, it doesn't include the CRS info, so I have to look into the best way to address this.

@petern48
Copy link
Contributor Author

petern48 commented Jul 18, 2025

I think what's happening:
When I call .apply() to convert geometries into EWKB, Spark applies my specified function, but then calls GeometryType()'s serialize and deserialize function on my input thinking they're still Geometry objects, which leads to messed up bytes. Looks like I need to find a different workaround.

@petern48
Copy link
Contributor Author

petern48 commented Jul 18, 2025

@jiayuasu Do you happen to know off the top of your head where Sedona stores it's SRID/CRS info (e.g in a separate column, metadata, in the serialization format)? It's got to be somewhere, since it can be accessed by worker nodes.

EDIT: I've dug into the C code and see that it's serialized in the format. Still not sure why it's not retained once it becomes a shapely object for me.

@jiayuasu
Copy link
Member

@Kontinuation did you have answers for Peter's question?

@petern48
Copy link
Contributor Author

I figured it all out after quite a bit of digging. SRID info is included in the serialization process per geometry object. Just a bunch of various bugs in Sedona and Spark covering things up.

Closing. New working PR is here: #2121

@petern48 petern48 closed this Jul 19, 2025
@Kontinuation
Copy link
Member

Kontinuation commented Jul 19, 2025

I'm submitting a PR #2123 to resolve #2122. Hopefully it will resolve the srid preservation issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Geopandas.GeoSeries: Implement scalable workaround for GeoSeries.__init__()

3 participants