[GH-2095] Implement scalable workaround for GeoSeries.__init__()#2096
[GH-2095] Implement scalable workaround for GeoSeries.__init__()#2096petern48 wants to merge 7 commits intoapache:masterfrom
Conversation
|
This PR makes the code lose the CRS info again. It's because of the way Sedona serializes the shapely Geometry objects. Unfortunately, it doesn't include the CRS info, so I have to look into the best way to address this. |
|
I think what's happening: |
|
@jiayuasu Do you happen to know off the top of your head where Sedona stores it's SRID/CRS info (e.g in a separate column, metadata, in the serialization format)? It's got to be somewhere, since it can be accessed by worker nodes. EDIT: I've dug into the C code and see that it's serialized in the format. Still not sure why it's not retained once it becomes a shapely object for me. |
|
@Kontinuation did you have answers for Peter's question? |
|
I figured it all out after quite a bit of digging. SRID info is included in the serialization process per geometry object. Just a bunch of various bugs in Sedona and Spark covering things up. Closing. New working PR is here: #2121 |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Geopandas.GeoSeries: Implement scalable workaround for GeoSeries.__init__() #2095What changes were proposed in this PR?
In a previous PR, I wrote a workaround for the behavior of creating a Series from a Series. That workaround required the use of
to_pandas()which is not scalable. I previously submitting a fix to Spark was told this would be a new feature that could not be backported. I then realized, I could just manually implement the logic in our codebase in the mean time since it's all in the constructor.How was this patch tested?
Ensured existing tests pass
Did this PR include necessary documentation updates?