-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-227] Python Serde Refactor #745
Conversation
@Kontinuation @Imbruced Please feel free to chime in with comments :-) |
LGTM. The refactored implementation looks much more pythonic while having better performance. It is also a great move to remove the expensive |
@Kontinuation What's your take on dropping support for empty geometry inside of multi-geometries in the new serialization format? They don't have representation in any other form so I'm not sure that supporting them provides any value. |
I think it is OK to ignore empty geometries inside geometry collections since it has no impact on the topological properties of the geometries, though it may change some structural properties of the geometry collection such as indices of geometry elements ( If I understand it correctly, both WKT and WKB support empty geometries in geometry collections according to the OGC standard Simple Feature Access - Part 1: Common Architecture. The standard does not explicitly forbid empty geometries inside geometry collections, and these kinds of geometries could be represented according to the BNF of WKT:
A MultiLineString containing 2 empty LineStrings could be represented as
WKB may also represent
Based on the above observations, I prefer supporting empty geometries inside geometry collections to preserve their structural properties, since it may already be supported by various popular geometry representations, and users won't be surprised by accidental geometry structure changes. |
I stand corrected :) I should have been a little more thorough. I think empty geometries in collections should be supported then, since the standards expressly provide for them. |
Is the shapely section refering to wkb.loads and wkb.dumps methods ? What do you think about letting user to chose if the Z/M support should be turned on ? (I guess that majority of use cases are about 2D data). At the end we need sth low level written I am wondering if we can create some PR's in the shapely project ? |
if coord_type == CoordinateType.XYM or coord_type == CoordinateType.XYZM: | ||
raise NotImplementedError("XYM or XYZM coordinates are not supported") | ||
|
||
if num_coords < 50: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that some magical number ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I wrestled a leprechaun for it :)
In reality, it's roughly where the performances of the two approaches crossed each other on my system so it's fairly arbitrary. In my experience geometry either has a low number of points or a lot. Unfortunately, neither approach here performs well at both scales. So, 50 seemed like a reasonable "gonna call it here" kind of threshold.
I created a constant with a better name and am using that now.
Do you have a chance to compare the memory footprint between wkb.loads approach and the changes after 29.12 and the current version ? |
I'd love too. Unfortunately, it won't be till tomorrow or early this next week. |
@douglasdennis Is this PR ready to be merged? :-) |
@Imbruced Took me a little longer but here are some memory profiling results.
Deserialization Average Memory Increment Results (MiB):
Method: This would give a printout like this:
I then averaged the "Increment" amount for line 23 across all five trials. Additionally, I plotted a representative run of the serialize operations (if you click on these you should get a bigger image):
Lastly here is the serialization trial script: import sys
from typing import List
from sedona.utils.geometry_serde import serialize
from shapely.geometry import Point, LineString, Polygon
from shapely.wkb import dumps
from memory_profiler import profile
def make_point():
return Point(12.3, 45.6)
def make_linestring():
return LineString([(n, n) for n in range(5)])
def make_polygon():
return Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])
@profile
def run_serialize_profile(serializer, geom):
x = [serializer(geom) for _ in range(100_000)]
return x
def main(args: List[str]) -> int:
if len(args) < 2:
print("Usage: mem_profile_run.py <sedona or wkb> <point>")
return 1
serializer_type = args[0]
geom_type = args[1]
if geom_type == "point":
geom = make_point()
elif geom_type == "linestring":
geom = make_linestring()
elif geom_type == "polygon":
geom = make_polygon()
else:
print(f"Geometry type is not supported: {geom_type}")
return 1
if serializer_type == "sedona":
serializer = serialize
elif serializer_type == "wkb":
serializer = dumps
else:
print(f"Serializer type is not supported: {serializer_type}")
return 1
run_serialize_profile(serializer, geom)
if __name__ == "__main__":
exit(main(sys.argv[1:])) Deserialize trial script: import sys
from typing import List
from sedona.utils.geometry_serde import serialize, deserialize
from shapely.geometry import Point, LineString, Polygon
from shapely.wkb import dumps, loads
from memory_profiler import profile
def make_point():
return Point(12.3, 45.6)
def make_linestring():
return LineString([(n, n) for n in range(5)])
def make_polygon():
return Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)])
@profile
def run_deserialize_profile(deserializer, geom):
x = [deserializer(geom) for _ in range(100_000)]
return x
def main(args: List[str]) -> int:
if len(args) < 2:
print("Usage: mem_profile_run.py <sedona or wkb> <point>")
return 1
serializer_type = args[0]
geom_type = args[1]
if geom_type == "point":
geom = make_point()
elif geom_type == "linestring":
geom = make_linestring()
elif geom_type == "polygon":
geom = make_polygon()
else:
print(f"Geometry type is not supported: {geom_type}")
return 1
if serializer_type == "sedona":
geom = serialize(geom)
deserializer = deserialize
elif serializer_type == "wkb":
geom = dumps(geom)
deserializer = loads
else:
print(f"Serializer type is not supported: {serializer_type}")
return 1
run_deserialize_profile(deserializer, geom)
if __name__ == "__main__":
exit(main(sys.argv[1:])) |
@jiayuasu I feel good about this PR. If the others do as well then I think it can merge. This code will probably have to be returned to at a later date once shapely decides how it wants to handle M coordinates and once Shapely < 2.0 is no longer supported. |
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
[SEDONA-XXX] my subject
.What changes were proposed in this PR?
A refactor of some of the python serialization/deserialization functions is proposed. This is due to the performance regression experienced from the change in serialization formats. Most of the functions were refactored to increase performance. Attempts were made to maintain readability of the code at the same time.
A significant pain point involves interacting with shapely geometry. Most of the shapely methods are too slow and involve repeated calls to native code. Once shapely 2.0 is adopted by Sedona and its users, it may be a good idea to revisit this code and attempt to utilize some of the libgeos connections that shapely exposes.
There are other issues in the current implementation of serde:
Should the serialization format allow for empty geometry within collections? They have no WKT representation, so I'm not sure that should be allowed.Finally, here are performance results between current master and this branch:
Here is the python script I used to generate these results:
How was this patch tested?
A parameterized set of unit tests were added that test the serialization/deserialization round trip of a geometry object between python and spark.
Did this PR include necessary documentation updates?