New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clip Module - polygon should use overlay() rather than the rtree approach & Multi Poly Issues #112
Comments
I took a stab at running some timing experiments to compare the speed of GeoPandas' The order of arguments to When EarthPy's Here's a gist that does the timing experiments: https://gist.github.com/mbjoseph/75e40aed8627ee69b8b505124b50a69e I'm not sure how helpful this is @lwasser but maybe it's a start toward quantifying whether/when/if |
thanks @mbjoseph !! i'm also pulling @nkorinek into this! So the issue with geopandas overlay is it doesn't handle points and lines. BUT conventional GIS methods don't consider geometry structure when you clip (eg ArcGIS). The idea i had was however that IF overlay is faster for polygons (which geopandas does handle), it could be better to just call that rather than my custimzed rtree approach which in theory is faster BUT only with BIG datasets. see this lesson: https://www.earthdatascience.org/courses/earth-analytics-python/spatial-data-vector-shapefiles/clip-vector-data-in-python-geopandas-shapely/ the global layers took some time to clip!! Id suggest trying this approach with the larger global datasets with MANY MANY points which in theory benefit from the rtree approach. this is great - thank you so much! |
Thanks for the info @lwasser! Are you suggesting that we try increasing numbers of points (e.g., And if we want the latter, then are we interested in comparisons for one of these cases, or all of them?
|
Yes. Start with clipping more points! i think multi polygons is less of an issue because if i recall i use a In theory from what i've read - rtree / spatial indexing can be slower with smaller numbers of features and then faster with many features. But ofcourse i left this issue open because i haven't had time to really test it especially with the built in geopandas functions that work on polygons only. |
I'm possibly missing something, but how should we compare the speed of clipping points with EarthPy vs. GeoPandas, when GeoPandas |
oh no - sorry @mbjoseph . Only test it using many polygons. When you have a clip a file that has many polygon features. i brought that up because the approach i'm using seems to work BETTER with many features BUT i suspected that geopandas would be faster and optimized properly for a polygon operation. |
@lwasser I have some updated experiments across a range of number of polygons, with the caveat that For example - if we generate random circles over Colorado counties like this: What GeoPandas
|
Thank you @mbjoseph !! To make the operations compariable you'd run
So the code above would be comparable. |
Thanks @lwasser I see now how the unary union can be used to get equivalent output. Just to be clear: is the comparison you want between def clip(to_clip, clip_shp):
"""Alternative clip function"""
union = gpd.GeoDataFrame(
gpd.GeoSeries([clip_shp.unary_union]),
columns=['geometry'],
crs=clip_shp.crs
)
return gpd.overlay(to_clip, union, how='intersection') |
@lwasser can you take another look at this? I wanted to confirm that the function above is the one you want to use as a comparison with our current implementation. Performance aside, if we can remove the dependency on rtree, that would be a win from an installation standpoint! |
Alright @lwasser have a look at this gist: https://gist.github.com/mbjoseph/75e40aed8627ee69b8b505124b50a69e I compared the current clip function with the one above. I'm fairly sure that this is the comparison you wanted to make. The rtree approach is slightly faster when many polygons are being clipped (e.g., about 1.2x faster for 1000 points, and 1.3x faster for 10,000 points. |
am i misreading this @mbjoseph So i'm hearing that you'd like to remove rtree . i'm fairly surprised that geopandas isn't using rtree - i guess i just assumed they were as it's supposed to be faster for larger operations? !! i haven't had any issues with rtree yet so can you tell me what the goal of this is IF it is in fact faster for larger operations. |
there is also the other issue which rtree helps with - geopandas overlay doesn't handle lines. just polygons. |
@lwasser the y-axis is execution time: the new function is slower than the current function, particularly as the number of polygons being clipped increases. So, if we are optimizing for performance we want to keep using the rtree approach. |
ok great. yes let's keep it then if we can. i went that route after a good bit of testing because i read it was the fastest option and would be particularly beneficial as the data got larger. |
Sounds good! |
Currently the clip module uses the same approach for lines and polygons. this is because
overlay()
which handles clip does not handle lines. We should do the followingoverlay()
function withhow=intersects
as an option. Adjust function accordingly.a. When a multi object is provided to the function, it gracefully fails telling the user to handle multi objects - potentially via
explode()
ORb. The function is smart enough to finds multi objects and explode them before trying to run an intersection.
Which brings me to the next missing test
The text was updated successfully, but these errors were encountered: