Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clip Module - polygon should use overlay() rather than the rtree approach & Multi Poly Issues #112

Closed
4 tasks done
lwasser opened this issue Nov 7, 2018 · 16 comments
Closed
4 tasks done
Labels
enhancement New feature or request

Comments

@lwasser
Copy link

lwasser commented Nov 7, 2018

Currently the clip module uses the same approach for lines and polygons. this is because overlay() which handles clip does not handle lines. We should do the following

  • Test whether the clip fun I wrote is faster or slower than the overlay() function with how=intersects as an option. Adjust function accordingly.
  • It follows that clip will fail if provided a multi-polygon (or line i presume) layer. There are two options here
    a. When a multi object is provided to the function, it gracefully fails telling the user to handle multi objects - potentially via explode() OR
    b. The function is smart enough to finds multi objects and explode them before trying to run an intersection.
    Which brings me to the next missing test
  • Test what happens when a clip object provided has multiple polygons in it OR
  • Test what happens when a clip object provided is a point or line (should return error)
@lwasser lwasser changed the title Clip Module - polygon should maybe use overlay() rather than the rtree approach Clip Module - polygon should use overlay() rather than the rtree approach & Multi Poly Issues Nov 7, 2018
@mbjoseph
Copy link
Member

I took a stab at running some timing experiments to compare the speed of GeoPandas' overlay(..., how='intersection') and EarthPy's clip_shp(). It's somewhat hard to do a 1:1 comparison because they appear to behave slightly differently.

The order of arguments to clip_shp matters: the first arg is clipped by the second arg. The order of arguments to overlay is not so important (caveat: if I understand the results correctly): the polygon(s) returned by overlay will always be broken up into the same set of polygons, such that the order does not matter.

When EarthPy's clip_shp is used to generate output like GeoPandas' overlay, it is about as fast, or maybe a little slower. But, because the order matters for clip_shp, it can be faster depending on the output that the user wants (i.e., depending on what clips what).

Here's a gist that does the timing experiments: https://gist.github.com/mbjoseph/75e40aed8627ee69b8b505124b50a69e

I'm not sure how helpful this is @lwasser but maybe it's a start toward quantifying whether/when/if overlay or clip_shp are more efficient.

@lwasser
Copy link
Author

lwasser commented Mar 1, 2019

thanks @mbjoseph !! i'm also pulling @nkorinek into this! So the issue with geopandas overlay is it doesn't handle points and lines. BUT conventional GIS methods don't consider geometry structure when you clip (eg ArcGIS). The idea i had was however that IF overlay is faster for polygons (which geopandas does handle), it could be better to just call that rather than my custimzed rtree approach which in theory is faster BUT only with BIG datasets. see this lesson: https://www.earthdatascience.org/courses/earth-analytics-python/spatial-data-vector-shapefiles/clip-vector-data-in-python-geopandas-shapely/ the global layers took some time to clip!!

Id suggest trying this approach with the larger global datasets with MANY MANY points which in theory benefit from the rtree approach. this is great - thank you so much!

@mbjoseph
Copy link
Member

mbjoseph commented Mar 1, 2019

Thanks for the info @lwasser! Are you suggesting that we try increasing numbers of points (e.g., clip_shp(lots_of_points, clip_obj)), or increasing numbers of polygons?

And if we want the latter, then are we interested in comparisons for one of these cases, or all of them?

  • clip_shp(lots_of_polygons, clip_obj)
  • clip_shp(shp, lots_of_polygons)
  • clip_shp(lots_of_polygons, lots_of_polygons)

@lwasser
Copy link
Author

lwasser commented Mar 1, 2019

Yes. Start with clipping more points! i think multi polygons is less of an issue because if i recall i use a unary_union to ensure there is one clip extent. I spent a lot of time testing speed associated with clipping lots of points and lots of lines. i would imagine polygons would present a similar issue.

In theory from what i've read - rtree / spatial indexing can be slower with smaller numbers of features and then faster with many features. But ofcourse i left this issue open because i haven't had time to really test it especially with the built in geopandas functions that work on polygons only.

@mbjoseph
Copy link
Member

mbjoseph commented Mar 1, 2019

I'm possibly missing something, but how should we compare the speed of clipping points with EarthPy vs. GeoPandas, when GeoPandas overlay() only has clipping functionality for polygons?

@lwasser
Copy link
Author

lwasser commented Mar 1, 2019

oh no - sorry @mbjoseph . Only test it using many polygons. When you have a clip a file that has many polygon features. i brought that up because the approach i'm using seems to work BETTER with many features BUT i suspected that geopandas would be faster and optimized properly for a polygon operation.

@mbjoseph
Copy link
Member

mbjoseph commented Mar 4, 2019

@lwasser I have some updated experiments across a range of number of polygons, with the caveat that clip_shp and overlay do not do the same thing, which prevents a one-to-one speed comparison.

For example - if we generate random circles over Colorado counties like this:

circles-over-colorado

What GeoPandas overlay does:

GeoPandas' overlay(..., how='intersection') does the intersection:

gpd-intersection

The result of overlay is a GeoDataFrame with many more rows than circles - each circle is broken up in to polygons according to the intersection with the counties.

What EarthPy's clip module does:

We can clip the circles by counties:

eclip1

This will give us the same number of rows out as we have points.

Or we can clip the counties by circles:

eclip2

This will give us a row for each county that remains after clipping.

So, timing experiments need to be taken with a grain of salt, because GeoPandas and EarthPy are generating totally different outputs. But, we can compare the speed of doing an intersection vs. clipping across a range of number of polygons (1, 10, 100, and 1000 circles), and clipping is faster when the number of circles being clipped gets large.

speed-comps

For all the details see this gist (note it's big enough that you may need to download the raw JSON for the notebook because GitHub can choke rendering large notebooks): https://gist.github.com/mbjoseph/75e40aed8627ee69b8b505124b50a69e

@lwasser
Copy link
Author

lwasser commented Mar 4, 2019

Thank you @mbjoseph !! To make the operations compariable you'd run unary_union on the intersection object (the one that is doing the clipping so the circles in this case). Clip clips a set of features to the geometric / spatial extent of another set of features. it does not intersect them as defined! we are trying to mimic typical arcgis behavior!

gpd_out = gpd.overlay(pts.unary_union, counties, how='intersection')

So the code above would be comparable.

@mbjoseph
Copy link
Member

mbjoseph commented Mar 5, 2019

Thanks @lwasser I see now how the unary union can be used to get equivalent output.

Just to be clear: is the comparison you want between clip_shp() and a function like the one below, as the number of polygons in to_clip (the number of polygons being clipped) increases?

def clip(to_clip, clip_shp):
    """Alternative clip function"""
    union = gpd.GeoDataFrame(
        gpd.GeoSeries([clip_shp.unary_union]), 
        columns=['geometry'], 
        crs=clip_shp.crs
    )
    return gpd.overlay(to_clip, union, how='intersection')

@mbjoseph mbjoseph added the enhancement New feature or request label Mar 14, 2019
@mbjoseph
Copy link
Member

@lwasser can you take another look at this? I wanted to confirm that the function above is the one you want to use as a comparison with our current implementation.

Performance aside, if we can remove the dependency on rtree, that would be a win from an installation standpoint!

@mbjoseph
Copy link
Member

Alright @lwasser have a look at this gist: https://gist.github.com/mbjoseph/75e40aed8627ee69b8b505124b50a69e

I compared the current clip function with the one above. I'm fairly sure that this is the comparison you wanted to make. The rtree approach is slightly faster when many polygons are being clipped (e.g., about 1.2x faster for 1000 points, and 1.3x faster for 10,000 points.

@lwasser
Copy link
Author

lwasser commented Mar 27, 2019

am i misreading this @mbjoseph
Screen Shot 2019-03-27 at 9 48 46 AM
it looks like the "new function" is faster as # poly's increases? perhaps i'm misreading this!

So i'm hearing that you'd like to remove rtree . i'm fairly surprised that geopandas isn't using rtree - i guess i just assumed they were as it's supposed to be faster for larger operations? !! i haven't had any issues with rtree yet so can you tell me what the goal of this is IF it is in fact faster for larger operations.

@lwasser
Copy link
Author

lwasser commented Mar 27, 2019

there is also the other issue which rtree helps with - geopandas overlay doesn't handle lines. just polygons.

@mbjoseph
Copy link
Member

@lwasser the y-axis is execution time: the new function is slower than the current function, particularly as the number of polygons being clipped increases. So, if we are optimizing for performance we want to keep using the rtree approach.

@lwasser
Copy link
Author

lwasser commented Mar 27, 2019

ok great. yes let's keep it then if we can. i went that route after a good bit of testing because i read it was the fastest option and would be particularly beneficial as the data got larger.

@mbjoseph
Copy link
Member

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants