New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: (optionally) use pygeos for vectorized GeometryArray operations #1154
ENH: (optionally) use pygeos for vectorized GeometryArray operations #1154
Conversation
Also to add to TODO: update |
Codecov Report
@@ Coverage Diff @@
## master #1154 +/- ##
==========================================
- Coverage 89.37% 88.05% -1.33%
==========================================
Files 21 21
Lines 2024 2051 +27
==========================================
- Hits 1809 1806 -3
- Misses 215 245 +30
Continue to review full report at Codecov.
|
@jorisvandenbossche I was playing around with this and found that for some reason the implementation in your branch Could you please help me figure out what I am doing wrong, or what is going on? Specifically timing the section that builds the result df (line 129 to line 165) for a given dataset: from timeit import default_timer
t = default_timer()
if len(r_idx) > 0 and len(l_idx) > 0:
....
else:
result = pd.DataFrame(columns=["_key_left", "_key_right"], dtype=float)
print(f"time to build result df: {default_timer()-t}") For I'm guessing from timeit import default_timer
t = default_timer()
if len(r_idx) > 0 and len(l_idx) > 0:
check_left = left_df.geometry.values[l_idx]
check_right = right_df.geometry.values[r_idx]
match_bool = getattr(check_left, op)(check_right)
res_vec = np.column_stack((l_idx, r_idx))[match_bool]
result_columns = ["_key_left", "_key_right"]
result = pd.DataFrame(res_vec, columns=result_columns)
else:
# when output from the join has no overlapping geometries
result = pd.DataFrame(columns=["_key_left", "_key_right"], dtype=float)
print(f"time to build result: {default_timer()-t}") Result: ~ 0.9 s Querying the spatial index is also marginally slower (0.1 s slower for this dataset), I assume that also means some unnecessary shapely conversion is happening when |
@adriangb FYI - there is work underway in This could involve migrating to One of the biggest differences right now in spatial joins between what is in See pygeos/pygeos#87 , pygeos/pygeos#92 for work in this space. The other part that is missing in For my own work, I've been seeing 3-10x or better speedups using a purely |
@brendan-ward thank you for the thorough reply! Very informative. It sounds like there is some very interesting stuff going on over at For my part, I was just trying to see how much the slower part of the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am playing with this branch to see how does it affect the performance of momepy
and to be fair, in most of the cases where I see a noticeable change, I actually get a slowdown. There are few with speedups, but not many and I am getting slowdowns even in situations I would not expect like this snippet below. (All tests on Polygon geometry (n=13810))
gdf.apply(lambda row: ((4 * math.sqrt(row['a'])) / (row['l'])) ** 2, axis=1)
# pygeos 2.11 s ± 261 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# master 1.33 s ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I know it can be done like ((np.sqrt(blg.a) * 4) / blg.l) ** 2
to make it significantly faster, but take it as an example to make the point.
I know that it is likely caused by the conversion to shapely geometry during loops (and momepy is loop-heavy at the moment), but overall I get speedup for functions I didn't really need it as it was fast enough (like a difference between 2.26 ms and 29.9 ms in .area
) and significant slowdown elsewhere, even if geometry is not involved (like the example above). So for the use with momepy, I would probably go with gpd.options.use_pygeos = False
by default now. But everything seems to work, all tests passed using pygeos.
I assume that this my be the issue with other packages and use cases as well, that is why I am writing it.
However, since one has to opt-in to use pygeos
now, I think it is ready for a public beta. There are lot of cases where we can get significant speedup using it as it is now. But I would be hesitant to make this a default behaviour before shapely 2.0 resolving the conversion-based slowdown.
I have no problem with the implementation if we'll be able to get rid of double implementation in _vectorized.py
at some point. I would not be happy to keep it for a long time as we would need to maintain both options.
I also went through the code which looks fine. I've left some minor comments, mostly focused on consistency. I also might have commented some old code which was just moved, so that can be ignored.
I am really looking forward having geopandas fully pygeos-ed. This is exciting even though my post might not sound like that :).
geopandas/tests/test_array.py
Outdated
@@ -133,10 +134,11 @@ def test_from_wkb(): | |||
assert all(v.equals(t) for v, t in zip(res, points_no_missing)) | |||
|
|||
# missing values | |||
L_wkb.extend([b"", None]) | |||
# XXX pygeos does not support empty strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use something more informative than XXX
? It looks like a placeholder, like something is missing there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made sure all TODO items are mentioned as "TODO(pygeos)". And will also open issues for them
geopandas/tests/test_array.py
Outdated
@@ -186,10 +188,12 @@ def f(x): | |||
assert all(v.almost_equals(t) for v, t in zip(res, points_no_missing)) | |||
|
|||
# missing values | |||
L_wkt.extend([f(""), None]) | |||
# L_wkt.extend([f(""), None]) | |||
# XXX pygeos does not support empty strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
geopandas/array.py
Outdated
return geom | ||
if geom is None: | ||
return None | ||
elif pygeos.is_empty(geom) and pygeos.get_type_id(geom) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment with ref to that shapely issue to make clear why is this separately?
Yes, as long as we are still using shapely as scalar object while using pygeos internally (so as long pygeos is not upstreamed to shapely), there will be a slowdown when not using one of the vectorized functions, but using your custom loops. Which is certainly annoying. I also ran the benchmarks, and we see it there as well. For example, all the read_file benchmarks have a 2x slowdown (going to look into that to see if that can already be improved with the currently available pygeos functionality, but it might need to implement conversion from a mapping to pygeos). @martinfleis I assume that while you don't see a speedup mostly, the slowdown is also not really problematic for the use cases you have? (assuming you wouldn't turn it of, or if you would want to use it for one part, ..) (and fully agree that we can only make this default once shapely/pygeos integration is better, which is the reason I hope to work on that in the near future!) |
@martinfleis Thanks a lot for the review! |
@jorisvandenbossche my normal use case is to run 50+ functions on a single dataset, so I am mostly interested about the sum of times needed to run each. The simple loop for i, r in gdf.iterrows():
pass can make a difference:
It may add several minutes in total in my case, which is nothing significant as the total run time is usually few hours. It all depends on what happens inside the loop. If it's something simple, then you will notice slowdown. If something heavy, you probably won't. Anyway, we need to make sure that people using this will be aware of these temporary drawbacks. |
I ran a quick test to see if caching/memoizing would help, since a lot of the slowdown is repeated conversion. Specifically, I added
pygeos without caching:
pygeos with caching
Edit: I did full benchmark runs. |
Updated for feedback. @martinfleis I also optimized the pygeos <-> shapely conversion (which is only used when pygeos and shapely use the same GEOS version, as in that case pointers are passed through). This should help for the cases you mentioned above where you still iterate through the geometries (and thus constantly convert from pygeos to shapely). For example, with:
I get with
and using pygeos (and with pygeos and shapely being compatible):
so not much slower anymore. Moreover, you should also try to avoid (with pygeos, and much faster)
Additionally, I also fixed a huge slowdown in |
Certainly! But that's something I would rather handle in a separate PR (this one is already big enough). And I am going to look at your PR! (sorry for being slow on giving feedback there) |
Sounds great! I have some work in progress for that here, I'll open another PR after you merge this one. |
That's also a cool idea! I think that with the faster version of |
The current results from running the benchmark suite with this PR compared against master:
And some benchmarks are now shown because they didn't change significantly (eg reading, sjoin). Interiors is slower because that is still using the old implementation, |
Thanks for the changes and benchmarks! That certainly looks like an improvement and once we'll fully switch, it'll be awesome :). I have no more comments, so if you're happy we can merge and release beta. |
OK, let's go ahead with this then .. ;) Exciting! |
Thanks! |
Closes #1155
Proof of concept using
pygeos
for the element-wise spatial operations of GeometryArray (and thus GeoSeries/GeoDataFrame).Right now this still needs a branch in pygeos (https://github.com/caspervdw/pygeos/pull/45).With those changes, all tests are passing.
It are relatively few changes, and part of them can also be applied on master first.
EDIT: I now moved most of the "compatibility layer" into a
_vectorized.py
file. So there the switch is done between pygeos vectorized function or our in house "looping over shapely objects" code. This gives a bigger diff, but for the shapely ops, it is mostly a copy paste from array.py to _vectorized.py.