Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: (optionally) use pygeos for vectorized GeometryArray operations #1154

Merged
merged 33 commits into from Mar 24, 2020

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 12, 2019

Closes #1155

Proof of concept using pygeos for the element-wise spatial operations of GeometryArray (and thus GeoSeries/GeoDataFrame). Right now this still needs a branch in pygeos (https://github.com/caspervdw/pygeos/pull/45).

With those changes, all tests are passing.

It are relatively few changes, and part of them can also be applied on master first.


EDIT: I now moved most of the "compatibility layer" into a _vectorized.py file. So there the switch is done between pygeos vectorized function or our in house "looping over shapely objects" code. This gives a bigger diff, but for the shapely ops, it is mostly a copy paste from array.py to _vectorized.py.

@snowman2
Copy link
Contributor

Also to add to TODO: update to_crs()

@codecov
Copy link

codecov bot commented Oct 31, 2019

Codecov Report

Merging #1154 into master will decrease coverage by 1.32%.
The diff coverage is 91.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1154      +/-   ##
==========================================
- Coverage   89.37%   88.05%   -1.33%     
==========================================
  Files          21       21              
  Lines        2024     2051      +27     
==========================================
- Hits         1809     1806       -3     
- Misses        215      245      +30
Impacted Files Coverage Δ
geopandas/array.py 85.74% <91.05%> (-8.19%) ⬇️
geopandas/tools/sjoin.py 97.82% <0%> (+0.45%) ⬆️
geopandas/plotting.py 92.51% <0%> (+1.18%) ⬆️
geopandas/geoseries.py 94.01% <0%> (+1.79%) ⬆️
geopandas/tools/_show_versions.py 90.9% <0%> (+2.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 500bceb...d98a39a. Read the comment docs.

@adriangb
Copy link
Contributor

adriangb commented Feb 9, 2020

@jorisvandenbossche I was playing around with this and found that for some reason the implementation in your branch jorisvandenbossche:pygeos is much slower than geopandas:master.

Could you please help me figure out what I am doing wrong, or what is going on?

Specifically timing the section that builds the result df (line 129 to line 165) for a given dataset:

from timeit import default_timer
t = default_timer()
if len(r_idx) > 0 and len(l_idx) > 0:
   ....
else:
    result = pd.DataFrame(columns=["_key_left", "_key_right"], dtype=float)
print(f"time to build result df: {default_timer()-t}")

For geopandas:master: ~ 0.26 sec
For jorisvandenbossche:pygeos: ~ 4.5 sec

I'm guessing sjoin in your branch ends up converting from shapely to pygeos objects unnecessarily, so I tried to optimize to ensure proper vectorized operations using your branch:

from timeit import default_timer
t = default_timer()
if len(r_idx) > 0 and len(l_idx) > 0:
    check_left = left_df.geometry.values[l_idx]
    check_right = right_df.geometry.values[r_idx]
    match_bool = getattr(check_left, op)(check_right)
    res_vec = np.column_stack((l_idx, r_idx))[match_bool]
    result_columns = ["_key_left", "_key_right"]
    result = pd.DataFrame(res_vec, columns=result_columns)
else:
    # when output from the join has no overlapping geometries
    result = pd.DataFrame(columns=["_key_left", "_key_right"], dtype=float)
print(f"time to build result: {default_timer()-t}")

Result: ~ 0.9 s
Muuuuch faster (and pretty concise actually), but still lagging behind geopandas:master.

Querying the spatial index is also marginally slower (0.1 s slower for this dataset), I assume that also means some unnecessary shapely conversion is happening when df.geometry.apply(lambda x: x.bounds) is being called.

@brendan-ward
Copy link
Member

@adriangb FYI - there is work underway in pygeos that may make use of pygeos geometries much faster for spatial joins here, but it's still pretty early. I don't think that optimizing performance was @jorisvandenbossche 's goal in this proof of concept, but rather to define a path forward for core objects that could unlock serious performance improvements once that core is in place. Like you indicate here, there are likely several places in geopandas that would need to be updated to minimize object conversion.

This could involve migrating to pygeos.STRtree for the spatial index, since it works very well with pygeos geometries. See: pygeos/pygeos#52

One of the biggest differences right now in spatial joins between what is in geopandas and implementing a similar approach based on pygeos is the use of prepared geometries; depending on the nature of the geometries in the join, prepared geometries can have a huge performance impact. These are automatically constructed in geopandas sjoin now, but are not yet available in pygeos.

See pygeos/pygeos#87 , pygeos/pygeos#92 for work in this space.

The other part that is missing in pygeos is vectorizing queries to the spatial index (STRtree), since right now STRtree.query() works on an individual source geometry. I've been pondering how to approach that once the solution to prepared geometries is in place. I imagine once that is implemented, we'll see even better speedups.

For my own work, I've been seeing 3-10x or better speedups using a purely pygeos based sjoin implementation that uses less-than-ideal vectorization of querying the tree (it's nowhere near as sophisticated as the sjoin here).

@adriangb
Copy link
Contributor

adriangb commented Feb 9, 2020

@brendan-ward thank you for the thorough reply!

Very informative. It sounds like there is some very interesting stuff going on over at pygeos that might really affect how things get implemented in geopandas, or even might replace geopandas's sjoin altogether.

For my part, I was just trying to see how much the slower part of the current sjoin could benefit from vectorization. I think the answer is that it is too soon to test. I'll sit tight for now.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review February 18, 2020 15:44
@jorisvandenbossche jorisvandenbossche changed the title POC: use pygeos for vectorized GeometryArray operations ENH: (optionally) use pygeos for vectorized GeometryArray operations Feb 18, 2020
Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am playing with this branch to see how does it affect the performance of momepy and to be fair, in most of the cases where I see a noticeable change, I actually get a slowdown. There are few with speedups, but not many and I am getting slowdowns even in situations I would not expect like this snippet below. (All tests on Polygon geometry (n=13810))

gdf.apply(lambda row: ((4 * math.sqrt(row['a'])) / (row['l'])) ** 2, axis=1)

# pygeos 2.11 s ± 261 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# master 1.33 s ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I know it can be done like ((np.sqrt(blg.a) * 4) / blg.l) ** 2 to make it significantly faster, but take it as an example to make the point.

I know that it is likely caused by the conversion to shapely geometry during loops (and momepy is loop-heavy at the moment), but overall I get speedup for functions I didn't really need it as it was fast enough (like a difference between 2.26 ms and 29.9 ms in .area) and significant slowdown elsewhere, even if geometry is not involved (like the example above). So for the use with momepy, I would probably go with gpd.options.use_pygeos = False by default now. But everything seems to work, all tests passed using pygeos.

I assume that this my be the issue with other packages and use cases as well, that is why I am writing it.

However, since one has to opt-in to use pygeos now, I think it is ready for a public beta. There are lot of cases where we can get significant speedup using it as it is now. But I would be hesitant to make this a default behaviour before shapely 2.0 resolving the conversion-based slowdown.

I have no problem with the implementation if we'll be able to get rid of double implementation in _vectorized.py at some point. I would not be happy to keep it for a long time as we would need to maintain both options.

I also went through the code which looks fine. I've left some minor comments, mostly focused on consistency. I also might have commented some old code which was just moved, so that can be ignored.

I am really looking forward having geopandas fully pygeos-ed. This is exciting even though my post might not sound like that :).

@@ -133,10 +134,11 @@ def test_from_wkb():
assert all(v.equals(t) for v, t in zip(res, points_no_missing))

# missing values
L_wkb.extend([b"", None])
# XXX pygeos does not support empty strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use something more informative than XXX? It looks like a placeholder, like something is missing there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made sure all TODO items are mentioned as "TODO(pygeos)". And will also open issues for them

@@ -186,10 +188,12 @@ def f(x):
assert all(v.almost_equals(t) for v, t in zip(res, points_no_missing))

# missing values
L_wkt.extend([f(""), None])
# L_wkt.extend([f(""), None])
# XXX pygeos does not support empty strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

geopandas/array.py Show resolved Hide resolved
geopandas/array.py Show resolved Hide resolved
return geom
if geom is None:
return None
elif pygeos.is_empty(geom) and pygeos.get_type_id(geom) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment with ref to that shapely issue to make clear why is this separately?

geopandas/_vectorized.py Outdated Show resolved Hide resolved
geopandas/_vectorized.py Outdated Show resolved Hide resolved
geopandas/_vectorized.py Show resolved Hide resolved
geopandas/_vectorized.py Show resolved Hide resolved
geopandas/_config.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

I know that it is likely caused by the conversion to shapely geometry during loops (and momepy is loop-heavy at the moment)

Yes, as long as we are still using shapely as scalar object while using pygeos internally (so as long pygeos is not upstreamed to shapely), there will be a slowdown when not using one of the vectorized functions, but using your custom loops.

Which is certainly annoying.

I also ran the benchmarks, and we see it there as well. For example, all the read_file benchmarks have a 2x slowdown (going to look into that to see if that can already be improved with the currently available pygeos functionality, but it might need to implement conversion from a mapping to pygeos).

@martinfleis I assume that while you don't see a speedup mostly, the slowdown is also not really problematic for the use cases you have? (assuming you wouldn't turn it of, or if you would want to use it for one part, ..)

(and fully agree that we can only make this default once shapely/pygeos integration is better, which is the reason I hope to work on that in the near future!)

@jorisvandenbossche
Copy link
Member Author

@martinfleis Thanks a lot for the review!

@martinfleis
Copy link
Member

I assume that while you don't see a speedup mostly, the slowdown is also not really problematic for the use cases you have?

@jorisvandenbossche my normal use case is to run 50+ functions on a single dataset, so I am mostly interested about the sum of times needed to run each. The simple loop

for i, r in gdf.iterrows():
    pass

can make a difference:

pygeos: 1.72 s ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
master: 813 ms ± 7.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It may add several minutes in total in my case, which is nothing significant as the total run time is usually few hours. It all depends on what happens inside the loop. If it's something simple, then you will notice slowdown. If something heavy, you probably won't.

Anyway, we need to make sure that people using this will be aware of these temporary drawbacks.

@adriangb
Copy link
Contributor

adriangb commented Mar 15, 2020

I ran a quick test to see if caching/memoizing would help, since a lot of the slowdown is repeated conversion. Specifically, I added @lru_cache(maxsize=2048, typed=False) to _vectorized._pygeos_to_shapely. I ran benchmarks for sjoin only (since I knew that slowed down a lot). I also reduced the array sizes because it was taking too long on my laptop.
master:

     op               
------------ ---------
 intersects   166±0ms 
  contains    145±0ms 
   within     124±0ms 

pygeos without caching:

     op               
------------ ---------
 intersects   2.49±0s 
  contains    2.05±0s 
   within     2.81±0s 

pygeos with caching

     op               
------------ ---------
 intersects   270±0ms 
  contains    273±0ms 
   within     298±0ms 

Edit: I did full benchmark runs.
pygeos_nocache.txt
master_nocache.txt
pygeos_cache.txt

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 21, 2020

Updated for feedback.

@martinfleis I also optimized the pygeos <-> shapely conversion (which is only used when pygeos and shapely use the same GEOS version, as in that case pointers are passed through). This should help for the cases you mentioned above where you still iterate through the geometries (and thus constantly convert from pygeos to shapely).

For example, with:

points = GeoSeries( 
    [Point(x, y) for x, y in zip(np.random.random(10000), 
                                 np.random.random(10000))]) 
 
gdf = GeoDataFrame({'val1': np.random.randn(

I get with geopandas.use_pygeos = False:

In [7]: %%timeit 
   ...: for i, r in gdf.iterrows(): 
   ...:     pass 
   ...:      
487 ms ± 32.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and using pygeos (and with pygeos and shapely being compatible):

In [10]: %%timeit 
    ...: for i, r in gdf.iterrows(): 
    ...:     pass 
    ...: 
513 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so not much slower anymore.

Moreover, you should also try to avoid iterrows .. ;-)

(with pygeos, and much faster)

In [12]: %%timeit 
    ...: for i in gdf.itertuples(): 
    ...:     pass 
    ...: 
65.4 ms ± 550 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Additionally, I also fixed a huge slowdown in sjoin. Now it is back to more or less the same time (at least for the case included in the benchmark). It's not yet faster, since we don't yet use pygeos' tree in sjoin.

@jorisvandenbossche
Copy link
Member Author

@jorisvandenbossche can't we use bulk_query on pygeos' STRTree (which will check the predicate itself) directly?

Certainly! But that's something I would rather handle in a separate PR (this one is already big enough). And I am going to look at your PR! (sorry for being slow on giving feedback there)

@adriangb
Copy link
Contributor

adriangb commented Mar 21, 2020

Certainly! But that's something I would rather handle in a separate PR (this one is already big enough). And I am going to look at your PR! (sorry for being slow on giving feedback there)

Sounds great! I have some work in progress for that here, I'll open another PR after you merge this one.

@jorisvandenbossche
Copy link
Member Author

I ran a quick test to see if caching/memoizing would help, since a lot of the slowdown is repeated conversion. Specifically, I added @lru_cache(maxsize=2048, typed=False) to _vectorized._pygeos_to_shapely

That's also a cool idea! I think that with the faster version of _pygeos_to_shapely (which clones a geometry pointer, instead of going through WKB), it should be fine now though.

@jorisvandenbossche
Copy link
Member Author

The current results from running the benchmark suite with this PR compared against master:

       before           after         ratio
     [83153dde]       [c07efd2b]
     <master>         <pygeos>  
+      8.79±0.1ms       19.1±0.3ms     2.17  geom_methods.Bench.time_unary_geo('interiors')
+     17.4±0.09ms       28.5±0.4ms     1.64  geom_methods.Bench.time_geom_type
+       113±0.9ms         168±10ms     1.49  plotting.Bench.time_plot_values('Point')
+         118±5ms          160±3ms     1.36  plotting.Bench.time_plot_series('Point')
+      1.23±0.02s          1.35±0s     1.10  io.Bench.time_write_series('.json')
+      1.37±0.02s       1.48±0.01s     1.08  io.Bench.time_write_series('.gpkg')
-         1.48±0s       1.39±0.01s     0.94  overlay.Countries.time_overlay('union')
-         1.48±0s          1.38±0s     0.93  overlay.Countries.time_overlay('identity')
-         478±2ms          361±2ms     0.76  overlay.Countries.time_overlay('intersection')
-         1.92±0s          1.38±0s     0.72  transform.CRS.time_transform_wgs84
-        112±20ms        79.6±10ms     0.71  geom_methods.Bench.time_binary_geo('difference')
-      28.7±0.2ms       19.9±0.5ms     0.70  geom_methods.Bench.time_binary_predicate_vector('geom_equals')
-      28.6±0.2ms       19.7±0.1ms     0.69  geom_methods.Bench.time_binary_predicate_vector('intersects')
-      28.7±0.2ms      19.7±0.09ms     0.69  geom_methods.Bench.time_binary_predicate_vector('crosses')
-      28.5±0.1ms      19.6±0.09ms     0.69  geom_methods.Bench.time_binary_predicate_vector('disjoint')
-     28.9±0.04ms       19.8±0.1ms     0.68  geom_methods.Bench.time_binary_predicate_vector('touches')
-     29.0±0.06ms       19.9±0.2ms     0.68  geom_methods.Bench.time_binary_predicate_vector('contains')
-      28.9±0.2ms      19.7±0.07ms     0.68  geom_methods.Bench.time_binary_predicate_vector('overlaps')
-     29.2±0.09ms      19.6±0.05ms     0.67  geom_methods.Bench.time_binary_predicate_vector('within')
-        18.0±1ms       11.5±0.5ms     0.64  geom_methods.Bench.time_binary_predicate('touches')
-      17.0±0.2ms      10.3±0.04ms     0.61  geom_methods.Bench.time_unary_predicate('is_valid')
-      18.4±0.8ms       11.0±0.9ms     0.60  geom_methods.Bench.time_binary_predicate('intersects')
-         123±3ms         71.5±6ms     0.58  geom_methods.Bench.time_binary_geo('union')
-        18.1±2ms         10.2±1ms     0.56  geom_methods.Bench.time_binary_predicate('crosses')
-      72.1±0.7ms       37.8±0.5ms     0.52  geom_methods.Bench.time_binary_geo_vector('union')
-      62.6±0.6ms       32.7±0.2ms     0.52  geom_methods.Bench.time_binary_geo_vector('difference')
-      72.1±0.4ms       37.5±0.4ms     0.52  geom_methods.Bench.time_binary_geo_vector('intersection')
-        19.8±1ms         10.3±2ms     0.52  geom_methods.Bench.time_binary_predicate('disjoint')
-      63.0±0.9ms       32.5±0.2ms     0.52  geom_methods.Bench.time_binary_geo_vector('symmetric_difference')
-        19.9±1ms         10.2±1ms     0.51  geom_methods.Bench.time_binary_predicate('overlaps')
-     7.84±0.05ms      3.80±0.03ms     0.49  geom_methods.Bench.time_unary_predicate('is_simple')
-         115±7ms         54.7±8ms     0.47  geom_methods.Bench.time_binary_geo('intersection')
-        159±10ms        63.1±10ms     0.40  geom_methods.Bench.time_binary_geo('symmetric_difference')
-      9.05±0.1ms      2.70±0.07ms     0.30  geom_methods.Bench.time_binary_predicate_vector('geom_almost_equals')
-      15.4±0.06s       4.38±0.02s     0.28  sjoin.Bench.time_sjoin('within')
-      15.5±0.4ms       3.72±0.4ms     0.24  geom_methods.Bench.time_binary_float('distance')
-      16.1±0.1ms      3.03±0.01ms     0.19  geom_methods.Bench.time_binary_float_vector('distance')
-      25.4±0.3ms      3.80±0.03ms     0.15  geom_methods.Bench.time_unary_predicate('is_ring')
-      14.5±0.3ms      2.15±0.02ms     0.15  geom_methods.Bench.time_unary_geo('convex_hull')
-      4.86±0.4ms        550±700μs     0.11  geom_methods.Bench.time_binary_predicate('contains')
-      8.72±0.1ms         978±10μs     0.11  geom_methods.Bench.time_unary_geo('exterior')
-     13.3±0.08ms      1.44±0.01ms     0.11  geom_methods.Bench.time_unary_geo('centroid')
-      13.3±0.1ms      1.22±0.01ms     0.09  geom_methods.Bench.time_unary_geo('envelope')
-      13.8±0.1ms      1.26±0.02ms     0.09  geom_methods.Bench.time_unary_geo_representative_point
-     12.3±0.05ms          968±3μs     0.08  geom_methods.Bench.time_unary_geo('boundary')
-         1.24±0s       70.6±0.5ms     0.06  transform.CRS.time_transform_many_points
-     6.05±0.03ms        293±0.8μs     0.05  geom_methods.Bench.time_binary_predicate('geom_almost_equals')
-     4.38±0.02ms          175±3μs     0.04  geom_methods.Bench.time_binary_predicate('geom_equals')
-     2.99±0.02ms        105±0.3μs     0.04  geom_methods.Bench.time_unary_predicate('is_empty')
-     33.3±0.06ms          969±5μs     0.03  geom_methods.Bench.time_unary_float('length')
-      32.9±0.5ms          886±3μs     0.03  geom_methods.Bench.time_unary_float('area')
-      58.9±0.9ms         845±30μs     0.01  geom_methods.Bench.time_bounds

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

And some benchmarks are now shown because they didn't change significantly (eg reading, sjoin).

Interiors is slower because that is still using the old implementation, geom_type because I am doing a slow conversion from int ids to strings (that should be easy to optimize). A bunch of the binary predicates should still be faster once pygeos uses prepared geometries.

@martinfleis
Copy link
Member

Thanks for the changes and benchmarks! That certainly looks like an improvement and once we'll fully switch, it'll be awesome :).

I have no more comments, so if you're happy we can merge and release beta.

@jorisvandenbossche
Copy link
Member Author

OK, let's go ahead with this then .. ;) Exciting!

@jorisvandenbossche jorisvandenbossche merged commit 5d1181a into geopandas:master Mar 24, 2020
@martinfleis
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrating pygeos in GeoPandas for vectorized array operations
6 participants