[cython] specific case where new sjoin is much slower #563

jorisvandenbossche · 2017-09-27T22:22:46Z

@andreas-h reported a use case where the sjoin from the geopandas-cython branch is much slower than the current released version: https://gist.github.com/andreas-h/4906aea5d8ecffc9751e191cd11d00b4

I ran it locally and I can confirm this. It is joining 20,000 points with 44,000 polygons (this only takes ca 5s on master, but 30-60s on the cython branch).

I tried to profile it, but it seems to indicate that virtually all time is spent within the cython cysjoin function (and thus c sjoin fucntion). Which is also strange because also the actual pandas code in the user-facing sjoin function should take some time.
I did not yet check that the actual results of both versions are the same; possibly one of both implementations is doing something wrong.

cc @mrocklin

@andreas-h could you simplify the example a little bit? (to not depend on the emiprepr library, eg just construct the polygons directly inside the notebook)

The text was updated successfully, but these errors were encountered:

mrocklin · 2017-10-01T14:15:41Z

Finding cases where spatial joins are not running optimally would be very welcome. Even knowing that such cases exist gives me a lot of optimism that we can improve things further. @andreas-h if you do have time to make an example like this I would really appreciate it. The first few cells in this notebook provide a decent baseline dataset from which to work.

I'm curious, does your performance change significantly if you replace within with intersects?

mrocklin · 2017-10-01T14:18:57Z

There was some special logic around within that I didn't carry over. I'll implement that and push up a branch. If you have a chance to test it I'd be grateful.

jorisvandenbossche · 2017-10-01T15:13:54Z

@mrocklin I converted the notebook of @andreas-h to a reproducible example without the additional dependencies (but with the same characteristics: joining points within a grid of polygons): http://nbviewer.ipython.org/ca67a9681ae7386beeff89a003911df9

jorisvandenbossche · 2017-10-01T15:18:00Z

From a quick test, switching to 'intersects' does not seem to help, neither does switching the order and using 'contains' (updated the notebook, they are both even slower)

mrocklin · 2017-10-01T15:23:19Z

I'm getting 25s for geopandas-cython, 11s for this branch, and 4s for master

I'll reactivate some profiling code at the c level and see if that helps to identify the bottleneck.

mrocklin · 2017-10-01T15:28:56Z

We spend almost all of this time querying the STRTree in this line:

GEOSSTRtree_query_r(handle, tree, left[l], strtree_query_callback, &vec);

mrocklin · 2017-10-01T15:51:43Z

The next thing to try is probably to switch out GEOS' STRTree implementation with libspatialindex, which is what we were using before. The internet has good things to say about libspatialindex.

I think that it has a C-API. I'll take a look soon.

jorisvandenbossche · 2017-10-01T15:56:46Z

Yes, I was also thinking that libspatialindex is possibly better optimized than GEOS' STRTree (that would give another C dependency however ..)

mrocklin · 2017-10-01T15:57:30Z

That C dependency is already pretty common. Master branch depends on it currently.

jorisvandenbossche · 2017-10-01T16:00:40Z

Yes, I know. But depending on it through rtree is a bit easier (and it is optional now as well), as all compatibility checking / issues are handled there. Also if we want to keep our own sindex property using rtree, then the user also has to make sure that both geopandas and rtree are build against the same libspatialindex.
Anyhow, probably nothing to do about it for optimal performance I suppose :-), it is just another added complexity for building / installing

mrocklin · 2017-10-01T16:23:44Z

OK. I took a look at libspatialindex. This is doable to try, but will take some effort, I suspect around a day. I'm not sure when I'll next have a full day free for something like this. I wouldn't expect this to be done any time in the next week or two.

brendancol · 2017-10-01T16:29:08Z

@mrocklin FYI libspatialindex has some thread-safety issues libspatialindex/libspatialindex#71

mrocklin · 2017-10-01T23:36:15Z

Yes, we may have to be careful at times. I'm not yet particularly concerned about this.

snowman2 · 2019-05-26T01:02:51Z

I did some profiling of the two different tools in python (gist link):

It definitely appears that rtree (libspatialindex) has much better performance when looking up geometries. Although it is a bit slower to create the rtree index, it is a one time operation and the lookup speed is around 35x faster.

adriangb · 2020-03-25T23:58:00Z

I re-ran these tests (gist), I'm posting here as well as in #1344 to try and give some closure to this issue.

Namely, I added PyGEOS which also uses GEOS' STRTree but different Python binding and geometry data structures:

So it seems to me that most of the slowdown comes from Shapely/Python stuff, not GEOS.

snowman2 · 2020-03-26T00:34:50Z

pygeos is impressive!

jorisvandenbossche · 2020-03-26T21:34:50Z

@adriangb thanks for testing!
Hmm, something must have been wrong in the old c/cython implementation of this (although it is using almost the same code / approach as what we have in pygeos now).

But I can confirm your findings, as I also ran my original notebook, and the sjoin on those data in master now takes around 4s, and doing the bulk query with pygeos takes less than 200ms (while this took 30s in the notebook with the old c/cython implementation):

In [12]: %time joined = geopandas.sjoin(rgeoms, grid, op='within')
CPU times: user 3.43 s, sys: 3.05 ms, total: 3.44 s
Wall time: 3.45 s

In [13]: %%timeit
    ...: tree = pygeos.STRtree(array_grid)
    ...: idx1, idx2= tree.query_bulk(array_rgeoms, predicate="within")
179 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The bulk query here still needs to index the dataframes and merge them, to be equivalent to the sjoin, but that's not very expensive (not seconds, at least).

martinfleis · 2023-01-05T22:29:49Z

I assume that this is no longer relevant :).

jorisvandenbossche added the geopandas-cython label Sep 27, 2017

mrocklin mentioned this issue Oct 1, 2017

Implement sjoin within with contains #575

Merged

jorisvandenbossche mentioned this issue Mar 25, 2020

API: How to deal with different spatial index implementations? #1344

Open

jorisvandenbossche added ops:sjoin performance and removed geopandas-cython labels Apr 19, 2020

martinfleis closed this as completed Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cython] specific case where new sjoin is much slower #563

[cython] specific case where new sjoin is much slower #563

jorisvandenbossche commented Sep 27, 2017

mrocklin commented Oct 1, 2017

mrocklin commented Oct 1, 2017

jorisvandenbossche commented Oct 1, 2017

jorisvandenbossche commented Oct 1, 2017

mrocklin commented Oct 1, 2017

mrocklin commented Oct 1, 2017

mrocklin commented Oct 1, 2017

jorisvandenbossche commented Oct 1, 2017

mrocklin commented Oct 1, 2017

jorisvandenbossche commented Oct 1, 2017

mrocklin commented Oct 1, 2017

brendancol commented Oct 1, 2017

mrocklin commented Oct 1, 2017

snowman2 commented May 26, 2019 •

edited

Loading

adriangb commented Mar 25, 2020

snowman2 commented Mar 26, 2020

jorisvandenbossche commented Mar 26, 2020

martinfleis commented Jan 5, 2023

[cython] specific case where new sjoin is much slower #563

[cython] specific case where new sjoin is much slower #563

Comments

jorisvandenbossche commented Sep 27, 2017

mrocklin commented Oct 1, 2017

mrocklin commented Oct 1, 2017

jorisvandenbossche commented Oct 1, 2017

jorisvandenbossche commented Oct 1, 2017

mrocklin commented Oct 1, 2017

mrocklin commented Oct 1, 2017

mrocklin commented Oct 1, 2017

jorisvandenbossche commented Oct 1, 2017

mrocklin commented Oct 1, 2017

jorisvandenbossche commented Oct 1, 2017

mrocklin commented Oct 1, 2017

brendancol commented Oct 1, 2017

mrocklin commented Oct 1, 2017

snowman2 commented May 26, 2019 • edited Loading

adriangb commented Mar 25, 2020

snowman2 commented Mar 26, 2020

jorisvandenbossche commented Mar 26, 2020

martinfleis commented Jan 5, 2023

snowman2 commented May 26, 2019 •

edited

Loading