Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cython] specific case where new sjoin is much slower #563

Closed
jorisvandenbossche opened this issue Sep 27, 2017 · 18 comments
Closed

[cython] specific case where new sjoin is much slower #563

jorisvandenbossche opened this issue Sep 27, 2017 · 18 comments

Comments

@jorisvandenbossche
Copy link
Member

@andreas-h reported a use case where the sjoin from the geopandas-cython branch is much slower than the current released version: https://gist.github.com/andreas-h/4906aea5d8ecffc9751e191cd11d00b4

I ran it locally and I can confirm this. It is joining 20,000 points with 44,000 polygons (this only takes ca 5s on master, but 30-60s on the cython branch).

I tried to profile it, but it seems to indicate that virtually all time is spent within the cython cysjoin function (and thus c sjoin fucntion). Which is also strange because also the actual pandas code in the user-facing sjoin function should take some time.
I did not yet check that the actual results of both versions are the same; possibly one of both implementations is doing something wrong.

cc @mrocklin

@andreas-h could you simplify the example a little bit? (to not depend on the emiprepr library, eg just construct the polygons directly inside the notebook)

@mrocklin
Copy link
Member

mrocklin commented Oct 1, 2017

Finding cases where spatial joins are not running optimally would be very welcome. Even knowing that such cases exist gives me a lot of optimism that we can improve things further. @andreas-h if you do have time to make an example like this I would really appreciate it. The first few cells in this notebook provide a decent baseline dataset from which to work.

I'm curious, does your performance change significantly if you replace within with intersects?

@mrocklin
Copy link
Member

mrocklin commented Oct 1, 2017

There was some special logic around within that I didn't carry over. I'll implement that and push up a branch. If you have a chance to test it I'd be grateful.

@jorisvandenbossche
Copy link
Member Author

@mrocklin I converted the notebook of @andreas-h to a reproducible example without the additional dependencies (but with the same characteristics: joining points within a grid of polygons): http://nbviewer.ipython.org/ca67a9681ae7386beeff89a003911df9

@jorisvandenbossche
Copy link
Member Author

From a quick test, switching to 'intersects' does not seem to help, neither does switching the order and using 'contains' (updated the notebook, they are both even slower)

@mrocklin
Copy link
Member

mrocklin commented Oct 1, 2017

I'm getting 25s for geopandas-cython, 11s for this branch, and 4s for master

I'll reactivate some profiling code at the c level and see if that helps to identify the bottleneck.

@mrocklin
Copy link
Member

mrocklin commented Oct 1, 2017

We spend almost all of this time querying the STRTree in this line:

GEOSSTRtree_query_r(handle, tree, left[l], strtree_query_callback, &vec);

@mrocklin
Copy link
Member

mrocklin commented Oct 1, 2017

The next thing to try is probably to switch out GEOS' STRTree implementation with libspatialindex, which is what we were using before. The internet has good things to say about libspatialindex.

I think that it has a C-API. I'll take a look soon.

@jorisvandenbossche
Copy link
Member Author

Yes, I was also thinking that libspatialindex is possibly better optimized than GEOS' STRTree (that would give another C dependency however ..)

@mrocklin
Copy link
Member

mrocklin commented Oct 1, 2017

That C dependency is already pretty common. Master branch depends on it currently.

@jorisvandenbossche
Copy link
Member Author

Yes, I know. But depending on it through rtree is a bit easier (and it is optional now as well), as all compatibility checking / issues are handled there. Also if we want to keep our own sindex property using rtree, then the user also has to make sure that both geopandas and rtree are build against the same libspatialindex.
Anyhow, probably nothing to do about it for optimal performance I suppose :-), it is just another added complexity for building / installing

@mrocklin
Copy link
Member

mrocklin commented Oct 1, 2017

OK. I took a look at libspatialindex. This is doable to try, but will take some effort, I suspect around a day. I'm not sure when I'll next have a full day free for something like this. I wouldn't expect this to be done any time in the next week or two.

@brendancol
Copy link

@mrocklin FYI libspatialindex has some thread-safety issues libspatialindex/libspatialindex#71

@mrocklin
Copy link
Member

mrocklin commented Oct 1, 2017

Yes, we may have to be careful at times. I'm not yet particularly concerned about this.

@snowman2
Copy link
Contributor

snowman2 commented May 26, 2019

I did some profiling of the two different tools in python (gist link):
image

It definitely appears that rtree (libspatialindex) has much better performance when looking up geometries. Although it is a bit slower to create the rtree index, it is a one time operation and the lookup speed is around 35x faster.

@adriangb
Copy link
Contributor

I re-ran these tests (gist), I'm posting here as well as in #1344 to try and give some closure to this issue.

Namely, I added PyGEOS which also uses GEOS' STRTree but different Python binding and geometry data structures:
build
query

So it seems to me that most of the slowdown comes from Shapely/Python stuff, not GEOS.

@snowman2
Copy link
Contributor

pygeos is impressive!

@jorisvandenbossche
Copy link
Member Author

@adriangb thanks for testing!
Hmm, something must have been wrong in the old c/cython implementation of this (although it is using almost the same code / approach as what we have in pygeos now).

But I can confirm your findings, as I also ran my original notebook, and the sjoin on those data in master now takes around 4s, and doing the bulk query with pygeos takes less than 200ms (while this took 30s in the notebook with the old c/cython implementation):

In [12]: %time joined = geopandas.sjoin(rgeoms, grid, op='within')
CPU times: user 3.43 s, sys: 3.05 ms, total: 3.44 s
Wall time: 3.45 s

In [13]: %%timeit
    ...: tree = pygeos.STRtree(array_grid)
    ...: idx1, idx2= tree.query_bulk(array_rgeoms, predicate="within")
179 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The bulk query here still needs to index the dataframes and merge them, to be equivalent to the sjoin, but that's not very expensive (not seconds, at least).

@martinfleis
Copy link
Member

I assume that this is no longer relevant :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants