-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cython] specific case where new sjoin is much slower #563
Comments
Finding cases where spatial joins are not running optimally would be very welcome. Even knowing that such cases exist gives me a lot of optimism that we can improve things further. @andreas-h if you do have time to make an example like this I would really appreciate it. The first few cells in this notebook provide a decent baseline dataset from which to work. I'm curious, does your performance change significantly if you replace |
There was some special logic around within that I didn't carry over. I'll implement that and push up a branch. If you have a chance to test it I'd be grateful. |
@mrocklin I converted the notebook of @andreas-h to a reproducible example without the additional dependencies (but with the same characteristics: joining points within a grid of polygons): http://nbviewer.ipython.org/ca67a9681ae7386beeff89a003911df9 |
From a quick test, switching to 'intersects' does not seem to help, neither does switching the order and using 'contains' (updated the notebook, they are both even slower) |
I'm getting 25s for geopandas-cython, 11s for this branch, and 4s for master I'll reactivate some profiling code at the c level and see if that helps to identify the bottleneck. |
We spend almost all of this time querying the STRTree in this line: GEOSSTRtree_query_r(handle, tree, left[l], strtree_query_callback, &vec); |
The next thing to try is probably to switch out GEOS' STRTree implementation with libspatialindex, which is what we were using before. The internet has good things to say about libspatialindex. I think that it has a C-API. I'll take a look soon. |
Yes, I was also thinking that libspatialindex is possibly better optimized than GEOS' STRTree (that would give another C dependency however ..) |
That C dependency is already pretty common. Master branch depends on it currently. |
Yes, I know. But depending on it through rtree is a bit easier (and it is optional now as well), as all compatibility checking / issues are handled there. Also if we want to keep our own |
OK. I took a look at libspatialindex. This is doable to try, but will take some effort, I suspect around a day. I'm not sure when I'll next have a full day free for something like this. I wouldn't expect this to be done any time in the next week or two. |
@mrocklin FYI |
Yes, we may have to be careful at times. I'm not yet particularly concerned about this. |
I did some profiling of the two different tools in python (gist link): It definitely appears that |
I re-ran these tests (gist), I'm posting here as well as in #1344 to try and give some closure to this issue. Namely, I added PyGEOS which also uses GEOS' STRTree but different Python binding and geometry data structures: So it seems to me that most of the slowdown comes from Shapely/Python stuff, not GEOS. |
pygeos is impressive! |
@adriangb thanks for testing! But I can confirm your findings, as I also ran my original notebook, and the
The bulk query here still needs to index the dataframes and merge them, to be equivalent to the |
I assume that this is no longer relevant :). |
@andreas-h reported a use case where the
sjoin
from the geopandas-cython branch is much slower than the current released version: https://gist.github.com/andreas-h/4906aea5d8ecffc9751e191cd11d00b4I ran it locally and I can confirm this. It is joining 20,000 points with 44,000 polygons (this only takes ca 5s on master, but 30-60s on the cython branch).
I tried to profile it, but it seems to indicate that virtually all time is spent within the cython cysjoin function (and thus c sjoin fucntion). Which is also strange because also the actual pandas code in the user-facing sjoin function should take some time.
I did not yet check that the actual results of both versions are the same; possibly one of both implementations is doing something wrong.
cc @mrocklin
@andreas-h could you simplify the example a little bit? (to not depend on the
emiprepr
library, eg just construct the polygons directly inside the notebook)The text was updated successfully, but these errors were encountered: