Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: implement sindex.nearest_N_neigbours for both rtree and strtree in a consistent way (equal results) #1509

Open
srenoes opened this issue Jul 9, 2020 · 1 comment

Comments

@srenoes
Copy link
Contributor

srenoes commented Jul 9, 2020

Not quite shure what is the plan. There is a pull request which contains most of the discussion #1271 but that is not an issue. And partly nearest neighbour indexing is addressed in #1455. Original related issue seems to be #1096

My suggestion would be to do all changes in geopandas to take the differences between geos and rtree into consideration.

There were some issues with that in pygeos only 1 nearest neighbour is returned, not many if they have same distance:
discussion in issue (pygeos/pygeos#110), and pygeos pull request (pygeos/pygeos#111)

Part of the idea was as it seems to have a method with two additional inputs, N_neighbours and maxdistance.(#1271)

This is relatively easily implemented with existing methods in geopandas. And through those it would also work with both rtree and strtree. For pygeos using implementation could look like this:

sindex=gdf.sindex
pairs=sindex.query_bulk(gdf2.geometry.buffer(max_distance),predicate='intersects').T

geom1=gdf.iloc[pairs[:,1],:].geometry
geom2 =gdf2.iloc[pairs[:,0],:].geometry

distance=geom1.distance(geom2)

df_for_sorting=pandas.DataFrame([pairs[:,1],distance],columns=['geom1_idx','distance'],index=pairs[:,0])
indeces_to_keep=df_for_sorting.groupby('geom1_idx')['distance'].nsmallest(N_neighbours,keep='all').index
newpairs=np.array([indeces_to_keep.values,df_for_sorting.iloc[indeces_to_keep,0].values])
return newpairs

and for rtree it can (but does not have to) be different, maybe usefully so.

Performance wise this might be a problem with big maxdistance (Too many intersections to calculate distances for) In rtree it can be first filtered by N nearest neighbours and then by distance).
What about the other performance issues with building the tree mentioned in #1271 @martinfleis? Or is this solved by using the strtree from pygeos in case of big geometries?

Any opinions about that?

@adriangb
Copy link
Contributor

adriangb commented Aug 26, 2020

I still think this should be implemented using the algorithm proposed in pygeos/pygeos#111 (comment) or something similar. Ideally in PyGeos/shapely since it'll be much faster.

Regarding the other two inputs proposed in #1271, I think those only make sense in the context of the rtree solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants