-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized distance produce skewed result #673
Comments
If you do Due to the alignment, you see NaNs in the end of the result, because both frames are probably not the same length. That is also the reason that What was the desired result? You want for each point the distance to the closest harbor? |
Indeed the description you gave is in alignment with the result, however this is not obvious from the description of the function (Returns a The end use case is to have something like nearest_point from shapely, but vectorized and without necessity to produce cascaded union, i.e. to receive row id and distance of to the nearest other. |
Where did you find this description? I get
(but I agree this is certainly not clear enough! a better explanation and some examples would help here)
This is a feature that I would like to see in geopandas as well (implement vectorized versions of those shapely.ops), but is not yet implemented. For now, what you can do is something like:
This will not be fully vectorized, but at least it will be vectorized for each point. |
Ah, I see this is in the latest master version (not the cython branch): http://geopandas.readthedocs.io/en/latest/reference.html#geopandas.GeoSeries.distance |
#674 should at least resolve the ambiguity, should do later a PR with some more explanation (always welcome to do a PR!) |
Oh no the description is from the https://github.com/geopandas/geopandas/blob/geopandas-cython/geopandas/base.py. The intention of this issue was to clean up description so that no one will spent half an hour trying to figure out what had happened. BTW: IMHO I would recommend to use sindex.nearest(self, other, x>1) before df.distance for geometries of any reasonable size (and x should be more than 1 as bounding box is not the geometry). However this makes the algorithm fully unvectorized. |
Scipy's cKDTree can be used to get a nearest neighbor(s) solution that operates on geopandas dataframes and is effectively vectorized. It is orders of magnitude faster than the brute force method @jorisvandenbossche suggested (find all pairwise distances and then find a minimum) and even the RTree spatial index nearest method that @avnovikov 's suggested (which is fast for single point lookup but requires looping over rows if we want nearest neighbors to all points in a geodataframe). The code below illustrates how to use cKDTree query method to write a function that operates on two dataframes, finding for each point in dataframe I'm not a developer but given tremendous speed up and usefulness from using cKDTree methods could functionality like this be built into a future geopandas release?
Now the helper function
Let's test it: searching for nearest harbor (of N=1000 harbors) to each of N=1000 ships. The function returns a two column dataframe with distance and the value of the 'harbor_id' column
And time it:
Compare this to use the brute force method (slightly adapting @jorisvandenbossche 's code to also return a 2 column dataframe):
The cKDTree method is efficient (I think I read it's O(log N) somewhere). When I increase the dataframe sizes to N=100,000 rows (so the potential pairwise distance comparisons rise to 10 billion) the cKDTree method can still find 100,000 nearest neighbor points about 4.6s. |
Is it possible to compare arrays with different sizes ? I keep getting 'ValueError: arrays must all be same length' when comparing geodataframes with different sizes :/ |
Hi @caiodu, |
GeoPandas got completely confused when calculating distance between two objects. geo_points and gdf_harbours are GeoDataFrames with few thousand rows
while
and
I was unable to reconstruct this result using binary_vector_float as
kills notebook's kernel immediately.
My versions are
The text was updated successfully, but these errors were encountered: