Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized distance produce skewed result #673

Open
avnovikov opened this issue Feb 16, 2018 · 9 comments
Open

Vectorized distance produce skewed result #673

avnovikov opened this issue Feb 16, 2018 · 9 comments

Comments

@avnovikov
Copy link

avnovikov commented Feb 16, 2018

GeoPandas got completely confused when calculating distance between two objects. geo_points and gdf_harbours are GeoDataFrames with few thousand rows

geo_points.distance(gdf_harbours)
 0        138.419503
 1        138.464243
 2        138.425727
 ...
 65496           NaN
 65497           NaN
 65498           NaN
 Length: 65499, dtype: float64

while

gdf_harbours.distance(geo_points.loc[0]).min()
Out[47]: 7.344255335164139

and

gdf_harbours.distance(geo_points.loc[65498]).min()
Out[48]: 0.00654932231511796

I was unable to reconstruct this result using binary_vector_float as

gpd.vectorized.binary_vector_float('distance', geo_points.geometry._geometry_array.data, gdf_harbours.geometry._geometry_array.data)

kills notebook's kernel immediately.
My versions are

geopandas                 1.0.0.dev                py36_1    conda-forge/label/dev
geos                      3.6.2                h5470d99_2
@avnovikov avnovikov changed the title Vectorized distance produce skewув result Vectorized distance produce skewed result Feb 16, 2018
@jorisvandenbossche
Copy link
Member

GeoPandas got completely confused when calculating distance between two objects.

If you do geo_points.distance(gdf_harbours) where both geo_points and geo_harbours are GeoDataFrames, it will align both frames and then calculate the distance between rows 0 of both frame, between rows 1 of both frames, ...
So it might be that the result you show above is actually correct, but just not what you wanted (or expected).

Due to the alignment, you see NaNs in the end of the result, because both frames are probably not the same length. That is also the reason that gpd.vectorized.binary_vector_float('distance', geo_points.geometry._geometry_array.data, gdf_harbours.geometry._geometry_array.data) kills your kernel, as this function does not do boundchecking (for performance reasons, and because it is not meant to be called by users directly) so it expects two arrays that are the same length.

What was the desired result? You want for each point the distance to the closest harbor?

@avnovikov
Copy link
Author

avnovikov commented Feb 16, 2018

So it might be that the result you show above is actually correct, but just not what you wanted (or expected).

Indeed the description you gave is in alignment with the result, however this is not obvious from the description of the function (Returns a Series containing the minimum distance to other.)

The end use case is to have something like nearest_point from shapely, but vectorized and without necessity to produce cascaded union, i.e. to receive row id and distance of to the nearest other.

@jorisvandenbossche
Copy link
Member

(Returns a Series containing the minimum distance to other.)

Where did you find this description? I get

In [19]: geopandas.GeoSeries.distance?
Signature: geopandas.GeoSeries.distance(self, other)
Docstring: Return distance of each geometry to *other*

(but I agree this is certainly not clear enough! a better explanation and some examples would help here)

The end use case is to have something like nearest_point from shapely, but vectorized and without necessity to produce cascaded union, i.e. to receive row id and distance of to the nearest other.

This is a feature that I would like to see in geopandas as well (implement vectorized versions of those shapely.ops), but is not yet implemented.

For now, what you can do is something like:

def dist_to_nearest(point, df):
    return df.distance(point).min()

gdf_point.geometry.apply(lambda p: dist_to_nearest(p, gdf_harbour))

This will not be fully vectorized, but at least it will be vectorized for each point.

@jorisvandenbossche
Copy link
Member

(Returns a Series containing the minimum distance to other.)

Where did you find this description?

Ah, I see this is in the latest master version (not the cython branch): http://geopandas.readthedocs.io/en/latest/reference.html#geopandas.GeoSeries.distance
That's not correct, that's an error that slipt in, sorry about that!

@jorisvandenbossche
Copy link
Member

#674 should at least resolve the ambiguity, should do later a PR with some more explanation (always welcome to do a PR!)
The 'minimum' distance is a bit ambiguous, as maybe that author was thinking of 'minimum' distance as the 'shortest distance in straight line' to the single other object (not 'minimum' distance of the distances to multiple objects)

@avnovikov
Copy link
Author

avnovikov commented Feb 16, 2018

Where did you find this description?

Oh no the description is from the https://github.com/geopandas/geopandas/blob/geopandas-cython/geopandas/base.py. The intention of this issue was to clean up description so that no one will spent half an hour trying to figure out what had happened.

BTW: IMHO I would recommend to use sindex.nearest(self, other, x>1) before df.distance for geometries of any reasonable size (and x should be more than 1 as bounding box is not the geometry). However this makes the algorithm fully unvectorized.

@jhconning
Copy link

Scipy's cKDTree can be used to get a nearest neighbor(s) solution that operates on geopandas dataframes and is effectively vectorized. It is orders of magnitude faster than the brute force method @jorisvandenbossche suggested (find all pairwise distances and then find a minimum) and even the RTree spatial index nearest method that @avnovikov 's suggested (which is fast for single point lookup but requires looping over rows if we want nearest neighbors to all points in a geodataframe).

The code below illustrates how to use cKDTree query method to write a function that operates on two dataframes, finding for each point in dataframe gdfA(e.g. ships) the distance to its nearest neighbor in target gdfB (e.g. harbors) as well as requested column information about that neighbor (e.g. 'harbor_id'). The function returns a two column dataframe.

I'm not a developer but given tremendous speed up and usefulness from using cKDTree methods could functionality like this be built into a future geopandas release?

import numpy as np
import geopandas as gpd
import pandas as pd
from scipy.spatial import cKDTree
from shapely.geometry import Point
import random

# Build  example `ships` and a `harbors` geodaframes (small but just to illustrate)
N = 1000
nharbors = np.random.uniform(0., 5000., (N, 2))  
nships = np.random.uniform(0., 5000., (N, 2))  

df = pd.DataFrame(nharbors,columns=['lat', 'lon'])
df['geometry'] = list(zip(df.lat, df.lon))
df['geometry'] = df['geometry'].apply(Point)
df['harbor_id'] = random.sample(range(N), N)
harbors = gpd.GeoDataFrame(df, geometry='geometry')

df = pd.DataFrame(nships,columns=['lat', 'lon'])
df['geometry'] = list(zip(df.lat, df.lon))
df['geometry'] = df['geometry'].apply(Point)
df['ship_id'] = random.sample(range(N), N)
ships = gpd.GeoDataFrame(df, geometry='geometry')

Now the helper function

def ckdnearest(gdA, gdB, bcol):
    nA = np.array(list(zip(gdA.geometry.x, gdA.geometry.y)) )
    nB = np.array(list(zip(gdB.geometry.x, gdB.geometry.y)) )
    btree = cKDTree(nB)
    dist, idx = btree.query(nA,k=1)
    df = pd.DataFrame.from_dict({'distance': dist,
                'bcol' : gdB.loc[idx, bcol].values })
    return df

Let's test it: searching for nearest harbor (of N=1000 harbors) to each of N=1000 ships. The function returns a two column dataframe with distance and the value of the 'harbor_id' column

In  []:  ckdnearest(ships, harbors, 'harbor_id').head(2)
Out: []:	
	distance	bcol
0	130.539793	491
1	102.736394	932

And time it:

In  []: %%timeit
ckdnearest(ships, harbors, 'harbor_id')
43.6 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Compare this to use the brute force method (slightly adapting @jorisvandenbossche 's code to also return a 2 column dataframe):

def dist_to_nearest(point, gdf):
    howfar = gdf.geometry.distance(point)
    return  howfar.min(), howfar.idxmin()

In  []: %%timeit
mdistances = ships.geometry.apply(lambda p: dist_to_nearest(p, harbors))
mdistances.apply(pd.Series)

5.3 s ± 91.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The cKDTree method is efficient (I think I read it's O(log N) somewhere). When I increase the dataframe sizes to N=100,000 rows (so the potential pairwise distance comparisons rise to 10 billion) the cKDTree method can still find 100,000 nearest neighbor points about 4.6s.

@caiodu
Copy link

caiodu commented Jun 4, 2019

Is it possible to compare arrays with different sizes ? I keep getting 'ValueError: arrays must all be same length' when comparing geodataframes with different sizes :/

@martinfleis
Copy link
Member

Hi @caiodu,
What exactly are you trying to do? For clarity I would recommend either opening new issue (if it is an issue or might be) or post your question on https://stackoverflow.com/questions/tagged/geopandas (if you need help with workflow). It is better for future reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants