Vectorized distance produce skewed result #673

avnovikov · 2018-02-16T07:32:03Z

GeoPandas got completely confused when calculating distance between two objects. geo_points and gdf_harbours are GeoDataFrames with few thousand rows

geo_points.distance(gdf_harbours)
 0        138.419503
 1        138.464243
 2        138.425727
 ...
 65496           NaN
 65497           NaN
 65498           NaN
 Length: 65499, dtype: float64

while

gdf_harbours.distance(geo_points.loc[0]).min()
Out[47]: 7.344255335164139

and

gdf_harbours.distance(geo_points.loc[65498]).min()
Out[48]: 0.00654932231511796

I was unable to reconstruct this result using binary_vector_float as

gpd.vectorized.binary_vector_float('distance', geo_points.geometry._geometry_array.data, gdf_harbours.geometry._geometry_array.data)

kills notebook's kernel immediately.
My versions are

geopandas                 1.0.0.dev                py36_1    conda-forge/label/dev
geos                      3.6.2                h5470d99_2

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-02-16T08:50:22Z

GeoPandas got completely confused when calculating distance between two objects.

If you do geo_points.distance(gdf_harbours) where both geo_points and geo_harbours are GeoDataFrames, it will align both frames and then calculate the distance between rows 0 of both frame, between rows 1 of both frames, ...
So it might be that the result you show above is actually correct, but just not what you wanted (or expected).

Due to the alignment, you see NaNs in the end of the result, because both frames are probably not the same length. That is also the reason that gpd.vectorized.binary_vector_float('distance', geo_points.geometry._geometry_array.data, gdf_harbours.geometry._geometry_array.data) kills your kernel, as this function does not do boundchecking (for performance reasons, and because it is not meant to be called by users directly) so it expects two arrays that are the same length.

What was the desired result? You want for each point the distance to the closest harbor?

avnovikov · 2018-02-16T09:40:13Z

So it might be that the result you show above is actually correct, but just not what you wanted (or expected).

Indeed the description you gave is in alignment with the result, however this is not obvious from the description of the function (Returns a Series containing the minimum distance to other.)

The end use case is to have something like nearest_point from shapely, but vectorized and without necessity to produce cascaded union, i.e. to receive row id and distance of to the nearest other.

jorisvandenbossche · 2018-02-16T09:57:19Z

(Returns a Series containing the minimum distance to other.)

Where did you find this description? I get

In [19]: geopandas.GeoSeries.distance?
Signature: geopandas.GeoSeries.distance(self, other)
Docstring: Return distance of each geometry to *other*

(but I agree this is certainly not clear enough! a better explanation and some examples would help here)

The end use case is to have something like nearest_point from shapely, but vectorized and without necessity to produce cascaded union, i.e. to receive row id and distance of to the nearest other.

This is a feature that I would like to see in geopandas as well (implement vectorized versions of those shapely.ops), but is not yet implemented.

For now, what you can do is something like:

def dist_to_nearest(point, df):
    return df.distance(point).min()

gdf_point.geometry.apply(lambda p: dist_to_nearest(p, gdf_harbour))

This will not be fully vectorized, but at least it will be vectorized for each point.

jorisvandenbossche · 2018-02-16T09:59:04Z

(Returns a Series containing the minimum distance to other.)

Where did you find this description?

Ah, I see this is in the latest master version (not the cython branch): http://geopandas.readthedocs.io/en/latest/reference.html#geopandas.GeoSeries.distance
That's not correct, that's an error that slipt in, sorry about that!

jorisvandenbossche · 2018-02-16T10:05:47Z

#674 should at least resolve the ambiguity, should do later a PR with some more explanation (always welcome to do a PR!)
The 'minimum' distance is a bit ambiguous, as maybe that author was thinking of 'minimum' distance as the 'shortest distance in straight line' to the single other object (not 'minimum' distance of the distances to multiple objects)

avnovikov · 2018-02-16T10:13:05Z

Where did you find this description?

Oh no the description is from the https://github.com/geopandas/geopandas/blob/geopandas-cython/geopandas/base.py. The intention of this issue was to clean up description so that no one will spent half an hour trying to figure out what had happened.

BTW: IMHO I would recommend to use sindex.nearest(self, other, x>1) before df.distance for geometries of any reasonable size (and x should be more than 1 as bounding box is not the geometry). However this makes the algorithm fully unvectorized.

jhconning · 2018-10-30T14:49:56Z

Scipy's cKDTree can be used to get a nearest neighbor(s) solution that operates on geopandas dataframes and is effectively vectorized. It is orders of magnitude faster than the brute force method @jorisvandenbossche suggested (find all pairwise distances and then find a minimum) and even the RTree spatial index nearest method that @avnovikov 's suggested (which is fast for single point lookup but requires looping over rows if we want nearest neighbors to all points in a geodataframe).

The code below illustrates how to use cKDTree query method to write a function that operates on two dataframes, finding for each point in dataframe gdfA(e.g. ships) the distance to its nearest neighbor in target gdfB (e.g. harbors) as well as requested column information about that neighbor (e.g. 'harbor_id'). The function returns a two column dataframe.

I'm not a developer but given tremendous speed up and usefulness from using cKDTree methods could functionality like this be built into a future geopandas release?

import numpy as np
import geopandas as gpd
import pandas as pd
from scipy.spatial import cKDTree
from shapely.geometry import Point
import random

# Build  example `ships` and a `harbors` geodaframes (small but just to illustrate)
N = 1000
nharbors = np.random.uniform(0., 5000., (N, 2))  
nships = np.random.uniform(0., 5000., (N, 2))  

df = pd.DataFrame(nharbors,columns=['lat', 'lon'])
df['geometry'] = list(zip(df.lat, df.lon))
df['geometry'] = df['geometry'].apply(Point)
df['harbor_id'] = random.sample(range(N), N)
harbors = gpd.GeoDataFrame(df, geometry='geometry')

df = pd.DataFrame(nships,columns=['lat', 'lon'])
df['geometry'] = list(zip(df.lat, df.lon))
df['geometry'] = df['geometry'].apply(Point)
df['ship_id'] = random.sample(range(N), N)
ships = gpd.GeoDataFrame(df, geometry='geometry')

Now the helper function

def ckdnearest(gdA, gdB, bcol):
    nA = np.array(list(zip(gdA.geometry.x, gdA.geometry.y)) )
    nB = np.array(list(zip(gdB.geometry.x, gdB.geometry.y)) )
    btree = cKDTree(nB)
    dist, idx = btree.query(nA,k=1)
    df = pd.DataFrame.from_dict({'distance': dist,
                'bcol' : gdB.loc[idx, bcol].values })
    return df

Let's test it: searching for nearest harbor (of N=1000 harbors) to each of N=1000 ships. The function returns a two column dataframe with distance and the value of the 'harbor_id' column

In  []:  ckdnearest(ships, harbors, 'harbor_id').head(2)
Out: []:	
	distance	bcol
0	130.539793	491
1	102.736394	932

And time it:

In  []: %%timeit
ckdnearest(ships, harbors, 'harbor_id')
43.6 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Compare this to use the brute force method (slightly adapting @jorisvandenbossche 's code to also return a 2 column dataframe):

def dist_to_nearest(point, gdf):
    howfar = gdf.geometry.distance(point)
    return  howfar.min(), howfar.idxmin()

In  []: %%timeit
mdistances = ships.geometry.apply(lambda p: dist_to_nearest(p, harbors))
mdistances.apply(pd.Series)

5.3 s ± 91.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The cKDTree method is efficient (I think I read it's O(log N) somewhere). When I increase the dataframe sizes to N=100,000 rows (so the potential pairwise distance comparisons rise to 10 billion) the cKDTree method can still find 100,000 nearest neighbor points about 4.6s.

caiodu · 2019-06-04T17:18:13Z

Is it possible to compare arrays with different sizes ? I keep getting 'ValueError: arrays must all be same length' when comparing geodataframes with different sizes :/

martinfleis · 2019-06-04T20:03:10Z

Hi @caiodu,
What exactly are you trying to do? For clarity I would recommend either opening new issue (if it is an issue or might be) or post your question on https://stackoverflow.com/questions/tagged/geopandas (if you need help with workflow). It is better for future reference.

avnovikov changed the title ~~Vectorized distance produce skewув result~~ Vectorized distance produce skewed result Feb 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized distance produce skewed result #673

Vectorized distance produce skewed result #673

avnovikov commented Feb 16, 2018 •

edited

jorisvandenbossche commented Feb 16, 2018

avnovikov commented Feb 16, 2018 •

edited

jorisvandenbossche commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

avnovikov commented Feb 16, 2018 •

edited

jhconning commented Oct 30, 2018

caiodu commented Jun 4, 2019

martinfleis commented Jun 4, 2019

Vectorized distance produce skewed result #673

Vectorized distance produce skewed result #673

Comments

avnovikov commented Feb 16, 2018 • edited

jorisvandenbossche commented Feb 16, 2018

avnovikov commented Feb 16, 2018 • edited

jorisvandenbossche commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

jorisvandenbossche commented Feb 16, 2018

avnovikov commented Feb 16, 2018 • edited

jhconning commented Oct 30, 2018

caiodu commented Jun 4, 2019

martinfleis commented Jun 4, 2019

avnovikov commented Feb 16, 2018 •

edited

avnovikov commented Feb 16, 2018 •

edited

avnovikov commented Feb 16, 2018 •

edited