Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spatial-join algorithm #475

Merged
merged 14 commits into from Aug 20, 2017

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Aug 5, 2017

This uses GEOS's STRTree to implement a spatial join.

@mrocklin
Copy link
Member Author

mrocklin commented Aug 5, 2017

I get 10-15x speedups with this over the previous implementation.

@mrocklin
Copy link
Member Author

mrocklin commented Aug 5, 2017

Along with #474 we're now at 32 failed tests (mostly because we just replaced/deleted all of the previously failing spatial join tests)

===== 32 failed, 255 passed, 5 skipped, 6 xfailed, 12 warnings in 72.11 seconds =====

@jorisvandenbossche
Copy link
Member

Just to know the status of this: does the new algorithm pass the existing tests? Or is there some change in behaviour?

@jorisvandenbossche
Copy link
Member

Is it possible you forgot to include the actual algos.c file?

@mrocklin
Copy link
Member Author

mrocklin commented Aug 7, 2017

Is it possible you forgot to include the actual algos.c file?

Thanks added

Just to know the status of this: does the new algorithm pass the existing tests? Or is there some change in behaviour?

The existing tests themselves rely on pandas components that don't function well. We do support all of the options that the old implementation supported but I wouldn't be surprised if there were changes around column ordering or similar issues.

At least in my basic experiments this implementation has performed well

@mrocklin
Copy link
Member Author

@jorisvandenbossche I'm inclined to merge this into the geopandas-cython branch. Any thoughts or objections?

@jorisvandenbossche
Copy link
Member

Taking a more detailed look now.

@jorisvandenbossche
Copy link
Member

Checked out the branch, and while trying to get my original examples/tests working, I am running into a strange issue:

First it works, but after restarting the ipython session (without doing anything else in the meantime), it fails returning empty geometry objects (in fact I was trying things out in a notebook and got empty polygons, and therefore made a reproducible example, did a clean rebuild of the branch, and then got the following running this example in a terminal session)

(dev) joris@joris-XPS-13-9350:~/scipy$ ipython
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.0.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from geopandas import GeoSeries, GeoDataFrame, sjoin
   ...: from shapely.geometry import Polygon

In [2]: polys1 = GeoSeries(
   ...:     [Polygon([(0, 0), (5, 0), (5, 5), (0, 5)]),
   ...:      Polygon([(5, 5), (6, 5), (6, 6), (5, 6)]),
   ...:      Polygon([(6, 0), (9, 0), (9, 3), (6, 3)])])
   ...: 
   ...: polys2 = GeoSeries(
   ...:     [Polygon([(1, 1), (4, 1), (4, 4), (1, 4)]),
   ...:      Polygon([(4, 4), (7, 4), (7, 7), (4, 7)]),
   ...:      Polygon([(7, 7), (10, 7), (10, 10), (7, 10)])])
   ...: 
   ...: df1 = GeoDataFrame({'geometry': polys1, 'df1': [0, 1, 2]})
   ...: df2 = GeoDataFrame({'geometry': polys2, 'df2': [3, 4, 5]})

In [3]: sjoin(df1, df2, 'intersects')
Out[3]: I am densified (3 elements)

   df1  df2  right_index                             geometry
0    0    4            1  POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))
0    0    3            0  POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))
1    1    4            1  POLYGON ((5 5, 6 5, 6 6, 5 6, 5 5))

In [4]: quit()
(dev) joris@joris-XPS-13-9350:~/scipy$ ipython
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.0.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from geopandas import GeoSeries, GeoDataFrame, sjoin
   ...: from shapely.geometry import Polygon

In [2]: polys1 = GeoSeries(
   ...:     [Polygon([(0, 0), (5, 0), (5, 5), (0, 5)]),
   ...:      Polygon([(5, 5), (6, 5), (6, 6), (5, 6)]),
   ...:      Polygon([(6, 0), (9, 0), (9, 3), (6, 3)])])
   ...: 
   ...: polys2 = GeoSeries(
   ...:     [Polygon([(1, 1), (4, 1), (4, 4), (1, 4)]),
   ...:      Polygon([(4, 4), (7, 4), (7, 7), (4, 7)]),
   ...:      Polygon([(7, 7), (10, 7), (10, 10), (7, 10)])])
   ...: 
   ...: df1 = GeoDataFrame({'geometry': polys1, 'df1': [0, 1, 2]})
   ...: df2 = GeoDataFrame({'geometry': polys2, 'df2': [3, 4, 5]})

In [3]: sjoin(df1, df2, 'intersects')
Out[3]: I am densified (3 elements)

   df1  df2       geometry
0    0    4  POLYGON EMPTY
0    0    3  POLYGON EMPTY
1    1    4  POLYGON EMPTY

Trying to debug it in the failing case, and all goes fine inside sjoin until the final GeoDataFrame construction.

@jorisvandenbossche
Copy link
Member

Seems to boil down to a problem with passing a custom ordering in the GeoDataFrame constructor (which is also happening in the upstream geopandas-cython branch):

In [1]: from geopandas import GeoSeries, GeoDataFrame, sjoin
   ...: from shapely.geometry import Polygon, Point
   ...: 

In [2]: GeoDataFrame({'a': [1, 2, 3],
   ...:               'geometry': GeoSeries([Point(1, 1), Point(2, 2), Point(3, 3)])},
   ...:            columns=['geometry', 'a'], index=pd.Index([0, 0, 1]), geometry='geometry')
   ...:            
Out[2]: I am densified (3 elements)

        geometry
0  POLYGON EMPTY
1  POLYGON EMPTY
2  POLYGON EMPTY

In [3]: GeoDataFrame({'a': [1, 2, 3],
   ...:               'geometry': GeoSeries([Point(1, 1), Point(2, 2), Point(3, 3)])},
   ...:            columns=['a', 'geometry'], index=pd.Index([0, 0, 1]), geometry='geometry')
   ...:            
Out[3]: I am densified (3 elements)

   a     geometry
0  1  POINT (1 1)
1  2  POINT (2 2)
2  3  POINT (3 3)

(only the order in column=[..] is different)

but I still don't understand why it was not always failing ..

@mrocklin
Copy link
Member Author

I think I've resolved this in the latest commit

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly some comments/questions for just better understanding the code.

Further:

  • could you leave the old tests intact? (or at least the first class with some pre-made objects and results, not necessarily the one using nybb) I tested this branch on those examples (https://gist.github.com/jorisvandenbossche/751809b631f30579b9859e6bf75d2657), and the new implementation gives the same results, with this exception of the exact ordering of columns / indices.

  • more binary predicates are now supported? Can you update the sjoin docstring for that?

// double time_spent = 0, time_spent_1 = 0, time_spent_2 = 0;

size_t l, r;
size_vector out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is size_vector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see this is defined in algos.h. Can you put a comment there what it is used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

} geom_vector;


void strtree_query_callback(void *item, void *vec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some more comments/docstrings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some more docstrings. Please feel free to point out further areas where this can be improved.

#include <time.h>
#include <stdbool.h>

typedef char (*GEOSPredicate)(GEOSContextHandle_t handle, const GEOSGeometry *left, const GEOSGeometry *right);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the prepared one is used here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. Should we remove the other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is not used for anything else, I would say yes



@pytest.mark.skip
def test_bench_sjoin():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not keep this in the test file?

out = np.empty((sv.n // 2, 2), dtype=np.uintp)

with nogil:
for idx in range(0, sv.n // 2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add here some comments as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is doing it manually faster than taking step-2 slices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately sv.a doesn't support slicing. It isn't a numpy array

for name, series in left.iteritems():
if name in right.columns:
name = lsuffix + '_' + name
series.index = index[:n_left]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this one needed?
Because index is already equal to series.index no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, for how='right' this is not the case. But can't we put this one then in the if block below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not. This causes some of the tests to fail.

if how == 'left':
missing = pd.Index(np.arange(len(left_df))).difference(pd.Index(left_indices))
if len(missing):
left_indices = np.concatenate([left_indices, missing])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we sort the indices here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this typical?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that I wonder it, is because (IIUC) that now entries in the left without a match will all end at the bottom (in case of a left/inner join), while it could make sense to keep the original ordering of the left frame ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the ordering of entries in the right frame when there are duplicate matches (duplicated rows of the right frame) does not always match the original.

Eg using the frames from the tests

geopandas.vectorized.cysjoin(df1.geometry._geometry_array.data,
                      df2.geometry._geometry_array.data,
                      'intersects')

gives (with the previous commit -> still 2D array)

array([[0, 1],
       [0, 0],
       [1, 1]], dtype=uint64)

while it could make sense to ensure it is like

array([[0, 0],
       [0, 1],
       [1, 1]], dtype=uint64)

But, that would need a nested sort, and will maybe give a performance impediment that is not worth it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an opinion about this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have thoughts about the values of sorting or not. I think that you probably have a better sense of both the performance costs and the usability expectations.

name = lsuffix + '_' + name
series.index = index[:n_left]
if how == 'right':
new = series.iloc[:0].reindex(index[n:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to create an empty series of the correct dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

names = []
for name, series in left.iteritems():
if name in right.columns:
name = lsuffix + '_' + name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lsuffix + '_' + name -> name + '_' + lsuffix ?

names.append(name)
for name, series in right.iteritems():
if name in left.columns:
name = rsuffix + '_' + name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (switch around)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now using name_suffix pattern

@jorisvandenbossche
Copy link
Member

BTW, that was maybe not clear, but I am really happy with this new implementation of sjoin! (the python part, once the indexers are obtained) is now much more understandable)

'covers',
'crosses',
'disjoint',
# 'equals',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason that this is commented out (seems useful)? Because equals has no prepared version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

allowed_hows = ['left', 'right', 'inner']
if how not in allowed_hows:
raise ValueError("`how` was \"%s\" but is expected to be in %s" % \
(how, allowed_hows))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping these checks

(op, allowed_ops))

if left_df.crs != right_df.crs:
print('Warning: CRS does not match!')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this one as well?

@mrocklin
Copy link
Member Author

Regarding the old tests, how do you feel about column ordering.

(Pdb) pp res
I am densified (3 elements)
   df1  df2  index_right                             geometry
0    0    4            1  POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))
0    0    3            0  POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))
1    1    4            1  POLYGON ((5 5, 6 5, 6 6, 5 6, 5 5))
(Pdb) pp exp
   df1                             geometry  index_right  df2
0    0  POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))            0    3
0    0  POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))            1    4
1    1  POLYGON ((5 5, 6 5, 6 6, 5 6, 5 5))            1    4

I'm inclined to put the geometry to one side, either left or right. The current implementation puts it in the middle.

@jorisvandenbossche
Copy link
Member

Yes, I think putting it at a side is actually better than in the middle. Since we default on putting it at the right side when constructing a frame, maybe best to do that here as well.
The only advantage of putting it in the middle is that it is clearer which columns have been added by the join.

pd.DataFrame(result[result['match_bool']==1])
.drop('match_bool', axis=1)
)
print("Warning: CRS does not match")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we should better turn this into an actual UserWarning

@mrocklin
Copy link
Member Author

I've added back the old tests. What do you recommend for sorting by the secondary index? (Also please feel free to push commits to this branch if you like).

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications! Based on that, posted a new round of questions (using myself as a noob trying to understand the code well, to make sure it is as clear as possible for the future)

{
size_t n, m;
GEOSGeometry **a;
} geom_vector;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this one used for? (I don't seem to find a usage of geom_vector)



/* Spatial join of two arrays of geometries over the predicate
* This creates an index for the right side
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence I don't fully understand. Doesn't it create an index for both sides?


/* Spatial join of two arrays of geometries over the predicate
* This creates an index for the right side
* Finds all polygons in both sides that intersect with each other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is general for 'geometries' and not limited to 'polygons' ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This intersection step is what is done by the STRTree ?

@@ -0,0 +1,90 @@
/* The MIT License
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put at the top of this file a short note about why this file is includes (what it provides (implementation of resizeable vector) and why we need it (used to store matching indices in the spatial join implementation in algos.c)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a small note to the top of kvec.h


size_t l, r; // indices for left and right sides
GEOSPreparedGeometry* prepared; // Temporary prepared geometry for right side
size_vector out; // output array of matching indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that a resizable vector is used is because we do not know in advance how many matches there will be, is that correct? If so, can you put somewhere a small comment about it (eg in the explanation above of how the algorithm works)

(starting a bit more to understand the use of the size_vector, at least I think :-))

if (left[l] != NULL)
{
// begin_1 = clock();
GEOSSTRtree_query_r(handle, tree, left[l], strtree_query_callback, &vec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this stores indices of all intersecting geoms of the right array that intersect with left[l] in temporary array vec, correct?

)
print("Warning: CRS does not match")

left_indices, right_indices = cysjoin(left_df.geometry._geometry_array.data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left_df.geometry._geometry_array.data can in principle be shortened to left._geometry_array.data

@jorisvandenbossche
Copy link
Member

Thanks for the updates!
I can take a look at the old tests if you want (probably sorting both before asserting is easiest for now to ignore column/row ordering -> but then we now this is for now the difference with the old sjoin)

@mrocklin
Copy link
Member Author

mrocklin commented Aug 19, 2017 via email

@jorisvandenbossche
Copy link
Member

Yes, will do.
BTW, I am adding a benchmark suite (#497), and added there the example you had here before in the tests.
(but didn't compare yet the geopandas-cython branch with master)

@jorisvandenbossche
Copy link
Member

So to get the old tests running (locally), I only had to sort the rows and change an error type (apart from the column ordering you already did?)

@jorisvandenbossche
Copy link
Member

Just wanted to look at the travis test results, but apparently something goes wrong during compiling (so everything fails)
This was not the case for previous PRs / geopandas-cython branch. But I also don't directly see an indication what could be the cause in this PR.

@mrocklin
Copy link
Member Author

Builds now work

@jorisvandenbossche
Copy link
Member

Some vectorized tests for sjoin are failing (because the result is no longer an array, but tuple of arrays)

@mrocklin
Copy link
Member Author

Some vectorized tests for sjoin are failing (because the result is no longer an array, but tuple of arrays)

Fixed

@mrocklin
Copy link
Member Author

@jorisvandenbossche any objection to merging this in?

.gitignore Outdated
@@ -9,6 +9,7 @@ geopandas/version.py
doc/_static/world_*
examples/nybb_*.zip
.cache/
*.c
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to ignore algos.c

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that is added, so then it is no problem?
I thought this just means that you have to force add new c files, but maybe that's an incorrect git understanding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing is that now vectorized.c is in the list of untracked file, I can of course also add geopandas/vectorized.c explicitly here in .gitignore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified that your understanding is correct. I retract my previous comment.

@@ -227,7 +225,6 @@ def test_sjoin_values(self):
self.assertEquals(df.shape, (12,8))

@unittest.skipIf(str(pd.__version__) < LooseVersion('0.19'), pandas_0_18_problem)
@pytest.mark.xfail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is still failing (so the only sjoin test that should fail), but with an interesting error somewhere in the pandas munging of the frames. So one I have to look at when improving the pandas interactions, therefore removed it for now

@jorisvandenbossche jorisvandenbossche merged commit 43546c4 into geopandas:geopandas-cython Aug 20, 2017
@jorisvandenbossche
Copy link
Member

@mrocklin Fixed the merge conflict and merged. Thanks a lot for this!

I think of the next follow-up issues:

  • Implement for 'equals' as well? (needs some adjustments because has no prepared version) This seems useful to me to be able to join on equal shapes.
  • Think/decide on preserving row ordering (want to keep original row ordering or not vs perf penalty to sort)

Do you think of others?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 20, 2017

BTW, I ran the sjoin benchmarks of this branch vs upstream/master:
(before and after are logically switched)

       before           after         ratio
     [dddf039c]       [baa1601c]
+           855ms            19.6s    22.91  sjoin.Bench.time_sjoin('within')
+           1.57s            19.6s    12.45  sjoin.Bench.time_sjoin('intersects')
+           1.62s            19.9s    12.24  sjoin.Bench.time_sjoin('contains')

So some nice improvements!

jorisvandenbossche pushed a commit to jorisvandenbossche/geopandas that referenced this pull request Jul 1, 2019
* add sjoin algorithm (c/cythonized)

* Fix constructor logic when geometry and columns both provided
jorisvandenbossche pushed a commit to jorisvandenbossche/geopandas that referenced this pull request Sep 18, 2019
* add sjoin algorithm (c/cythonized)

* Fix constructor logic when geometry and columns both provided
jorisvandenbossche pushed a commit to jorisvandenbossche/geopandas that referenced this pull request Sep 30, 2019
* add sjoin algorithm (c/cythonized)

* Fix constructor logic when geometry and columns both provided
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants