New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spatial-join algorithm #475
Add spatial-join algorithm #475
Conversation
I get 10-15x speedups with this over the previous implementation. |
Along with #474 we're now at 32 failed tests (mostly because we just replaced/deleted all of the previously failing spatial join tests)
|
Just to know the status of this: does the new algorithm pass the existing tests? Or is there some change in behaviour? |
Is it possible you forgot to include the actual |
Thanks added
The existing tests themselves rely on pandas components that don't function well. We do support all of the options that the old implementation supported but I wouldn't be surprised if there were changes around column ordering or similar issues. At least in my basic experiments this implementation has performed well |
@jorisvandenbossche I'm inclined to merge this into the geopandas-cython branch. Any thoughts or objections? |
Taking a more detailed look now. |
Checked out the branch, and while trying to get my original examples/tests working, I am running into a strange issue: First it works, but after restarting the ipython session (without doing anything else in the meantime), it fails returning empty geometry objects (in fact I was trying things out in a notebook and got empty polygons, and therefore made a reproducible example, did a clean rebuild of the branch, and then got the following running this example in a terminal session)
Trying to debug it in the failing case, and all goes fine inside |
Seems to boil down to a problem with passing a custom ordering in the GeoDataFrame constructor (which is also happening in the upstream geopandas-cython branch):
(only the order in but I still don't understand why it was not always failing .. |
I think I've resolved this in the latest commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly some comments/questions for just better understanding the code.
Further:
-
could you leave the old tests intact? (or at least the first class with some pre-made objects and results, not necessarily the one using nybb) I tested this branch on those examples (https://gist.github.com/jorisvandenbossche/751809b631f30579b9859e6bf75d2657), and the new implementation gives the same results, with this exception of the exact ordering of columns / indices.
-
more binary predicates are now supported? Can you update the sjoin docstring for that?
geopandas/algos.c
Outdated
// double time_spent = 0, time_spent_1 = 0, time_spent_2 = 0; | ||
|
||
size_t l, r; | ||
size_vector out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is size_vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see this is defined in algos.h. Can you put a comment there what it is used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
geopandas/algos.c
Outdated
} geom_vector; | ||
|
||
|
||
void strtree_query_callback(void *item, void *vec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some more comments/docstrings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some more docstrings. Please feel free to point out further areas where this can be improved.
geopandas/algos.c
Outdated
#include <time.h> | ||
#include <stdbool.h> | ||
|
||
typedef char (*GEOSPredicate)(GEOSContextHandle_t handle, const GEOSGeometry *left, const GEOSGeometry *right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only the prepared one is used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. Should we remove the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is not used for anything else, I would say yes
geopandas/tests/test_vectorized.py
Outdated
|
||
|
||
@pytest.mark.skip | ||
def test_bench_sjoin(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not keep this in the test file?
geopandas/vectorized.pyx
Outdated
out = np.empty((sv.n // 2, 2), dtype=np.uintp) | ||
|
||
with nogil: | ||
for idx in range(0, sv.n // 2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add here some comments as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is doing it manually faster than taking step-2 slices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately sv.a doesn't support slicing. It isn't a numpy array
geopandas/tools/sjoin.py
Outdated
for name, series in left.iteritems(): | ||
if name in right.columns: | ||
name = lsuffix + '_' + name | ||
series.index = index[:n_left] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this one needed?
Because index
is already equal to series.index
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, for how='right'
this is not the case. But can't we put this one then in the if block below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems not. This causes some of the tests to fail.
geopandas/tools/sjoin.py
Outdated
if how == 'left': | ||
missing = pd.Index(np.arange(len(left_df))).difference(pd.Index(left_indices)) | ||
if len(missing): | ||
left_indices = np.concatenate([left_indices, missing]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we sort the indices here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this typical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that I wonder it, is because (IIUC) that now entries in the left without a match will all end at the bottom (in case of a left/inner join), while it could make sense to keep the original ordering of the left frame ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the ordering of entries in the right frame when there are duplicate matches (duplicated rows of the right frame) does not always match the original.
Eg using the frames from the tests
geopandas.vectorized.cysjoin(df1.geometry._geometry_array.data,
df2.geometry._geometry_array.data,
'intersects')
gives (with the previous commit -> still 2D array)
array([[0, 1],
[0, 0],
[1, 1]], dtype=uint64)
while it could make sense to ensure it is like
array([[0, 0],
[0, 1],
[1, 1]], dtype=uint64)
But, that would need a nested sort, and will maybe give a performance impediment that is not worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an opinion about this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have thoughts about the values of sorting or not. I think that you probably have a better sense of both the performance costs and the usability expectations.
geopandas/tools/sjoin.py
Outdated
name = lsuffix + '_' + name | ||
series.index = index[:n_left] | ||
if how == 'right': | ||
new = series.iloc[:0].reindex(index[n:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to create an empty series of the correct dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
geopandas/tools/sjoin.py
Outdated
names = [] | ||
for name, series in left.iteritems(): | ||
if name in right.columns: | ||
name = lsuffix + '_' + name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lsuffix + '_' + name
-> name + '_' + lsuffix
?
geopandas/tools/sjoin.py
Outdated
names.append(name) | ||
for name, series in right.iteritems(): | ||
if name in left.columns: | ||
name = rsuffix + '_' + name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here (switch around)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now using name_suffix
pattern
BTW, that was maybe not clear, but I am really happy with this new implementation of sjoin! (the python part, once the indexers are obtained) is now much more understandable) |
geopandas/tests/test_vectorized.py
Outdated
'covers', | ||
'crosses', | ||
'disjoint', | ||
# 'equals', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason that this is commented out (seems useful)? Because equals has no prepared version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
geopandas/tools/sjoin.py
Outdated
allowed_hows = ['left', 'right', 'inner'] | ||
if how not in allowed_hows: | ||
raise ValueError("`how` was \"%s\" but is expected to be in %s" % \ | ||
(how, allowed_hows)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping these checks
geopandas/tools/sjoin.py
Outdated
(op, allowed_ops)) | ||
|
||
if left_df.crs != right_df.crs: | ||
print('Warning: CRS does not match!') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep this one as well?
Regarding the old tests, how do you feel about column ordering. (Pdb) pp res
I am densified (3 elements)
df1 df2 index_right geometry
0 0 4 1 POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))
0 0 3 0 POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))
1 1 4 1 POLYGON ((5 5, 6 5, 6 6, 5 6, 5 5))
(Pdb) pp exp
df1 geometry index_right df2
0 0 POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0)) 0 3
0 0 POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0)) 1 4
1 1 POLYGON ((5 5, 6 5, 6 6, 5 6, 5 5)) 1 4 I'm inclined to put the geometry to one side, either left or right. The current implementation puts it in the middle. |
Yes, I think putting it at a side is actually better than in the middle. Since we default on putting it at the right side when constructing a frame, maybe best to do that here as well. |
geopandas/tools/sjoin.py
Outdated
pd.DataFrame(result[result['match_bool']==1]) | ||
.drop('match_bool', axis=1) | ||
) | ||
print("Warning: CRS does not match") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably we should better turn this into an actual UserWarning
I've added back the old tests. What do you recommend for sorting by the secondary index? (Also please feel free to push commits to this branch if you like). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications! Based on that, posted a new round of questions (using myself as a noob trying to understand the code well, to make sure it is as clear as possible for the future)
geopandas/algos.c
Outdated
{ | ||
size_t n, m; | ||
GEOSGeometry **a; | ||
} geom_vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this one used for? (I don't seem to find a usage of geom_vector
)
geopandas/algos.c
Outdated
|
||
|
||
/* Spatial join of two arrays of geometries over the predicate | ||
* This creates an index for the right side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence I don't fully understand. Doesn't it create an index for both sides?
geopandas/algos.c
Outdated
|
||
/* Spatial join of two arrays of geometries over the predicate | ||
* This creates an index for the right side | ||
* Finds all polygons in both sides that intersect with each other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is general for 'geometries' and not limited to 'polygons' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This intersection step is what is done by the STRTree ?
@@ -0,0 +1,90 @@ | |||
/* The MIT License |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put at the top of this file a short note about why this file is includes (what it provides (implementation of resizeable vector) and why we need it (used to store matching indices in the spatial join implementation in algos.c)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a small note to the top of kvec.h
geopandas/algos.c
Outdated
|
||
size_t l, r; // indices for left and right sides | ||
GEOSPreparedGeometry* prepared; // Temporary prepared geometry for right side | ||
size_vector out; // output array of matching indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that a resizable vector is used is because we do not know in advance how many matches there will be, is that correct? If so, can you put somewhere a small comment about it (eg in the explanation above of how the algorithm works)
(starting a bit more to understand the use of the size_vector, at least I think :-))
geopandas/algos.c
Outdated
if (left[l] != NULL) | ||
{ | ||
// begin_1 = clock(); | ||
GEOSSTRtree_query_r(handle, tree, left[l], strtree_query_callback, &vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this stores indices of all intersecting geoms of the right array that intersect with left[l]
in temporary array vec
, correct?
geopandas/tools/sjoin.py
Outdated
) | ||
print("Warning: CRS does not match") | ||
|
||
left_indices, right_indices = cysjoin(left_df.geometry._geometry_array.data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left_df.geometry._geometry_array.data
can in principle be shortened to left._geometry_array.data
Thanks for the updates! |
I suspect that you are more adept at normalizing outputs on the pandas
side. If you have the time to take a look I would appreciate it.
…On Sat, Aug 19, 2017 at 11:02 AM, Joris Van den Bossche < ***@***.***> wrote:
Thanks for the updates!
I can take a look at the old tests if you want (probably sorting both
before asserting is easiest for now to ignore column/row ordering -> but
then we now this is for now the difference with the old sjoin)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#475 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszMx3HOuPX9HJIf5JXR66hwUQlnS2ks5sZvj7gaJpZM4OuUNf>
.
|
Yes, will do. |
bdb9c73
to
ebd1aaa
Compare
So to get the old tests running (locally), I only had to sort the rows and change an error type (apart from the column ordering you already did?) |
Just wanted to look at the travis test results, but apparently something goes wrong during compiling (so everything fails) |
Builds now work |
Some vectorized tests for sjoin are failing (because the result is no longer an array, but tuple of arrays) |
Fixed |
@jorisvandenbossche any objection to merging this in? |
.gitignore
Outdated
@@ -9,6 +9,7 @@ geopandas/version.py | |||
doc/_static/world_* | |||
examples/nybb_*.zip | |||
.cache/ | |||
*.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to ignore algos.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that is added, so then it is no problem?
I thought this just means that you have to force add new c files, but maybe that's an incorrect git understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thing is that now vectorized.c is in the list of untracked file, I can of course also add geopandas/vectorized.c
explicitly here in .gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified that your understanding is correct. I retract my previous comment.
@@ -227,7 +225,6 @@ def test_sjoin_values(self): | |||
self.assertEquals(df.shape, (12,8)) | |||
|
|||
@unittest.skipIf(str(pd.__version__) < LooseVersion('0.19'), pandas_0_18_problem) | |||
@pytest.mark.xfail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is still failing (so the only sjoin test that should fail), but with an interesting error somewhere in the pandas munging of the frames. So one I have to look at when improving the pandas interactions, therefore removed it for now
@mrocklin Fixed the merge conflict and merged. Thanks a lot for this! I think of the next follow-up issues:
Do you think of others? |
BTW, I ran the sjoin benchmarks of this branch vs upstream/master: before after ratio
[dddf039c] [baa1601c]
+ 855ms 19.6s 22.91 sjoin.Bench.time_sjoin('within')
+ 1.57s 19.6s 12.45 sjoin.Bench.time_sjoin('intersects')
+ 1.62s 19.9s 12.24 sjoin.Bench.time_sjoin('contains') So some nice improvements! |
* add sjoin algorithm (c/cythonized) * Fix constructor logic when geometry and columns both provided
* add sjoin algorithm (c/cythonized) * Fix constructor logic when geometry and columns both provided
* add sjoin algorithm (c/cythonized) * Fix constructor logic when geometry and columns both provided
This uses GEOS's STRTree to implement a spatial join.