ENH: add attribute requirement with spatial join #3231

nicholas-ys-tan · 2024-03-26T11:59:17Z

Addresses #3049

Proposed approach to do a spatial join when both geometry and an attribute column are equal. Thought I'd get some eyes on it before I progress further as I'm new to contributing to this repo.

Things left to do:

more tests
changelog
don't add suffix to the sharedAttribute
consider if sharedAttribute should allow more than one attribute column, i.e. accept list and force all attributes in listed columns to be equal
more details in docstring

martinfleis · 2024-03-28T10:09:31Z

Thanks! We should probably get #2353 in before touching sjoin code elsewhere.

One minor nit - please use snake_case in the code, not camelCase. Thanks!

nicholas-ys-tan · 2024-03-30T10:35:08Z

Thanks Martin,

I noted #2353 has been merged. I've since re-based and resolved conflicts.

I've made further updates so it can take lists and tuples too - similar to the merge() function in pandas. Renamed shared_attribute (corrected to snake case) to on_attribute - to be a bit more similar to kwarg on used in pandas. Happy to change to any argument you think is preferable. Have also updated to the "_left", "_right" suffix does not get appended to the on_attribute columns.

martinfleis

Thanks. Couple of notes in the code.

CHANGELOG.md

geopandas/tools/sjoin.py

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

martinfleis

Can you also update the docstring of GeoDataFrame.sjoin in geodataframe.py?

Otherwise this looks good to my eyes. Thanks!

geopandas/tools/sjoin.py

geopandas/tools/tests/test_sjoin.py

nicholas-ys-tan · 2024-04-30T14:42:00Z

Can you also update the docstring of GeoDataFrame.sjoin in geodataframe.py?

Otherwise this looks good to my eyes. Thanks!

I've added to the docstring, I also noted the distance kwarg wasn't added to the GeoDataFrame.sjoin docstring, I've added that here (though wasn't sure if it really should be a separate commit)

martinfleis

Looks good to me, thanks!

CHANGELOG.md

Co-authored-by: Kyle Barron <kylebarron2@gmail.com>

martinfleis · 2024-05-21T07:00:02Z

@nicholas-ys-tan can you merge main here to resolve conflicts?

jorisvandenbossche

For sjoin this looks good!
I was wondering what the best approach would be, though, but I can imagine that this will depend on the characteristics of your data. But in general you could either first evaluate the spatial predicate and then the attribute match (as you did here), or either first perform the attribute merge (as a way to reduce the set of geometries for which to evaluate the spatial predicate?) and then the spatial predicate.
Before I looked at the code in this PR, I was assuming this PR was for the latter, but of course for sjoin this should give identical results. So for a first version this is probably just fine. But I do think it might be interesting to explore performance characteristics of both on some example datasets in the future.

However, that also made me wonder if this approach is correct for the sjoin_nearest? Because in that case it does matter in which order you do those operations, I think? I would assume that I get "the closest geometry among those with a matching attribute", but in practice it will give "the overall closest geometry if that closest geometry has a matching attribute".
That seems an important difference in expected behaviour, which I am not sure has been discussed?

I would maybe suggest to focus this first PR on the sjoin function (where the behaviour is clearer), and leave out sjoin_nearest for later.

geopandas/tools/sjoin.py

geopandas/geodataframe.py

geopandas/tools/sjoin.py

m-richards · 2024-05-21T11:20:36Z

For sjoin this looks good! I was wondering what the best approach would be, though, but I can imagine that this will depend on the characteristics of your data. But in general you could either first evaluate the spatial predicate and then the attribute match (as you did here), or either first perform the attribute merge (as a way to reduce the set of geometries for which to evaluate the spatial predicate?) and then the spatial predicate. Before I looked at the code in this PR, I was assuming this PR was for the latter, but of course for sjoin this should give identical results. So for a first version this is probably just fine. But I do think it might be interesting to explore performance characteristics of both on some example datasets in the future.

I will confess I was not engaging with this PR and the original issue precisely because of this, I would think the best order of operations depends quite a bit of the data in question, and for that reason was wondering if this actually belongs in geopandas. But I can still see value in providing a convenience utility for this for the general case, even if it's not necessarily the most performant in all circumstances.

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

nicholas-ys-tan · 2024-05-21T12:42:29Z

For sjoin this looks good! I was wondering what the best approach would be, though, but I can imagine that this will depend on the characteristics of your data. But in general you could either first evaluate the spatial predicate and then the attribute match (as you did here), or either first perform the attribute merge (as a way to reduce the set of geometries for which to evaluate the spatial predicate?) and then the spatial predicate. Before I looked at the code in this PR, I was assuming this PR was for the latter, but of course for sjoin this should give identical results. So for a first version this is probably just fine. But I do think it might be interesting to explore performance characteristics of both on some example datasets in the future.

However, that also made me wonder if this approach is correct for the sjoin_nearest? Because in that case it does matter in which order you do those operations, I think? I would assume that I get "the closest geometry among those with a matching attribute", but in practice it will give "the overall closest geometry if that closest geometry has a matching attribute". That seems an important difference in expected behaviour, which I am not sure has been discussed?

I would maybe suggest to focus this first PR on the sjoin function (where the behaviour is clearer), and leave out sjoin_nearest for later.

Thank you @jorisvandenbossche , that's great food for thought re performance and I will do some investigation into that.

Also thank you for pointing out the implications on sjoin_nearest, the sequence and its impact on the output was not something that had crossed my mind. I've since removed all on_attribute joins from sjoin_nearest.

I will open up a new issue for discussion on the ordering of operations how on_attribute behaves in sjoin_nearest.

CHANGELOG.md

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

nicholas-ys-tan · 2024-06-15T13:12:34Z

I was wondering what the best approach would be, though, but I can imagine that this will depend on the characteristics of your data. But in general you could either first evaluate the spatial predicate and then the attribute match (as you did here), or either first perform the attribute merge (as a way to reduce the set of geometries for which to evaluate the spatial predicate?) and then the spatial predicate.

@jorisvandenbossche , is this sort of what you had in mind in terms of evaluating the characteristics first, then the spatial predicate? I'm essentially passing in smaller dataframes into that already have the attribute filtered to have geometries evaluated separately. I haven't yet done any performance testing as I am not sure if this is the optimal approach - it feels a bit naive at the moment and wanted to run it by you first.

An initial test with test_sjoin_shared_attribute in this PR suggests this approach (~0.045 secs) would be slower than the approach currently in the PR (~0.025 secs). But, this may not be a great example with the dataset being relatively small. I imagine maybe the batched processing of geometry joins may become more preferable on larger datasets (pure speculation).

Additionally, this does not currently work for multiple on_attributes yet.

def _geom_predicate_query_on_attribute_wrapper(left_df, right_df, predicate, distance, on_attribute):

    unique_attrs = left_df[on_attribute[0]].unique()
    l_idx = []
    r_idx = []
    for attr in unique_attrs:
        right_attr_index = right_df[on_attribute[0]]==attr
        left_attr_index = left_df[on_attribute[0]]==attr
        right_df_attr = right_df[right_attr_index]
        left_df_attr = left_df[left_attr_index]

        left_df_idx, right_df_idx = _geom_predicate_query(left_df_attr, 
                                                          right_df_attr,
                                                          predicate, 
                                                          distance)

        l_idx += left_df_attr.index[left_df_idx].to_list()
        r_idx += right_df_attr.index[right_df_idx].to_list()


    return l_idx, r_idx

jorisvandenbossche · 2024-06-24T09:18:13Z

Thanks @nicholas-ys-tan

nicholas-ys-tan force-pushed the issue3049 branch from 97ff8e5 to bc8df71 Compare March 29, 2024 03:19

nicholas-ys-tan marked this pull request as ready for review April 7, 2024 03:27

nicholas-ys-tan force-pushed the issue3049 branch from 230b2e6 to 0b31a68 Compare April 7, 2024 03:48

martinfleis reviewed Apr 8, 2024

View reviewed changes

nicholas-ys-tan and others added 11 commits April 20, 2024 23:51

ENH: add shared_attribute argument to sjoin

c35f244

ENH: on_attribute can be list or tuple, added tests

cf6510c

ENH: updated type hinting for tests

c979f1b

ENH: updated test on datatypes to improve readability

f8b4e9d

ENH: updated changelog

646d025

Update CHANGELOG.md

a0d1a10

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Update sjoin docstring

4b0ab35

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Update sjoin docstring

b2446af

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Update tests, use f-string in error msg

ad2609a

fix slicing to use correct iloc method

a03f1fc

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Fix CHANGELOG formatting

c29abbe

nicholas-ys-tan force-pushed the issue3049 branch from ac36f80 to c29abbe Compare April 20, 2024 13:55

martinfleis added this to the 1.0 milestone Apr 28, 2024

martinfleis reviewed Apr 30, 2024

View reviewed changes

geopandas/tools/sjoin.py Outdated Show resolved Hide resolved

geopandas/tools/tests/test_sjoin.py Show resolved Hide resolved

updated tests, updated error msg, updated docstring

3808b0c

martinfleis approved these changes May 2, 2024

View reviewed changes

martinfleis mentioned this pull request May 20, 2024

GeoPandas 1.0 release #3201

Open

11 tasks

kylebarron reviewed May 20, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Fix typo in CHANGELOG.md

23b0f1c

Co-authored-by: Kyle Barron <kylebarron2@gmail.com>

nicholas-ys-tan and others added 2 commits May 21, 2024 17:07

Merge branch 'main' into issue3049

bc8eef5

linting

6c476a6

jorisvandenbossche reviewed May 21, 2024

View reviewed changes

geopandas/tools/sjoin.py Outdated Show resolved Hide resolved

geopandas/geodataframe.py Outdated Show resolved Hide resolved

geopandas/tools/sjoin.py Outdated Show resolved Hide resolved

nicholas-ys-tan and others added 4 commits May 21, 2024 21:57

Improve speed of reading dataframe in sjoin on attribute

64d8b56

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Update sjoin docstring

28d13f8

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Update sjoin docstring

8ee9272

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Remove on_attribute kwarg for sjoin_nearest

8dcf32d

Merge branch 'main' into issue3049

7e1056c

nicholas-ys-tan mentioned this pull request Jun 6, 2024

ENH: add attribute requirement to sjoin_nearest #3327

Open

jorisvandenbossche reviewed Jun 10, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

jorisvandenbossche mentioned this pull request Jun 10, 2024

ENH: joint spatial and attribute join #3049

Open

update changelog

67b1f62

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche merged commit 217772b into geopandas:main Jun 24, 2024
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add attribute requirement with spatial join #3231

ENH: add attribute requirement with spatial join #3231

nicholas-ys-tan commented Mar 26, 2024

martinfleis commented Mar 28, 2024

nicholas-ys-tan commented Mar 30, 2024

martinfleis left a comment

martinfleis left a comment

nicholas-ys-tan commented Apr 30, 2024

martinfleis left a comment

martinfleis commented May 21, 2024

jorisvandenbossche left a comment

m-richards commented May 21, 2024

nicholas-ys-tan commented May 21, 2024

nicholas-ys-tan commented Jun 15, 2024

jorisvandenbossche commented Jun 24, 2024

ENH: add attribute requirement with spatial join #3231

ENH: add attribute requirement with spatial join #3231

Conversation

nicholas-ys-tan commented Mar 26, 2024

martinfleis commented Mar 28, 2024

nicholas-ys-tan commented Mar 30, 2024

martinfleis left a comment

Choose a reason for hiding this comment

martinfleis left a comment

Choose a reason for hiding this comment

nicholas-ys-tan commented Apr 30, 2024

martinfleis left a comment

Choose a reason for hiding this comment

martinfleis commented May 21, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

m-richards commented May 21, 2024

nicholas-ys-tan commented May 21, 2024

nicholas-ys-tan commented Jun 15, 2024

jorisvandenbossche commented Jun 24, 2024