Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add attribute requirement with spatial join #3231

Merged
merged 21 commits into from
Jun 24, 2024

Conversation

nicholas-ys-tan
Copy link
Contributor

Addresses #3049

Proposed approach to do a spatial join when both geometry and an attribute column are equal. Thought I'd get some eyes on it before I progress further as I'm new to contributing to this repo.

Things left to do:

  • more tests
  • changelog
  • don't add suffix to the sharedAttribute
  • consider if sharedAttribute should allow more than one attribute column, i.e. accept list and force all attributes in listed columns to be equal
  • more details in docstring

@martinfleis
Copy link
Member

Thanks! We should probably get #2353 in before touching sjoin code elsewhere.

One minor nit - please use snake_case in the code, not camelCase. Thanks!

@nicholas-ys-tan
Copy link
Contributor Author

Thanks Martin,

I noted #2353 has been merged. I've since re-based and resolved conflicts.

I've made further updates so it can take lists and tuples too - similar to the merge() function in pandas. Renamed shared_attribute (corrected to snake case) to on_attribute - to be a bit more similar to kwarg on used in pandas. Happy to change to any argument you think is preferable. Have also updated to the "_left", "_right" suffix does not get appended to the on_attribute columns.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Couple of notes in the code.

CHANGELOG.md Outdated Show resolved Hide resolved
geopandas/tools/sjoin.py Outdated Show resolved Hide resolved
geopandas/tools/sjoin.py Outdated Show resolved Hide resolved
geopandas/tools/sjoin.py Outdated Show resolved Hide resolved
geopandas/tools/sjoin.py Outdated Show resolved Hide resolved
Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the docstring of GeoDataFrame.sjoin in geodataframe.py?

Otherwise this looks good to my eyes. Thanks!

geopandas/tools/sjoin.py Outdated Show resolved Hide resolved
geopandas/tools/tests/test_sjoin.py Show resolved Hide resolved
@nicholas-ys-tan
Copy link
Contributor Author

Can you also update the docstring of GeoDataFrame.sjoin in geodataframe.py?

Otherwise this looks good to my eyes. Thanks!

I've added to the docstring, I also noted the distance kwarg wasn't added to the GeoDataFrame.sjoin docstring, I've added that here (though wasn't sure if it really should be a separate commit)

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

@martinfleis martinfleis mentioned this pull request May 20, 2024
11 tasks
CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
@martinfleis
Copy link
Member

@nicholas-ys-tan can you merge main here to resolve conflicts?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sjoin this looks good!
I was wondering what the best approach would be, though, but I can imagine that this will depend on the characteristics of your data. But in general you could either first evaluate the spatial predicate and then the attribute match (as you did here), or either first perform the attribute merge (as a way to reduce the set of geometries for which to evaluate the spatial predicate?) and then the spatial predicate.
Before I looked at the code in this PR, I was assuming this PR was for the latter, but of course for sjoin this should give identical results. So for a first version this is probably just fine. But I do think it might be interesting to explore performance characteristics of both on some example datasets in the future.

However, that also made me wonder if this approach is correct for the sjoin_nearest? Because in that case it does matter in which order you do those operations, I think? I would assume that I get "the closest geometry among those with a matching attribute", but in practice it will give "the overall closest geometry if that closest geometry has a matching attribute".
That seems an important difference in expected behaviour, which I am not sure has been discussed?

I would maybe suggest to focus this first PR on the sjoin function (where the behaviour is clearer), and leave out sjoin_nearest for later.

geopandas/tools/sjoin.py Outdated Show resolved Hide resolved
geopandas/geodataframe.py Outdated Show resolved Hide resolved
geopandas/tools/sjoin.py Outdated Show resolved Hide resolved
@m-richards
Copy link
Member

For sjoin this looks good! I was wondering what the best approach would be, though, but I can imagine that this will depend on the characteristics of your data. But in general you could either first evaluate the spatial predicate and then the attribute match (as you did here), or either first perform the attribute merge (as a way to reduce the set of geometries for which to evaluate the spatial predicate?) and then the spatial predicate. Before I looked at the code in this PR, I was assuming this PR was for the latter, but of course for sjoin this should give identical results. So for a first version this is probably just fine. But I do think it might be interesting to explore performance characteristics of both on some example datasets in the future.

I will confess I was not engaging with this PR and the original issue precisely because of this, I would think the best order of operations depends quite a bit of the data in question, and for that reason was wondering if this actually belongs in geopandas. But I can still see value in providing a convenience utility for this for the general case, even if it's not necessarily the most performant in all circumstances.

nicholas-ys-tan and others added 4 commits May 21, 2024 21:57
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@nicholas-ys-tan
Copy link
Contributor Author

For sjoin this looks good! I was wondering what the best approach would be, though, but I can imagine that this will depend on the characteristics of your data. But in general you could either first evaluate the spatial predicate and then the attribute match (as you did here), or either first perform the attribute merge (as a way to reduce the set of geometries for which to evaluate the spatial predicate?) and then the spatial predicate. Before I looked at the code in this PR, I was assuming this PR was for the latter, but of course for sjoin this should give identical results. So for a first version this is probably just fine. But I do think it might be interesting to explore performance characteristics of both on some example datasets in the future.

However, that also made me wonder if this approach is correct for the sjoin_nearest? Because in that case it does matter in which order you do those operations, I think? I would assume that I get "the closest geometry among those with a matching attribute", but in practice it will give "the overall closest geometry if that closest geometry has a matching attribute". That seems an important difference in expected behaviour, which I am not sure has been discussed?

I would maybe suggest to focus this first PR on the sjoin function (where the behaviour is clearer), and leave out sjoin_nearest for later.

Thank you @jorisvandenbossche , that's great food for thought re performance and I will do some investigation into that.

Also thank you for pointing out the implications on sjoin_nearest, the sequence and its impact on the output was not something that had crossed my mind. I've since removed all on_attribute joins from sjoin_nearest.

I will open up a new issue for discussion on the ordering of operations how on_attribute behaves in sjoin_nearest.

CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@nicholas-ys-tan
Copy link
Contributor Author

I was wondering what the best approach would be, though, but I can imagine that this will depend on the characteristics of your data. But in general you could either first evaluate the spatial predicate and then the attribute match (as you did here), or either first perform the attribute merge (as a way to reduce the set of geometries for which to evaluate the spatial predicate?) and then the spatial predicate.

@jorisvandenbossche , is this sort of what you had in mind in terms of evaluating the characteristics first, then the spatial predicate? I'm essentially passing in smaller dataframes into that already have the attribute filtered to have geometries evaluated separately. I haven't yet done any performance testing as I am not sure if this is the optimal approach - it feels a bit naive at the moment and wanted to run it by you first.

An initial test with test_sjoin_shared_attribute in this PR suggests this approach (~0.045 secs) would be slower than the approach currently in the PR (~0.025 secs). But, this may not be a great example with the dataset being relatively small. I imagine maybe the batched processing of geometry joins may become more preferable on larger datasets (pure speculation).

Additionally, this does not currently work for multiple on_attributes yet.

def _geom_predicate_query_on_attribute_wrapper(left_df, right_df, predicate, distance, on_attribute):

    unique_attrs = left_df[on_attribute[0]].unique()
    l_idx = []
    r_idx = []
    for attr in unique_attrs:
        right_attr_index = right_df[on_attribute[0]]==attr
        left_attr_index = left_df[on_attribute[0]]==attr
        right_df_attr = right_df[right_attr_index]
        left_df_attr = left_df[left_attr_index]

        left_df_idx, right_df_idx = _geom_predicate_query(left_df_attr, 
                                                          right_df_attr,
                                                          predicate, 
                                                          distance)

        l_idx += left_df_attr.index[left_df_idx].to_list()
        r_idx += right_df_attr.index[right_df_idx].to_list()


    return l_idx, r_idx

@jorisvandenbossche jorisvandenbossche merged commit 217772b into geopandas:main Jun 24, 2024
19 of 20 checks passed
@jorisvandenbossche
Copy link
Member

Thanks @nicholas-ys-tan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants