Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Table fuzzy or cross-match join with SkyCoord, Quantity, Column #10169

Merged
merged 13 commits into from May 2, 2020

Conversation

taldcroft
Copy link
Member

@taldcroft taldcroft commented Apr 18, 2020

Description

This PR adds the ability to specify a custom matching function for table joins. In particular this makes it possible to do cross-match table joins on SkyCoord, Quantity, or standard columns, where column entries within a specified distance are considered to be matched.

The change to join is to add a new kwarg join_funcs that is a dict of functions which are responsible for doing the actual cross-matching for a key column.

With sufficient ingenuity one could doing fuzzy text joins by supplying an appropriate join function.

To do:

  • Narrative docs describing using this and also how to write your own join_func.
  • Add join_funcs to join docstring.
  • Check for issues this might close.

@taldcroft
Copy link
Member Author

@mhvk - ping to make sure this is on your radar for 4.1.

@mhvk
Copy link
Contributor

mhvk commented Apr 28, 2020

Definitely on the radar, but trying to get #10210 wrapped up first... Unfortuantely, timing of feature freeze is bad in Spring, with lots of student projects/committee meetings.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really neat! Until I saw the join code, I hadn't actually quite appreciated it was also a quite tricky problem generally (though obviously so thinking about it), so those helper functions are going to be very useful as inspiration for others.

As you can see, I only have two comment, which are quite trivial. This makes me worry a bit, especially as it is quite late and hence I'm a bit sleepy, so maybe I can ping @adrn to have a look as well (as someone who might well want to use this...).

----------
distance : Quantity (angle or length)
Maximum distance between points to be considered a join match
distance_kind : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, this could be inferred based on the unit of distance.

Copy link
Member Author

@taldcroft taldcroft Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even then I much prefer one less kwarg (simple is better than complex), on further reflection I think this "guessing" may be ambiguous because of u.dimensionless_angles(). Because of this equivalency one can pass in a dimensionless Quantity and it could either be a 3d cartesian distance or a dimensionless angular distance. I don't know of any rock solid way to disambiguate, and even if we come up with one it will be complicated to explain to users. So I think it might be best to just leave this be (explicit is better than implicit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can dimensionless ever be a 3-d distance? I guess the test would be ...unit.is_equivalent(u.radian).

But perhaps an alternative is to have the default be determined by the unit? It is good to have the user be able to pass in an explicit distance_kind so that things will simply fail if distance has the wrong units.

return join_func


def join_distance(distance, kdtree_args=None, query_args=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it make sense to re-purpose people's experience with isclose and have an atol and rtol? (in which case this might become join_close).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to be a completely new function since this is explicitly applicable to N-dimensional points and uses KDTree to find the neighbors. For N-dim there is no concept of rtol.

For the 1-d case, I actually don't know how to do a cross match with atol and rtol without just doing the N^2 cross-match.

There is another problem I thought of, which is that this whole concept is based on the cross-matching being symmetric. If point A is "close" to point B, then it is required that B is "close" to A. That is true for a distance (atol), but breaks down for rtol.

In [7]: np.isclose(1, 0, atol=0.8, rtol=0.8)                                                                                                                  
Out[7]: False

In [8]: np.isclose(0, 1, atol=0.8, rtol=0.8)                                                                                                                  
Out[8]: True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, my brain was not quite there with the 2-d case. Let's just stick with distance which is nicely defined. (and, yes, isclose is not symmetric unlike the math equivalent; numpy/numpy#10161)

@mhvk
Copy link
Contributor

mhvk commented Apr 29, 2020

circle-ci failures try to import scipy...

@taldcroft
Copy link
Member Author

circle-ci failures try to import scipy...

Ah right, so need to skip test if no scipy.

Copy link
Member

@adrn adrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks great, and yes this will make writing simple cross-matching code much faster!

@@ -87,9 +88,238 @@ def _get_out_class(objs):
return out_class


def join_skycoord(distance, distance_kind='sky'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm usually against arguments that allow string values - why not require passing in the distance function directly, e.g.,

Suggested change
def join_skycoord(distance, distance_kind='sky'):
def join_skycoord(distance, distance_func=search_around_sky):

(some code below can be simplified)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but passing in the function is harder for the user (extra import required). We have had this philosophical difference going back to forever, but in the end we agreed that passing in strings should generally be allowed instead of requiring imported objects.

How about meeting in the middle and allowing a string which must correspond to a function in coordinatesi.e. (try getattr(astropy.coordinates, distance_func), or a function that has the same API as the current search_around functions. So instead 'sky' or '3d' it would be the full name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'll say that I am usually against putting in functionality where we have not yet identified a realistic use case. In this case allowing passing in a user-defined distance function that meets the search_around API exactly. So a plan B is to not allow passing in a function now and wait for user to request that feature. And in fact in that case the user could make their own join_func.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, philosophy. For completeness: This doesn't require another import if import astropy.coordinates as coord is already done (pass coord.search_around_sky). And the (weak) argument against string values is that it requires reading to see what values are allowed - if the default was coord.search_around_sky, the user might think to try coord.search_<TAB> (in a notebook) to see what other options there are. But, yes, this is a weak argument and perhaps too "behavioral"...

In practice, though, I don't feel very strongly about this, so I leave it up to you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in my world I just do from astropy.coordinates import SkyCoord and that is all I need! 😄 And in practice time is running out ... I wasn't totally happy with distance_kind and I like distance_func better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood - thanks for your work on this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if the default of the function is based on the unit, this will very rarely be needed!

docs/table/operations.rst Outdated Show resolved Hide resolved
@taldcroft taldcroft merged commit 9ad423b into astropy:master May 2, 2020
@taldcroft taldcroft deleted the table-fuzzy-join branch May 2, 2020 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants