Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

table.join_skycoord fails with large input tables: too many matches. SkyCoord.search_around_sky works fine though #13054

Open
alcione-mora opened this issue Apr 4, 2022 · 4 comments

Comments

@alcione-mora
Copy link

Dear Astropy team,

I am using search_around_sky to do a cross-match between two astropy tables. However, I find wrong results when the input tables are large (1000 and 852,128). However, SkyCoord.search_around_sky works fine. I provide an example code and input files reproducing the problem.

Any help would be appreciated

Thank you very much

Alcione Mora

from astropy import table
from astropy.coordinates import search_around_sky
import astropy.units as u
import numpy as np


# Load input tables (HEALPix grid close to south pole)
t1 = table.Table.read('astropy_sky_coord_bug/t1.fits')
t2 = table.Table.read('astropy_sky_coord_bug/t2.fits')


# xmatch using join_skycoord function WRONG!!!!
join_func = table.join_skycoord(0.05 * u.deg)
t12 = table.join(t1, t2, join_funcs={'sc': join_func})
t12['sep'] = t12['sc_1'].separation(t12['sc_2'])
t12.sort('sep', reverse=True)
print(t12)


# Re-cross-match worst offenders. It works: results are correct for small input tables?
sub_1 = t12['sc_1', 'sc_2'][:10]
sub_2 = t12['sc_1', 'sc_2'][:10]
sub_1.remove_column('sc_2')
sub_2.remove_column('sc_1')
sub_1.rename_column('sc_1', 'sc')
sub_2.rename_column('sc_2', 'sc')
join_func = table.join_skycoord(0.05 * u.deg)
sub_12 = table.join(sub_1, sub_2, join_funcs={'sc': join_func})
sub_12['separation'] = sub_12['sc_1'].separation(sub_12['sc_2'])
sub_12.sort('separation', reverse=True)
print(sub_12)


# Full cross-match using search_around_sky
idx1a, idx2a, sep2da, dist3da = search_around_sky(t1['sc'], t2['sc'], 0.05 * u.deg)
print(f'Maximum separation reported by search_around_sky: {np.max(sep2da)}')
separations = t1['sc'][idx1a].separation(t2['sc'][idx2a])
print(f'Paranoid check computing distances by hand: {np.max(separations)}')

t1.fits.zip
t2.fits.zip

@github-actions
Copy link

github-actions bot commented Apr 4, 2022

Welcome to Astropy 👋 and thank you for your first issue!

A project member will respond to you as soon as possible; in the meantime, please double-check the guidelines for submitting issues and make sure you've provided the requested details.

GitHub issues in the Astropy repository are used to track bug reports and feature requests; If your issue poses a question about how to use Astropy, please instead raise your question in the Astropy Discourse user forum and close this issue.

If you feel that this issue has not been responded to in a timely manner, please leave a comment mentioning our software support engineer @embray, or send a message directly to the development mailing list. If the issue is urgent or sensitive in nature (e.g., a security vulnerability) please send an e-mail directly to the private e-mail feedback@astropy.org.

@eerovaher
Copy link
Member

It looks like this issue about table.join() with join_skycoord() is what the warning in https://docs.astropy.org/en/stable/table/operations.html#joining-coordinates-and-custom-join-functions is about. You can get the expected results either by removing rows from the tables (which is what you have done in the example) or by using a smaller separation threshold (0.008 deg seems to work fine for your data).

Getting the expected output from table.join() with your full data and with the distance threshold you have specified would require replacing the joining algorithm, but documenting the issue in the relevant docstrings would be a good idea.

@alcione-mora alcione-mora changed the title search_around_sky fails with large input tables: too many matches. SkyCoord.search_around_sky works fine though table.join_skycoord fails with large input tables: too many matches. SkyCoord.search_around_sky works fine though Apr 4, 2022
@alcione-mora
Copy link
Author

alcione-mora commented Apr 4, 2022

Thanks a lot for the quick answer.

I understand the quick answer is keep using SkyCoord.search_around_sky if I want predictable and repeatable output not based on the input table length.

As a naïve user, I would say I would never expect fuzzy logic to be applied to table joins, at least not as the default behaviour. I understand it might help analyzing really big tables, but the unpredictability renders it useless for me. I would really recommend updating the docstring, so people only use table.join_skycoord it when they know what they are doing . I would also suggest adding a way to do exact table joins on sky distance. Maybe the answer is to use SkyCoord.search_around_sky, but it should be explicit.

Thanks again

Alcione

@lboogaard
Copy link

[..] but documenting the issue in the relevant docstrings would be a good idea.
[..] I would say I would never expect fuzzy logic to be applied to table joins, at least not as the default behaviour.

+1 on the above statements and thanks for reporting this.

I was looking for the Astropy-way to join two tables, looping through the first table and find all matches within in a few arcsec radius from a second table (what I'd consider a fairly normal use case). I found the join_skycoord functionality which looked promising (via the join_skycoord page) and was surprised to get completely unexpected and non-deterministic behavior depending on how short I trimmed my input source catalog (I actually only found this out by accident because I was testing it on a few sources before running the whole catalog).

I think the current implementation is indeed unexpected (if not dangerous), for what one expects to be a deterministic operation with a single solution. At least a warning would be very useful.

FWIW, here's what I ended up using, in case anyone is looking for something similar:

import astropy.units as u
from astropy.table import hstack
from astropy.coordinates import SkyCoord

# two catalogs with to-be-defined ra and dec columns in some unit
c1 = SkyCoord(cat1[ra1], cat1[dec1], unit=unit1)
c2 = SkyCoord(cat2[ra2], cat2[dec2], unit=unit2)

# match within 5 arcsec; note reversed order c2 and c1
idc1, idc2, d2d, d3d = c2.search_around_sky(c1, 5*u.arcsec)

matched_tables = hstack([cat1[idc1], cat2[idc2]], join_type='exact')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants