Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-matching fails with ArrowInvalid error #335

Closed
2 of 3 tasks
hombit opened this issue May 24, 2024 · 3 comments
Closed
2 of 3 tasks

Cross-matching fails with ArrowInvalid error #335

hombit opened this issue May 24, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@hombit
Copy link
Contributor

hombit commented May 24, 2024

Bug report

import lsdb

ztf_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/ztf/ztf_dr14/'
gaia_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/gaia_dr3/gaia/'
gaia_margin_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/gaia_dr3/gaia_10arcs/'

cone_search = dict(ra=283.44639, dec=33.06623, radius_arcsec=3600)

ztf_catalog = lsdb.read_hipscat(ztf_path, columns=["ra", "dec"]).cone_search(**cone_search)
gaia_margin_catalog = lsdb.read_hipscat(gaia_margin_path, columns=["ra", "dec"])
gaia_catalog = lsdb.read_hipscat(
    gaia_path,
    columns=["ra", "dec"],
    margin_cache=gaia_margin_catalog,
).cone_search(**cone_search)

lsdb_result = ztf_catalog.crossmatch(gaia_catalog).compute()
Expand
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[1], line 17
     10 gaia_margin_catalog = lsdb.read_hipscat(gaia_margin_path, columns=["ra", "dec"])
     11 gaia_catalog = lsdb.read_hipscat(
     12     gaia_path,
     13     columns=["ra", "dec"],
     14     margin_cache=gaia_margin_catalog,
     15 ).cone_search(**cone_search)
---> 17 lsdb_result = ztf_catalog.crossmatch(gaia_catalog).compute()

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/lsdb/catalog/dataset/dataset.py:39, in Dataset.compute(self)
     37 def compute(self) -> pd.DataFrame:
     38     """Compute dask distributed dataframe to pandas dataframe"""
---> 39     return self._ddf.compute()

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/dask/base.py:375, in DaskMethodsMixin.compute(self, **kwargs)
    351 def compute(self, **kwargs):
    352     """Compute this dask collection
    353 
    354     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    373     dask.compute
    374     """
--> 375     (result,) = compute(self, traverse=False, **kwargs)
    376     return result

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/dask/base.py:661, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    658     postcomputes.append(x.__dask_postcompute__())
    660 with shorten_traceback():
--> 661     results = schedule(dsk, keys, **kwargs)
    663 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/lsdb/dask/crossmatch_catalog_data.py:59, in perform_crossmatch(left_df, right_df, right_margin_df, left_pix, right_pix, right_margin_pix, left_catalog_info, right_catalog_info, right_margin_catalog_info, algorithm, suffixes, right_columns, meta_df, **kwargs)
     56 if len(left_df) == 0:
     57     return meta_df
---> 59 right_joined_df = concat_partition_and_margin(right_df, right_margin_df, right_columns)
     61 return algorithm(
     62     left_df,
     63     right_joined_df,
   (...)
     71     suffixes,
     72 ).crossmatch(**kwargs)

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/lsdb/dask/merge_catalog_functions.py:55, in concat_partition_and_margin(partition, margin, right_columns)
     53 margin_renamed = margin[margin_columns_no_hive].rename(columns=rename_columns)
     54 margin_filtered = margin_renamed[right_columns]
---> 55 joined_df = pd.concat([partition, margin_filtered]) if margin_filtered is not None else partition
     56 return joined_df

File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/table.pxi:574, in pyarrow.lib.ChunkedArray.cast()

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/compute.py:404, in cast(arr, target_type, safe, options, memory_pool)
    402     else:
    403         options = CastOptions.safe(target_type)
--> 404 return call_function("cast", [arr], options, memory_pool)

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/_compute.pyx:590, in pyarrow._compute.call_function()

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File /ocean/projects/phy210048p/malanche/lsdb-xmatch-perf/cenv/lib/python3.12/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Integer value 4180559770064781312 not in range: 0 to 9007199254740992

It is the same error as if I would try to cast a large uint64 to double:

pa.scalar(4180559770064781312, type=pa.uint64()).cast(pa.float64())

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
@hombit hombit added the bug Something isn't working label May 24, 2024
@hombit
Copy link
Contributor Author

hombit commented May 24, 2024

I created a pandas issue for this, but I'm not 100% sure that we have this issue...
pandas-dev/pandas#58819

@hombit
Copy link
Contributor Author

hombit commented May 24, 2024

It looks like the root of the issue is that margin catalog has a wrong index dtype

import lsdb

gaia_margin_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/gaia_dr3/gaia_10arcs/'
gaia_margin_catalog = lsdb.read_hipscat(gaia_margin_path, columns=["ra", "dec"])
print(gaia_margin_catalog.index.dtype)

prints int64[pyarrow]

@hombit
Copy link
Contributor Author

hombit commented Jun 4, 2024

It happened to be the margin catalog issue, fixed by astronomy-commons/hipscat-import#325

@hombit hombit closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

1 participant