# Crossmatching Catalogs

## Learning Objectives

At the end of this tutorial, you will understand:

- how to test a crossmatch by limiting inputs with cone searches and column selection
- how to crossmatch objects between two catalogs by `ra` and `dec`
- the importance of margin caches when crossmatching, and the default margin caches

You should already have an understanding of:

- how to open a catalog

## Introduction

To crossmatch two catalogs is to create a new catalog that contains columns from both inputs, with the rows aligned based on the geometric match of their `ra` and `dec` columns.

In [None]:
import lsdb
from dask.distributed import Client

## 1. Open a catalog

We create a basic dask client, and open an existing HATS catalog—the ZTF DR14 catalog.

Create a basic Dask client, limiting the number of workers. This keeps subsequent operations from using more of our compute resources than we might intend, which is helpful in any case but especially when working on a shared resource.

In [None]:
import warnings

# Suppress the specific warning about Dask dashboard port usage.
# Uncomment this if you want to attach to the Dask dashboard.
# For the purposes of this tutorial, it's not necessary.
warnings.filterwarnings("ignore", message="Port 8787 is already in use.")


client = Client(n_workers=4, memory_limit="auto")
client

We open two catalogs, ZTF DR22 and Gaia DR3.  Both of these are catalog *collections* on http://data.lsdb.io, and each of them has a default margin cache catalog that is implicitly loaded.  The margin cache is important for crossmatching, as it will help with cases when objects of interest are right at the edge of the catalog pixels.

In [None]:
ztf22 = lsdb.open_catalog(
    "https://data.lsdb.io/hats/ztf_dr22",
)

gaia3 = lsdb.open_catalog(
    "https://data.lsdb.io/hats/gaia_dr3",
)

In [None]:
display(ztf22.columns)
display(gaia3.columns)

In [None]:
ztf22

### 1.1 Will these catalogs have any overlap at all?

To find an area of the sky where we know that crossmatching will succeed, we can use `.plot_pixels()` to get a quick view of the sky coverage for each catalog.

In [None]:
ztf22.plot_pixels()

In [None]:
gaia3.plot_pixels()

Yes, looks like they will, above `dec=-30`.

## 1.2 Work with small sections of the catalogs first

Before firing up the whole compute cluster to match both catalogs, it's good practice to choose a small
section of each catalog first, greatly limiting both I/O and compute.  We can limit *spatially* by using
a search filter such as a cone search or box search, and can limit *structurally* by only loading
columns of interest.

Let's use a cone search to isolate a tiny part of both catalogs in the same area, so that we can test our crossmatch.  We'll set the search filters of both catalogs to `lsdb.ConeSearch(280, 0, radius_arcsec=36)`.

We'll use the suffix `_sm` (small) to distinguish these from the full catalog.

In [None]:
ztf22_sm = lsdb.open_catalog(
    "https://data.lsdb.io/hats/ztf_dr22",
    columns=["objectid", "objra", "objdec", "nepochs", "hmjd", "mag", "magerr"],
    search_filter=lsdb.ConeSearch(280, 0, radius_arcsec=36),
)
ztf22_sm

## 1.3 Nesting columns with list data

Note that ZTF DR22 has lightcurve data stored as lists under each column.  To access and crossmatch these efficiently, we will use `.nest_lists` to arrange these lists into a single "nest" in the catalog.

Three of the loaded columns in `ztf_lc` are of type `list`, so let's put those lists into a nest
named `lc`, to make them more tractable.

In [None]:
ztf22_sm = ztf22_sm.nest_lists(
    list_columns=["hmjd", "mag", "magerr"],
    name="lc",
)
ztf22_sm

This small version, which we call `ztf22_sm`, occupies only one partition.
Pulling this into memory with `.compute()` won't take too long.

We're interested in seeing how many *objects* are in our chosen cone search.
While the number of rows is a fairly reliable proxy, the `set()` of identifiers
will be the most accurate, especially once we create the crossmatch catalog.

In [None]:
%%time
ztf_cone_objs = ztf22_sm["objectid"].compute()
print("Rows in the ZTF cone search:", len(ztf_cone_objs))
print("Unique ZTF objects in the cone:", len(set(ztf_cone_objs)))

## 1.4 Viewing the objects within the cone searches

We can plot these points graphically:

In [None]:
%%time
from astropy.coordinates import SkyCoord
import astropy.units as u

# Center our view on the cone.
center = SkyCoord(280 * u.deg, 0 * u.deg)
fov = (1 * u.arcmin, 1 * u.arcmin)

ztf22_sm.plot_points(center=center, fov=fov)

Now let's treat the Gaia catalog the same way, using the same small cone search, and
selecting only a few columns of interest.  Again, we'll use the suffix `_sm` for "small":

In [None]:
gaia3_sm = lsdb.open_catalog(
    "https://data.lsdb.io/hats/gaia_dr3",
    columns=["source_id", "ra", "dec", "parallax", "phot_g_n_obs", "phot_g_mean_mag"],
    search_filter=lsdb.ConeSearch(280, 0, radius_arcsec=36),
)
gaia3_sm

In [None]:
%%time
gaia_cone_objs = gaia3_sm["source_id"].compute()
print("Rows in the Gaia cone search:", len(gaia_cone_objs))
print("Unique Gaia objects in the cone:", len(set(gaia_cone_objs)))

In [None]:
%%time
gaia3_sm.plot_points(center=center, fov=fov)

We can see that the Gaia catalog is much more sparse than the ZTF catalog, having
almost a tenth of the objects for the same area of the sky.

# 2. Perform the crossmatch on the small catalogs

The default crossmatch algorithm is the `KdTreeCrossmatch`, and it is performed in a manner similar
to an inner join, in that the resulting catalog will only have rows containing objects that exist
in both catalogs.

# 2.1 Effect of ordering on crossmatching

However, this doesn't mean that the order of the catalogs doesn't matter.  As we'll see,
`gaia X ztf` gives us different results than `ztf X gaia`, even though the unique objects
from the two catalogs participating in the crossmatch remain the same.

The algorithm takes each object in the left catalog and finds the closest spatial match
from the right catalog.  If this left catalog is *denser*, this means that more than one
object in the left may match the same object on the right.  If the left catalog is
*sparser*, the result will have fewer rows, as in the next example.

## 2.1.1 gaia X ztf (sparser on the left)

We'll start by matching Gaia against ZTF, or `gaia X ztf`.  Note that `gaia_x_ztf` will be the
uncomputed catalog, and we'll use the prefix `c_`, or `c_gaia_x_ztf` to indicate the computed
result.  The computed result from this small crossmatch will fit easily in memory, and it will
be a Pandas DataFrame.

In [None]:
%%time
gaia_x_ztf = gaia3_sm.crossmatch(ztf22_sm)
c_gaia_x_ztf = gaia_x_ztf.compute()
c_gaia_x_ztf

Note that the crossmatch catalog's columns include:

  * all the columns from Gaia, suffixed with `_gaia`
  * all the columns from ZTF, suffixed with `_ztf_lc`
  * a new column, `_dist_arcsec`, which gives the shortest (great circle) distance between the two matched objects in each row, in arcseconds
  
**NOTE:** The catalog identifiers `gaia` and `ztf_lc` come from the `.name` properties of the two input catalogs.  The `suffixes=` argument of `.crossmatch` can be used to override these.

In [None]:
c_gaia_x_ztf["_dist_arcsec"].sort_values()

Earlier, we plotted the points for both of the input catalogs.  We can plot the points of the output
catalog, now, to see how well the crossmatch aligned.  Since each row contains the combination of columns
from both input catalogs, each row includes the (ra, dec) points from each source catalog,
for each row representing a match.

In our case, this means `(ra_gaia, dec_gaia)` and `(objra_ztf_lc, objdec_ztf_lc)`.

**NOTE:** we are using the *catalog objects*, not the *computed results*, for the plotting, because `.plot_points()`
is a method on the catalog, not on the Pandas DataFrame.

**NOTE:** See https://docs.lsdb.io/en/stable/tutorials/pre_executed/plotting.html for this trick as well as others.

In [None]:
%%time

# First the Gaia points
gaia_x_ztf.plot_points(
    ra_column="ra_gaia",
    dec_column="dec_gaia",
    center=center,
    fov=fov,
    c="red",
    marker="x",
    s=40,
    label="gaia points",
)

gaia_x_ztf.plot_points(
    ra_column="objra_ztf_lc",
    dec_column="objdec_ztf_lc",
    # Can skip the center & fov args here, since this is an overlay
    c="green",
    marker="+",
    s=30,
    label="ztf points",
)

## 2.1.2 ztf X gaia (denser on the left)

Now let's crossmatch the same catalogs, but switching the left and right catalog.  Now we have the
denser catalog on the left.

The algorithm is, in pseudocode:

```
for every point in the left catalog:
    find the closest match in the right catalog
```

When there are more points in the left catalog than in the right, as in this case, each object in
the right catalog will get matched to more than one object in the left.

In [None]:
ztf_x_gaia = ztf22_sm.crossmatch(gaia3_sm)
ztf_x_gaia

In [None]:
%%time
c_ztf_x_gaia = ztf_x_gaia.compute()
c_ztf_x_gaia

47 matches in this case, vs 16 matches in the first case.  Let's look at the plot again:

In [None]:
%%time

# First the ZTF points
ztf_x_gaia.plot_points(
    ra_column="objra_ztf_lc",
    dec_column="objdec_ztf_lc",
    center=center,
    fov=fov,
    c="red",
    marker="x",
    s=30,
    label="ztf points",
)

# Then the Gaia points
ztf_x_gaia.plot_points(
    ra_column="ra_gaia",
    dec_column="dec_gaia",
    # Can skip the center & fov args here, since this is an overlay
    c="green",
    marker="+",
    s=40,
    label="gaia points",
)

There are several red `x` markers (left) for each green `+` marker (right).

And yet, out of the 129 points in the ZTF cone, only 47 of those could be considered matches at all;
the other 82 aren't even candidates.

## 2.2 Effect of `n_neighbors`

One of the arguments to `.crossmatch` is `n_neighbors`, which defaults to `1`, meaning that for each
object in the left catalog, we're looking for one best match from the right catalog.

But we can look for more than one best match.  By increasing this argument, we can get gaia X ztf to
produce the same results as ztf X gaia.

In [None]:
gaia_x_ztf_n10 = gaia3_sm.crossmatch(ztf22_sm, n_neighbors=10)
c_gaia_x_ztf_n10 = gaia_x_ztf_n10.compute()
len(c_gaia_x_ztf_n10)

47 results, just like ztf x gaia with `n_neighbors=1`.

## 2.3 Verifying object identifiers

Converting the catalog identifiers from the crossmatches to Python sets makes it easy to
verify that we've got not only the right count but the same results.  We're showing that
the computed gaia X ztf crossmatch with 10 neighbors produces the same set of results
as the computed ztf X gaia crossmatch with 1 neighbor.

In [None]:
set(c_gaia_x_ztf_n10[["source_id_gaia", "objectid_ztf_lc"]]) == set(
    c_ztf_x_gaia[["source_id_gaia", "objectid_ztf_lc"]]
)

## Closing the Dask client

In [None]:
client.close()

## About

**Authors**: Derek Jones

**Last updated on**:  June 4, 2025

If you use `lsdb` for published research, please cite following [instructions](https://docs.lsdb.io/en/stable/citation.html).