# The Catalog Object


## Learning Objectives

In this tutorial, you will learn:

- The purpose and scope of the `Catalog` object in an LSDB pipeline.
- How to load catalogs in LSDB and inspect their metadata.

## Introduction

The `Catalog` object encapsulates all of the information that LSDB knows about an astronomical catalog, and is the basis for performing operations on the underlying catalog data.

There are two types of catalog data that the `Catalog` object exposes:
1. high-level metadata: The columns and table schema of the catalog, the number of partitions of data, sky coverage, provenance information, basic aggregate statistics about the data.
1. leaf-level tabular data: the full rows of data from the objects and/or observations in the catalog.


## 1. Getting Data into LSDB

The simplest way to load a catalog in LSDB is to call `lsdb.open_catalog()` with a path to a catalog in the HATS format. This will return a `Catalog` object with all the high level metadata loaded that LSDB needs to enable you to work with the catalog. We recommend you to visit our website, [data.lsdb.io](https://data.lsdb.io), where you are able to find large surveys in HATS format publicly available to use. If you're looking for how to get your own external data into LSDB, see the topic [Import Catalogs](import_catalogs.html) instead.

Let's open GAIA DR3 as an example and take a look at the object we get back.

In [None]:
import lsdb

In [None]:
gaia_path = "https://data.lsdb.io/hats/gaia_dr3"
gaia = lsdb.open_catalog(gaia_path)
gaia

The Gaia catalog is very wide so you would be requesting its whole set of >150 columns. We can see all of the columns available in the catalog by using the `all_columns` property.

In [None]:
gaia.all_columns[:10]  # Truncating the output to not display the whole list

Note that it's important (and highly recommended) to:

- **Pre-select a small subset of columns** that satisfies your scientific needs. Loading an unnecessarily large amount of data leads to computationally expensive and inefficient workflows. To see which columns are available, use the `catalog.all_columns` property, then load the catalog with only the necesarry columns.

- **Load catalogs with their respective margin caches**, when available. These margins are necessary to obtain accurate results in several operations such as joining and crossmatching. If you're working with catalogs from [data.lsdb.io](https://data.lsdb.io), the margin cache will be included in the `open_catalog()` call for you to copy if it is available. For more information about margins please visit our [Margins](margins.ipynb) topic notebook.

Let's define the set of columns we need and add the margin catalog's path to our `open_catalog` call.

In [None]:
gaia = lsdb.open_catalog(
    "https://data.lsdb.io/hats/gaia_dr3/gaia/",
    margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_10arcs/",
    columns=[
        "source_id",
        "ra",
        "dec",
        "phot_g_mean_mag",
        "phot_proc_mode",
        "azero_gspphot",
        "classprob_dsc_combmod_star",
    ],
)
gaia

When we look at the catalog's representation above, we can see all the columns that are in the catalog object along with their datatypes, and information about how HATS has partitioned the catalog. But there's one thing that we can't see: we haven't loaded any of the data yet! That's why we have the `...` as placeholders for the data, and the warning at the bottom. This is because LSDB's operations are what we call *lazy*: they don't actually perform any work on the data when you call them, they just plan out the pipeline of operations to be performed later. This is how LSDB can work on huge catalogs with billions of rows, and run on any scale of device from a laptop up to a supercomputer. To learn more about how Lazy operations work, take a look at our [lazy operations tutorial.](lazy_operations.html)

## 2. Inspecting the Catalog metadata

The natural next step once you have a catalog object is to explore the metadata that has been loaded to understand what kind of data is inside your catalog.

First, we will generate a basic plot showing the sky coverage of the catalog. The `Catalog` object's `plot_pixels` method shows a plot of the HATS partitioning of the catalog. GAIA is a survey that covers the whole sky, so we see the whole sky covered in pixels. The colors of the pixels represent the pixel sizes. The main advantage of HATS partitioning is that the partitions all contain roughly the same amount of rows, so the smaller the pixels, the more dense the catalog is in that area. This explains why we see smaller pixels in the galactic bulge.

In [None]:
gaia.plot_pixels()

We can also get an idea of the schema of data that's stored in the catalog, by looking at the `columns` and `dtypes`.

In [None]:
gaia.columns

In [None]:
gaia.dtypes

The `columns` method shows the columns that have been loaded, and will be available to any operations on the catalog. You can still see all of the columns in the catalog by calling the `all_columns` method, but to use any of these columns that aren't in the `columns` the catalog will need to be opened again with these columns selected.

In [None]:
gaia.all_columns[:10]  # Truncating the output to not display the whole list

We can also see how many objects are in the catalog, which is another piece of metadata that is loaded by `open_catalog`.

In [None]:
len(gaia)

## Working with Catalog Data

Now that we have a catalog object, we're ready to start planning and executing operations on the data! Our next tutorials will explain how that works and all the operations you can do with LSDB.

## About

**Authors**: Sandro Campos, Melissa DeLucchi, and Sean McGuire

**Last updated on**: Jun 26, 2025

If you use `lsdb` for published research, please cite following [instructions](https://docs.lsdb.io/en/stable/citation.html).