# The Catalog Object


## Learning Objectives

In this tutorial, you will learn:

- The purpose and scope of the `Catalog` object in an LSDB pipeline.
- What are lazy operations and how LSDB uses them to run pipelines at scale.

## Introduction

The `Catalog` object encapsulates all of the information that LSDB knows about an astronomical catalog, and is the basis for performing operations on the underlying catalog data.

There are two types of catalog data that the `Catalog` object exposes:
1. high-level metadata: The columns and table schema of the catalog, the number of partitions of data, sky coverage, provenance information, basic aggregate statistics about the data.
1. leaf-level tabular data: the full rows of data from the objects and/or observations in the catalog.


## 1. Loading a catalog

The simplest way to load a catalog in LSDB is to call `lsdb.open_catalog()` with a path to a catalog in the HATS format. This will return a `Catalog` object with all the high level metadata loaded that LSDB needs to enable you to work with the catalog. We recommend you to visit our own website, [data.lsdb.io](https://data.lsdb.io), where you are able to find large surveys in HATS format publicly available to use. Let's open GAIA DR3 as an example and take a look at the object we get back.

In [None]:
import lsdb

In [None]:
gaia_path = "https://data.lsdb.io/hats/gaia_dr3"
gaia = lsdb.open_catalog(gaia_path)
gaia

### Lazy operations

When we look at the catalog's representation above, we can see all the columns that are in the catalog object along with their datatypes, and information about how HATS has partitioned the catalog. But there's one thing that we can't see: we haven't loaded any of the data yet! That's why we have the `...` as placeholders for the data, and the warning at the bottom. This is because LSDB's operations are what we call *lazy*: they don't actually perform any work on the data when you call them, they just plan out the pipeline of operations to be performed later. This is how LSDB can work on huge catalogs with billions of rows, and run on any scale of device from a laptop up to a supercomputer.

<video src="../_static/lazy-flowchart.mp4" loop autoplay controls style="width: 100%;"></video>

As explained in the video above, when you call any LSDB operations on the catalog, a task graph is built up - an object that keeps track of the pipeline of operations you want to perform on the catalog. To actually execute the operations, you call the `catalog.compute()` method, which will execute the pipeline and return the resulting data as a pandas `DataFrame`.

## 2. Inspecting the Catalog metadata

The natural next step once you have a catalog object is to explore the metadata that has been loaded to understand what kind of data is inside your catalog.

First, we will generate a basic plot showing the sky coverage of the catalog. The `Catalog` object's `plot_pixels` method shows a plot of the HATS partitioning of the catalog. GAIA is a survey that covers the whole sky, so we see the whole sky covered in pixels. The colors of the pixels represent the pixel sizes. The main advantage of HATS partitioning is that the partitions all contain roughly the same amount of rows, so the smaller the pixels, the more dense the catalog is in that area. This explains why we see smaller pixels in the galactic bulge.

In [None]:
gaia.plot_pixels()

We can also get an idea of the schema of data that's stored in the catalog, by looking at the `columns` and `dtypes`.

In [None]:
gaia.columns

In [None]:
gaia.dtypes

We can also see how many objects are in the catalog, which is another piece of metadata that is loaded by `open_catalog`.

In [None]:
len(gaia)

## 3. Operating on the Catalog

Once we have a catalog object, we can start planning operations on it. In the rest of the tutorials, we'll look deeper into exactly what kind of operations you can do with a catalog. The catalog is based on pandas `DataFrames` so you'll see some functions that work the same as in pandas, such as `columns`, `dtypes`, `query`, and selecting columns or filtering with `[]`.

After you've performed your operations, you can call `catalog.compute()` to perform the pipeline, but this will run on the entire catalog!

## 4. Previewing part of the data

Computing an entire catalog will result in loading all of its data into memory on your local machine after the workers have computed it, which is expensive and may lead to out-of-memory issues.

Often, our goal is to have a peek at a slice of data to make sure the workflow output is reasonable (e.g., to assess if some new created columns are present and their values have been properly processed). `head()` is a pandas-like method which allows us to preview part of the data for this purpose. It runs the pipeline on the catalog partitions one by one, and finds the first few rows of the results.

### Making a Dask client

LSDB is built on top of the [Dask](https://www.dask.org) framework, which allows the pipelines to be executed on distributed workers. Before we do anything that executes the pipeline such as `head()` or `compute()`, we recommend making a dask client.

In [None]:
from dask.distributed import Client

client = Client(n_workers=4, memory_limit="auto")
client

In [None]:
gaia.head()

By default, the first 5 rows of data will be shown, but we can specify a higher number if we need.

In [None]:
gaia.head(n=10)

### Closing the Dask client

In [None]:
client.close()

## About

**Authors**: Sandro Campos, Melissa DeLucchi, and Sean McGuire

**Last updated on**: Jun 26, 2025

If you use `lsdb` for published research, please cite following [instructions](https://docs.lsdb.io/en/stable/citation.html).