# The Catalog Object


## Learning Objectives

In this tutorial, we will discuss the purpose and scope of the `Catalog` object in an LSDB pipeline.

TODO - Help wanted

https://github.com/astronomy-commons/lsdb/issues/661

## Introduction

The `Catalog` object encapsulates all of the information that LSDB knows about the survey data, and is the basis for providing operations on the underlying catalog data.

There are two primary types of catalog data that the `Catalog` object exposes:
1. high-level metadata: number of partitions of data, sky coverage, provenance information, basic aggregate statistics about the data.
1. leaf-level tabular data: the parquet files with objects and/or observations.


## 1. Load a catalog

We create a basic dask client, and load an existing HATS catalog - the ZTF DR22 catalog.

The catalog has been loaded lazily: we can see its metadata, but no actual data is there yet.

We will be defining more operations in this notebook. Only when we call `compute()` on the resulting catalog are operations executed; i.e., data is loaded from leaf parquet files on disk into memory for processing.

In [None]:
import lsdb
from dask.distributed import Client

In [None]:
client = Client(n_workers=4, memory_limit="auto")
client

In [None]:
ztf_object_path = "https://data.lsdb.io/hats/ztf_dr22/ztf_lc"
ztf_object = lsdb.read_hats(ztf_object_path)
ztf_object

## 2. Basic catalog inspection

The natural next step once you have a link to your catalog is to understand what kind of data is inside your catalog.

First, we will generate a basic plot showing the sky coverage of the catalog. ZTF is a northern hemisphere survey, and so we only see data filled in for regions where celestial `ra > -30 deg`. The lighter areas in the plot suggest the shape of the galactic plane.

NOTE: The darker blue/green areas represent larger angular area regions, where there are fewer observables. The yellow areas are instead more photometrically dense regions, where the partitions have a smaller angular area. This is fundamental to the HATS format. 

In [None]:
ztf_object.plot_pixels()

We can also get an idea of the kind of data that's stored in the catalog, by looking at the `columns` and `dtypes`.

In [None]:
ztf_object.columns

In [None]:
ztf_object.dtypes

We can see how many objects are in the catalog, again without needing to scan the entire dataset.

In [None]:
len(ztf_object)

## 3. Previewing part of the data

Computing an entire catalog requires loading all of its resulting data into memory, which is expensive and may lead to out-of-memory issues. 

Often, our goal is to have a peek at a slice of data to make sure the workflow output is reasonable (e.g., to assess if some new created columns are present and their values have been properly processed). `head()` is a pandas-like method which allows us to preview part of the data for this purpose. It iterates over the existing catalog partitions, in sequence, and finds up to `n` number of rows.

Notice that this method implicitly calls `compute()`.

In [None]:
ztf_object.head()

By default, the first 5 rows of data will be shown, but we can specify a higher number if we need.

In [None]:
ztf_object.head(n=10)

## Closing the Dask client

In [None]:
client.close()

## About

**Authors**: Sandro Campos and Melissa DeLucchi

**Last updated on**: April 14, 2025

If you use `lsdb` for published research, please cite following [instructions](https://docs.lsdb.io/en/stable/citation.html).