In [None]:
# Lazy Operations in LSDB

## Learning Objectives

By the end of this tutorial, you will learn:

* What are lazy operations and how LSDB uses them to run pipelines at scale.
* How to preview a small part of the data.

## Introduction - What are Lazy Operations?

In the previous tutorial we looked at loading a catalog and inspecting it's metadata. When we call `open_catalog()`, only the catalog's metadata is loaded, not any of the data in the rows of the catalog. This is because operations in LSDB are *lazy*: when you call the operation it isn't actually executed immediately.

<video src="../_static/lazy-flowchart.mp4" loop autoplay controls style="width: 100%;"></video>

As explained in the video above, instead of executing the operation when you call it, the operation is just planned. The catalog object keeps track of the pipeline of operations by building a task graph - an object that keeps track of the pipeline of operations you want to perform on the catalog. This way, you can write the code for the pipeline locally, and the task graph will be sent to the workers to execute the pipeline in parallel. We can also perform optimizations to the task graph to make sure the workflow is as efficient as possible. This is how LSDB can scale from working on your local machine, to running pipelines on clusters or in the cloud without having to make any code changes.

To actually execute the operations, you call the `catalog.compute()` method, which will execute the pipeline and return the resulting data as a pandas `DataFrame`.

You will find that most use cases start with **LAZY** loading and planning operations, followed by more expensive **COMPUTE** operations. The data is only loaded into memory when we trigger the workflow computations, usually with a `compute` call.

![Lazy workflow diagram](../_static/lazy_diagram.svg)

In [None]:
import lsdb

In [None]:
gaia = lsdb.open_catalog("https://data.lsdb.io/hats/gaia_dr3/gaia/")
gaia

We can see above from the `...` as placeholders for the data, and the warning at the bottom that this catalog has been loaded lazily. Now that we have the object we're ready to start performing operations!

## Operating on the Catalog

Once we have a catalog object, we can start planning operations on it. In the rest of the tutorials, we'll look deeper into exactly what kind of operations you can do with a catalog. The catalog is based on pandas `DataFrames` so you'll see some functions that work the same as in pandas, such as `columns`, `dtypes`, `query`, and selecting columns or filtering with `[]`.

After you've performed your operations, you can call `catalog.compute()` to perform the pipeline, but this will run on the entire catalog!

## Previewing part of the data

Computing an entire catalog will result in loading all of its data into memory on your local machine after the workers have computed it, which is expensive and may lead to out-of-memory issues.

Often, our goal is to have a peek at a slice of data to make sure the workflow output is reasonable (e.g., to assess if some new created columns are present and their values have been properly processed). `head()` is a pandas-like method which allows us to preview part of the data for this purpose. It runs the pipeline on the catalog partitions one by one, and finds the first few rows of the results.

### Making a Dask client

LSDB is built on top of the [Dask](https://www.dask.org) framework, which allows the pipelines to be executed on distributed workers. Before we do anything that executes the pipeline such as `head()` or `compute()`, we recommend making a dask client.

In [None]:
from dask.distributed import Client

client = Client(n_workers=4, memory_limit="auto")
client

In [None]:
gaia.head()

By default, the first 5 rows of data will be shown, but we can specify a higher number if we need.

In [None]:
gaia.head(n=10)

### Closing the Dask client

In [None]:
client.close()

## About

**Authors**: Sean McGuire

**Last updated on**: June 27, 2025

If you use `lsdb` for published research, please cite following [instructions](https://docs.lsdb.io/en/stable/citation.html).