# Intro to Using Rubin Data

## Learning Objectives

In this tutorial, you will learn:

  * Understand basics of column discovery
  * How to work with an individual lightcurve
  * See examples of common filtering operations
  * How to calculate basic aggregation statistics

## Introduction

This tutorial showcases a handful of basic LSDB operations that should be useful when working with Rubin (DP1) data. These operations are likely to be used regardless of science case, and the particular examples in this tutorial should allow you to understand how to use these operations in other ways. For example, while we filter by photometric band in one of the example below, that filter can easily be modified to filter by a quality flag in the data.

# Loading Data

The details of loading Rubin data are discussed in [How to Access Data](TODO:linktorubin), so we'll just provide a starter codeblock below:

In [None]:
from upath import UPath
import lsdb
from lsdb import ConeSearch

# This will eventually work
# base_path = UPath("/rubin/lincc_lsb_data")
# object_collection = lsdb.read_hats(base_path / "object_collection_lite")

# In the meantime
# Cone search on ECDFS (Extended Chandra Deep Field South)
object_collection = lsdb.read_hats(
    "/sdf/data/rubin/shared/lsdb_commissioning/hats/v29_0_0/dia_object_collection",
    search_filter=ConeSearch(ra=52.838, dec=-28.279, radius_arcsec=5000),
)
object_collection

As mentioned beneath the catalog dataframe, the view above is a "lazy" view of the data. Often, it's nice to preview the first few rows to better understand the contents of the dataset:

In [None]:
object_collection.head(5)

### Viewing Available Columns

The [schema browser](https://sdm-schemas.lsst.io/dp1.html) provides the most information regarding available (DP1) columns, there is also a handful of properties useful for quick column discovery within the LSDB API. First, `all_columns` gives a view of **all** available columns in the HATS catalog, even if only a handful of columns were selected on load:

In [None]:
object_collection.all_columns

Any nested columns will have their own sets of sub-columns as well, we can first identify any nested columns programmatically using the `nested_columns` property:

In [None]:
object_collection.nested_columns

To view the available sub-columns, we use the `nest` accessor for one of the nested columns:

In [None]:
object_collection["diaObjectForcedSource"].nest.fields

## Viewing a Single Lightcurve

Selecting a single lightcurve is most effectively done via the `id_search` function, in this case we have a particular "diaObjectId" in mind:

In [None]:
objectid = 609782208097419314
single_id = object_collection.id_search(values={"diaObjectId": objectid}).compute()
single_id

In [None]:
from matplotlib.patches import Patch
import matplotlib.pyplot as plt
import numpy as np

first_lc = single_id.diaObjectForcedSource.iloc[0]

# Compute symmetric y-limits around 0 using 95% range
flux = first_lc["psfDiffFlux"].dropna()
limit = np.percentile(np.abs(flux), 97.5) + 100
y_min, y_max = -limit, limit

# Start plot
fig, ax = plt.subplots(2, 1, figsize=(10, 10), dpi=200)

# Define band → color mapping
band_colors = {"u": "blue", "g": "green", "r": "red", "i": "orange", "z": "purple", "y": "brown"}

# Plot each band with its color
for band, color in band_colors.items():
    band_data = first_lc[first_lc["band"] == band]
    if band_data.empty:
        continue
    ax[0].errorbar(
        band_data["midpointMjdTai"],
        band_data["psfDiffFlux"],
        yerr=band_data["psfDiffFluxErr"],
        fmt="o",
        color=color,
        ecolor=color,
        elinewidth=2,
        capsize=2,
        alpha=0.8,
        markeredgecolor="k",
        label=band,
    )

    ax[1].errorbar(
        band_data["midpointMjdTai"],
        band_data["psfMag"],
        yerr=band_data["psfMagErr"],
        fmt="o",
        color=color,
        ecolor=color,
        elinewidth=2,
        capsize=2,
        alpha=0.8,
        markeredgecolor="k",
        label=band,
    )

fig.suptitle(
    f'Object ID: {single_id["diaObjectId"].values[0]} RA: {single_id["ra"].values[0]:.5f}, Dec: {single_id["dec"].values[0]:.5f}'
)

ax[0].invert_yaxis()
ax[0].set_xlabel("MJD (midpointMjdTai)")
ax[0].set_ylabel("psfDiffFlux")
# ax[0].set_title(f'Object ID: {single_id["diaObjectId"].values[0]} RA: {single_id["ra"].values[0]:.5f}, Dec: {single_id["dec"].values[0]:.5f}', fontsize=12)
ax[0].set_ylim(y_min, y_max)
ax[0].set_xlim(60622, 60658)
ax[0].grid(True)
ax[0].legend(title="Band", loc="best")

ax[1].invert_yaxis()
ax[1].set_xlabel("MJD (midpointMjdTai)")
ax[1].set_ylabel("psfMag")
# ax[1].set_title(f'Object ID: {single_id["diaObjectId"].values[0]} RA: {single_id["ra"].values[0]:.5f}, Dec: {single_id["dec"].values[0]:.5f}', fontsize=12)
# ax[1].set_ylim(y_min, y_max)
ax[1].set_xlim(60622, 60658)
ax[1].grid(True)
ax[1].legend(title="Band", loc="best")
plt.tight_layout()

## Common Filtering Operations

### Filtering by Number of Sources

Provided the Source table(s) haven't been modified by any filtering operations, the "nDiaSources" column is directly provided and allows for easy filtering based on lightcurve length.

In [None]:
oc_long_lcs = object_collection.query("nDiaSources > 10")
oc_long_lcs.head(5)

### Filtering by Photometric Band

Another common operation is to filter by band, which can done similarly to above, but using sub-column queries:

In [None]:
oc_long_lcs_g = oc_long_lcs.query("diaObjectForcedSource.band == 'g'")
oc_long_lcs_g.head(5)

> **Note**: Filtering operations on "diaObjectForcedSource" are not propagated to "diaSource". Any filtering operations on "diaSource" should be applied in addition to any operations done on "diaObjectForcedSource".

### Filtering Empty Lightcurves

Sometimes, filters on lightcurves may throw out all observations for certain objects, leading to empty lightcurves as seen for one of the objects above. In this case, we can filter objects with empty lightcurves using `dropna`:

In [None]:
oc_long_lcs_g = oc_long_lcs_g.dropna(subset="diaObjectForcedSource")
oc_long_lcs_g.head(5)

## Calculating Basic Statistics

While Rubin DP1 data has many statistics pre-computed in object table column, custom computation of statistics remains broadly useful.

Simple aggregrations can be applied via the `reduce` function, where below we define a very simple mean magnitude function and pass it along to reduce, selecting the "psfMag" sub-column of "diaObjectForcedSource" to compute the mean of for each object.

In [None]:
import numpy as np


def mean_mag(mag):
    return {"mean_psfMag": np.mean(mag)}


# meta defines the expected structure of the result
# append_columns adds the result as a column to the original catalog
oc_mean_mags_g = oc_long_lcs_g.reduce(
    mean_mag, "diaObjectForcedSource.psfMag", meta={"mean_psfMag": np.float64}, append_columns=True
)
oc_mean_mags_g.head(10)[["mean_psfMag"]]

## About
**Author(s):** Doug Branton

**Last updated on:** 26 June 2025

If you use lsdb for published research, please cite following [instructions](https://docs.lsdb.io/en/stable/citation.html).