# PPDB in HATS: demo and validation

In this notebook we will:

1. Import the pre-existing HATS catalog of PPDB.
2. Increment the catalog with data for 5 days.
3. Perform the full weekly reimport.
4. Validate they have the same objects and sources.
5. Compare speed of access between the incremented vs full reimport.

In [None]:
import lsdb
import nested_pandas as npd
import pandas as pd

from datetime import date
from glob import glob
from pathlib import Path

PPDB_HATS_DIR = Path("/sdf/data/rubin/shared/lsdb_commissioning/ppdb")
PPDB_LSST_DIR = Path("/sdf/scratch/rubin/ppdb/data/ppdb_lsstcam")

In [None]:
%ls {PPDB_LSST_DIR}/2026/01/

In [None]:
def _find_files_per_date(date):
    obj_files = glob(f"{PPDB_LSST_DIR}/{date}/**/DiaObject*.parquet")
    src_files = glob(f"{PPDB_LSST_DIR}/{date}/**/DiaSource*.parquet")
    fsrc_files = glob(f"{PPDB_LSST_DIR}/{date}/**/DiaForcedSource*.parquet")
    print(f"Number of DiaObject files: {len(obj_files)}")
    print(f"Number of DiaSource files: {len(src_files)}")
    print(f"Number of DiaForcedSource files: {len(fsrc_files)}")
    return obj_files, src_files, fsrc_files

### Import pre-existing catalog

The pre-existing catalog has data from Sept 2025 to Jan 20, 2026.

In [None]:
collection_dir = PPDB_HATS_DIR / "dia_object_collection"
collection = lsdb.open_catalog(collection_dir)
collection

In [None]:
collection.plot_coverage()

All the input parquet paths are stored under `input_paths`:

In [None]:
!tail -10 {collection_dir}/input_paths/dia_object.txt

### Daily increment

We will increment with data from a couple of days (Jan 24 and Jan 25), one at a time.

In [None]:
obj_files, src_files, fsrc_files = _find_files_per_date("2026/01/24")

In [None]:
from ppdb_hats.config import get_default_config

config = get_default_config(until_date=date(2026, 1, 24))
config

In [None]:
from ppdb_hats import DailyPipeline

DailyPipeline(config=config).execute()

#### Validation of first increment

Let's validate that the data looks consistent:

In [None]:
!tail -10 {collection_dir}/input_paths/dia_object.txt

In [None]:
increment = lsdb.open_catalog(collection_dir)
print(f"The collection has {len(increment):,} objects.")
print(f"The increment added {len(increment) - len(collection):,} new objects.")

This is the expected number of objects, after de-duplication:

In [None]:
new_objects = npd.read_parquet(obj_files)
unique_ids = new_objects["diaObjectId"].unique()
increment_ids = increment["diaObjectId"].compute()
assert set(unique_ids).issubset(set(increment_ids))

Looking at a specific object:

In [None]:
increment.query(f"diaObjectId == {unique_ids[1]}").compute()

In [None]:
new_objects.query(f"diaObjectId == {unique_ids[1]}").sort_values("validityStartMjdTai").tail(1)

Making sure this object's new sources were appended correctly:

In [None]:
new_sources = npd.read_parquet(src_files)
obj_new_sources = new_sources.query(f"diaObjectId == {unique_ids[1]}")
obj_new_sources

In [None]:
increment_obj = increment.query(f"diaObjectId == {unique_ids[1]}")
obj_sources = increment_obj["diaSource"].compute().explode()
pd.concat([obj_sources.query(f"diaSourceId == {id}") for id in obj_new_sources["diaSourceId"]])

We keep track of the already imported parquet paths. On an increment, those are ignored.

In [None]:
obj_files, src_files, fsrc_files = _find_files_per_date("2026/01/25")

In [None]:
config = get_default_config(until_date=date(2026, 1, 25))
DailyPipeline(config=config).execute()

In [None]:
!tail -10 {collection_dir}/input_paths/dia_object.txt

In [None]:
increment2 = lsdb.open_catalog(collection_dir)
print(f"The collection has {len(increment2):,} objects.")
print(f"The increment added {len(increment2) - len(increment):,} new objects.")

In [None]:
new_objects = npd.read_parquet(obj_files)
unique_ids = new_objects["diaObjectId"].unique()
increment2_ids = increment2["diaObjectId"].compute()
assert set(unique_ids).issubset(set(increment2_ids))

Let's check the object we were working on previously:

In [None]:
# We grabbed the most recent object data from the input files
new_objects.query(f"diaObjectId == 169747015140900902").sort_values(
    "validityStartMjdTai", ascending=False
).tail(1)

In [None]:
# It's present in the updated HATS collection:
increment2.query(f"diaObjectId == 169747015140900902").compute()

We now have a parquet file for each import, which we can query (even with other parquet readers) to get specific date updates:

In [None]:
%ls {collection_dir}/dia_object_lc/dataset/Norder=2/Dir=0/Npix=106

### Weekly reimport

### Validation

In [None]:
# Assert that the reimported catalog equals to the