# Understanding the NestedFrame

## Learning Objectives

By the end of this tutorial, you will:

* Learn about general capabilities of the DataFrame-like API that LSDB is built on top of, the NestedFrame.
* Understand common methods to package data as nested columns
* Understand common approaches to working with nested columns

## Introduction

The LSDB catalog is built on top of the pandas/dask ecosystem, where much of the look and feel of a pandas DataFrame is present in catalog operations. Pandas DataFrames are fairly ubiquitous, both in astronomy-specific applications and the broader scientific python community. If you haven't encountered them before, you can read about them in the [pandas documentation](https://pandas.pydata.org/docs/user_guide/10min.html). 

However, in complex and interconnected astronomical datasets, the pandas DataFrame has several limitations. The [nested-pandas](https://nested-pandas.readthedocs.io/en/latest/) package was developed with these specific limitations in mind, and provides a performant extension class, the NestedFrame. The main idea of the NestedFrame, is to provide the ability to nest DataFrames within DataFrames, an example being in time-domain astronomy where the photometry table corresponding to an object table can be directly nested within a column of the object table.

The core advantages being:
* **No Duplication of Data**: In native pandas one might be tempted to simply join the tables, but this will duplicate all columns from the object table. With many object table columns, this can severely increase the memory footprint of the full dataset.
* **Avoiding Grouping Operations**: Nested Datasets are pre-grouped by the top-level index, avoiding grouping operations that come with storing these datasets as flat tables (for example, when doing operations that apply to a lightcurve).
* **Interacting with Nested Datasets as Columns**: Because the DataFrames are directly nested, DataFrame operations apply to the nested data intuitively. For example, filtering based on object IDs will also remove any nested photometry for the thrown out object IDs, just as any other data in the row.
* **Pyarrow-backed storage and operation**: DataFrames in DataFrames is the core idea, but not the actual implementation. Instead data is stored internally in pyarrow data structures, allowing performant storage and operation on nested datasets. You can read more about pyarrow from it's [documentation](https://arrow.apache.org/docs/python/index.html).

> **NOTE**: *In this tutorial, the terms "NestedFrame" and "DataFrame" will at times be used interchangeably. In general, these objects will always be NestedFrames but "DataFrame" is often a nice term to describe a single container of data.*

## 1. The Nested Data Representation

We can begin by walking through the basics of the NestedFrame representation. Below, we generate a toy NestedFrame which has 5 base columns("ra","dec","id","a","b"), and 1 nested column ("nested"). In a notebook, the nested column is represented as a small inner DataFrame per row, with information on the available sub-columns (sometimes referred to as fields), a preview of the first row, and the number of additional rows.

In [100]:
from lsdb.nested import generate_data

# Generate toy data
nf = generate_data(5, 10, seed=1).compute()
# Do some reformatting for a nice output
nf = nf[["id", "ra", "dec", "a", "b", "nested"]].rename({"nested": "lightcurve"}, axis=1)
nf = nf.sort_values("id")
nf

Unnamed: 0_level_0,id,ra,dec,a,b,lightcurve
t,flux,band,flux_err,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
t,flux,band,flux_err,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
t,flux,band,flux_err,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3
t,flux,band,flux_err,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4
t,flux,band,flux_err,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5
1,2,342.166931,40.950381,0.720324,0.37252,t  flux  band  flux_err  13.70439  41.405599  g  2.07028  +9 rows  ...  ...  ...
t,flux,band,flux_err,,,
13.70439,41.405599,g,2.07028,,,
+9 rows,...,...,...,,,
4,22,112.259323,-70.888248,0.146756,1.077633,t  flux  band  flux_err  0.547752  4.995346  r  0.249767  +9 rows  ...  ...  ...
t,flux,band,flux_err,,,
0.547752,4.995346,r,0.249767,,,
+9 rows,...,...,...,,,
0,24,184.255785,-8.820946,0.417022,0.184677,t  flux  band  flux_err  8.38389  10.233443  g  0.511672  +9 rows  ...  ...  ...
t,flux,band,flux_err,,,

t,flux,band,flux_err
13.70439,41.405599,g,2.07028
+9 rows,...,...,...

t,flux,band,flux_err
0.547752,4.995346,r,0.249767
+9 rows,...,...,...

t,flux,band,flux_err
8.38389,10.233443,g,0.511672
+9 rows,...,...,...

t,flux,band,flux_err
4.089045,69.440016,g,3.472001
+9 rows,...,...,...

t,flux,band,flux_err
17.562349,41.417927,g,2.070896
+9 rows,...,...,...


These sub-DataFrames act as any other value stored in a DataFrame and can be accessed through normal pandas indexing:

In [102]:
nf["lightcurve"][0]  # look at the "lightcurve" dataframe for the first row
# or
# nf.iloc[0]["lightcurve"]

Unnamed: 0,t,flux,band,flux_err
0,8.38389,10.233443,g,0.511672
1,13.40935,53.589641,g,2.679482
2,16.014891,90.340192,g,4.51701
3,17.892133,16.53542,g,0.826771
4,1.966937,88.330609,g,4.41653
5,6.310313,89.588622,r,4.479431
6,19.777222,11.474597,g,0.57373
7,8.957871,23.702698,g,1.185135
8,0.387339,32.66449,g,1.633225
9,1.067251,62.336012,r,3.116801


When represented lazily, nested columns are identified by their custom dtype, which also provides dtype information about the sub-columns:

In [107]:
generate_data(5, 10).rename(columns={"nested": "lightcurve"})

Unnamed: 0_level_0,ra,dec,id,a,b,lightcurve
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,float64,float64,int64,float64,float64,"nested<t: [double], flux: [double], band: [string], flux_err: [double]>"
4,...,...,...,...,...,...


Beyond just visual identification, there are programmatic ways to determine nested columns:

In [108]:
nf.nested_columns

['lightcurve']

## 2. What We Can Do With Nested Columns

NestedFrames are pandas DataFrames, but expanded to work directly with nested columns. This is done in a few ways:
* Introduces some new access dynamics for nested columns.
* Adds several new functions specific to working with nested columns.
* Modifies some existing DataFrame functions to enable support for nested columns.


### Sub-Column Access

The first to touch on are the access dynamics, through nested sub-column access:

In [109]:
# flux is a sub-column of the "lightcurve" column
nf["lightcurve.flux"]

1    41.405599
1    66.379465
       ...    
3    35.726976
3    69.089692
Name: flux, Length: 50, dtype: double[pyarrow]

Using "[column].[sub-column]" allows access to the sub-column data. In the case above, it pulls out a flat pandas series of all flux values from all nested DataFrames, with appropriate index. This sub-column access pattern is available in many pandas/LSDB functions as well and augments their behavior. For example, `query`:

In [110]:
nf.query("lightcurve.band == 'g'")

Unnamed: 0_level_0,id,ra,dec,a,b,lightcurve
t,flux,band,flux_err,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
t,flux,band,flux_err,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
t,flux,band,flux_err,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3
t,flux,band,flux_err,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4
t,flux,band,flux_err,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5
1,2,342.166931,40.950381,0.720324,0.37252,t  flux  band  flux_err  13.70439  41.405599  g  2.07028  +4 rows  ...  ...  ...
t,flux,band,flux_err,,,
13.70439,41.405599,g,2.07028,,,
+4 rows,...,...,...,,,
4,22,112.259323,-70.888248,0.146756,1.077633,t  flux  band  flux_err  17.56285  72.599799  g  3.62999  +5 rows  ...  ...  ...
t,flux,band,flux_err,,,
17.56285,72.599799,g,3.62999,,,
+5 rows,...,...,...,,,
0,24,184.255785,-8.820946,0.417022,0.184677,t  flux  band  flux_err  8.38389  10.233443  g  0.511672  +7 rows  ...  ...  ...
t,flux,band,flux_err,,,

t,flux,band,flux_err
13.70439,41.405599,g,2.07028
+4 rows,...,...,...

t,flux,band,flux_err
17.56285,72.599799,g,3.62999
+5 rows,...,...,...

t,flux,band,flux_err
8.38389,10.233443,g,0.511672
+7 rows,...,...,...

t,flux,band,flux_err
4.089045,69.440016,g,3.472001
+4 rows,...,...,...

t,flux,band,flux_err
17.562349,41.417927,g,2.070896
+5 rows,...,...,...


By using "lightcurve.band" in the query string, we signal to apply the query to the nested DataFrames, in this case filtering those DataFrames by the "band" sub-column. Some operations may empty some of the sub-DataFrames, as below we apply a harsh filter on flux:

In [111]:
nf_highflux = nf.query("lightcurve.flux > 95.")
nf_highflux

Unnamed: 0_level_0,id,ra,dec,a,b,lightcurve
t,flux,band,flux_err,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
t,flux,band,flux_err,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1,2,342.166931,40.950381,0.720324,0.37252,
4,22,112.259323,-70.888248,0.146756,1.077633,t  flux  band  flux_err  13.995167  99.732285  g  4.986614  +0 rows  ...  ...  ...
t,flux,band,flux_err,,,
13.995167,99.732285,g,4.986614,,,
+0 rows,...,...,...,,,
0,24,184.255785,-8.820946,0.417022,0.184677,
2,36,51.897461,-10.463070,0.000114,0.691121,t  flux  band  flux_err  16.692513  96.484005  g  4.8242  +0 rows  ...  ...  ...
t,flux,band,flux_err,,,
16.692513,96.484005,g,4.8242,,,
+0 rows,...,...,...,,,

t,flux,band,flux_err
13.995167,99.732285,g,4.986614
+0 rows,...,...,...

t,flux,band,flux_err
16.692513,96.484005,g,4.8242
+0 rows,...,...,...


Empty DataFrames are represented as None values, and if we wanted to reject rows that had empty DataFrames, we could do something like this:

In [112]:
nf_highflux.dropna(subset="lightcurve")

Unnamed: 0_level_0,id,ra,dec,a,b,lightcurve
t,flux,band,flux_err,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
t,flux,band,flux_err,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
4,22,112.259323,-70.888248,0.146756,1.077633,t  flux  band  flux_err  13.995167  99.732285  g  4.986614  +0 rows  ...  ...  ...
t,flux,band,flux_err,,,
13.995167,99.732285,g,4.986614,,,
+0 rows,...,...,...,,,
2,36,51.897461,-10.463070,0.000114,0.691121,t  flux  band  flux_err  16.692513  96.484005  g  4.8242  +0 rows  ...  ...  ...
t,flux,band,flux_err,,,
16.692513,96.484005,g,4.8242,,,
+0 rows,...,...,...,,,

t,flux,band,flux_err
13.995167,99.732285,g,4.986614
+0 rows,...,...,...

t,flux,band,flux_err
16.692513,96.484005,g,4.8242
+0 rows,...,...,...


For the most part, pandas functions changed in nested-pandas are modified to allow sub-column access to apply that function to the inner dataframes. Not every function has been modified, and you can refer to [this page](https://nested-pandas.readthedocs.io/en/latest/reference/nestedframe.html#extended-pandas-dataframe-interface) to see the list of augmented functions.

### Using Nested Columns in Analysis

Another important feature of the NestedFrame API is `reduce`, which allows columns and sub-columns to be used in external analysis functions:

In [113]:
import numpy as np

# Calculate the mean of the "flux" sub-column per row of the NestedFrame
mean_flux = nf.reduce(np.mean, "lightcurve.flux")
mean_flux

Unnamed: 0,0
1,55.903454
4,55.829467
0,47.879572
2,62.510849
3,55.587439


A common use case is to use `reduce` to calculate object-level metrics, as done above, which can then be added back to the NestedFrame:

In [124]:
nf.join(mean_flux).rename(columns={0: "mean_flux"})

Unnamed: 0_level_0,id,ra,dec,a,b,lightcurve,mean_flux
t,flux,band,flux_err,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
t,flux,band,flux_err,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
t,flux,band,flux_err,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3
t,flux,band,flux_err,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4
t,flux,band,flux_err,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5
1,2,342.166931,40.950381,0.720324,0.37252,t  flux  band  flux_err  13.70439  41.405599  g  2.07028  +9 rows  ...  ...  ...,55.903454
t,flux,band,flux_err,,,,
13.70439,41.405599,g,2.07028,,,,
+9 rows,...,...,...,,,,
4,22,112.259323,-70.888248,0.146756,1.077633,t  flux  band  flux_err  0.547752  4.995346  r  0.249767  +9 rows  ...  ...  ...,55.829467
t,flux,band,flux_err,,,,
0.547752,4.995346,r,0.249767,,,,
+9 rows,...,...,...,,,,
0,24,184.255785,-8.820946,0.417022,0.184677,t  flux  band  flux_err  8.38389  10.233443  g  0.511672  +9 rows  ...  ...  ...,47.879572
t,flux,band,flux_err,,,,

t,flux,band,flux_err
13.70439,41.405599,g,2.07028
+9 rows,...,...,...

t,flux,band,flux_err
0.547752,4.995346,r,0.249767
+9 rows,...,...,...

t,flux,band,flux_err
8.38389,10.233443,g,0.511672
+9 rows,...,...,...

t,flux,band,flux_err
4.089045,69.440016,g,3.472001
+9 rows,...,...,...

t,flux,band,flux_err
17.562349,41.417927,g,2.070896
+9 rows,...,...,...


We use sub-column access to signal to reduce that we're applying mean to the "flux" sub-column of the "nested" column. Let's look at a more complex example:

In [114]:
def flux_ranges(flux, band, modifier, single_band=False):
    # Calculate the flux range in a single band with a modifier
    flux = flux + modifier

    # filter to just a single band
    if single_band:
        mask = band == single_band
        flux = flux[mask]
    return {"min_flux": flux.min(), "max_flux": flux.max()}


# We use the "a" column as a modifier
nf.reduce(flux_ranges, "lightcurve.flux", "lightcurve.band", "a", single_band="g")

Unnamed: 0,min_flux,max_flux
1,2.302449,95.66925
4,27.139545,99.879041
0,10.650465,90.757214
2,13.927749,96.484119
3,41.72026,94.761808


`flux_ranges` is still a relatively simple function, but exemplifies some of the important aspects of `reduce`:

1. We see that we were able to pass both nested sub-columns ("nested.flux", "nested.band") as well as a base column ("a"). In this case, "a" is used to add a flat offset to the flux array. 
2. The function returns a dictionary, with string keys matched to the resulting data. This is nice to do, as it allows column names to be generated for the output. 
3. Our `flux_ranges` function has a kwarg that is not sourced from our NestedFrame. In this case, we called it at the end of the function using kwarg syntax (`single_band='g'`). Any arguments (not just kwargs) that are not sourced from the NestedFrame must be called in this way.

### The `nest` Accessor

Another NestedFrame aspect to touch on is the `.nest` accessor:

In [116]:
nf.lightcurve.nest

<nested_pandas.series.accessor.NestSeriesAccessor at 0x15382a3f0>

This accessor object can be called on any nested column and has a small suite of additional functions used for viewing and transforming nested data. For example, we can programmatically view the available sub-columns through the accessor:

In [118]:
nf.lightcurve.nest.fields

['t', 'flux', 'band', 'flux_err']

We can use the accessor to grab different views of our nested data:

In [119]:
# View a flat version of the nested data
# All sub-dataframes are brought to one level
nf.lightcurve.nest.to_flat()

Unnamed: 0,t,flux,band,flux_err
1,13.70439,41.405599,g,2.07028
1,8.346096,66.379465,r,3.318973
...,...,...,...,...
3,5.310933,35.726976,r,1.786349
3,11.786111,69.089692,g,3.454485


In [120]:
# Or view the data as lists
nf.lightcurve.nest.to_lists()

Unnamed: 0,t,flux,band,flux_err
1,[13.70439001 8.34609605 19.36523151 1.700884...,[41.40559878 66.37946452 13.74747041 92.750858...,['g' 'r' 'r' 'g' 'r' 'g' 'g' 'r' 'r' 'g'],[2.07027994 3.31897323 0.68737352 4.6375429 3...
4,[ 0.54775186 3.96202978 17.52778305 17.562850...,[ 4.99534589 58.65550405 39.7676837 72.599798...,['r' 'r' 'r' 'g' 'g' 'g' 'g' 'g' 'r' 'g'],[0.24976729 2.9327752 1.98838418 3.62998993 1...
0,[ 8.38389029 13.4093502 16.01489137 17.892133...,[10.23344288 53.58964059 90.34019153 16.535419...,['g' 'g' 'g' 'g' 'g' 'r' 'g' 'g' 'g' 'r'],[0.51167214 2.67948203 4.51700958 0.82677099 4...
2,[ 4.08904499 11.17379657 6.26848356 0.781095...,[69.44001577 51.48891121 13.92763473 34.776585...,['g' 'r' 'g' 'r' 'g' 'g' 'r' 'r' 'r' 'g'],[3.47200079 2.57444556 0.69638174 1.7388293 3...
3,[17.56234873 2.80773877 13.84645231 3.396608...,[41.41792695 94.4594756 80.73912887 75.081210...,['g' 'g' 'g' 'g' 'r' 'g' 'r' 'r' 'r' 'g'],[2.07089635 4.72297378 4.03695644 3.75406052 1...


## 3. Turning Data into Nested Columns

In many data products, the relevant data will already be packaged as nested columns. However, it's very possible your data will be provided in other formats that you would like to convert into nested columns. Let's walk through a few common examples.


### Converting from List Columns

In [65]:
import nested_pandas as npd

nf_lists = npd.NestedFrame(
    {
        "a": [10, 20, 30],
        "b": ["red", "blue", "green"],
        "c": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
        "d": [[11, 22, 33], [44, 55, 66], [77, 88, 99]],
    }
)
nf_lists

Unnamed: 0,a,b,c,d
0,10,red,"[1, 2, 3]","[11, 22, 33]"
1,20,blue,"[4, 5, 6]","[44, 55, 66]"
2,30,green,"[7, 8, 9]","[77, 88, 99]"


In this case, "c" and "d" are list columns. Because the lengths of "c" and "d" match for each row, they can be nested together. We can achieve this with the `nest_lists` function:

In [121]:
nf_lists.nest_lists(name="packed", columns=["c", "d"])

Unnamed: 0_level_0,a,b,packed
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
c,d,Unnamed: 2_level_2,Unnamed: 3_level_2
c,d,Unnamed: 2_level_3,Unnamed: 3_level_3
0,10,red,c  d  1  11  +2 rows  ...
c,d,,
1,11,,
+2 rows,...,,
1,20,blue,c  d  4  44  +2 rows  ...
c,d,,
4,44,,
+2 rows,...,,
2,30,green,c  d  7  77  +2 rows  ...
c,d,,

c,d
1,11
+2 rows,...

c,d
4,44
+2 rows,...

c,d
7,77
+2 rows,...


### Converting from Flat Tables

Another case may be that your data is available as two separate tables:

In [None]:
import nested_pandas as npd

nf_base = npd.NestedFrame({"a": [10, 20, 30], "b": ["red", "blue", "green"]})
nf_flat = npd.NestedFrame(
    {"c": [1, 2, 3, 4, 5, 6, 7, 8, 9], "d": [11, 22, 33, 44, 55, 66, 77, 88, 99]},
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2],
)
nf_base

Unnamed: 0,a,b
0,10,red
1,20,blue
2,30,green


In [80]:
nf_flat

Unnamed: 0,c,d
0,1,11
0,2,22
0,3,33
1,4,44
1,5,55
1,6,66
2,7,77
2,8,88
2,9,99


In this case, nf_flat has a matched index with nf_base, and we can easily nest it using `add_nested`:

In [122]:
nf_nested = nf_base.add_nested(nf_flat, name="packed")
nf_nested

Unnamed: 0_level_0,a,b,packed
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
c,d,Unnamed: 2_level_2,Unnamed: 3_level_2
c,d,Unnamed: 2_level_3,Unnamed: 3_level_3
0,10,red,c  d  1  11  +2 rows  ...
c,d,,
1,11,,
+2 rows,...,,
1,20,blue,c  d  4  44  +2 rows  ...
c,d,,
4,44,,
+2 rows,...,,
2,30,green,c  d  7  77  +2 rows  ...
c,d,,

c,d
1,11
+2 rows,...

c,d
4,44
+2 rows,...

c,d
7,77
+2 rows,...


If the index is not matched, you can set the "on" kwarg to a shared column between the two DataFrames to treat the operation more as a join.

This notebook touches on a few of the most common concepts of the NestedFrame. While this notebook works directly with the NestedFrame, operations on the LSDB catalog largely work the same. However, minor differences do exist as certain functions may not be available in the LSDB catalog API, and arguments may differ depending on LSDB's ability to support a given operation. You're encouraged to consult the docs for a specific function whenever you aren't sure on it's behavior. For more information on NestedFrames, and nested-pandas in general, consult the [nested-pandas documentation](https://nested-pandas.readthedocs.io/en/latest/).

## About

**Authors**: Doug Branton

**Last updated on**: June 18, 2025

If you use ``lsdb`` for published research, please cite the following 
[instructions](https://docs.lsdb.io/en/stable/citation.html).