# Architecture

Each data schema supported by XRADIO is organized into its own sub-package, with a shared `_utils` directory that contains code common to multiple sub-packages as shown in [Figure 1](#figure-1). The current architecture includes the `measurement_set` and `image` sub-packages ([see the list of planned XRADIO schemas](overview.ipynb#XRADIO-Schemas)).

The user-facing API is implemented in the `.py` files located at the top level of each sub-package directory, while private functions are housed in a dedicated sub-directory, such as `_measurement_set`. This sub-directory contains folders for each supported storage backend, as well as a `_utils` folder for common functions used across backends.

For instance, in the `measurement_set` sub-package, XRADIO currently supports a `zarr`-based backend. Additionally, we offer limited support for `casacore` table Measurement Set v2 (`MS v2`), through a conversion function that allows users to convert data from Measurement Set v2 (stored in Casacore tables) to Measurement Set v4 (stored using zarr). The conversion function for MS v2 requires the optional dependency `python-casacore`, or alternatively CASA's `casatools` backend (see [casatools I/O backend](measurement_set/guides/backends.md)).

<!--Link to google drawing: https://docs.google.com/drawings/d/1afPe5oro26NMTkAKpK9iif0adNA0B4R9otLookOixvI/edit?usp=sharing -->

<div style="text-align: center;">
    <figure id="figure-1" style="display: inline-block;">
        <img src="https://docs.google.com/drawings/d/e/2PACX-1vSZvo-TDFELunK_2Oa0lryGGytcb98WfIKk0UdnIIue8Fb-GEXoqf-YFVPjehSqreOl3aZ0GyKfEapN/pub?w=921&amp;h=838"
             alt="diagram showing the XRADIO architecture: dependencies, modules, functions, etc."
             style="display: block; margin: auto;">
        <figcaption>Figure 1: XRADIO Architecture.</figcaption>
    </figure>
</div>

# Software Framework

XRADIO is built using the following core packages:

- `xarray`: Provides the a framework of labelled multi-dimensional arrays for defining and implementing data schemas.
- `dask` and `distributed`: Enable parallel execution for handling large datasets efficiently.
- `zarr` (zarr specification, [v2](https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html) and [v3](https://zarr-specs.readthedocs.io/en/latest/specs.html)): Used as a storage backend for scalable, chunked and compressed n-dimensional data.
- Optionally, [python-casacore](https://github.com/casacore/python-casacore) ([Casacore Table Data System (CTDS) File Formats](https://casacore.github.io/casacore-notes/260.pdf)): Used to convert data from MS v2 to MS v4 in Zarr format, with ongoing development toward a lightweight, pure Python replacement. Alternatively, the [casatools I/O backend](measurement_set/guides/backends.md) can be used.
- Optionally, [pyasdm](https://github.com/casangi/pyasdm) (under development): A Python-based storage backend in progress, designed for accessing ASDM (Astronomy Science Data Model) data.


# Schema Design

For this section to make sense please ensure to have completed the [foundational reading](overview.ipynb#Foundational-Reading) on Xarray terminology.

For each of the schemas data is organized into:


- [xarray Datasets](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html): A multi-dimensional, in-memory, array database of labeled n-dimensional arrays.
- [xarray DataTrees](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html): a hierarchical representation of multiple heterogeneous datasets, used in particular to group collections of `xarray Datasets`.

`Processing Sets` are XRADIO implementation of xarray DataTree objects that consist of a collection of nodes that represent `Measurement Sets` as xarray DataTree objects. Each `Measurement Set` is a DataTree that groups a collection of `xarray Datasets`. Among these datasets are the correlated dataset (either Spectrum or Visibilities dataset), the antenna dataset, the field_and_source dataset, etc.

## Translating from a Table-based to an Xarray-based Schema

When creating an Xarray-based schema from a table-based schema, we use the following criteria to decide what type of Xarray structure is used:

- **Coordinates**: Values used to label plots (e.g., numbers or strings). Coordinate names are always in lowercase and use snake_case.
- **Data Variables**: Numerical values used for plotting. Data variable names are always in uppercase and use snake_case.

For instance, in the [Measurement Set v4 schema](measurement_set/schema_and_api/measurement_set_schema.rst), `antenna_name` and `frequency` are coordinates, while `VISIBILITY` data are data variables.

## Measures

Both data variables and coordinates can have additional metadata, such as measures information, stored in their attributes. XRADIO’s measures are based on [python-casacore measures](https://casacore.github.io/python-casacore/casacore_measures.html), with updates to align with [astropy coordinate](https://docs.astropy.org/en/stable/coordinates/index.html) naming conventions. The table below outlines the different types of XRADIO measures:

.. autoclass:: xradio.measurement_set.schema.TimeArray()

.. xradio_array_schema_table:: xradio.measurement_set.schema.TimeArray

.. autoclass:: xradio.measurement_set.schema.SpectralCoordArray()

.. xradio_array_schema_table:: xradio.measurement_set.schema.SpectralCoordArray

.. autoclass:: xradio.measurement_set.schema.SkyCoordArray()

.. xradio_array_schema_table:: xradio.measurement_set.schema.SkyCoordArray

.. autoclass:: xradio.measurement_set.schema.LocationArray()

.. xradio_array_schema_table:: xradio.measurement_set.schema.LocationArray

.. autoclass:: xradio.measurement_set.schema.DopplerArray()

.. xradio_array_schema_table:: xradio.measurement_set.schema.DopplerArray

<!-- <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQRZyrmK41kXbeaq1V7UFK8IDO5u-zIt5I-4xUbxjOX7oK5muw0vFufreSLMn23KOqtawWjkgtGyfTR/pubhtml?gid=1504318014&single=true" 
        width="80%" 
        height="600" 
        frameborder="0" 
        scrolling="no">
</iframe> -->


## Coordinate Labels

For some types of measures, the data consists of values that are labeled using coordinate labels. These labels provide context for interpreting the data:

| Coordinate Label Name       | Values         | Related Measures Type |
|-----------------------------|----------------|-----------------------|
| ellipsoid_dir_label         | lon, lat       | location |
| ellipsoid_dis_label         | height         | location |
| cartesian_pos_label         | x, y, z        | location |
| cartesian_local_pos_label   | p, q, r        | location |
| galactic_sky_dir_label      | lon, lat       | sky_coord |
| local_sky_dir_label         | az, alt        | sky_coord |
| local_sky_dis_label         | dist           | sky_coord |
| sky_dir_label               | ra, dec        | sky_coord |
| sky_dist_label              | dist           | sky_coord |
| uvw_label                   | u, v, w        | uvw |
| receptor_label              | pol_0, pol_1   | quantity |
| tone_label                  | tone_0, tone_1 | spectral_coord |



## Measures Example

The following example illustrates how measures information is included in both a data variable (`FIELD_PHASE_CENTER`) and a coordinate (`time`). The `FIELD_PHASE_CENTER` data variable has the dimensions `time` and `sky_dir_label`. Note that the `sky_coord` measure requires only the `sky_dir_label` dimension, not the `time` dimension. 

In [5]:
import xarray as xr
phase_center = xr.DataArray()

import numpy as np
import xarray as xr
import pandas as pd

# Create an empty Xarray Dataset.
xds = xr.Dataset()

# Create the time coordinate with time measures attributes.
time = xr.DataArray(pd.date_range('2000-01-01', periods=3).astype('datetime64[s]').astype(int), dims='time', attrs={'type': 'time', 'units': 's', 'format':'unix', 'scale':'utc'})

# Create FIELD_PHASE_CENTER data variable with coordinates time x sky_dir_label.
coords = {'time': time,
          'sky_dir_label': ['ra', 'dec']}

data = np.array([[-2.10546176, -0.29611873],
       [-2.10521098, -0.29617315],
       [-2.1050196, -0.2961987]])

xds['FIELD_PHASE_CENTER'] = xr.DataArray(data, coords=coords, dims=['time', 'sky_dir_label'])

# Add sky_coord measures attributes to FIELD_PHASE_CENTER.
xds['FIELD_PHASE_CENTER'].attrs = {
    "type": "sky_coord",
    "units": ["rad", "rad"],
    "frame": "icrs",
}

xds

In [2]:
# Example of creating an Astropy SkyCoord object from the FIELD_PHASE_CENTER data variable.
from astropy.coordinates import SkyCoord
astropy_skycoord = SkyCoord(ra=xds.FIELD_PHASE_CENTER.sel(sky_dir_label='ra').values,dec=xds.FIELD_PHASE_CENTER.sel(sky_dir_label='dec').values,unit='rad',frame=xds.FIELD_PHASE_CENTER.attrs['frame'])
astropy_skycoord

<SkyCoord (ICRS): (ra, dec) in deg
    [(239.36592723, -16.96635346), (239.38029586, -16.9694715 ),
     (239.39126113, -16.97093541)]>

## Lazy and Eager Functions

- Functions prefixed with `open_` perform **lazy execution**, meaning only metadata—such as coordinates and attributes—are loaded into memory. Data variables, though not immediately loaded, are represented as lazy [Dask Arrays](https://docs.dask.org/en/stable/generated/dask.array.Array.html). These arrays only load data into memory when you explicitly call the `.compute()`, `.load()` or related methods.

- Functions prefixed with `load_` perform **eager execution**, loading all data into memory immediately. These functions can be integrated with [dask.delayed](https://docs.dask.org/en/stable/delayed.html) for more flexible execution.