## yt_xarray

linking yt & xarray

* https://github.com/data-exp-lab/yt_xarray/
* https://yt-xarray.readthedocs.io/en/latest/

(this presentation: https://github.com/deathbeds/jupyterlab-deck ) 

## xarray

Multidimensional array IO:

* self-describing data formats (netcdf, ...)

* arbitrary dimension names

* distributed support (chunks to files): 
    * dask arrays 
    * zarr arrays

Load in a [GEOS](https://gmao.gsfc.nasa.gov/GEOS_systems/) dataset (~2 GB, NASA Global Modeling and Assimilation Office):

In [1]:
import xarray as xr 
import os 

fname_geos = os.path.expanduser("~/hdd/data/yt_data/yt_sample_sets/geos/GEOS.fp.asm.inst3_3d_aer_Nv.20180822_0900.V01.nc4")
ds = xr.open_dataset(fname_geos)
ds

view metadata for a variable:

In [2]:
ds.AIRDENS   # or ds.data_vars["AIRDENS"]

extract ordered dimension names:

In [3]:
ds.AIRDENS.dims

('time', 'lev', 'lat', 'lon')

## Data selection with xarray 

### array access and slicing

In [4]:
ds.AIRDENS[0,0,:,:]  # ranges, masking, etc...

need to remember axis ordering!

extracting raw np arrays:

In [5]:
dens = ds.AIRDENS[0,0,:,:].to_numpy()
type(dens)

numpy.ndarray

### selection by coordinate **name**

by index:

In [6]:
ds.AIRDENS.isel(lev=1, time=0, lon=3, lat=4)

by **exact** value:

In [7]:
ds.sel(lev=2.0, lat=-89.0)

with some fuzziness: 

In [8]:
ds.sel(lev=2.0, lat=-89.013, method="nearest")

finally, with dictionary:

In [9]:
ds.sel({"lev":2.0, "lat":-89.0})  # important for yt_xarray!

## xarray & dask 

In [10]:
ds.close()
del ds

Test data set ([generated from here](https://github.com/chrishavlin/yt-xarray-dask-sandbox/blob/main/example.ipynb)):
* random field data 
* some chunks
* 1 chunk = 1 .nc file

In [13]:
data_dir = os.path.expanduser("~/src/yt_/daskify/yt-xarray-dask-sandbox/data_small")
dask_test_ds = os.path.join(data_dir, "*.nc")
ds = xr.open_mfdataset(dask_test_ds)
ds

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.52 MiB 72.35 kiB Shape (84, 84, 84) (21, 21, 21) Dask graph 64 chunks in 149 graph layers Data type float64 numpy.ndarray",84  84  84,

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.52 MiB 72.35 kiB Shape (84, 84, 84) (21, 21, 21) Dask graph 64 chunks in 149 graph layers Data type float64 numpy.ndarray",84  84  84,

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.52 MiB 72.35 kiB Shape (84, 84, 84) (21, 21, 21) Dask graph 64 chunks in 149 graph layers Data type float64 numpy.ndarray",84  84  84,

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.52 MiB 72.35 kiB Shape (84, 84, 84) (21, 21, 21) Dask graph 64 chunks in 149 graph layers Data type float64 numpy.ndarray",84  84  84,

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.52 MiB 72.35 kiB Shape (84, 84, 84) (21, 21, 21) Dask graph 64 chunks in 149 graph layers Data type float64 numpy.ndarray",84  84  84,

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [14]:
ds.temperature

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.52 MiB 72.35 kiB Shape (84, 84, 84) (21, 21, 21) Dask graph 64 chunks in 149 graph layers Data type float64 numpy.ndarray",84  84  84,

Unnamed: 0,Array,Chunk
Bytes,4.52 MiB,72.35 kiB
Shape,"(84, 84, 84)","(21, 21, 21)"
Dask graph,64 chunks in 149 graph layers,64 chunks in 149 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


* **Coordinates** are in memory and over all chunks!
* **Data variables** are dask arrays

Returning in-memory values:

In [15]:
ds.temperature.mean()

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 152 graph layers,1 chunks in 152 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
Array Chunk Bytes 8 B 8 B Shape () () Dask graph 1 chunks in 152 graph layers Data type float64 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 152 graph layers,1 chunks in 152 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [16]:
ds.temperature.mean().values  # equivalent to .compute()

array(9.99834171)

In [19]:
ds.temperature.mean().load()  # to preserve xarray-ness

**selections are also delayed (important for yt_xarray!):**

In [17]:
vals = ds.temperature.isel(z=range(10)).sel(x=1, y=2, method="nearest")
vals

Unnamed: 0,Array,Chunk
Bytes,80 B,80 B
Shape,"(10,)","(10,)"
Dask graph,1 chunks in 151 graph layers,1 chunks in 151 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 80 B 80 B Shape (10,) (10,) Dask graph 1 chunks in 151 graph layers Data type float64 numpy.ndarray",10  1,

Unnamed: 0,Array,Chunk
Bytes,80 B,80 B
Shape,"(10,)","(10,)"
Dask graph,1 chunks in 151 graph layers,1 chunks in 151 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [18]:
vals.values

array([ 7.43742365,  3.47532451,  2.02274815,  8.86139121, 12.68530917,
        5.15708079, 15.90768807,  3.42061532,  8.44047676, 17.54390986])

## what about yt?



previously:

1. load in arrays
2. use yt generic data loader (`yt.load_uniform_grid(...)`)


**yt_xarray** v0.1.1: yt datasets from xarray datasets

automate (as much as possible) 1 & 2 !

## **yt_xarray** usage overview

yt_xarray provides a `yt` "accessor object":

In [None]:
import yt_xarray

In [None]:
ds.yt

In [None]:
ds.yt.

### Loading all data (not always possible):

In [None]:
ds_yt = ds.yt.load_grid(length_unit="km")

In [None]:
ds_yt.field_list

In [None]:
import yt
yt.SlicePlot(ds_yt, "x", ("stream", "gauss"))

### not always so easy...

In [None]:
ds = yt_xarray.open_dataset('wrf/wrfout_d03_2016-06-01.nc')  # checks yt paths

In [None]:
import xwrf

In [None]:
ds_x = ds.xwrf.postprocess()
ds_x

1. different dimensionality of fields (including time)
2. yt has strict coordinate names (latitude, longitude, altitude)

### choose a subset of fields

In [None]:
# ds_x.yt.load_grid()
ds_yt = ds_x.yt.load_grid(
    fields=('geopotential', 'geopotential_height')
)

### choose a time to load

In [None]:
ds_yt = ds_x.yt.load_grid(
    fields=('geopotential', 'geopotential_height'),                      
    sel_dict={'Time':0})

### COORDINATE ALIASING

In [None]:
yt_xarray.known_coord_aliases

In [None]:
yt_xarray.known_coord_aliases["z_stag"] = "z"

In [None]:
ds_yt = ds_x.yt.load_grid(fields=('geopotential', 'geopotential_height'),
                          sel_dict={'Time':0},
                          length_unit='m',
                          use_callable=False)

separate problem with the 3d data (bug: interpolation going wrong)... so:

In [None]:
ds_yt = ds_x.yt.load_grid(fields=('geopotential', 'geopotential_height'),
                          sel_dict={'Time':0, 'z_stag':4},
                          length_unit='m')   

finally ... 

In [None]:
slc = yt.SlicePlot(ds_yt, "z", ("stream", "geopotential_height"))
slc.set_log("all", False)

**Note**: need to use yt coordinate names for yt functions

**What is [geopotential height](https://legacy.climate.ncsu.edu/images/climate/enso/geo_heights.php)?**: 

* cold air denser than warm air 
* pressure in the atmo from overlying air

geopotential height = the altitude to get to a particular pressure


### yt_xarray chunking

create a test dataset with a dask array:

In [20]:
# from dask import array as da
# import numpy as np

# shp = (800, 650, 500)
# f1 = da.random.random(shp , chunks=100)
# coords = {'x': np.linspace(0, 1, shp[0]),
#           'y': np.linspace(0, 1, shp[1]),
#           'z': np.linspace(0, 1, shp[2])}

# data = {'test_field': xr.DataArray(f1, coords=coords, dims=('x', 'y', 'z'))}
# ds = xr.Dataset(data)
# ds.test_field
print("REBUILD THE DATASET")
data_dir = os.path.expanduser("~/src/yt_/daskify/yt-xarray-dask-sandbox/data")
dask_test_ds = os.path.join(data_dir, "*.nc")
ds = xr.open_mfdataset(dask_test_ds)
ds

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 13.20 GiB 13.52 MiB Shape (1210, 1210, 1210) (121, 121, 121) Dask graph 1000 chunks in 2111 graph layers Data type float64 numpy.ndarray",1210  1210  1210,

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 13.20 GiB 13.52 MiB Shape (1210, 1210, 1210) (121, 121, 121) Dask graph 1000 chunks in 2111 graph layers Data type float64 numpy.ndarray",1210  1210  1210,

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 13.20 GiB 13.52 MiB Shape (1210, 1210, 1210) (121, 121, 121) Dask graph 1000 chunks in 2111 graph layers Data type float64 numpy.ndarray",1210  1210  1210,

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 13.20 GiB 13.52 MiB Shape (1210, 1210, 1210) (121, 121, 121) Dask graph 1000 chunks in 2111 graph layers Data type float64 numpy.ndarray",1210  1210  1210,

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 13.20 GiB 13.52 MiB Shape (1210, 1210, 1210) (121, 121, 121) Dask graph 1000 chunks in 2111 graph layers Data type float64 numpy.ndarray",1210  1210  1210,

Unnamed: 0,Array,Chunk
Bytes,13.20 GiB,13.52 MiB
Shape,"(1210, 1210, 1210)","(121, 121, 121)"
Dask graph,1000 chunks in 2111 graph layers,1000 chunks in 2111 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [23]:
import yt_xarray
yt_ds = ds.yt.load_grid(fields=("gauss",), length_unit='m', chunksizes=121)

yt_xarray : [INFO ] 2023-02-21 17:19:23,279:  Inferred geometry type is cartesian. To override, use ds.yt.set_geometry
yt_xarray : [INFO ] 2023-02-21 17:19:23,281:  Attempting to detect if yt_xarray will require field interpolation:
yt_xarray : [INFO ] 2023-02-21 17:19:23,282:      Cartesian geometry on uniform grid: yt_xarray will not interpolate.
yt_xarray : [INFO ] 2023-02-21 17:19:23,283:  Constructing a yt chunked grid with 1000 chunks.
yt : [INFO     ] 2023-02-21 17:19:23,393 Parameters: current_time              = 0.0
yt : [INFO     ] 2023-02-21 17:19:23,394 Parameters: domain_dimensions         = [1210 1210 1210]
yt : [INFO     ] 2023-02-21 17:19:23,395 Parameters: domain_left_edge          = [-8.67361738e-19 -8.67361738e-19 -8.67361738e-19]
yt : [INFO     ] 2023-02-21 17:19:23,397 Parameters: domain_right_edge         = [10. 10. 10.]
yt : [INFO     ] 2023-02-21 17:19:23,397 Parameters: cosmological_simulation   = 0


In [None]:
# index

In [26]:
import yt
slc = yt.SlicePlot(yt_ds, "z", ("stream", "gauss"))
slc.annotate_grids()
slc.show()

HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5A.c line 528 in H5Aopen_by_name(): can't open attribute
    major: Attribute
    minor: Can't open object
  #001: H5VLcallback.c line 1091 in H5VL_attr_open(): attribute open failed
    major: Virtual Object Layer
    minor: Can't open object
  #002: H5VLcallback.c line 1058 in H5VL__attr_open(): attribute open failed
    major: Virtual Object Layer
    minor: Can't open object
  #003: H5VLnative_attr.c line 130 in H5VL__native_attr_open(): can't open attribute
    major: Attribute
    minor: Can't open object
  #004: H5Aint.c line 545 in H5A__open_by_name(): unable to load attribute info from object header
    major: Attribute
    minor: Unable to initialize object
  #005: H5Oattribute.c line 494 in H5O__attr_open_by_name(): can't locate attribute: '_QuantizeBitGroomNumberOfSignificantDigits'
    major: Attribute
    minor: Object not found
HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5A.c line 528 in H5Ao

# yt_xarray code tour 

loads data as yt Stream frontend via `load_amr_grids`:


```python
yt.load_amr_grids(
        grid_data,  # the data OR FUNCTION for the grid(s)
        data_shp,   # global grid shape, (Nx, Ny, Nz)
        geometry=geom,  # e.g., ('cartesian', ('x', 'z', 'y'))
        bbox=bbox,  # the bounding box
        length_unit=length_unit,  
        **kwargs,
    )
```    

Form of `grid_data` depends on:

* grid type (uniform, stretched)
* memory management: delayed reads (`use_callable`) vs in memory
* chunking

We'll look at:

* `yt_xarray.accessor.accessor.YtAccessor` : the top level accessor object
* `yt_xarray.accessor._xr_to_yt.Selection` : yt-xr translation, mapping of selections
* `yt_xarray.accessor._readers._get_xarray_reader`: building a function to load the data when needed
