# Load Input Data in Parallel with Dask and UXarray 

## Overview

This usage example showcases how to parallel load unstructured input data with the use of Dask and UXarray to minimize memory. 

## Imports
This notebook requires the following packages to be installed in the notebook environment. 
```
mamba install -c conda-forge uxarray dask 
```

In [22]:
import numpy as np 
import xarray as xr
import uxarray as ux
import dask as da
import glob 
from dask.diagnostics import ProgressBar

## Reading Data in Parallel

### Data

Data loaded in this notebook is the simulated output from the Department of Energy (DOE) Energy Exascale Earth System Model (E3SM) version 2. The case is set up as an atmosphere-only (AMIP) simulation with present-day control forcing (F2010) at a 1-degree horizontal resolution (ne30pg2), where sea surface temperatures and sea ice set as default as in the E3SMv2 model. The case is run for 6 years.

In [89]:
# Load a single file with chunking data to every 4 hybrid level at midpoints (by 4 lev) 

data_file_monthonly = "/glade/campaign/cisl/vast/uxarray/data/e3sm_keeling/ENSO_ctl_1std/unstructured/20231220.F2010.ENSO_ctl.lagreg.ne30pg2_EC30to60E2r2.keeling.eam.h0.0006-12.nc"
grid_file = "/glade/campaign/cisl/vast/uxarray/data/e3sm_keeling/E3SM_grid/ne30pg2_grd.nc"
uxds_e3sm_mon = ux.open_dataset(grid_file,data_file_monthonly, chunks={"lev": 4})

In [88]:
uxds_e3sm_mon.Q

Unnamed: 0,Array,Chunk
Bytes,5.93 MiB,337.50 kiB
Shape,"(1, 72, 21600)","(1, 4, 21600)"
Dask graph,18 chunks in 2 graph layers,18 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 5.93 MiB 337.50 kiB Shape (1, 72, 21600) (1, 4, 21600) Dask graph 18 chunks in 2 graph layers Data type float32 numpy.ndarray",21600  72  1,

Unnamed: 0,Array,Chunk
Bytes,5.93 MiB,337.50 kiB
Shape,"(1, 72, 21600)","(1, 4, 21600)"
Dask graph,18 chunks in 2 graph layers,18 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [93]:
# Load multiple files with chunking by time 
data_files = "/glade/campaign/cisl/vast/uxarray/data/e3sm_keeling/ENSO_ctl_1std/unstructured/*.nc"
uxds_e3sm_multi = ux.open_mfdataset(grid_file,data_files, chunks={"time": 12})

In [94]:
uxds_e3sm_multi.TS

Unnamed: 0,Array,Chunk
Bytes,5.93 MiB,84.38 kiB
Shape,"(72, 21600)","(1, 21600)"
Dask graph,72 chunks in 145 graph layers,72 chunks in 145 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 5.93 MiB 84.38 kiB Shape (72, 21600) (1, 21600) Dask graph 72 chunks in 145 graph layers Data type float32 numpy.ndarray",21600  72,

Unnamed: 0,Array,Chunk
Bytes,5.93 MiB,84.38 kiB
Shape,"(72, 21600)","(1, 21600)"
Dask graph,72 chunks in 145 graph layers,72 chunks in 145 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [110]:
# WITH XARRAY: Load multiple files with chunking by time 
data_files = "/glade/campaign/cisl/vast/uxarray/data/e3sm_keeling/ENSO_ctl_1std/unstructured/*.nc"
uxds_e3sm_multi = xr.open_mfdataset(data_files, chunks={"time": 6})

In [111]:
uxds_e3sm_multi.TS

Unnamed: 0,Array,Chunk
Bytes,5.93 MiB,84.38 kiB
Shape,"(72, 21600)","(1, 21600)"
Dask graph,72 chunks in 145 graph layers,72 chunks in 145 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 5.93 MiB 84.38 kiB Shape (72, 21600) (1, 21600) Dask graph 72 chunks in 145 graph layers Data type float32 numpy.ndarray",21600  72,

Unnamed: 0,Array,Chunk
Bytes,5.93 MiB,84.38 kiB
Shape,"(72, 21600)","(1, 21600)"
Dask graph,72 chunks in 145 graph layers,72 chunks in 145 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


## Performance Improvement with `parallel`

In [99]:
%%time
uxds_e3sm_basic_load = ux.open_mfdataset(grid_file,data_files)

CPU times: user 20 s, sys: 313 ms, total: 20.3 s
Wall time: 22.5 s


In [100]:
%%time
uxds_e3sm_parallel_load = ux.open_mfdataset(grid_file,data_files, chunks={"time": 12},parallel=True)

CPU times: user 20.1 s, sys: 376 ms, total: 20.5 s
Wall time: 21.8 s
