## Summary

### Variability related to resampling methods

The figures below show the resampling duration and peak memory allocation for each method and zoom level. Here are some key takeaways:

- ODC and Rioxarray (both based on GDAL) consistently had relatively fast times and low memory usage.
- Rasterio had similar memory usage and time for local tile generation to ODC and Rioxarray, but was much slower for remote files. This was likely due to the use of NetCDF driver, which is not optimized for remote access patterns.
- XESMF without pre-generated weights was consistently the slowest method. It is advisable to pre-generate weights. Pre-generated weights can be re-used for reprojecting from the same input grid structure to output grid structure with the same resampling methods, even if the data differ.
- Generating a tile for a ultra-high resolution dataset (MUR SST - 0.01 degree global) was prohibitively slow with XESMF. For ultra-high resolution datasets, regridding with XESMF would likely require pre-generating the weights using `mpirun` and the `ESMF_regrid` tool.
- Resampling methods typically had a peak memory allocation roughly twice the size of the required portion of the input dataset. For example, the `analysed_sst` variable of the MUR SST dataset is ~5.2 GB and generating a global web mercator tile had ~10 GB peak memory allocation for most methods. Pyresample was a notable exception that had much greater peak memory allocation for global and nearly global tiles..
- Rasterio was not included as a resampling method for GPM IMERG due to a lack of simple methods for handling the non-standard axis order (e.g., (time, x, y) instead of (time, y, x)).

In [1]:
from plotting import (
    plot_duration_by_weboptimization,
    plot_memory,
    plot_memory_by_weboptimization,
    plot_time,
    plot_time_by_format,
)

In [2]:
# Plot time required for resampling GPM IMERG
gpm_imerg_local = plot_time("gpm_imerg", local=True, format="netcdf")
gpm_imerg_remote = plot_time("gpm_imerg", local=False, format="netcdf")
(gpm_imerg_local + gpm_imerg_remote).cols(1)

In [3]:
# Plot time required for resampling MUR SST
mur_sst_local = plot_time("mursst", local=True, format="netcdf")
mur_sst_remote = plot_time("mursst", local=False, format="netcdf")
(mur_sst_local + mur_sst_remote).cols(1)

In [4]:
# Plot memory required for resampling GPM IMERG
gpm_imerg_local = plot_memory("gpm_imerg", local=True, format="netcdf")
gpm_imerg_remote = plot_memory("gpm_imerg", local=False, format="netcdf")
(gpm_imerg_local + gpm_imerg_remote).cols(1)

In [5]:
# Plot memory required for resampling MUR SST
mur_sst_local = plot_memory("mursst", local=True, format="netcdf")
mur_sst_remote = plot_memory("mursst", local=False, format="netcdf")
(mur_sst_local + mur_sst_remote).cols(1)

### Variability related to I/O

The figures below show the resampling duration and peak memory allocation for each data stored as NetCDF and accessed through the H5NetCDF library, data stored as NetCDF but virtualized into Zarr and accessed via the Zarr and Icechunk libraries, and data transformed to Zarr and accessed via the Zarr and Icechunk libraries. Here are some key takeaways:

- Virtualizing the data as Zarr gives a >2x performance improvement relative to loading with the H5NetCDF library.
- If the chunk sizes remain the same, virtualization gives the same performance benefit as conversion to a cloud-optimized data format like Zarr. Differences would be observed if the chunk configuration and size is optimized for the particular workflow.

In [6]:
plot_time_by_format("mursst")

In [7]:
plot_time_by_format("gpm_imerg")

### Variability related to web-optimization

The figures below show the resampling duration and peak memory allocation for tile generation from COGs relative to virtualized NetCDF and "web-optimized Zarr". Here are some key takeaways:

- Overviews dramatically improve the performance of tile generation at all zoom levels. For example, tile generation was 20x as fast at zoom level 0 and 3x as fast at zoom level 6.
- Resampling from Web-Optimized Zarr (WOZ) using rioxarray added overhead relative to resampling from Web-Optimized Zarr using rasterio, due to the increased import times and object instantiation times in Xarray relative to using Zarr, Numpy, and Rasterio alone. While the performance differences between COG and WOZ resampling with rasterio could likely be eliminated with future development, rasterio will likely always be raster than rioxarray when using overviews.

In [8]:
plot_duration_by_weboptimization()

In [9]:
plot_memory_by_weboptimization()

### Implications for future development

- Virtualizing archival file formats greatly improves performance relative to archival file readers such as h5netcdf and motivates the generation of virtual references whenever possible.
- The Web-Optimized Zarr example shows the potential for Zarr overviews to enable highly performant visualization and motivates the development of the GeoZarr and multi-scales Zarr specifications.
- Pyinstrument showed a significant fraction of the total time when resampling Web-Optimized Zarr using rioxarray went towards Xarray importing Pandas and guessing the chunk manager. Both of these components could be improved or removed through future development.