# Global snowmelt runoff onset zarr store creation

This notebook creates and initializes the global zarr store that will contain the snowmelt runoff onset dataset. The zarr format provides efficient compressed storage and chunked access patterns optimized for cloud-based processing and analysis.

## Zarr dataset overview

### Variables

- **runoff_onset**: Day of water year (DOWY) of snowmelt runoff onset for each water year (2015-2024)
- **runoff_onset_median**: Median DOWY across all available water years 
- **runoff_onset_mad**: Median absolute deviation of runoff onset timing
- **temporal_resolution**: Effective temporal resolution of runoff_onset for each water year
- **temporal_resolution_median**: Median temporal resolution across water years

### Data structure

- **Spatial resolution**: 0.00072000072000072° (~80m at equator)
- **Spatial extent**: Global (-180° to 180°, -60° to 81.1°)
- **Temporal coverage**: Water years 2015-2024
- **Chunking strategy**: Optimized 2048×2048 pixel chunks for efficient I/O
- **Compression**: Blosc with zstd algorithm for optimal compression ratios

### Coordinate encoding

The store uses advanced coordinate encoding techniques:
- **FixedScaleOffset**: Converts floating-point coordinates to integers
- **Delta encoding**: Stores differences between consecutive values
- **Blosc compression**: Further compresses the encoded coordinates

This approach achieves significant storage savings while maintaining precision and was heavily inspired by the [serverless-datacube-demo github repo](https://github.com/earth-mover/serverless-datacube-demo) and [accompanying earthmover blog post](https://earthmover.io/blog/serverless-datacube-pipeline) by Ryan Abernathy.

In [1]:
import numpy as np
import zarr
import odc.stac
import xarray as xr
import adlfs
import pathlib
import configparser
import dask
from global_snowmelt_runoff_onset.config import Config, Tile

In [2]:
config = Config('../config/global_config_v9.txt')

Configuration loaded:
resolution = 0.00072000072000072
bands = vv
mountain_snow_only = False
spatial_chunk_dim_s1_read = 2048
spatial_chunk_dim_s1_process = 512
spatial_chunk_dim_zarr_output = 2048
bbox_left = -179.999
bbox_right = 179.999
bbox_top = 81.099
bbox_bottom = -59.999
wy_start = 2015
wy_end = 2024
low_backscatter_threshold = 0.001
min_monthly_acquisitions = 1
max_allowed_days_gap_per_orbit = 30
min_years_for_median_std = 3
extend_search_window_beyond_sdd_days = 16
min_consec_snow_days_for_seasonal_snow = 56
valid_tiles_geojson_path = ../processing/tile_data/global_tiles_with_seasonal_snow.geojson
tile_results_path = ../processing/tile_data/tile_results_v9.csv
global_runoff_zarr_store_azure_path = snowmelt/snowmelt_runoff_onset/global_v9.zarr
seasonal_snow_mask_zarr_store_azure_path = snowmelt/snow_cover/global_modis_snow_cover_reprojected.zarr
seasonal_snow_mask_reproject_method = precomputed


## Build global dataset 

The `build_global_runoff_onset_dataset` function creates the structure for the global dataset. We create empty arrays for each output variable with appropriate data types, we implement advanced encoding for efficient storage, we add metadata for ease of use, and we define optimal chunk sizes. 

### Data types and fill values
- `int16` for all variables with -9999 as no data
- Variables runoff_onset, runoff_onset_median are in units [DOWY], valid values (1-366)
- Variables runoff_onset_mad, temporal_resolution, temporal_resolution_median are in units [days] and are scaled: scale_factor=0.1 for 0.1 day precision

In [3]:
def build_global_runoff_onset_dataset(global_geobox, water_years, chunk_size):

    nodata_int16 = np.int16(-9999)
    # Use int16 for all variables with -9999 as nodata
    # MAD and temporal resolution: multiply by 10 for 0.1 day precision, store as int16

    global_ds = xr.combine_by_coords([  
    odc.geo.xr.wrap_xr(dask.array.full(shape=global_geobox.shape.yx,fill_value=nodata_int16,dtype=np.int16,chunks=-1),
                   global_geobox).expand_dims({'water_year':water_years}).rename('runoff_onset'),
    odc.geo.xr.wrap_xr(dask.array.full(shape=global_geobox.shape.yx,fill_value=nodata_int16,dtype=np.int16,chunks=-1),
                   global_geobox).rename('runoff_onset_median'),
    odc.geo.xr.wrap_xr(dask.array.full(shape=global_geobox.shape.yx,fill_value=nodata_int16,dtype=np.int16,chunks=-1),
                   global_geobox).rename('runoff_onset_mad'),
    odc.geo.xr.wrap_xr(dask.array.full(shape=global_geobox.shape.yx,fill_value=nodata_int16,dtype=np.int16,chunks=-1),
                    global_geobox).expand_dims({'water_year':water_years}).rename('temporal_resolution'),
    odc.geo.xr.wrap_xr(dask.array.full(shape=global_geobox.shape.yx,fill_value=nodata_int16,dtype=np.int16,chunks=-1),
                    global_geobox).rename('temporal_resolution_median'),
    ])



    global_ds.water_year.attrs['description'] = ("Water year. In northern hemisphere, water year starts on October 1st "
                                    "and ends on September 30th. For the southern hemisphere, water year "
                                    "starts on April 1st and ends on March 31st. e.g. in NH WY 2015 is "
                                    "[2014-10-01,2015-09-30] and in SH WY 2015 is [2015-04-01,2016-03-31].")

    global_ds.runoff_onset.attrs['description'] = "Estimated day of water year of snowmelt runoff onset for a given water year [unit=DOWY]."
    global_ds.runoff_onset_median.attrs['description'] = "Median estimated day of water year of snowmelt runoff onset for all water years [unit=DOWY]."
    global_ds.runoff_onset_mad.attrs['description'] = "Median absolute deviation of snowmelt runoff onset for all water years [unit=days]."
    global_ds.temporal_resolution_median.attrs['description'] = "Median temporal resolution for all water years [unit=days]."
    global_ds.temporal_resolution.attrs['description'] = "Temporal resolution of runoff onset for a given water year [unit=days]."


    global_ds.attrs['processed_tiles'] = []

    #from https://github.com/earth-mover/serverless-datacube-demo/blob/main/src/lib.py
    def optimize_coord_encoding(values, dx):
        dx_all = np.diff(values)
        # dx = dx_all[0]
        np.testing.assert_allclose(dx_all, dx), "must be regularly spaced"

        offset_codec = zarr.FixedScaleOffset(
            offset=values[0], scale=1 / dx, dtype=values.dtype, astype="i8"
        )
        delta_codec = zarr.Delta("i8", "i2")
        compressor = zarr.Blosc(cname="zstd")

        enc0 = offset_codec.encode(values)
        # everything should be offset by 1 at this point
        np.testing.assert_equal(np.unique(np.diff(enc0)), [1])
        enc1 = delta_codec.encode(enc0)
        # now we should be able to compress the shit out of this
        enc2 = compressor.encode(enc1)
        decoded = offset_codec.decode(delta_codec.decode(compressor.decode(enc2)))

        # will produce numerical precision differences
        # np.testing.assert_equal(values, decoded)
        np.testing.assert_allclose(values, decoded)

        return {"compressor": compressor, "filters": (offset_codec, delta_codec)}

    lon_encoding = optimize_coord_encoding(global_ds.longitude.values, global_geobox.resolution.x)
    lat_encoding = optimize_coord_encoding(global_ds.latitude.values, global_geobox.resolution.y)

    encoding = {
        "longitude": {"chunks": global_ds.longitude.shape, **lon_encoding},
        "latitude": {"chunks": global_ds.latitude.shape, **lat_encoding},
        "water_year": {
            "chunks": global_ds.water_year.shape,
            "compressor": zarr.Blosc(cname="zstd"),
        },
        "runoff_onset": {
            "chunks": (1,) + chunk_size,
            "compressor": zarr.Blosc(cname="zstd"),
            "_FillValue": nodata_int16,
            "dtype": "int16",
        },
        "runoff_onset_median": {
            "chunks": chunk_size,
            "compressor": zarr.Blosc(cname="zstd"),
            "_FillValue": nodata_int16,
            "dtype": "int16",
        },
        "runoff_onset_mad": {
            "chunks": chunk_size,
            "compressor": zarr.Blosc(cname="zstd"),
            "_FillValue": nodata_int16,
            "dtype": "int16",
            "scale_factor": np.float32(0.1),
            "add_offset": np.float32(0.0),
        },
        "temporal_resolution_median": {
            "chunks": chunk_size,
            "compressor": zarr.Blosc(cname="zstd"),
            "_FillValue": nodata_int16,
            "dtype": "int16",
            "scale_factor": np.float32(0.1),
            "add_offset": np.float32(0.0),
        },
        "temporal_resolution": {
            "chunks": (1,) + chunk_size,
            "compressor": zarr.Blosc(cname="zstd"),
            "_FillValue": nodata_int16,
            "dtype": "int16",
            "scale_factor": np.float32(0.1),
            "add_offset": np.float32(0.0),
        },
    }

    for var in global_ds.data_vars:
        global_ds[str(var)].attrs['grid_mapping'] = 'spatial_ref' 

    return global_ds, encoding

## Initialize global dataset

Create the global dataset structure using the configuration parameters from the config file, passing in the global geobox scheme, water years of interest, and chunk sizes.

In [4]:
global_ds, encoding = build_global_runoff_onset_dataset(config.global_geobox, config.water_years, config.spatial_chunk_dims_zarr)
global_ds

Unnamed: 0,Array,Chunk
Bytes,1.78 TiB,1.78 TiB
Shape,"(10, 195970, 499998)","(10, 195970, 499998)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int16 numpy.ndarray,int16 numpy.ndarray
"Array Chunk Bytes 1.78 TiB 1.78 TiB Shape (10, 195970, 499998) (10, 195970, 499998) Dask graph 1 chunks in 2 graph layers Data type int16 numpy.ndarray",499998  195970  10,

Unnamed: 0,Array,Chunk
Bytes,1.78 TiB,1.78 TiB
Shape,"(10, 195970, 499998)","(10, 195970, 499998)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int16 numpy.ndarray,int16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,182.51 GiB,182.51 GiB
Shape,"(195970, 499998)","(195970, 499998)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int16 numpy.ndarray,int16 numpy.ndarray
"Array Chunk Bytes 182.51 GiB 182.51 GiB Shape (195970, 499998) (195970, 499998) Dask graph 1 chunks in 1 graph layer Data type int16 numpy.ndarray",499998  195970,

Unnamed: 0,Array,Chunk
Bytes,182.51 GiB,182.51 GiB
Shape,"(195970, 499998)","(195970, 499998)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int16 numpy.ndarray,int16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,182.51 GiB,182.51 GiB
Shape,"(195970, 499998)","(195970, 499998)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int16 numpy.ndarray,int16 numpy.ndarray
"Array Chunk Bytes 182.51 GiB 182.51 GiB Shape (195970, 499998) (195970, 499998) Dask graph 1 chunks in 1 graph layer Data type int16 numpy.ndarray",499998  195970,

Unnamed: 0,Array,Chunk
Bytes,182.51 GiB,182.51 GiB
Shape,"(195970, 499998)","(195970, 499998)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int16 numpy.ndarray,int16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.78 TiB,1.78 TiB
Shape,"(10, 195970, 499998)","(10, 195970, 499998)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int16 numpy.ndarray,int16 numpy.ndarray
"Array Chunk Bytes 1.78 TiB 1.78 TiB Shape (10, 195970, 499998) (10, 195970, 499998) Dask graph 1 chunks in 2 graph layers Data type int16 numpy.ndarray",499998  195970  10,

Unnamed: 0,Array,Chunk
Bytes,1.78 TiB,1.78 TiB
Shape,"(10, 195970, 499998)","(10, 195970, 499998)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int16 numpy.ndarray,int16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,182.51 GiB,182.51 GiB
Shape,"(195970, 499998)","(195970, 499998)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int16 numpy.ndarray,int16 numpy.ndarray
"Array Chunk Bytes 182.51 GiB 182.51 GiB Shape (195970, 499998) (195970, 499998) Dask graph 1 chunks in 1 graph layer Data type int16 numpy.ndarray",499998  195970,

Unnamed: 0,Array,Chunk
Bytes,182.51 GiB,182.51 GiB
Shape,"(195970, 499998)","(195970, 499998)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int16 numpy.ndarray,int16 numpy.ndarray


## Write to zarr store

Write the empty dataset structure to Azure Blob Storage. This creates:
- Zarr metadata files (.zmetadata, .zattrs, .zarray)
- Directory structure for chunked data
- Coordinate encoding specifications

The `write_empty_chunks=False` parameter ensures only the metadata is written initially, with data chunks created as tiles are processed.

In [5]:
global_ds.to_zarr(store=config.global_runoff_store,
                  encoding=encoding,
                  compute=False,
                  write_empty_chunks=False, 
                  mode='w',
                  #zarr_version=3,
                  
                  )

Delayed('_finalize_store-adc6e85b-7bed-411e-8b74-ff7725ca6ab7')

## Verify empty zarr creation

Load the newly created zarr store to verify:
- Correct dimensions and coordinates
- Proper data types and fill values
- Chunking configuration
- Metadata attributes

In [6]:
# NOTE: Known dtype issue with scaled variables
# Despite specifying np.float32() for scale_factor and add_offset in the encoding,
# xarray still promotes runoff_onset_mad, temporal_resolution, and temporal_resolution_median
# to float64 when loading the zarr store. This is a known issue related to xarray's
# CF decoding behavior (see GitHub issue #9041). The encoding preserves np.float32 
# precision in the zarr metadata, but xarray converts Python builtin floats to float64
# during decoding. Potential solutions include post-processing dtype conversion or
# using mask_and_scale=False with manual scaling. For now, the data is functionally
# correct but uses more memory than intended.
# https://docs.xarray.dev/en/stable/user-guide/io.html#scaling-and-type-conversions
# https://github.com/pydata/xarray/pull/8946
# https://github.com/pydata/xarray/issues/9041
# global_zarr_ds.runoff_onset_mad.encoding

global_zarr_ds = xr.open_zarr(
    config.global_runoff_store,
    consolidated=True,
    decode_coords="all",
    mask_and_scale=True, # setting mask_and_scale to False prevents xarray from applying scale/offset to the data and converting to float32
)
global_zarr_ds

Unnamed: 0,Array,Chunk
Bytes,3.56 TiB,16.00 MiB
Shape,"(10, 195970, 499998)","(1, 2048, 2048)"
Dask graph,235200 chunks in 2 graph layers,235200 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 3.56 TiB 16.00 MiB Shape (10, 195970, 499998) (1, 2048, 2048) Dask graph 235200 chunks in 2 graph layers Data type float32 numpy.ndarray",499998  195970  10,

Unnamed: 0,Array,Chunk
Bytes,3.56 TiB,16.00 MiB
Shape,"(10, 195970, 499998)","(1, 2048, 2048)"
Dask graph,235200 chunks in 2 graph layers,235200 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,730.04 GiB,32.00 MiB
Shape,"(195970, 499998)","(2048, 2048)"
Dask graph,23520 chunks in 2 graph layers,23520 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 730.04 GiB 32.00 MiB Shape (195970, 499998) (2048, 2048) Dask graph 23520 chunks in 2 graph layers Data type float64 numpy.ndarray",499998  195970,

Unnamed: 0,Array,Chunk
Bytes,730.04 GiB,32.00 MiB
Shape,"(195970, 499998)","(2048, 2048)"
Dask graph,23520 chunks in 2 graph layers,23520 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,365.02 GiB,16.00 MiB
Shape,"(195970, 499998)","(2048, 2048)"
Dask graph,23520 chunks in 2 graph layers,23520 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 365.02 GiB 16.00 MiB Shape (195970, 499998) (2048, 2048) Dask graph 23520 chunks in 2 graph layers Data type float32 numpy.ndarray",499998  195970,

Unnamed: 0,Array,Chunk
Bytes,365.02 GiB,16.00 MiB
Shape,"(195970, 499998)","(2048, 2048)"
Dask graph,23520 chunks in 2 graph layers,23520 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.13 TiB,32.00 MiB
Shape,"(10, 195970, 499998)","(1, 2048, 2048)"
Dask graph,235200 chunks in 2 graph layers,235200 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 7.13 TiB 32.00 MiB Shape (10, 195970, 499998) (1, 2048, 2048) Dask graph 235200 chunks in 2 graph layers Data type float64 numpy.ndarray",499998  195970  10,

Unnamed: 0,Array,Chunk
Bytes,7.13 TiB,32.00 MiB
Shape,"(10, 195970, 499998)","(1, 2048, 2048)"
Dask graph,235200 chunks in 2 graph layers,235200 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,730.04 GiB,32.00 MiB
Shape,"(195970, 499998)","(2048, 2048)"
Dask graph,23520 chunks in 2 graph layers,23520 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 730.04 GiB 32.00 MiB Shape (195970, 499998) (2048, 2048) Dask graph 23520 chunks in 2 graph layers Data type float64 numpy.ndarray",499998  195970,

Unnamed: 0,Array,Chunk
Bytes,730.04 GiB,32.00 MiB
Shape,"(195970, 499998)","(2048, 2048)"
Dask graph,23520 chunks in 2 graph layers,23520 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [7]:
for var in global_zarr_ds.data_vars:
    print(f'Encoding for {var}:')
    print(global_zarr_ds[var].encoding)

Encoding for runoff_onset:
{'chunks': (1, 2048, 2048), 'preferred_chunks': {'water_year': 1, 'latitude': 2048, 'longitude': 2048}, 'compressor': Blosc(cname='zstd', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': np.int16(-9999), 'dtype': dtype('int16'), 'grid_mapping': 'spatial_ref'}
Encoding for runoff_onset_mad:
{'chunks': (2048, 2048), 'preferred_chunks': {'latitude': 2048, 'longitude': 2048}, 'compressor': Blosc(cname='zstd', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': np.int16(-9999), 'scale_factor': 0.10000000149011612, 'add_offset': 0.0, 'dtype': dtype('int16'), 'grid_mapping': 'spatial_ref'}
Encoding for runoff_onset_median:
{'chunks': (2048, 2048), 'preferred_chunks': {'latitude': 2048, 'longitude': 2048}, 'compressor': Blosc(cname='zstd', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': np.int16(-9999), 'dtype': dtype('int16'), 'grid_mapping': 'spatial_ref'}
Encoding for temporal_resolution:
{'chunks'

## Next steps

With the zarr store initialized, we can process tiles and write to the zarr store in `process_tiles.ipynb`. The store supports concurrent writes from multiple workers, enabling efficient parallel processing of the global dataset!!!