Skip to content

Latest commit

 

History

History
667 lines (595 loc) · 21.3 KB

ZEP0006.md

File metadata and controls

667 lines (595 loc) · 21.3 KB
layout title description parent nav_order
default
ZEP0006
Zarr Object Models (ZOMs)
draft ZEPs
6

ZEP 6 - A Zarr Object Model

Authors:

Status: Draft

Type: Specification

Created: 2023-07-20

Abstract

This ZEP defines Zarr Object Models, or ZOMs. A ZOM is a language-independent interface that describes an abstract Zarr hierarchy as a tree of nodes. The ZOM forms the basis for a declarative, type-safe approach to managing Zarr hierarchies.

A ZOM defines two types of nodes: arrays and groups, which schematize Zarr arrays and Zarr groups, respectively. ZOM arrays and groups both have an attributes property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. ZOM groups have a property called members, which is an object with string keys and values that are either arrays or groups; thus, the ZOM can represent an arbitrarily structured tree of arrays and groups.

These definitions are designed to be abstract enough to be implemented by a range of programming languages, and expressed in a wide range of interchange formats. Applications can use the ZOM for a particular Zarr version to implement declarative APIs for accessing structured Zarr hierarchies.

Motivation and Scope

The Zarr specifications define models for arrays and groups, but the Zarr specifications do not define models for hierarchies of arrays and groups. This is unfortunate, because many applications using Zarr operate on the level of structured hierarchies rather than individual groups or arrays.

For example, the python library xarray defines data structures that can be persisted to Zarr as a Zarr group containing one or more Zarr arrays with specific metadata (a key called "_ARRAY_DIMENSIONS", which must have a list of strings as its value). The full specification of this format can be found here. For xarray (or any other application that works with structured hierarchies) to save data to Zarr, it must first create a compliant Zarr hierarchy. To read data from Zarr, xarray must first check if the potential source of data is an xarray-compliant Zarr hierarchy.

These actions access Zarr hierarchies, not individual arrays or groups. Accessing Zarr hierarchies can be done procedurally, i.e. as a sequence of Zarr array and group access routines, or declaratively, as a hierarchy definition followed by a procedure that creates a Zarr hiearchiy consistent with that definition. The data models defined in the Zarr specifcations are only sufficient to design procedural Zarr hierarchy manipulation. This is a level of abstraction too low -- tasks like checking if an arbitrary Zarr group is compatible with the xarray format would easier with a declarative API. But a declarative hierarchy API requires defining data models for Zarr hierarchies. This is the central goal of this document.

Definition of hierarchy structure

This document distinguishes the structure of a Zarr hierarchy from the data stored in the hierarchy. The structure of a Zarr hierarchy is the layout of the tree of arrays and groups, and the metadata of those arrays and groups. This definition omits the data stored in the arrays, and the particular storage backend used to store data and metadata. By these definitions, two distinct Zarr hierarchies can have the same structure even if their arrays contain different values, and / or the hierarchies are stored using different storage backends.

Specification of the base Zarr Object Model

We begin with a definition of a "base" Zarr Object Model. On its own, the base ZOM is not useful for working with actual Zarr hierarchies, because it contains a reference to an unspecified Zarr version. By supplying definitions from a particular Zarr version, we can specialize the base ZOM and produce an object that can be used for doing actual work.

A node is an object with a property called attributes, which is a key-value data structure that contains content described as "arbitrary user metadata" in Zarr specifications. As of Zarr versions 2 and 3, attributes must be a JSON-serializable object with string keys.

The base ZOM defines exactly two types of node: groups and arrays. This definition will use the unqualified terms "array" and "group" to refer to the two nodes defined in the ZOM. Where necessary to avoid ambiguity, the objects represented by ZOM arrays and ZOM groups, i.e. Zarr arrays and Zarr groups, will be referred to as "Zarr arrays" and "Zarr groups".

ZOM arrays and ZOM groups represent Zarr arrays and Zarr groups in the simplest way possible that still conforms to the definition of "node" given above. Thus, a ZOM array is a node with properties identical to those defined in a particular specification of Zarr array metadata, unless one of those Zarr array properties contains user metadata, in which case a ZOM array does not include that property (since user metadata is already represented by the attributes property of the array). This definition is parametric with respect to a particular Zarr specification in order to accomodate future versions of Zarr that may add new properties to Zarr arrays.

Similarly, a ZOM group is a node with properties identical to those defined in a specification of Zarr group metadata, unless one of those properties contains user metadata, in which case a ZOM group does not contain that property, for the same reason given above for arrays. Beyond the properties of Zarr groups defined in a particular Zarr specification, a ZOM group has an additional property:

  • members: a key-value data structure where the keys are the subset of strings that are permitted node names according to a particular Zarr specification, and the values are arrays or groups. This property allows a ZOM group to represent the hierarchical relationship between Zarr groups and the Zarr arrays or Zarr groups contained within them.

If future versions of Zarr use a property called members for some element of Zarr group metadata, then there would be a naming collision between the members property of a Zarr group and the members property of a ZOM group. In this case, the ZOM group would rename the Zarr group's members property to _members, and any additional name collisions would be resolved by prepending additional underscore ("_") characters. E.g., in the unlikely case that members and _members are both listed in Zarr group metadata, then the schema group representation would map the members property of the Zarr group to a property called __members.

Thus, ZOM groups and ZOM arrays can represent the structure of a Zarr hierarchy, per the description given in [#definition-of-hierarchy-structure].

Zarr Object Models in JSON

The ZOM representation of a Zarr hierarchy can be easily represented as a JSON object.

See below an example of a Zarr version 3 ZOM group representing a Zarr group that contains a single Zarr array. Both the Zarr group and the Zarr array contain user metadata.

{
  "zarr_format": 3,
  "node_type": "group",
  "attributes": {
    "foo": 42,
    "bar": false
  },
  "members": {
    "array": {
      "zarr_format": 3,
      "node_type": "array",
      "attributes": {
        "baz": [
          1,
          2,
          3
        ]
      },
      "shape": [
        1000,
        1000
      ],
      "data_type": "|u1",
      "chunk_grid": {
        "name": "regular",
        "configuration": {
          "chunk_shape": [
            1000,
            100
          ]
        }
      },
      "chunk_key_encoding": {
        "name": "default",
        "configuration": {
          "separator": "/"
        }
      },
      "fill_value": 0,
      "codecs": [
        {
          "name": "GZip",
          "configuration": {
            "level": 1
          }
        }
      ],
      "storage_transformers": null,
      "dimension_names": [
        "rows",
        "columns"
      ]
    }
  }
}

A similar hierarchy defined in Zarr V2 can also be represented as a ZOM group:

{
  "zarr_format" : 2,
  "attributes": {
    "foo" : 42, 
    "bar" : false
    },
  "members": {
    "foo": {
      "zarr_format" : 2,
      "shape" : [1000, 1000],
      "chunks": [100, 100],
      "dtype": "|u1",
      "compressor": null,
      "fill_value": 0,
      "order": "C",
      "dimension_separator":  "/",
      "filters": {},
      "attributes" : {
        "baz": true
        }
    }
  }
}

Zarr Object Models in JSON schema

A ZOM can also be represented as a JSON schema. Here is a the ZOM for Zarr V3 expressed as a JSON schema:

{
  "$defs": {
    "ArraySpec": {
      "additionalProperties": false,
      "description": "A model of a Zarr version 3 Array",
      "properties": {
        "zarr_format": {
          "const": 3,
          "default": 3,
          "title": "Zarr Format"
        },
        "node_type": {
          "const": "array",
          "default": "array",
          "title": "Node Type"
        },
        "attributes": {
          "default": {},
          "title": "Attributes",
          "type": "object"
        },
        "shape": {
          "items": {
            "type": "integer"
          },
          "title": "Shape",
          "type": "array"
        },
        "data_type": {
          "title": "Data Type",
          "type": "string"
        },
        "chunk_grid": {
          "$ref": "#/$defs/NamedConfig"
        },
        "chunk_key_encoding": {
          "$ref": "#/$defs/NamedConfig"
        },
        "fill_value": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "integer"
            },
            {
              "type": "number"
            },
            {
              "const": "Infinity"
            },
            {
              "const": "-Infinity"
            },
            {
              "const": "NaN"
            },
            {
              "type": "string"
            },
            {
              "maxItems": 2,
              "minItems": 2,
              "prefixItems": [
                {
                  "type": "number"
                },
                {
                  "type": "number"
                }
              ],
              "type": "array"
            },
            {
              "items": {
                "type": "integer"
              },
              "type": "array"
            }
          ],
          "title": "Fill Value"
        },
        "codecs": {
          "items": {
            "$ref": "#/$defs/NamedConfig"
          },
          "title": "Codecs",
          "type": "array"
        },
        "storage_transformers": {
          "anyOf": [
            {
              "items": {
                "$ref": "#/$defs/NamedConfig"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Storage Transformers"
        },
        "dimension_names": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "title": "Dimension Names"
        }
      },
      "required": [
        "shape",
        "data_type",
        "chunk_grid",
        "chunk_key_encoding",
        "fill_value",
        "codecs"
      ],
      "title": "ArraySpec",
      "type": "object"
    },
    "GroupSpec": {
      "additionalProperties": false,
      "properties": {
        "zarr_format": {
          "const": 3,
          "default": 3,
          "title": "Zarr Format"
        },
        "node_type": {
          "const": "group",
          "default": "group",
          "title": "Node Type"
        },
        "attributes": {
          "title": "Attributes",
          "type": "object"
        },
        "members": {
          "additionalProperties": {
            "anyOf": [
              {
                "$ref": "#/$defs/GroupSpec"
              },
              {
                "$ref": "#/$defs/ArraySpec"
              }
            ]
          },
          "default": {},
          "title": "Members",
          "type": "object"
        }
      },
      "required": [
        "attributes",
        "members"
      ],
      "title": "GroupSpec",
      "type": "object"
    },
    "NamedConfig": {
      "additionalProperties": false,
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "configuration": {
          "anyOf": [
            {
              "type": "object"
            },
            {
              "type": "null"
            }
          ],
          "title": "Configuration"
        }
      },
      "required": [
        "name",
        "configuration"
      ],
      "title": "NamedConfig",
      "type": "object"
    }
  },
  "allOf": [
    {
      "$ref": "#/$defs/GroupSpec"
    }
  ]
}

And likewise for Zarr V2:

{
  "$ref": "#/definitions/Group",
  "definitions": {
    "Array": {
      "title": "Array",
      "description": "Model of a Zarr Version 2 Array",
      "type": "object",
      "properties": {
        "attributes": {
          "title": "Attributess",
          "type": "object"
        },
        "shape": {
          "title": "Shape",
          "type": "array",
          "items": {
            "type": "integer"
          }
        },
        "chunks": {
          "title": "Chunks",
          "type": "array",
          "items": {
            "type": "integer"
          }
        },
        "dtype": {
          "title": "Dtype",
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "array",
              "items": {
                "type": "string"
              }
            }
          ]
        },
        "compressor": {
          "title": "Compressor",
          "type": "object"
        },
        "fill_value": {
          "title": "Fill Value"
        },
        "order": {
          "title": "Order",
          "enum": [
            "C",
            "F"
          ],
          "type": "string"
        },
        "filters": {
          "title": "Filters",
          "type": "array",
          "items": {
            "type": "object"
          }
        },
        "dimension_separator": {
          "title": "Dimension Separator",
          "enum": [
            ".",
            "/"
          ],
          "type": "string"
        },
        "zarr_version": {
          "title": "Zarr Version",
          "default": 2,
          "type": "integer"
        }
      },
      "required": [
        "attributess",
        "shape",
        "chunks",
        "dtype",
        "compressor",
        "order",
        "filters"
      ],
      "additionalProperties": false
    },
    "Group": {
      "title": "Group",
      "description": "Model of a Zarr Version 2 Group",
      "type": "object",
      "properties": {
        "attributes": {
          "title": "Attributes",
          "type": "object"
        },
        "members": {
          "title": "Members",
          "type": "object",
          "additionalProperties": {
            "anyOf": [
              {
                "$ref": "#/definitions/Array"
              },
              {
                "$ref": "#/definitions/Group"
              }
            ]
          }
        },
        "zarr_version": {
          "title": "Zarr Version",
          "default": 2,
          "type": "integer"
        }
      },
      "required": [
        "attributes",
        "members"
      ],
      "additionalProperties": false
    }
  }
}

To facilitate adoption of new Zarr versions, it may be desirable to define a mapping from ZOM to ZOM, e.g. ZOM for Zarr v2 -> ZOM for Zarr v3. Programs could use such mappings to execute programmatic conversions of hierarchies to newer Zarr versions.

Implementing consolidated metadata via Zarr Object Models

The time required to traverse large, deeply-nested Zarr hierarchies stored on high-latency backends (e.g., cloud storage) can be onerous for applications that consume Zarr containers. One solution to this problem is to consolidate the metadata for each node or group in the hierarchy into a document stored at the root of the hierarchy.

Validating Zarr Object Models

A key motivator for developing the ZOM was the need to validate Zarr hierarchies. Communities that use Zarr often define conventions for storing their particular data models in Zarr hierarchies, and often these conventions could be statically checked if there was a data model for the hierarchy that was amenable to static type checking. The ZOM satisfies this constraint.

For example, consider the xarray model introduced earlier, which requires a key ("_ARRAY_DIMENSIONS") to be present in array metadata. If we define the ZOM for Zarr V2 as dataclasses, we can use the python type-checking tool mypy to statically check that a Zarr hierarchy complies with the xarray convention:

from dataclasses import dataclass
from typing import Generic, TypeVar, Mapping, List, Any, Union

TAttrs = TypeVar('TAttrs')
TMember = TypeVar('TMember')

@dataclass
class GroupSpec(Generic[TAttrs, TMember]):
  zarr_version = 2
  attributes: TAttrs
  members: Mapping[str, TMember]

@dataclass
class ArraySpec(Generic[TAttrs]):
  zarr_version = 2
  attributes: TAttrs
  shape: List[int]
  dtype: str
  chunks: List[int]
  compressor: Any
  dimension_separator: Union[Literal["."], Literal["/"]]

@dataclass
class XAttrs:
  _ARRAY_DIMENSIONS: List[str]

XArraySpec = ArraySpec[XAttrs]
XGroupSpec = GroupSpec[Any, Union[XArray, GroupSpec]]

# this passes the type checker
valid = XGroupSpec(
    attributes={}, 
    members={
        'array_0': XArraySpec(
        shape=[10,10],
        attributes=XAttrs(_ARRAY_DIMENSIONS= ['a','b']),
        dtype='uint8',
        chunks=[10,10],
        compressor=None,
        dimension_separator='.')},
        )

# This fails type checking because of the mssing _ARRAY_DIMENSIONS array attribute
invalid = XGroupSpec(
    attributes={}, 
    members={
        'array_0': XArraySpec(
        shape=[10,10],
        attributes={'foo': 10},
        dtype='uint8',
        chunks=[10,10],
        compressor=None,
        dimension_separator='/')},
        )
"""
Argument "attributes" to "ArraySpec" has incompatible type "dict[str, int]"; expected "XAttrs"
"""

The same result is possible in TypeScript:

type GroupSpec<TAttr, TMember> = {
  zarr_version: 2
  attributes: TAttr
  members: {[key: string]: TMember}
}

type ArraySpec<TAttr> = {
  zarr_version: 2
  attributes: TAttr
  shape: number[]
  chunks: number[]
  dtype: string
  compressor: any
  dimension_separator: "." | "/"
}

type XAttrs = {
  _ARRAY_DIMENSIONS: string[]
}

type XArraySpec = ArraySpec<XAttrs>
type XGroupSpec = GroupSpec<any, XArraySpec>

const valid: XGroupSpec = {
  zarr_version: 2,
  attributes: {},
  members: {
    'array_0': {
      zarr_version: 2,
      dtype: 'uint8',
      attributes: {'_ARRAY_DIMENSIONS': ['a', 'b']},
      shape: [10, 10],
      chunks: [10, 10],
      compressor: undefined,
      dimension_separator: "/"
      }
    }
  }

// This fails type checking because of the mssing _ARRAY_DIMENSIONS array attribute
  const invalid: XGroupSpec = {
  zarr_version: 2,
  attributes: {},
  members: {
    'array_0': {
      zarr_version: 2,
      dtype: 'uint8',
      attributes: {'foo': ['a', 'b']},
      shape: [10, 10],
      chunks: [10, 10],
      compressor: undefined,
      dimension_separator: "/"
      }
    }
  }
/*
Type '{ foo: string[]; }' is not assignable to type 'XAttrs'.
  Object literal may only specify known properties, and ''foo'' does not exist in type 'XAttrs'.(2322)
*/

Static type checking on ZOM data structures offers an additional level of safety for applications that manipulate structured Zarr hierarchies, but not every invariant of a structured hierarchy can be expressed statically -- consider the constraint "the length of the _ARRAY_DIMENSIONS attribute must match the length of the shape attribute". This is not statically checkable, because the shape of an array may not be known before runtime. Such value-dependent validation can be added by runtime type checkers like pydantic for python, or zod for TypeScript.

Related Work

Implementation

  • pydantic zarr
  • ?

Discussion

References and Footnotes

License

CC0
To the extent possible under law, the authors have waived all copyright and related or neighboring rights to ZEP 1.