The generator is an xcube feature which allows users to create, manipulate, and write xcube datasets according to a supplied configuration. The same configuration can be used to generate a dataset on the user's local computer or remotely, using an online server.
The generator offers two main user interfaces: A Python API, configured using Python objects; and a command-line interface, configured using YAML or JSON files. The Python and file-based configurations have the same structure and are interconvertible.
The online generator service interfaces with the xcube client via a well-defined REST API; it is also possible for third-party clients to make use of this API directly, though it is expected that the Python and command-line interfaces will be more convenient in most cases.
This document aims to provide a brief overview of the generation process and the available configuration options. More details are available in other documents and in the code itself:
- Probably the most thorough documentation is available in the Jupyter demo notebooks in the xcube repository. These can be run in any JupyterLab environment containing an xcube installation. They combine explanation with interactive worked examples to demonstrate practical usage of the generator in typical use cases.
- For the Python API in particular, the xcube API documentation is generated from the docstrings included in the code itself and serves as a detailed low-level reference for individual Python classes and methods. The docstrings can also be read from a Python environment (e.g. using the
?
postfix in IPython or JupyterLab) or, of course, by browsing the source code itself. - For the YAML/JSON configuration syntax used with the command-line interface, there are several examples available in the examples/gen2/configs subdirectory of the xcube repository.
- For the REST API underlying the Python and command-line interfaces, there is a formal definition on SwaggerHub, and one of the example notebooks demonstrates its usage with the Python requests library.
The usual cube generation process is as follows:
- The generator opens the input data store using the store identifier and parameters in the input configuration.
- The generator reads from the input store the data specified in the cube configuration and uses them to create a data cube, often with additional manipulation steps such as resampling the data.
- If an optional code configuration has been given, the user-supplied code is run on the created data cube, potentially modifying it.
- The generator writes the generated cube to the data store specified in the output configuration.
The configurations for the various parts of the generator are used to initialize a GeneratorRequest
, which is then passed to xcube.core.gen2.generator.CubeGenerator.generate_cube
. The generate_cube
method returns a cube reference which can be used to open the cube from the output data store.
The generator can also be directly invoked with a configuration file from a Python environment, using the xcube.core.gen2.generator.CubeGenerator.from_file
method.
The generator can be invoked from the command line using the xcube gen2
subcommand. (Note: the subcommand xcube gen
invokes an earlier, deprecated generator feature which is not compatible with the generator framework described here.)
All Python configuration classes are defined in the xcube.core.gen2
package, except for CodeConfig
, which is in xcube.core.byoa
.
The types in the parameter tables are given in an ad-hoc, semi-formal notation whose corresponding Python and JSON representations should be obvious. For the formal Python type definitions, see the signatures of the __init__
methods of the configuration classes; for the formal JSON type definitions, see the JSON schemata (in JSON Schema format) produced by the get_schema
methods of the configuration classes.
The command-line interface allows a service configuration for the remote generator service to be provided as a YAML or JSON file. This file defines the endpoint and access credentials for an online generator service. If it is provided, the specified remote service will be used to generate the cube. If it is omitted, the cube will be generated locally. The configuration file defines three values: endpoint_url
, client_id
, and client_secret
. A typical service configuration YAML file might look as follows:
endpoint_url: "https://xcube-gen.brockmann-consult.de/api/v2/"
client_id: "93da366d7c39517865e4f141ddf1dd81"
client_secret: "d2l0aG91dCByZXN0cmljdGlvbiwgaW5jbHVkaW5nIHd"
In the command-line interface, an additional YAML or JSON file containing one or more store configurations may be supplied. A store configuration encapsulates a data store ID and an associated set of store parameters, which can then be referenced by an associated store configuration identifier. This identifier can be used in the input configuration, as described below. A typical YAML store configuration might look as follows:
sentinelhub_eu:
title: SENTINEL Hub (Central Europe)
description: Datasets from the SENTINEL Hub API deployment in Central Europe
store_id: sentinelhub
store_params:
api_url: https://services.sentinel-hub.com
client_id: myid123
client_secret: 0c5892208a0a82f1599df026b5e19017
cds:
title: C3S Climate Data Store (CDS)
description: Selected datasets from the Copernicus CDS API
store_id: cds
store_params:
normalize_names: true
num_retries: 3
my_data_bucket:
title: S3 output bucket
description: An S3 bucket for output data sets
store_id: s3
store_params:
root: cube-outputs
storage_options:
key: qwerty12345
secret: 7ff889c0aea254d5e00440858289b85c
client_kwargs:
endpoint_url: https://my-endpoint.some-domain.org/
The input configuration defines the data store from which data for the cube are to be read, and any additional parameters which this data store requires.
The Python configuration object is InputConfig
; the corresponding YAML configuration section is input_configs
.
Parameter | Required? | Type | Description |
---|---|---|---|
store_id |
N | str | Identifier for the data store |
opener_id |
N | str | Identifier for the data opener |
data_id |
Y | str | Identifier for the dataset |
store_params |
N | map(str→* ) |
Parameters for the data store |
open_params |
N | map(str→* ) |
Parameters for the data opener |
store_id
is a string identifier for a particular xcube data store, defined by the data store itself. If a store configuration file has been supplied (see above), a store configuration identifier can also be supplied here in place of a ‘plain’ store identifier. Store configuration identifiers must be prefixed by an @
symbol. If a store configuration identifier is supplied in place of a store identifier, store_params
values will be supplied from the predefined store configuration and can be omitted from the input configuration.
data_id
is a string identifier for the dataset within a particular store.
The format and content of the store_params
and open_params
dictionaries is defined by the individual store or opener.
The generator service does not yet provide a remote interface to list available data stores, datasets, and store parameters (i.e. allowed values for the parameters in the table above). In a local xcube Python environment, you can list the currently available store identifiers with the expression list(map(lambda e: e.name, xcube.core.store.find_data_store_extensions()))
. You can create a local store object for an identifier store_id
with xcube.core.store.get_data_store_instance(store_id).store
. The store object provides methods list_data_ids
, get_data_store_params_schema
, and get_open_data_params_schema
to describe the allowed values for the corresponding parameters. Note that the available stores and datasets on a remote xcube generator server may not be the same as those available in your local xcube environment.
This configuration element defines the characteristics of the cube that should be generated. The Python configuration class is called CubeConfig
, and the YAML section cube_config
. All parameters are optional and will be filled in with defaults if omitted; the default values are dependent on the data store and dataset.
Parameter | Type | Units/Description |
---|---|---|
variable_names |
[str, …] | Available variables are data store dependent. |
crs |
str | PROJ string, JSON string with PROJ parameters, CRS WKT string, or authority string |
bbox |
[float, float, float, float] | Bounding-box (min_x , min_y , max_x , max_y ) CRS-dependent, usually degrees |
spatial_res |
float or [float, float] | CRS-dependent, usually degrees |
tile_size |
int or [int, int] | pixels |
time_range |
str or [str, str] | ISO 8601 subset |
time_period |
str | integer + unit |
chunks |
map(str→null/int) | maps variable names to chunk sizes |
The crs
parameter string is interpreted using `CRS.from_string
in the pyproj package <https://pyproj4.github.io/pyproj/dev/api/crs/crs.html#pyproj.crs.CRS.from_string>`__ and therefore accepts the same specifiers.
time_range
specified the start and end of the requested time range. can be specified either as a date in the format YYYY-MM-DD
or as a date and time in the format YYYY-MM-DD HH:MM:SS
. If the time is omitted, it is taken to be 00:00:00
(the start of the day) for the start specifier and 24:00:00
(the end of the day) for the specifier. The end specifier may be omitted; in this case the current time is used.
time_period
specified the duration of a single time step in the requested cube, which determines the temporal resolution. It consists of an integer denoting the number of time units, followed by single upper-case letter denoting the time unit. Valid time unit specifiers are D (day), W (week), M (month), and Y (year). Examples of time_period
values: 1Y
(one year), 2M
(two months), 10D
(ten days).
The value of the chunks
mapping determines how the generated data is chunked for storage. The chunking has no effect on the data itself, but can have a dramatic impact on data access speeds in different scenarios. The value of chunks
is structured a map from variable names (corresponding to those specified by the variable_names
parameter) to chunk sizes.
The code configuration supports multiple ways to define a dataset processor – fundamentally, a Python function which takes a dataset and returns a processed version of the input dataset. Since the code configuration can work directly with instantiated Python objects (which can't be stored in a YAML file), there are some differences in code configuration between the Python API and the YAML format.
Parameter | Type | Units/description |
---|---|---|
_callable † |
Callable | Function to be called to process the datacube. Only available via Python API |
callable_ref |
str (non-empty) | A reference to a Python class or function, in the format <module>:< function_or_class> |
callable_params |
map(str→* ) |
Parameters to be passed to the specified callable |
inline_code † |
str (non-empty) | An inline snippet of Python code |
file_set † |
FileSet (Python) / map (YAML) | A bundle of Python modules or packages (see details below) |
install_required |
boolean | If set, indicates that file_set contains modules or packages to be installed. |
All parameters are optional (as is the entire code configuration itself). The three parameters marked † are mutually exclusive: at most one of them may be given.
_callable
provides the dataset processor directly and is only available in the Python API. It must be either a function or a class.
- If a function, it takes a
Dataset
and optional additional named parameters, and returns aDataset
. Any additional parameters are supplied in thecallable_params
parameter of the code configuration. - If an object, it must implement a method
process_dataset
, which is treated like the function described above, and may optionally implement a class methodget_process_params_schema
, which returns aJsonObjectSchema
describing the additional parameters. For convenience and clarity, the object may extend the abstract base classDatasetProcessor
, which declares both these methods.
callable_ref
is a string with the structure <module>:<function_or_class>
, and specifies the function or class to call when inline_code
or file_set
is provided. The specified function or class is handled like the _callable
parameter described above.
callable_params
specifies a dictionary of named parameters which are passed to the processor function or method.
inline_code
is a string containing Python source code. If supplied, it should contain the definition of a function or object as described for the _callable
parameter. The module and class identifiers for the callable in the inline code snippet should be specified in callable_ref
parameter.
file_set
specifies a set of files which should be read from an fsspec file system and which contain a definition of a dataset processor. As with inline_code
, the parameter callable_ref
should also be supplied to tell the generator which class or function in the file set is the actual processor. The parameters of file_set
are identical with those of the constructor of the corresponding Python FileSet
class, and are as follows:
Parameter | Type | Description |
---|---|---|
path |
str | fsspec-compatible root path specifier |
sub_path |
str | optional sub-path to append to main path |
includes |
[str] | include files matching any of these patterns |
excludes |
[str] | exclude files matching any of these patterns |
storage_params |
map(str→* ) |
FS-specific parameters (passed to fsspec) |
This configuration element determines where the generated cube should be written to. The Python configuration class is called OutputConfig
, and the YAML section output_config
.
Parameter | Type | Units/description |
---|---|---|
store_id |
str | Identifier of output store |
writer_id |
str | Identifier of data writer |
data_id |
str | Identifier under which to write the cube |
store_params |
map(str→* ) |
Store-dependent parameters for output store |
write_params |
map(str→* ) |
Writer-dependent parameters for output writer |
replace |
bool | If true, replace any existing data with the same identifier. |