This document makes a number of recommendations for creating real world ML Model Extensions. None of them are required to meet the core specification, but following these practices will improve the documentation of your model and make life easier for client tooling and users. They come about from practical experience of implementors and introduce a bit more 'constraint' for those who are creating STAC objects representing their models or creating tools to work with STAC.
- Using STAC Common Metadata Fields for the ML Model Extension
- Recommended Extensions to Compose with the ML Model Extension
It is recommended to use the start_datetime
and end_datetime
, geometry
, and bbox
in a STAC Item,
and the corresponding
Extent Object
in a Collection, to represent the recommended context of the data the model was trained with and for which the model
should have appropriate domain knowledge for inference.
For example, if a model was trained using the EuroSAT dataset, and represented using MLM, it would be reasonable to describe it with a time range of 2015-2018 and an area corresponding to the European Urban Atlas, as described by the EuroSAT paper. However, it could also be considered adequate to define a wider extent, considering that it would not be unexpected to have reasonably similar classes and domain distribution in following years and in other locations. Provided that the exact extent applicable for a model is difficult to define reliably, it is left to the good judgement of users to provide adequate values. Note that users employing the model can also choose to apply it for contexts outside the recommended extent for the same reason.
As another example, let us consider a model which is trained on imagery from all over the world
and is robust enough to be applied to any time period. In this case, the common metadata to use with the model
could include the bbox of "the world" [-90, -180, 90, 180]
and the start_datetime
and end_datetime
range
would ideally be generic values like ["1900-01-01T00:00:00Z", null]
(see warning below).
However, due to limitations with the STAC 1.0 specification, this time extent is not applicable.
Warning
The null
value is not allowed for datetime specification.
As a workaround, the end_datetime
can be set with a "very large value"
(similarly to start_datetime
set with a small value), such as "9999-12-31T23:59:59Z"
.
Alternatively, the model can instead be described with only datetime
corresponding to its publication date.
For more details, see the following discussion.
It is to be noted that generic and very broad spatiotemporal extents like above rarely reflect the reality regarding the capabilities and precision of the model to predict reliable results. If a more restrained area and time of interest can be identified, such as the ranges for which the training dataset applies, or a test split dataset that validates the applicability of the model on other domains, those should be provided instead. Nevertheless, users of the model are still free to apply it outside the specified extents.
If specific datasets with training/validation/test splits are known to support the claims of the suggested extent for the model, it is recommended that they are included as reference to the STAC Item/Collection using MLM. For more information regarding these references, see the ML-AOI and Label Extensions details.
It is recommended to use at least the processing:lineage
and processing:level
fields from
the Processing Extension to make it clear
how Model Input Objects are processed by the data provider prior to an
inference preprocessing pipeline. This can help users locate the correct version of the dataset used during model
inference or help them reproduce the data processing pipeline.
For example:
{
"processing:lineage": "GRD Post Processing",
"processing:level": "L1C",
"processing:facility": "Copernicus S1 Core Ground Segment - DPA",
"processing:software": {
"Sentinel-1 IPF": "002.71"
}
}
STAC Items or STAC Assets resulting from the model inference should be
annotated with processing:level = L4
(as described below) to indicate that they correspond from the output of an ML model.
processing:level = L4
Model output or results from analyses of lower level data (i.e.: variables that are not directly measured by the instruments, but are derived from these measurements).
Furthermore, the processing:expression
should be specified with a reference to the STAC Item employing the MLM extension to provide full context of the source
of the derived product.
A potential representation of a STAC Asset could be as follows:
{
"model-output": {
"mlm:name": "<name-in-MLM-STAC-Item",
"processing:level": "L4",
"processing:expression": {
"format": "stac-mlm",
"expression": "<URI-to-MLM-STAC-Item>"
}
}
}
Furthermore, the STAC Item representing the derived product could also include
a Link Object
referring back to the MLM definition using rel: derived_from
, as described in
MLM Relation Types. Such a link would like something like the following:
{
"links": [
{
"rel": "derived_from",
"type": "application/geo+json",
"href": "<URI-to-MLM-STAC-Item>",
"mlm:name": "<name-in-MLM-STAC-Item",
"processing:level": "L4"
}
]
}
Supervised machine learning models will typically employ a dataset of training, validation and test samples.
If those samples happen to be represented by STAC Collections and Items annotated with
the ML-AOI Extension, notably with the corresponding ml-aoi:split
and all their annotations with Label Extension references, the STAC Item
that contains the MLM Extension should include those STAC Collections in its links
listing in order
to provide direct references to the training dataset that was employed for creating the model.
Providing dataset references would, in combination with the training pipeline contained under an
MLM Asset Object annotated by the mlm:training-runtime
role,
allow users to retrain the model for validation, or with adaptations to improve it, eventually
leading to a new MLM STAC Item definition (see also STAC Version Extension).
{
"id": "stac-item-model",
"stac_extensions": [
"https://crim-ca.github.io/mlm-extension/v1.3.0/schema.json",
"https://stac-extensions.github.io/ml-aoi/v0.2.0/schema.json"
],
"assets": {
"mlm:training": {
"title": "Model Training Pipeline",
"href": "docker.io/training/image:latest",
"type": "application/vnd.oci.image.index.v1+json",
"roles": ["mlm:training-runtime"]
}
},
"links": [
{
"rel": "derived_from",
"type": "application/json",
"href": "<URI-to-STAC-Collection-Split-Train>",
"ml-aoi:split": "train"
},
{
"rel": "derived_from",
"type": "application/json",
"href": "<URI-to-STAC-Collection-Split-Valid>",
"ml-aoi:split": "validate"
},
{
"rel": "derived_from",
"type": "application/json",
"href": "<URI-to-STAC-Collection-Split-Test>",
"ml-aoi:split": "test"
}
]
}
Since it is expected that a model will provide some kind of classification values as output, the Classification Extension can be leveraged inside MLM definition to indicate which class values can be contained in the resulting output from the model prediction.
For more details, see the Model Output Object definition.
Note
Update according to stac-extensions/classification#48.
Provided that most models derive from previous scientific work, it is strongly recommended to employ the
Scientific Extension to provide references corresponding to the
original source of the model (sci:doi
, sci:citation
). This can help users find more information about the model,
its underlying architecture, or ways to improve it by piecing together the related work (sci:publications
) that
lead to its creation.
This extension can also be used for the purpose of publishing new models, by providing to users the necessary details
regarding how they should cite its use (i.e.: sci:citation
field and cite-as
relation type).
In order to provide a reliable and reproducible machine learning pipeline, external references to data required by the model should employ the file to validate that they are properly retrieved for inference.
One of the most typical case is the definition of an external file reference to model weights, often stored on a
Git LFS or S3 bucket due to their size. Providing the file:checksum
and file:size
for this file can help ensure
that the model is properly instantiated from the expected weights, or that sufficient storage is allocated to run it.
{
"stac_extensions": [
"https://crim-ca.github.io/mlm-extension/v1.3.0/schema.json",
"https://stac-extensions.github.io/file/v2.1.0/schema.json"
],
"assets": {
"model": {
"type": "application/x-pytorch",
"href": "<URI-to-model-weights>",
"roles": [
"mlm:model",
"mlm:weights",
"data"
],
"file:size": 123456789,
"file:checksum": "12209f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08",
"mlm:artifact_type": "torch.save"
}
}
}
In order to help users understand how to apply and run the described machine learning model, the Example Extension can be used to provide code examples demonstrating how it can be applied.
For example, a Model Card on Hugging Face
is often provided (see Hugging Face Model examples) to describe the model, which
can embed sample code and references to more details about the model. This kind of reference should be added under
the links
of the STAC Item using MLM.
Typically, a STAC Item using the MLM extension to describe the training or
inference strategies to apply a model should define the Source Code Asset.
This code is in itself ideal to guide users how to run it, and should therefore be replicated as an example
link
reference to offer more code samples to execute the model.
Note
Update according to stac-extensions/example-links#4.
In the even that a model is retrained with gradually added annotations or improved training strategies leading to better performances, the existing model and newer models represented by STAC Items with MLM should also make use of the Version Extension. Using the fields and link relation types defined by this extension, the retraining cycle of the model can better be described, with a full history of the newer versions developed.
Additionally, the version:experimental
field should be considered for models being trained and still under evaluation
before widespread deployment. This can be particularly useful for annotating models experiments during cross-validation
training process to find the "best model". This field could also be used to indicate if a model is provided for
educational purposes only.