# The ESGF Search STAC service

## Overview

This document provides examples of how the proposed ESGF-STAC-SEARCH service can we used via:
 1. A Python client library
 2. GET Requests made directly to the service API
 
In order to compare to the existing ESGF Search service/client we have structured this page to match the existing ESGF Search API instructions which are located at:

https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#the-esgf-search-restful-api

## Which data is being searched?

Initially, we have indexed all the CMIP6 data held at the UK Data Node (CEDA) and the example searches work with that.

The data volumes in this subset are:
- Dataset (STAC Item) count: 677813
- File (STAC Asset) count: 6137991

## Syntax

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#syntax

### New API

The general syntax is:

```
https://api.stac.ceda.ac.uk/search?[keyword parameters as (name, value) pairs][facet parameters as (name,value) pairs]
```

All parameters (keyword and facet) are optional. Also, the value of all parameters must be URL-encoded, so that the complete search URL is well formed.

### Python client

First the client must be initialised as follows:

In [3]:
from esgf_stac_client.client import ESGFStacClient
client = ESGFStacClient.open("https://api.stac.ceda.ac.uk")

Example of the basic Python client usage:
 - get collections

In [None]:
client.get_collections()

 - item search

In [None]:
client.search()

 - asset search

In [None]:
client.asset_search()
client.search(doctype="files")

Get STAC data from the search Object:

In [None]:
search_result = client.search()

# Call the items function to get the item generator object.
for item in search_result.items():
    item.to_dict()

## Keywords

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#keywords

### New API

The following keywords are currently used by the system - see later for usage examples:

- limit=, page= to paginate through available results. (default: limit=10, page=0)
- filter=, filter_lang= to include a search query in the filter parameter. (default GET search: filter_lang=cql-text, filter=None)
- fields= to return only specified metadata for fields for each matching result. (default: fields=*)
  - (NOT YET IMPLEMENTED) The current system does not have the fields extension, use source= instead.
- Doctype search, for STAC use the respective endpoints for datasets and files:
  - Dataset search endpoint: https://api.stac.ceda.ac.uk/search
  - File search endpoint: https://api.stac.ceda.ac.uk/asset/search
- bbox=`[West,South,East,North]` to filter within a geo-spatial box. (default: bbox=None)
- datetime=`start_datime/end_datetime OR datetime` to filter within a specified temporal range or point. (default: datetime=None)
- ids= to list one or more current STAC object to filter on.
- collections= to list one or more STAC collections to filter on.
- intersects= to filter by any geo-json type. (default: intersects=None)
- q= to filter by a string against values in the properties. (default: q=None)


### Python client

The following keywords are currently used by the Python Client - see later for usage examples:

- method= to specify to run a POST or GET search request.
- limit=, max_items= to specify the number of items to return in a response, limit is items per page and max_items is total return. (default: limit=100, max_items=100)
- doctype=`"file"/"datasets"` to specify what type of document to return. (default: doctype="datasets")
- ids= to list one or more current STAC object to filter on.
- collections= to list one or more STAC collections to filter on. (Only for item search)
- items= to list one or more STAC item to filter on. (Only for asset search)
- bbox=`[West,South,East,North]` to filter within a geo-spatial box. (default: bbox=None)
- intersects= to filter by any geo-json type. (default: intersects=None)
- datetime=`start_datime/end_datetime OR datetime` to filter within a specified temporal range or point. (default: datetime=None)
- filter=, filter_lang= to include a search query in the filter parameter. (default GET search: filter_lang=cql-text, filter=None)
- fields= to return only specified metadata for fields for each matching result. (default: fields=*)
  - (NOT YET IMPLEMENTED) The current system does not have the fields extension, use source= instead.
- q= to filter by a string against values in the properties. (default: q=None)

## Default Query

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#default-query

If no parameters at all are specified, the search service will execute a query using all the default values, specifically:

q=* (query all records)
type=Dataset (return results of type “Dataset”)

### New API

If no parameters at all are specifed, the `/search` or `/assets/search` endpoints will execute a query using all the default values.

### Python client

If no parameters at all are specified, the `ESGFStacClient.search()` function will execute using all the default values, specifically:

method="GET" (perform a GET /search)
fields=* (source=*, return all fields)
doctype=datasets (return results of type "Item")

## Free Text Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#free-text-queries

Free-text queries are enabled in STAC using the free-text extension: https://api.stacspec.org/v1.0.0-beta.2/item-search/#free-text-search.
This uses the keyword parameter `q=` to match a string to **all** fields in the properties. The
string supports case-insensitivity and partial search with the wildcard char, *.

### New API

Search for any text, anywhere: https://api.stac.ceda.ac.uk/search?q=%2A (q=*, URL encoded)

Search for "humidity" in all properties fields: https://api.stac.ceda.ac.uk/search?q=humidity

Partial match for "humid\*" in all properties fields: https://api.stac.ceda.ac.uk/search?q=humid%2A (q=humid\*, URL encoded)

### Python client

Search for any text anywhere:

In [None]:
client.search(q="*")

Search for "humidity" in all properties fields:

In [None]:
client.search(q="humidity")

Partial match for "humid\*" in all properties fields:

In [None]:
client.search(q="humid*")

## Facet Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#facet-queries

Facet search is enabled using the `filter=` keyword parameter alongside `filter_lang=` to
specify the common query language used. Default is `filter_lang='cql-text'` for a string
based filter query using a GET search.

filter= does not support temporal or spatial queries however can work alongside the datetime= and bbox= keyword
parameters for additional temporal/spatial queries.

### New API

Single facet query: https://api.stac.ceda.ac.uk/search?filter=cf_standard_name+%3D+%27air_temperature%27&filter-lang=cql2-text

Query with two different facet constraints: https://api.stac.ceda.ac.uk/search?filter=cf_standard_name+%3D+%27air_temperature%27+and+activity_id+%3D+%27CMIP%27&filter-lang=cql2-text

Combining two values of the same facet with a logical OR: https://api.stac.ceda.ac.uk/search?filter=cf_standard_name+%3D+%27air_temperature%27+and+%28activity_id+%3D+%27CMIP%27+or+activity_id+%3D+%27AerChemMIP%27%29&filter-lang=cql2-text

#### Using a negative facet:

Search for all datasets that have variable ta OR hus, excluding those with acitivity_id  of AerChemMIP: https://api.stac.ceda.ac.uk/search?filter=activity_id%3C%3E%27AerChemMIP%27+AND+%28variable%3D%27hus%27+or+variable%3D%27ta%27%29&filter-lang=cql2-text

Search for all datasets that have neither the variable ta OR hus: https://api.stac.ceda.ac.uk/search?filter=variable%3C%3E%27hus%27+and+variable%3C%3E%27ta%27&filter-lang=cql2-text

Issue a query for all supported facets and their values at one site, while returning no results (note that only facets with one or more values are returned): https://api.stac.ceda.ac.uk/collections/cmip6/queryables

### Python client

Single facet query:

In [None]:
client.search(cf_standard_name="air_temperature")

Query with two different facet constraints:

In [None]:
client.search(cf_standard_name="air_temperature", activity_id="CMIP")

Combining two values of the same facet with a logical OR:

In [None]:
client.search(cf_standard_name="air_temperature", activity_id=["CMIP", "AerChemMIP"])

#### Using a negative facet:

For more complex queries using comparitors such as `<>`, `<=`, `>=` (not, lte, gte) the filter keyword
parameter must be used directly. The filter parameter does **not** work alongside facets keyword args.
The filter parameter can be in "cql2-text" string or "cql2-json" dict with the appropriate method=.

Search for all datasets that have variable ta OR hus, excluding those with acitivity_id  of AerChemMIP:

In [None]:
client.search(filter="activity_id<>'AerChemMIP' AND (variable='hus' or variable='ta')")

Search for all datasets that have neither the variable ta OR hus:

In [None]:
client.search(filter="variable<>'hus' and variable<>'ta'")

Issue a query for all supported facets and their values at one site. The queryables extension is not supported by
the python client. To get a list of all common facets and their values of a collection, they can be found in the
summaries of a particular collection.

In [None]:
client.get_collection(collection_id='cmip6').summaries


## Facet Listings

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#facet-listings

The STAC /queryables endpoint is only supported on a per collection basis. It will return a JSON of **all** the facets and
values listings.

### New API

List all the CMIP6 facet names and values: https://api.stac.ceda.ac.uk/collections/cmip6/queryables

### Python client

Queryables endpoint not supported by the Python Client.

## Temporal Coverage Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#temporal-coverage-queries

Temporal search is applied with the "datetime" keyword parameter and uses the ISO 8601 format.

### New API

Single date search (equivalent to a range search from the start to the end of the day): https://api.stac.ceda.ac.uk/search?datetime=2300-01-01

Single datetime search: https://api.stac.ceda.ac.uk/search?datetime=2300-01-01T00%3A00%3A00Z

Open ended datetime search (GTE a datetime point): https://api.stac.ceda.ac.uk/search?datetime=2300-01-01T00%3A00%3A00Z%2F..

Open begining datetime search (LTE a datetime point): https://api.stac.ceda.ac.uk/search?datetime=..%2F2800-12-01T00%3A00%3A00.000Z

Complete range datetime search (GTE and LTE two datetime points): https://api.stac.ceda.ac.uk/search?datetime=2300-01-01T00%3A00%3A00Z%2F2800-12-01T00%3A00%3A00.000Z

### Python client

Single date search (equivalent to a range search from the start to the end of the day):

In [None]:
client.search(datetime="2300-01-01")

Single datetime search:

In [None]:
client.search(datetime="2300-01-01T00:00:00Z")

Open ended datetime search (GTE a datetime point):

In [None]:
client.search(datetime="2300-01-01T00:00:00Z/..")

Open begining datetime search (LTE a datetime point):

In [None]:
client.search(datetime="../2800-12-01T00:00:00.000Z")

Complete range datetime search (GTE and LTE two datetime points):

In [None]:
client.search(datetime="2300-01-01T00:00:00Z/2800-12-01T00:00:00.000Z")

## Spatial Coverage Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#spatial-coverage-queries

Example: 

http://esgf-node.llnl.gov/esg-search/search?bbox=%5B-10,-10,+10,+10%5D (translates to: east_degrees:[-10 TO *] AND north_degrees:[-10 TO *] AND west_degrees:[* TO 10] AND south_degrees:[* TO 10])

### New API

https://api.stac.ceda.ac.uk/search?bbox=-180.0%2C-90%2C180.0%2C90.0 (translates to: west_degrees/min_longitude:[-180.0 TO *] AND south_degrees/min_latitude:[-90.0 TO *] AND east_degrees/max_longitude:[* TO 180.0] AND north_degrees/max_latitude:[* To 90.0])

### Python client

In [None]:
client.search(bbox="-180.0,-90,180.0,90.0")

## Distributed Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#distributed-queries

**NOT RELEVANT TO STAC (YET)**

## Shard Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#shard-queries

**NOT RELEVANT TO STAC**

## Replica Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#replica-queries

**NOT RELEVANT TO STAC**

## Latest and Version Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#latest-and-version-queries

By default, a query to the ESGF search services will return all versions of the matching records (Datasets or Files). To only return the very last, up-to-date version include latest=true . To return a specific version, use version=… . Using latest=false will return only datasets that were superseded by newer versions.

Examples:

Search for all latest CMIP5 datasets: http://esgf-node.llnl.gov/esg-search/search?project=CMIP5&latest=true

Search for all versions of a given dataset: http://esgf-node.llnl.gov/esg-search/search?project=CMIP5&master_id=cmip5.output1.MOHC.HadCM3.decadal1972.day.atmos.day.r10i2p1&facets=version

Search for a specific version of a given dataset: http://esgf-node.llnl.gov/esg-search/search?project=CMIP5&master_id=cmip5.output1.NSF-DOE-NCAR.CESM1-CAM5-1-FV2.historical.mon.atmos.Amon.r1i1p1&version=20120712

### New API

At the file level is the record of latest saved, the item is an aggregate of assets with common filepath metadata.

1. **latest is not aggregated up to items**

2. **filter via boolean value not implemented to add to a filter query, the filter evaluation
defaults to "properties__{filter}__keyword", thus only supports keyword filtering**

### Python client

## Retracted Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#retracted-queries

Example:

Search for all retracted datasets in the CMIP5 project, across all nodes: https://esgf-node.llnl.gov/esg-search/search?project=CMIP5&retracted=true

### New API

**Same points as above**

### Python client


## Minimum and Maximum Version Queries

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#minimum-and-maximum-version-queries

**NOT RELEASED IN SOLR - IGNORING IN STAC**

## Results Pagination

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#results-pagination

By default, a query to the search service will return the first 10 records matching the given constraints. The offset into the returned results, and the total number of returned results, can be changed through the keyword parameters limit= and offset= . The system imposes a maximum value of limit <= 10,000.

Examples:

Query for 100 CMIP5 datasets in the system: http://esgf-node.llnl.gov/esg-search/search?project=CMIP5&limit=100

Query for the next 100 CMIP5 datasets in the system: http://esgf-node.llnl.gov/esg-search/search?project=CMIP5&limit=100&offset=100


### New API

Query for 100 CMIP6 datasets in the system: https://api.stac.ceda.ac.uk/search?limit=100

Query for the next 100 CMIP6 datasets in the system: https://api.stac.ceda.ac.uk/search?limit=100&page=2

### Python client

The python client hides pagination via generators. A page is equivalent to an ItemCollection.

The client will paginate intrinsically when iterating through the generator. The limit= parameter
will dictate the number of items per ItemCollection.

To iterate through pages:

In [None]:
result = client.search()
for page in result.item_collections():
    for items in page.items:
        ...

However, if only iterating through items, there is no need for iterating through the pages.
Directly iterate through all the items:

In [None]:
result = client.search()
for item in result.items():
    ...

## Output Format

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#output-format

**ONLY JSON FORMAT AVAILABLE IN STAC**

### Python Client

The python client can return items as [PySTAC](https://github.com/stac-utils/pystac) object or
as a dictionary representation of the JSON.


## Returned Metadata Fields

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#returned-metadata-fields

By default, all available metadata fields are returned for each result. The keyword parameter fields= can be used to limit the number of fields returned in the response document, for each matching result. The list must be comma-separated, and white spaces are ignored. Use fields=* to return all fields (same as not specifiying it, since it is the default). Note that the pseudo field “score” is always appended to any fields list.

Examples:

Return all available metadata fields for CMIP5 datasets: http://esgf-node.llnl.gov/esg-search/search?project=CMIP5&fields=*

Return only the “model” and “experiment” fields for CMIP5 datasets: http://esgf-node.llnl.gov/esg-search/search?project=CMIP5&fields=model,experiment


### New API

Default: return all available metadata fields for CMIP6 datasets.

Return only the model and experiment properties' fields for CMIP6 datasets using the fields= keyword: https://api.stac.ceda.ac.uk/search?fields=%2Bproperties.model%2C%2Bproperties.experiment

### Python client

Return only the model and experiment properties' field for CMIP6 datasets using the fields= keyword:

*Note that this may return an invalid STAC objects, use the `items_as_dicts()` to bypass object unmarshalling errors.
This will not return Item objects, rather just the dictionary representation thus reducing certain functionalities.
In this instance, this calls a generator that yields dictionary objects.*

In [None]:
results = client.search(fields="properties.model,properties.experiments")
for item in results.items_as_dicts():
    ...

## Identifiers

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#identifiers

Each search record in the system is assigned the following identifiers (all of type string):

id : universally unique for each record across the federation, i.e. specific to each Dataset or File, version and replica (and the data node storing the data). It is intended to be “opaque”, i.e. it should not be parsed by clients to extract any information.

Dataset example: id=obs4MIPs.NASA-JPL.TES.tro3.mon.v20110608|esgf-data.llnl.gov

File example: id=obs4MIPs.NASA-JPL.TES.tro3.mon.v20110608.tro3Stderr_TES_L3_tbd_200507-200912.nc|esgf-data.llnl.gov

master_id : same for all replicas and versions across the federation. When parsing THREDDS catalogs, it is extracted from the properties “dataset_id” or “file_id”.

Dataset example: obs4MIPs.NASA-JPL.TES.tro3.mon (for a Dataset)

File example: obs4MIPs.NASA-JPL.TES.tro3.mon.tro3Stderr_TES_L3_tbd_200507-200912.nc

instance_id : same for all replicas across federation, but specific to each version. When parsing THREDDS catalogs, it is extracted from the ID attribute of the corresponding THREDDS catalog element (for both Datasets and Files).

Dataset example: obs4MIPs.NASA-JPL.TES.tro3.mon.v20110608

File example: obs4MIPs.NASA-JPL.TES.tro3.mon.v20110608.tro3Stderr_TES_L3_tbd_200507-200912.nc

Note also that the record version is the same for all replicas of that record, but different across versions. Examples:

Dataset example: version=20110608

File example: version=1

### New API

id: A universally unique identifier for each collections, items and assets (projects, datasets, files). Can be used to
browse the API for a specific: `https://api.ceda.stac.ac.uk/collections/<collection id>/[items/<item id>/[assets/<asset_id>]]`
(the id of STAC documents are a hash.)

For all other ids; master_id, instance_id, can be searched using filter=

### Python client

id: Get a STAC document by id using the client (Each get method will return the respective object):

In [None]:
client.get_collection(collection_id).get_item(item_id).get_asset(asset_id)

## Access URLs

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#access-urls

In the Solr output document returned by a search, URLs that are access points for Datasets and Files are encoded as 3-tuple of the form “url|mime type|service name”, where the fields are separated by the “pipe (”|“) character, and the”mime type” and “service name” are chosen from the ESGF controlled vocabulary.

Example of Dataset access URLs:

THREDDS catalog: http://esgf-data.llnl.gov/thredds/catalog/esgcet/1/obs4MIPs.NASA-JPL.TES.tro3.mon.v20110608.xml#obs4MIPs.NASA-JPL.TES.tro3.mon.v20110608|application/xml+thredds|THREDDS

LAS server: http://esgf-node.llnl.gov/las/getUI.do?catid=0C5410C250379F2D139F978F7BF48BB9_ns_obs4MIPs.NASA-JPL.TES.tro3.mon.v20110608|application/las|LAS

Example of File access URLs:

HTTP download: http://esgf-data.llnl.gov/thredds/fileServer/esg_dataroot/obs4MIPs/observations/atmos/tro3Stderr/mon/grid/NASA-JPL/TES/v20110608/tro3Stderr_TES_L3_tbd_200507-200912.nc|application/netcdf|HTTPServer

GridFTP download: gsiftp://esgf-data.llnl.gov:2811//esg_dataroot/obs4MIPs/observations/atmos/tro3Stderr/mon/grid/NASA-JPL/TES/v20110608/tro3Stderr_TES_L3_tbd_200507-200912.nc|application/gridftp|GridFTP

OpenDAP download: http://esgf-data.llnl.gov/thredds/dodsC/esg_dataroot/obs4MIPs/observations/atmos/tro3Stderr/mon/grid/NASA-JPL/TES/v20110608/tro3Stderr_TES_L3_tbd_200507-200912.nc.html|application/opendap-html|OPENDAP

Globus As-A-Service download: globus:e3f6216e-063e-11e6-a732-22000bf2d559/esg_dataroot/obs4MIPs/observations/atmos/tro3Stderr/mon/grid/NASA-JPL/TES/v20110608/tro3Stderr_TES_L3_tbd_200507-200912.nc|Globus|Globus

### New API

For STAC all urls are at the Asset level. An Item can access it's assets with the assets endpoint: https://api.stac.ceda.ac.uk/collections/<collection_id>/items/<item_id>/assets/[<asset_id>]

This will return a JSON representation and the default in the "href" field is the access URL for HTTP download.

### Python client

The object representation of an Asset is <{id} {href}> where href is the access URL for HTTP download. Alternatively,
the access URL is an attribute of the Asset class: `href`

In [6]:
collection_id = "cmip6"
item_id = "c2d94dd296525cc105cfa657b6b559c0"
asset_id = "03cd0a42604287aa08bde5aabeb6c167"

asset = client.get_collection(collection_id).get_item(item_id).get_asset(asset_id)
asset.href

'http://esgf-data3.ceda.ac.uk/thredds/fileServer/esg_cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3-Veg-LR/piControl/r1i1p1f1/Amon/rsds/gr/v20200213/rsds_Amon_EC-Earth3-Veg-LR_piControl_r1i1p1f1_gr_238901-238912.nc'

## Wget scripting

Existing version: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html#wget-scripting

**NOT IMPLEMENTED FOR STAC - IS IT NEEDED?**