# Heading for the CEDA Singularity

Where Kerchunk meets Cloud meets NetCDF meets Zarr meets STAC meets EO Data Hub...

In [None]:
# Let's search the catalogue with pystac client, identify some records and load
# them directly as `xarray.Datasets` (via Kerchunk files)

import pystac_client
import kc_stac

CEDA_CAT_URL = "https://api.stac.ceda.ac.uk/eo"
catalog = pystac_client.Client.open(CEDA_CAT_URL)

# Create a filter to match similar data in UKCP and CMIP6 datasets:
# - collection IN ["ukcp", "cmip6"]
# - experiment IN ["rcp85", "ssp585"]
# - variable IN ["tasmin", "tasmax"]
# - frequency:  "day"
# - institute:  "MOHC"
fylter = {
    {
        "op": "and",
         "args": [
             {"op": "in",
              "args": [{"property": "collection"}, ["ukcp", "cmip6"]]},
             {"op": "in",
              "args": [{"property": "experiment"}, ["rcp85", "ssp585"]]},
             {"op": "in",
              "args": [{"property": "variable"}, ["tasmin", "tasmax"]]},
             {"op": "=",
              "args": [{"property": "frequency"}, "day"]},
             {"op": "=",
              "args": [{"property": "institute"}, "MOHC"]}
        ]
    }
}

# Define a bounding box (UK) and a datetime period of interest
search_terms = {
    "filter_lang": "cql2-json",
    "filter": fylter,
    "bbox": [-5, 65, 5, 50],
    "datetime": "2020-01-01/2099-12-31"
}

# Run the query
q = catalog.search(method="GET", **search_terms)

# How many STAC Item records?
print(q.matched())
# >>> 4

# Okay let's load those items from Kerchunk into xarray
dsets = kc_stac.load(q)

# There is a one-to-one mapping between STAC Items and Kerchunk files here
print(dsets)
# >>> 4 Xarray Datasets: [tasmin@ssp585, tasmax@ssp585, tasmin@rcp85, tasmax@rcp85]

### kc_stac is doing 2 things here...

`kc_stac` is doing:
1. Hide the access control layer.
2. Hide the layer of code that works out the recipe for finding and opening a `kerchunk` (_reference_) file.

`kc_stac` should also:
- support loading individual data files (e.g. `*.nc`) - whilst still handling the access control

Proposed syntax:

```
dsets = kc_stac.load(
    q: pystac_query,
    use_reference_file=True,                      # IDEAS: might have multiple kerchunks (possibly)
    data_asset_ids: List[str] | boolean = None,   # DEFAULT: try to aggregate/load all data (e.g. "*.nc") assets
    max_items: Int = -1,                          # Limit how many items to load from (returned by STAC)
    compression: None
    )
```

# Questions and issues

1. Might want to _lazy load_ the `xarray.Datasets` to be quick:
- maybe set a `max_items=10`:
  - would raise exception if more requested, unless you change the setting.
- ideally, we are using `parquet` when Kerchunk files are large.

2. How does `kc_stac` know how to load the Kerchunk files?
- Format: `json`, `parquet`, `zst` - store in the Asset properties
- Service endpoint: `POSIX`, `http(s)`, `S3` - store in the Asset properties
- `xr.open_zarr` keyword arguments: some of these might need to stored in the STAC
  - within `open_zarr_kwargs`: E.g. `{"decode_timedeltas": False}`
  - NOTE: when Kerchunk files are created, we could update the STAC Item records
- Access Control: open or restricted - _see below_.

3. Access Control - there will be different recipes for different collections:
- `kc_stac` would need to check the user has a token and/or certificate file:
  - then raise appropriate warning/exception to instruct the user how to get them
- cert file and/or token should have default locations (e.g. `env variables` or `file paths`)
  - then `kc_stac` will try to find them before raising an exception
  - allow user-configuration to change location of these defaults

4. Should the access restrictions be a searchable property?
- Yes, great idea, see _below_.

5. What about people that only want to use STAC (without `kc_stac`) and work it out from there?
- Later, we need to provide recipes and enough information to help them:
  - how to get a token/certificate
  - where to put it
  - how to download a file
  - how to translate the STAC content to some Python to open Kerchunk file:
    - `open`
    - `restricted`
    - `S3`, `POSIX`, `HTTP(S)`

6. Create examples that use Items from different collections:
- UKCP and CMIP6 together - through a single search

7. Don't call it `kc_stac`, or `stackerchunk`, or `stackyo`

Example STAC record, for access control:

```
'properties': {
    'access_control: {   # because the same for all Assets in this Item
      'rule': 'restricted',
      'roles': ['reguser']
    }
}
{'assets': 
  {'clt_2060s': 
    {
    'checksum': 'HJDSHFKSHFUHFDKLFHDIOSFHSDFHOSDFIHSFODHO',
    'checksum_type': 'SHA256',
    'href': 'https://dap.ceda.ac.uk/badc/ukcp18/metadata/kerchunks/land-cpm/clt_5km_01_day.zst',
    'roles': ['reference'],
    'size': 63621,
    'type': 'application/zstd',
    'open_zarr_kwargs': {"decode_times": true}
   }
  }
}
```