# Data Analysis Workshop

## Tutorial I: data queries, dataset catalog creation


<div style="border-left: 4px solid #0366d6; padding: 0.5em; background-color: #deecfc;">
  ℹ️ If you want to know more about these topics please refer to:
  <ul>
  <li><code>freva-client</code> library <a href='https://example.com' target="_blank">installation</a></li>
  <li>Databrowser <a href="https://freva-org.github.io/freva-nextgen/databrowser/cli.html">command line interface </a></li>
</ul>  
</div>



In this tutorial, we'll learn how to use the `freva-client` library to explore and access available datasets and at the end customize the dataset on Freva to our own liking.

First let's see how to install the freva client library:

## Installation

#### The Client Library

| Environment | Installation Command |
|-------------|---------------------|
| DKRZ/Levante (Recommended) | `$ module load clint gems` |
| Conda (Local) | $ `conda create -n freva-client-env -c conda-forge freva-client -y` |
| Python (Local) | `$ pip install freva-client` |


<div style="
  border-left: 6px solid rgb(236, 114, 0);
  background-color:rgb(253, 231, 157);
  color:rgb(19, 19, 18);
  padding: 1em;
  font-size: 110%;
  border-radius: 4px;
  margin: 1em 0;
">
⚠️ <strong>ATTENTION</strong>: For the Freva Databrowser workshop, please open a Terminal tab in Jupyter and write the following:
<pre><code>$ module load clint gems
$ da-workshop-setup
</code></pre>

<br>
And then from kernel environment list, please choose, <code>DA Workshop (shell)</code>
Now your environment is ready to start!

</div>




Let quickly ckeck if `freva-client` is available on our current kernel environment!

In [1]:
freva-client --version

freva-client: [1;36m2508.0[0m.[1;36m0[0m


## Querying Data

To get started, we'll run a simple analysis on the [MPI Grand Ensemble data](https://mpimet.mpg.de/en/research/modeling/grand-ensemble), a large collection of climate simulations. Our goal will be to create an ensemble of global averaged time series of 2 m air temperature.

The data browser organizes metadata in a **tree-like hierarchy**. At the top of this structure is the **`project`** facet (equivalent to the **`mip-era`** for CMIP6 Data Reference Syntax) and then it goes down as it follows:
```
.
├── project
│   ├── product
│   │   ├── institute
│   │   │   ├── model
│   │   │   │   ├── experiment
...
```
These facets are organised as `{key: value}` pairs. First and foremost, let's find out which search keys are available:

In [2]:
freva-client databrowser data-overview

Available search flavours:
- freva
- cmip6
- cmip5
- cordex
- nextgems
- user
Search attributes by flavour:
  cmip5:
  - experiment
  - member_id
  - fs_type
  - grid_label
  - institution_id
  - model_id
  - project
  - product
  - realm
  - variable
  - time
  - bbox
  - time_aggregation
  - time_frequency
  - cmor_table
  - dataset
  - format
  - grid_id
  - level_type
  cmip6:
  - experiment_id
  - member_id
  - fs_type
  - grid_label
  - institution_id
  - source_id
  - mip_era
  - activity_id
  - realm
  - variable_id
  - time
  - bbox
  - time_aggregation
  - frequency
  - table_id
  - dataset
  - format
  - grid_id
  - level_type
  cordex:
  - experiment
  - ensemble
  - fs_type
  - grid_label
  - institution
  - model
  - project
  - domain
  - realm
  - variable
  - time
  - bbox
  - time_aggregation
  - time_frequency
  - cmor_table
  - dataset
  - driving_model
  - format
  - grid_id
  - level_type
  - rcm_name
  - rcm_version
  freva:
  - project
  - product
  - institute


Let's assume we know that the Grand-Ensemble data is stored under `mpi-ge` but we don't know whether it's under `project` or `product` etc. The databrowser is here to help. You can simply use the `facet` argument to search for all entries containing a certain value, such as `mpi-ge`.

Let's get the project(s) of all search keys (or facets) that contain `mpi-ge`.  We can use the `metadata-search` function for that:

In [5]:
freva-client databrowser metadata-search --facet mpi-ge --json | jq -rc '.project|join(", ")'

mpi-ge


Since we want to create a time series of 2 m air temperature we will need to check whether the `tas` (near-surface air temperature) variable is available:

In [6]:
freva-client databrowser metadata-search project=mpi-ge --json | jq -rc '.variable | index("tas") != null'

[0;39mtrue[0m


In the same vein, let's query the available output time frequencies:

In [7]:
freva-client databrowser metadata-search project=mpi-ge variable=tas --json | jq -rc ".time_frequency"

[1;39m[[0;32m"mon"[0m[1;39m[1;39m][0m


Similarly, let's query the available output experiments:

In [4]:
freva-client databrowser metadata-search project=mpi-ge variable=tas time_frequency=mon --time "2025-01 to 2100-12" --json | jq -c ".experiment"

[1;39m[[0;32m"picontrol"[0m[1;39m,[0;32m"rcp26"[0m[1;39m,[0;32m"rcp45"[0m[1;39m,[0;32m"rcp85"[0m[1;39m[1;39m][0m


Now we do have a rough overview of the available data. We want to cover future scenarios, that is, timesteps from today until 2100. To check how many files were found we can apply the `data-count` function:

In [9]:
freva-client databrowser data-count project=mpi-ge variable=tas time_frequency=mon --time "2025-01 to 2100-12"

602


Going back a little bit we see that the `picontrol` experiment _was_ unexpected! Let's check the what is going on. We create a new search and check how many files belong to that search:

In [11]:
freva-client databrowser data-count project=mpi-ge variable=tas time_frequency=mon experiment=picontrol --time "2025-01 to 2100-12"

2


We can get the list of file via `data-search`:

In [5]:
freva-client databrowser data-search project=mpi-ge variable=tas time_frequency=mon experiment=picontrol --time "2025-01 to 2100-12"

/work/mh1007/CMOR/MPI-GE/output1/MPI-M/MPI-ESM/piControl/mon/atmos/tas/r001i1850p3/v20190123/tas_Amon_MPI-ESM_piControl_r001i1850p3_210001-219912.nc
/work/mh1007/CMOR/MPI-GE/output1/MPI-M/MPI-ESM/piControl/mon/atmos/tas/r001i1850p3/v20190123/tas_Amon_MPI-ESM_piControl_r001i1850p3_200001-209912.nc


Let's do a reverse search, that is, check what meta-data is associated with a file, for that we use the `file=` parameter:

In [6]:
freva-client databrowser metadata-search file=/work/mh1007/CMOR/MPI-GE/output1/MPI-M/MPI-ESM/piControl/mon/atmos/tas/r001i1850p3/v20190123/tas_Amon_MPI-ESM_piControl_r001i1850p3_210001-219912.nc

ensemble: r001i1850p3
experiment: picontrol
institute: mpi-m
model: mpi-esm
product: output1
project: mpi-ge
realm: atmos
time_aggregation: mean
time_frequency: mon
variable: tas


Since we don't want this pre-industrial control run among our selected datasets we will tell the databrowser to ignore it.

We can use the `!` to *not* include a certain value:

In [7]:
experiments=$(freva-client databrowser metadata-search \
  project=mpi-ge \
  variable=tas \
  time_frequency=mon \
  --time="2025-01 to 2100-12" \
  experiment='!picontrol' \
  --json \
  | jq -rc '.experiment | join(" ")')
echo "$experiments"

rcp26 rcp45 rcp85


Now let's try to create a global time series for each of the experiments. We can use the search result of the databrowser to directly pip the output into `cdo`

In [8]:
temp_dir=$(mktemp -d --suffix cdo)
for exp in $experiments ;do
    outlist=()
    # Let's get only the first 5 ensemble member for brevity
    members=$(freva-client databrowser metadata-search \
    project=mpi-ge variable=tas time_frequency=mon --time="2025-01 to 2100-12" experiment="$exp" --json |
    jq -r '.ensemble | unique | .[:5] | join(" ")')
    for ens in $members;do
        echo -ne "Reading data and calculating TS for experiment $exp in ens: $ens\r"
        files=$(freva-client databrowser data-search project=mpi-ge variable=tas time_frequency=mon --time="2025-01 to 2100-12" experiment=$exp ensemble=$ens realm=atmos)
        outfile="$temp_dir/tas_mean_${exp}_${ens}.nc"
        cdo -s fldmean -mergetime $files "$outfile"
        outlist+=("$outfile")
    done
    cdo mergetime "${outlist[@]}" "$temp_dir/tas_ensemble_${exp}.nc"
done
cdo mergetime $temp_dir/tas_ensemble_*.nc tas_all_experiments.nc


[32mcdo    mergetime: [0mProcessed 5640 values from 5 variables over 5640 timesteps [0.05s 25MB]
[32mcdo    mergetime: [0mProcessed 5640 values from 5 variables over 5640 timesteps [0.03s 25MB]
[32mcdo    mergetime: [0mProcessed 5640 values from 5 variables over 5640 timesteps [0.03s 25MB]
[32mcdo    mergetime: [0mProcessed 16920 values from 3 variables over 16920 timesteps [0.08s 37MB]


We can take a shallow look at the file:

In [9]:
cdo sinfo tas_all_experiments.nc

[0;1m   File format[0m : NetCDF
[0;1m    -1 : Institut Source   T Steptype Levels Num    Points Num Dtype : Parameter ID[0m
     1 : [34munknown  MPI-ESM  v instant  [0m[32m     1 [0m  1 [32m        1 [0m  1 [34m F32  [0m: -1            
[0;1m   Grid coordinates[0m :
     1 : [34mlonlat                  [0m : [32mpoints=1 (1x1)[0m
                              lon : 0 degrees_east
                              lat : 0 degrees_north
[0;1m   Vertical coordinates[0m :
     1 : [34mheight                  [0m :[32m levels=1  scalar[0m
                           height : 2 m
[0;1m   Time coordinate[0m :
                             time : [32m16920 steps
[0m     RefTime =  2005-01-01 00:00:00  Units = days  Calendar = proleptic_gregorian  Bounds = true
  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss
[35m  2006-01-16 12:00:00  2006-01-16 12:00:00  2006-01-16 12:00:00  2006-01-16 12:00:00
  2006-01-16 12:00:00  2006-01-16 12:00

In order to make further analysis with this information, for example, to plot the ensemble spread and mean for each experiment, we would need to need some other specific program, programing language.

Please, refer to the python notebook (`Tutorial-py-search-cataloging.ipynb`) for a complete workflow.

## Creating dataset catalogs:

Now that we've already found our target dataset on Freva, we may want to export the full metadata for other project's partner that might not have direct Freva access or for us to download and access it somewhere else on any other HPC system.

In this section we are going to introduce two different types of Catalogues: 
1. The [**intake-esm**](https://intake-esm.readthedocs.io/en/stable/) catalog provides a lightweight, Python-friendly interface to the metadata of large Earth System Model archives. By pointing to a central JSON index, it lets you discover, filter, and load climate model outputs—such as temperature, precipitation, or ocean variables—without downloading entire datasets. The catalog structure follows the CMIP/ESM conventions, enabling easy subsetting by attributes like project name, variable, experiment, and time period. Once exported as a standalone YAML file, your subsetted catalog can be shared with collaborators who can query and load data locally, with no direct access to the original archive required.


2. The [**STAC (SpatioTemporal Asset Catalog)**](https://stacspec.org/en) static catalog defines a simple, filesystem-based layout for geospatial metadata. A static catalog bundles Catalog, Collection, and Item JSON files into a set of directories that mirror your data hierarchy, with no dynamic search API. Bundling the entire catalog into a ZIP archive makes it trivial to distribute or archive a snapshot of your dataset inventory—satellite imagery, climate projections, or any spatiotemporal assets—for offline use, disaster recovery, or reproducible analyses. Once unzipped, the folder structure and JSON files provide the same discovery semantics as a live STAC endpoint.  


First, we’ll use **intake-esm** to subset by our chosen search keys-value pairs:  
- project: `mpi-ge`
- variable: `tas`
- time_frequency: `mon`
- time: `'2025-01 to 2100-12'`
- experiment: `picontrol`

In [10]:
freva-client databrowser intake-catalogue project=mpi-ge time_frequency=mon variable=tas --time "2025-01 to 2100-12" experiment=picontrol

{
   "esmcat_version": "0.1.0",
   "attributes": [
      {
         "column_name": "project",
         "vocabulary": ""
      },
      {
         "column_name": "product",
         "vocabulary": ""
      },
      {
         "column_name": "institute",
         "vocabulary": ""
      },
      {
         "column_name": "model",
         "vocabulary": ""
      },
      {
         "column_name": "experiment",
         "vocabulary": ""
      },
      {
         "column_name": "time_frequency",
         "vocabulary": ""
      },
      {
         "column_name": "realm",
         "vocabulary": ""
      },
      {
         "column_name": "variable",
         "vocabulary": ""
      },
      {
         "column_name": "ensemble",
         "vocabulary": ""
      },
      {
         "column_name": "cmor_table",
         "vocabulary": ""
      },
      {
         "column_name": "fs_type",
         "vocabulary": ""
      },
      {
         "column_name": "grid_label",
         "vocabulary": ""
      

We can then export the catalogue as e.g. JSON:

In [12]:
freva-client databrowser intake-catalogue project=mpi-ge time_frequency=mon variable=tas --time "2025-01 to 2100-12" experiment=picontrol > intake_catalog.json

<br>

We’ll now perform the same operation on a **STAC static catalog**: download the entire catalog as a ZIP archive so you can share or inspect it offline.


In [None]:
freva-client databrowser stac-catalogue project=mpi-ge time_frequency=mon variable=tas --time "2025-01 to 2100-12" experiment=picontrol

<br>

To complete our explanation about STAC catalog, the **STAC static catalog** is implemented as a set of flat files on a web server or object store (e.g., S3). It exposes the same Item, Catalog, and Collection JSON structure as a dynamic STAC, but without a `/search` endpoint—making it easy to bundle and distribute as a ZIP for disaster recovery or offline use.