# Data Analysis Workshop

## Tutorial I: Search and query, accessing and Cataloging the dataset

In this tutorial, we’ll learn how to use the `freva-client` library to explore and access available datasets and at the end customize the dataset on Freva based on own desire.
To get started, we’ll run a simple analysis on the [MPI Grand Ensemble data](https://mpimet.mpg.de/en/research/modeling/grand-ensemble), a large collection of climate simulations.
The data browser organizes metadata in a **tree-like hierarchy**. At the top of this structure is the **`project`** level.

First let's setup and learn about authentication and then keep up with the rest.

## Installation

#### The Client Library

| Environment | Installation Command |
|-------------|---------------------|
| DKRZ/Levante (Recommended) | `$ module load clint gems`(Terminal) <br> `In [1]: !module load clint gems`(Jupyterhub) |
| Conda (Local) | $ `conda create -n freva-client-env -c conda-forge freva-client -y` |
| Python (Local) | `$ pip install freva-client` |


**ATTENTION**: For the Freva Databrowser workshop, please open the new Jupyter as Terminal and write the following:
```bash
$ module load clint gems
$ da-workshop-setup
```
And then from kernel environment list, please choose, `DA Workshop (shell)`
Now your environment is ready to start!


In [None]:
!export PATH=/sw/spack-levante/cdo-2.2.2-4z4icb/bin:$PATH

Now let quickly ckeck if `freva-client` is available on our current kernel environemnt or not!

In [None]:
freva-client auth --version

Now let's learn together how to authenticate to Freva. It is as simple as the following line!

## Querying Data

First and foremost, let's findout which search keys are avaiable on the system:

In [None]:
freva-client databrowser data-overview

Let's assume we know that the Grand-Ensemble data is stored under `mpi-ge` but we don't know whether it's under `project` or `product` etc. The databrowser is here to help. You can simply use the `facet` argument to search for all entries containing a certain value, such as `mpi-ge`. Let's get the project(s) of all search keys (or facets) that contain `mpi-ge`

In [None]:
freva-client databrowser metadata-search --facet mpi-ge --json | jq -rc '.project|join(", ")'

Let's create a time series of 2 m air temperature. To do so we have to check if the `tas` variable is available. We can use the `metadata_search` function:

In [None]:
freva-client databrowser metadata-search project=mpi-ge --json | jq -rc '.variable | index("tas") != null'

Let's query the available ouput time frequencies:

In [None]:
freva-client databrowser metadata-search project=mpi-ge variable=tas --json | jq -rc ".time_frequency"

Now we do have a rough overview of the available data, to access the data create a so called `instance` of the databrowser class. We want to cover future scenarios, that is timesteps from today until 2100.  

In [None]:
freva-client databrowser metadata-search project=mpi-ge variable=tas time_frequency=mon --time "2025-01 to 2100-12"

To check how many files were found we can apply the `len` function to our instance:

In [None]:
freva-client databrowser data-count project=mpi-ge variable=tas time_frequency=mon --time "2025-01 to 2100-12"

Just like with the `metadata_search` method we can check the meta data with using `metadata` property. This will give you the metdata search parameters that were used to create the `db` object:

In [None]:
freva-client databrowser metadata-search project=mpi-ge variable=tas time_frequency=mon --time "2025-01 to 2100-12" --json | jq -c ".experiment"

The `picontrol` experiment is unexpceted. Let's check the what is going on. We create a new search and check the files belonging to that search:

In [None]:
freva-client databrowser data-count project=mpi-ge variable=tas time_frequency=mon experiment=picontrol --time "2025-01 to 2100-12"

To get the files we can "convert" our search to a list

In [None]:
freva-client databrowser data-search project=mpi-ge variable=tas time_frequency=mon experiment=picontrol --time "2025-01 to 2100-12"

Let's do a reverse search, that is check what meta data is assocaited with a file: 

In [None]:
freva-client databrowser metadata-search file=/work/mh1007/CMOR/MPI-GE/output1/MPI-M/MPI-ESM/piControl/mon/atmos/tas/r001i1850p3/v20190123/tas_Amon_MPI-ESM_piControl_r001i1850p3_210001-219912.nc

Since we don't want this pre-industrial control run in our databrowser search we tell the databrowser to not use it. We can use the `!` to *not* include a certain value

In [None]:
experiments=$(freva-client databrowser metadata-search \
  project=mpi-ge \
  variable=tas \
  time_frequency=mon \
  --time="2025-01 to 2100-12" \
  experiment='!picontrol' \
  --json \
  | jq -rc '.experiment | join(" ")')
echo "$experiments"

Now let's try to create a global time series for each of the experiments. We can use the search result of the databrowser to directly pip the output into cdo

In [None]:
temp_dir=$(mktemp -d --suffix cdo)
for exp in $experiments ;do
    outlist=()
    # Let's get only the first 5 ensemble member for brevity
    members=$(freva-client databrowser metadata-search \
    project=mpi-ge variable=tas time_frequency=mon --time="2025-01 to 2100-12" experiment="$exp" --json |
    jq -r '.ensemble | unique | .[:5] | join(" ")')
    for ens in $members;do
        echo -ne "Reading data and calculating TS for experiment $exp in ens: $ens\r"
        files=$(freva-client databrowser data-search project=mpi-ge variable=tas time_frequency=mon --time="2025-01 to 2100-12" experiment=$exp ensemble=$ens realm=atmos)
        outfile="$temp_dir/tas_mean_${exp}_${ens}.nc"
        cdo -s fldmean -mergetime $files "$outfile"
        outlist+=("$outfile")
    done
    cdo mergetime "${outlist[@]}" "$temp_dir/tas_ensemble_${exp}.nc"
done
cdo mergetime $temp_dir/tas_ensemble_*.nc tas_all_experiments.nc


We can use the search result of the databrowser object to directly open dataset in xarray:

In [None]:
cdo sinfo tas_all_experiments.nc

## Cataloging Datasets:

Now that you’ve located your target dataset on Freva, you may want to export the full metadata for project's partner who don’t have direct Freva access or you want to download it and access somewhere else on any other HPC system.

In this section we are going to introduce two different types of Cataloues: 
1. The **intake-esm** catalog provides a lightweight, Python-friendly interface to the metadata of large Earth System Model archives. By pointing to a central JSON index, it lets you discover, filter, and load climate model outputs—such as temperature, precipitation, or ocean variables—without downloading entire datasets. The catalog structure follows the CMIP/ESM conventions, enabling easy subsetting by attributes like project name, variable, experiment, and time period. Once exported as a standalone YAML file, your subsetted catalog can be shared with collaborators who can query and load data locally, with no direct access to the original archive required.


2. The **STAC (SpatioTemporal Asset Catalog)** static catalog defines a simple, filesystem-based layout for geospatial metadata. A static catalog bundles Catalog, Collection, and Item JSON files into a set of directories that mirror your data hierarchy, with no dynamic search API. Bundling the entire catalog into a ZIP archive makes it trivial to distribute or archive a snapshot of your dataset inventory—satellite imagery, climate projections, or any spatiotemporal assets—for offline use, disaster recovery, or reproducible analyses. Once unzipped, the folder structure and JSON files provide the same discovery semantics as a live STAC endpoint.  


First, we’ll use [intake-esm](https://intake-esm.readthedocs.io/en/stable/) to:

Subset by our chosen search keys:  
- project: `mpi-ge`
- time_frequency: `mon`
- variable: `tas`
- time: `'2025-01 to 2100-12'`
- experiment: `picontrol`


In [None]:
freva-client databrowser intake-catalogue project=mpi-ge time_frequency=mon variable=tas --time "2025-01 to 2100-12" experiment=picontrol

We’ll now perform the same operation on a STAC static catalog: download the entire catalog as a ZIP archive so you can share or inspect it offline.


In [None]:
freva-client databrowser stac-catalogue project=mpi-ge time_frequency=mon variable=tas --time "2025-01 to 2100-12" experiment=picontrol

To complete our explanation about STAC catalog, a **static catalog** is implemented as a set of flat files on a web server or object store (e.g., S3). It exposes the same Item, Catalog, and Collection JSON structure as a dynamic STAC, but without a `/search` endpoint—making it easy to bundle and distribute as a ZIP for disaster recovery or offline use