# Data Analysis Workshop
## Tutorial I: Querying and Accessing Data

In this tutorial, we’ll learn how to use the `freva-client` library to explore available datasets.

To get started, we’ll run a simple analysis on the [MPI Grand Ensemble data](https://mpimet.mpg.de/en/research/modeling/grand-ensemble), a large collection of climate simulations.

The data browser organizes metadata in a **tree-like hierarchy**. At the top of this structure is the **`project`** level.

Let’s begin by finding out which projects are currently available in the system.

In [69]:
export PATH=/sw/spack-levante/cdo-2.2.2-4z4icb/bin:$PATH
freva-client databrowser --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1mfreva-client databrowser [OPTIONS] COMMAND [ARGS]...[0m[1m                   [0m[1m [0m
[1m                                                                                [0m
 Data search related commands                                                   
                                                                                
[2m╭─[0m[2m Options [0m[2m───────────────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m [1;36m-[0m[1;36m-help[0m          Show this message and exit.                                  [2m│[0m
[2m╰──────────────────────────────────────────────────────────────────────────────╯[0m
[2m╭─[0m[2m Commands [0m[2m──────────────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m [1;36mdata-overview    [0m[1;36m [0m Get an overview over what is available in the      

Let's get an overview:

In [70]:
freva-client databrowser data-overview

Available search flavours:
- freva
- cmip6
- cmip5
- cordex
- nextgems
- user
Search attributes by flavour:
  cmip5:
  - experiment
  - member_id
  - fs_type
  - grid_label
  - institution_id
  - model_id
  - project
  - product
  - realm
  - variable
  - time
  - bbox
  - time_aggregation
  - time_frequency
  - cmor_table
  - dataset
  - format
  - grid_id
  - level_type
  cmip6:
  - experiment_id
  - member_id
  - fs_type
  - grid_label
  - institution_id
  - source_id
  - mip_era
  - activity_id
  - realm
  - variable_id
  - time
  - bbox
  - time_aggregation
  - frequency
  - table_id
  - dataset
  - format
  - grid_id
  - level_type
  cordex:
  - experiment
  - ensemble
  - fs_type
  - grid_label
  - institution
  - model
  - project
  - domain
  - realm
  - variable
  - time
  - bbox
  - time_aggregation
  - time_frequency
  - cmor_table
  - dataset
  - driving_model
  - format
  - grid_id
  - level_type
  - rcm_name
  - rcm_version
  freva:
  - project
  - product
  - institute


Let's assume we know that the Grand-Ensemble data is stored under `mpi-ge` but we don't know whether it's under `project` or `product` etc. The databrowser is here to help. You can simply use the `facet` argument to search for all entries containing a certain value, such as `mpi-ge`. To fine tune the output we can use the `--json` flag and the bash json parser `jq` to process the output. Let's get the project(s) of all search keys (or facets) that contain `mpi-ge`

In [71]:
freva-client databrowser metadata-search --host https://www.freva.dkrz.de --facet mpi-ge --json | jq -rc '.project|join(", ")'

mpi-ge


Let's create a time series of 2 m air temperature. To do so we have to check if the `tas` variable is available. We can use the `metadata-search` subcommand. Here we use `jq` to check if `tas` is in the variables:

In [72]:
freva-client databrowser metadata-search project=mpi-ge --json | jq -rc '.variable | index("tas") != null'

[0;39mtrue[0m


Let's query the available ouput time frequencies, we can also parse `--json` output to `jq`

In [73]:
freva-client databrowser metadata-search project=mpi-ge variable=tas --json | jq -rc ".time_frequency"

[1;39m[[0;32m"mon"[0m[1;39m[1;39m][0m


Now we do have a rough overview of the available data. We want to cover future scenarios, that is timesteps from today until 2100. Check how many data files do we have:

In [74]:
freva-client databrowser data-count project=mpi-ge variable=tas time_frequency=mon time="2025-01 to 2100-12"

602


Let's check the experiments that match this query:

In [75]:
freva-client databrowser metadata-search project=mpi-ge variable=tas time_frequency=mon --time "2025-01 to 2100-12" --json | jq -c ".experiment"

[1;39m[[0;32m"picontrol"[0m[1;39m,[0;32m"rcp26"[0m[1;39m,[0;32m"rcp45"[0m[1;39m,[0;32m"rcp85"[0m[1;39m[1;39m][0m


The `picontrol` experiment is unexpceted. Let's check the what is going on. Let's check how many files exist with the `picontrol` experiment key:

In [76]:
freva-client databrowser data-count project=mpi-ge variable=tas time_frequency=mon experiment=picontrol --time "2025-01 to 2100-12"

2


And those are the files:

In [77]:
freva-client databrowser data-search project=mpi-ge variable=tas time_frequency=mon experiment=picontrol --time "2025-01 to 2100-12"

/work/mh1007/CMOR/MPI-GE/output1/MPI-M/MPI-ESM/piControl/mon/atmos/tas/r001i1850p3/v20190123/tas_Amon_MPI-ESM_piControl_r001i1850p3_210001-219912.nc
/work/mh1007/CMOR/MPI-GE/output1/MPI-M/MPI-ESM/piControl/mon/atmos/tas/r001i1850p3/v20190123/tas_Amon_MPI-ESM_piControl_r001i1850p3_200001-209912.nc


Let's do a reverse search, that is check what meta data is assocaited with a file: 

In [78]:
freva-client databrowser metadata-search file=/work/mh1007/CMOR/MPI-GE/output1/MPI-M/MPI-ESM/piControl/mon/atmos/tas/r001i1850p3/v20190123/tas_Amon_MPI-ESM_piControl_r001i1850p3_210001-219912.nc

ensemble: r001i1850p3
experiment: picontrol
institute: mpi-m
model: mpi-esm
product: output1
project: mpi-ge
realm: atmos
time_aggregation: mean
time_frequency: mon
variable: tas


Since we don't want this pre-industrial control run in our databrowser search we tell the databrowser to not use it. We can use the `!` to *not* include a certain value

In [79]:
experiments=$(freva-client databrowser metadata-search project=mpi-ge variable=tas time_frequency=mon --time="2025-01 to 2100-12"  experiment='!picontrol' --json| jq -rc '.experiment| join(" ")')
echo $experiments

rcp26 rcp45 rcp85


Now let's try to create a global time series for each of the experiments. We can use the search result of the databrowser to directly pip the output into cdo

In [None]:
temp_dir=$(mktemp -d --suffix cdo)
for exp in $experiments ;do
    outlist=()
    # Let's get only the first 5 ensemble member for brevity
    members=$(freva-client databrowser metadata-search \
    project=mpi-ge variable=tas time_frequency=mon --time="2025-01 to 2100-12" experiment="$exp" --json |
    jq -r '.ensemble | unique | .[:5] | join(" ")')
    for ens in $members;do
        echo -ne "Reading data and calculating TS for experiment $exp in ens: $ens\r"
        files=$(freva-client databrowser data-search project=mpi-ge variable=tas time_frequency=mon --time="2025-01 to 2100-12" experiment=$exp ensemble=$ens realm=atmos)
        outfile="$temp_dir/tas_mean_${exp}_${ens}.nc"
        cdo -s fldmean -mergetime $files "$outfile"
        outlist+=("$outfile")
    done
    cdo mergetime "${outlist[@]}" "$temp_dir/tas_ensemble_${exp}.nc"
done
cdo mergetime $temp_dir/tas_ensemble_*.nc tas_all_experiments.nc

[32mcdo    mergetime: [0mProcessed 5640 values from 5 variables over 5640 timesteps [0.03s 25MB]
[32mcdo    mergetime: [0mProcessed 5640 values from 5 variables over 5640 timesteps [0.03s 25MB]
[32mcdo    mergetime: [0mProcessed 5640 values from 5 variables over 5640 timesteps [0.03s 25MB]


In [99]:
cdo sinfo tas_all_experiments.nc

[0;1m   File format[0m : NetCDF
[0;1m    -1 : Institut Source   T Steptype Levels Num    Points Num Dtype : Parameter ID[0m
     1 : [34munknown  MPI-ESM  v instant  [0m[32m     1 [0m  1 [32m        1 [0m  1 [34m F32  [0m: -1            
[0;1m   Grid coordinates[0m :
     1 : [34mlonlat                  [0m : [32mpoints=1 (1x1)[0m
                              lon : 0 degrees_east
                              lat : 0 degrees_north
[0;1m   Vertical coordinates[0m :
     1 : [34mheight                  [0m :[32m levels=1  scalar[0m
                           height : 2 m
[0;1m   Time coordinate[0m :
                             time : [32m16920 steps
[0m     RefTime =  2005-01-01 00:00:00  Units = days  Calendar = proleptic_gregorian  Bounds = true
  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss
[35m  2006-01-16 12:00:00  2006-01-16 12:00:00  2006-01-16 12:00:00  2006-01-16 12:00:00
  2006-01-16 12:00:00  2006-01-16 12:00