# EERIE Phase 1 production simulation data available at DKRZ Levante

This notebook provides an overview about the EERIE Phase 1 production simulation data available at DKRZ's High Performance Computer Levante. We browse through the [EERIE intake catalog](https://github.com/eerie-project/intake_catalogues/tree/main), search for phase 1 models, experiments and versions and collect information from a datasets:

## Phase 1 simulation defintion:

- **Source**

    The source refers to the EERIE Earth System Model that was used to generate phase 1 production experiments

- **Experiments**

    For each source, we define a set of phase 1 production experiments for which output is available. Each set contains a keyvalue
    - the latest version. Versions in the catalogue distinguish between preliminary and production simulations.
    - an example dataset for getting statistical information

In [1]:
#example must be defined
#version must be a list
phase1_simulations={
    "ifs-fesom2-sr":[dict(
        experiment="eerie-spinup-1950",
        version=["v20240304"],
        example="ocean.native.daily"
    )],
    "icon-esm-er":[dict(
        experiment="eerie-spinup-1950",
        version=["v20240618"],
        example="ocean.native.2d_daily_mean"
    )],
    "ifs-amip":{},
    "ifs-nemo":{},
    "hadgem3-gc5-n640-orca12":[dict(
        experiment="eerie-picontrol",
        example="atmos.native.atmos_monthly_emon"
    )],
    "hadgem3-gc5-n216-orca025":[dict(
        experiment="eerie-picontrol",
        example="atmos.native.atmos_monthly_emon"
    )]
}

## Statistics

- No of xarray datasets

    The number of xarray datasets per phase 1 simulation is equivalent to the sum of the entries in the intake catalogue for the specific simulation.
    
- No of variables 

    The number off variables per phase 1 simulation is computed by summing up `len(ds.data_vars)` for each xarray dataset `ds` of a simulation. That means, "*variables*" are a combination of *aggregation* and variable name, similar to the definition of a *CMOR variable*. 2m Temperature can be accounted multiple times if it is written for multiple datasets i.e. multiple aggregations.
    
- Size in memory [TB]

    The size in memory per phase 1 simulation is computed by summing up `ds.size` for each xarray dataset `ds` of a simulation. This does not reflect the actual volume on disk because the datasets can be stored in a compressed form.
    
- Start simulation year

    The start simulation year is the first year of the *example* dataset
    
- End simulation year

    The start simulation year is the end year of the *example* dataset

In [2]:
import intake
eerie_cat=intake.open_catalog(
    "https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/dkrz/disk/model-output/main.yaml"
)

In [3]:
from copy import deepcopy
def find_data_sources(catalog,name=None):
    newname='.'.join(
        [ a 
         for a in [name, catalog.name]
         if a
        ]
    )
    data_sources = []

    for key, entry in catalog.items():
        if isinstance(entry, intake.catalog.Catalog):
            if newname == "main":
                newname = None
            # If the entry is a subcatalog, recursively search it
            data_sources.extend(find_data_sources(entry, newname))
        elif isinstance(entry, intake.source.base.DataSource):
            if newname:
                data_sources.append(newname+"."+key)
            else:
                data_sources.append(key)

    return data_sources

In [4]:
from copy import deepcopy as copy
dflist=[]
for source_id,experiments in phase1_simulations.items():
    print(source_id)
    for idx,experiment in enumerate(experiments):
        sdict=dict(source=source_id)
        datasets={}        
        cat_source=eerie_cat[source_id]
        exp_id=experiment["experiment"]        
        sdict["experiment"]=exp_id
        version=experiment.get("version",None)
        sdict["version"]="latest"
        cat_sourceexp=cat_source[exp_id]
        if version:
            sdict["version"]=version[-1]
            for vid in version:
                cat_sourceexp=cat_sourceexp[vid]
                for ds in find_data_sources(cat_sourceexp):
                    datasets['.'.join([
                        source_id,
                        exp_id,
                        ds
                    ])]=cat_sourceexp['.'.join(ds.split('.')[1:])].to_dask()
        else:
            for ds in find_data_sources(cat_sourceexp):
                datasets['.'.join([
                    source_id,
                    ds
                ])]=cat_sourceexp['.'.join(ds.split('.')[1:])].to_dask()        
        phase1_simulations[source_id][idx]["datasets"]=copy(datasets)
        #
        #assume datasets is the latest version
        #
        no_of_variables=0
        size=0
        sdict["No of xarray datasets"]=len(datasets)
        exds=None
        for name,ds in datasets.items():
            if experiment["example"] in name:
                exds=ds
            if version:
                if sdict["version"] in name:
                    size+=ds.nbytes
                    no_of_variables+=len(ds.data_vars)
            else:
                size+=ds.nbytes
                no_of_variables+=len(ds.data_vars)
        sdict["No of variables"]=no_of_variables
        sdict["Size in memory [TB]"]=size/1024**4
        
        years=exds["time"].groupby("time.year").groups        
        sdict["Start simulation year"]=list(years.keys())[0]
        sdict["End simulation year"]=list(years.keys())[-1]

        dflist.append(copy(sdict))

ifs-fesom2-sr
icon-esm-er
ifs-amip
ifs-nemo
hadgem3-gc5-n640-orca12


  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config

hadgem3-gc5-n216-orca025


  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(


In [5]:
import pandas as pd
sourcedf=pd.DataFrame(dflist)#.transpose()

In [6]:
sourcedf

Unnamed: 0,source,experiment,version,No of xarray datasets,No of variables,Size in memory [TB],Start simulation year,End simulation year
0,ifs-fesom2-sr,eerie-spinup-1950,v20240304,12,86,31.30358,1950,1980
1,icon-esm-er,eerie-spinup-1950,v20240618,27,334,846.091654,1950,2000
2,hadgem3-gc5-n640-orca12,eerie-picontrol,latest,9,117,3.41377,1851,1900
3,hadgem3-gc5-n216-orca025,eerie-picontrol,latest,9,115,1.67103,1851,1980
