This notebook was tested from a GFDL workstation.
This notebook is an example of using catalog builder from a notebook to generate data catalogs, a.k.a intake-esm catalogs.

How to get here? 

Login to your workstation at GFDL.
module load miniforge
conda activate catalogbuilder 
(For the above: Note that you can either install your own environment using the following or use an existing environment such as this: conda activate /nbhome/Aparna.Radhakrishnan/conda/envs/catalogbuilder)

Alternatively (or the primary method)

conda create -n catalogbuilder 
conda install catalogbuilder -c noaa-gfdl -n catalogbuilder

Now, we do a couple of things to make sure your environment is available to jupyter-lab as a kernel.

pip install ipykernel 
python -m ipykernel install --user --name=catalogbuilder

Now, start a jupyter-lab session from GFDL workstation: 

jupyter-lab 

This will give you the URL to the jupyter-lab session running on your localhost. Paste the URL in your web-browser (or via TigerVNC). Paste the notebook cells from this notebook, or locate the notebook from the path where you have downloaded or cloned it via git. Go to Kernel->Change Kernel-> Choose intakebuilder.

Run the notebook and see the results! Extend it and share it with us via a github issue. 


In [1]:
import sys, os 
git_package_dir = '/home/a1r/git/forkCatalogBuilder-/'
sys.path.append(git_package_dir)

import catalogbuilder
from catalogbuilder.scripts import gen_intake_gfdl
print(gen_intake_gfdl.__file__)

######USER input begins########

#User provides the input directory for which a data catalog needs to be generated.

input_path = "/archive/a1r/fre/FMS2024.02_OM5_20240724/CM4.5v01_om5b06_piC_noBLING/gfdl.ncrc5-intel23-prod-openmp/pp/"
#/archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp/"

#USER inputs the output path. Based on the following setting, user can expect to see /home/a1r/mycatalog.csv and /home/a1r/mycatalog.json generated as output.

output_path = "/home/a1r/tests/static-catalog"
#NOTE: If your input_path does not look like the above in general, you will need to pass a --config which is custom 
#for your directory structure. See examples below.  
####END OF user input ##########



/home/a1r/git/forkCatalogBuilder-/catalogbuilder/scripts/gen_intake_gfdl.py


In [2]:
#This is an example call to run catalog builder using a yaml config file.
configyaml = os.path.join(git_package_dir, 'configs/config-template.yaml')
#input_path = "/archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp"
#output_path = "sample-mdtf-catalog"

def create_catalog_from_config(input_path=input_path,output_path=output_path): #,configyaml=configyaml):
    csv, json = gen_intake_gfdl.create_catalog(input_path=input_path,output_path=output_path)#,verbose=True,config=configyaml)
    return(csv,json)

if __name__ == '__main__':
    csv,json = create_catalog_from_config(input_path,output_path)#,configyaml)
    

INFO:local:[Mostly] silent log activated
INFO:local:Default schema: catalogbuilder/cats/gfdl_template.json
INFO:local:input path: /archive/a1r/fre/FMS2024.02_OM5_20240724/CM4.5v01_om5b06_piC_noBLING/gfdl.ncrc5-intel23-prod-openmp/pp/
INFO:local: output path: /home/a1r/tests/static-catalog
{'activity_id': 'dev', 'path': '/archive/a1r/fre/FMS2024.02_OM5_20240724/CM4.5v01_om5b06_piC_noBLING/gfdl.ncrc5-intel23-prod-openmp/pp/ocean_monthly/ocean_monthly.static.nc', 'variable_id': 'fixed', 'frequency': 'fx', 'table_id': 'fx', 'realm': 'static'}
{'activity_id': 'dev', 'path': '/archive/a1r/fre/FMS2024.02_OM5_20240724/CM4.5v01_om5b06_piC_noBLING/gfdl.ncrc5-intel23-prod-openmp/pp/ocean_monthly/ts/monthly/5yr/ocean_monthly.000101-000512.zos.nc', 'variable_id': 'zos', 'time_range': '000101-000512', 'realm': 'ocean_monthly'}
time-series data
{'activity_id': 'dev', 'path': '/archive/a1r/fre/FMS2024.02_OM5_20240724/CM4.5v01_om5b06_piC_noBLING/gfdl.ncrc5-intel23-prod-openmp/pp/ocean_monthly/ts/monthl

Found existing file! Overwrite? (y/n) y


JSON generated at: /home/a1r/tests/static-catalog.json
CSV generated at: /home/a1r/tests/static-catalog.csv
INFO:local:CSV generated at/home/a1r/tests/static-catalog.csv


Let's begin our analysis

In [None]:
import intake, intake_esm
import matplotlib #do a pip install of tools needed in your env or from the notebook
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


In [None]:
col_url = json
col = intake.open_esm_datastore(col_url)

Explore the catalog

In [None]:
col.df

Let's narrow down the search

In [None]:
expname_filter = ['CM4.5v01_om5b06_piC_noBLING']
modeling_realm = "ocean_monthly"
frequency = "mon"

In [None]:
cat = col.search(experiment_id=expname_filter,frequency=frequency,realm=modeling_realm)

In [None]:
set(cat.df["variable_id"])

In [None]:
cat = cat.search(variable_id="sos") #Total Soil Moisture Content

dmget the files

In [None]:
cat

In [None]:
#for simple dmget usage, just use this !dmget {file}
#use following to wrap the dmget call for each path in the catalog
def dmgetmagic(x):
    cmd = 'dmget %s'% str(x) 
    return os.system(cmd)

#OR refer to importing dmget ,  https://github.com/aradhakrishnanGFDL/canopy-cats/tree/main/notebooks/dmget.py

In [None]:
dmstatus = cat.df["path"].apply(dmgetmagic)

In [None]:
dset_dict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':5}, 'decode_times': True})

In [None]:
for k in dset_dict.keys(): 
    print(k)

In [None]:
ds = dset_dict[k]

In [None]:
ds

In [None]:
ds["sos"]

In [None]:
sos = ds.sos.isel(time=1).plot()

In [None]:
ds.sos.mean(dim='time').plot()