# Catalog Preparation

The [NetCDF Java library](https://www.unidata.ucar.edu/software/netcdf-java/) implements the Common Data Model (CDM) to interface netCDF files to a variety of data formats (e.g., netCDF, HDF, GRIB). Layered above the basic data access, the CDM uses the metadata contained in datasets to provide a higher-level interface to geoscience specific features of datasets, in particular, providing geolocation and data subsetting in coordinate space. 

[**climate4R**](https://github.com/SantanderMetGroup/climate4R) leverages this CDM for a flexible and efficiente data access and retrieval, via the "wrapper" package `loadeR`, that works as an interface between R and the netCDF Java API, via a simple set of user-friendly functions.

In this context, NcML is an XML representation of netCDF metadata, (approximately) the header information one gets from a netCDF file with the `ncdump -h`” command. A more advanced use is to modify existing NetCDF files, as well as to create "virtual" NetCDF datasets, for example through aggregation. [(Link to Unidata's NcML overview)](https://docs.unidata.ucar.edu/netcdf-java/5.6/userguide/ncml_overview.html)

For FWI evaluation, we will create specific NcML datasets (a.k.a. "catalogues"), in order to efficiently retrieve the input variables, without worrying about the different paths and underlying files forming the dataset.


In [2]:
## Climate4R
library(loadeR)


# Example with the CCLM6-0-1-URB RCM data

This is the directory containing the hourly data files of the evaluation run of this model:

In [4]:
dir <- "/mnt//CORDEX_CMIP6_tmp//sim_data//CORDEX-CMIP6//DD//EUR-12//CLMcom-CMCC/ERA5//evaluation//r1i1p1f1//CCLM6-0-1-URB//v1-r1//1hr"


The creation of NcML can be done automatically with the loadeR's function `makeAggregatedDataset`. The function retrieves the directory structure and scans all netcdf files extracting the relevant metadata for NcML creation (optionally following a given character pattern to discard unwanted data)

Next, in one line we create a 'virtual' dataset that contains only the input hourly variables we are interested in for FWI calculation, and store it in a target directory (Note that the NcML itself is only a XML representation of the data, which means that it is just a text file, and not the data itself):

In [None]:
makeAggregatedDataset(dir, recursive = TRUE,
                      pattern = c("hurs|tas|sfcWind"),
                      ncml.file = "../data_catalogs/CCLM6-0-1-URB_fwi_vars.ncml")

Next, we display the first 10 lines of this file as a sample: 

In [7]:
sample <- readLines("../data_catalogs/CCLM6-0-1-URB_fwi_vars.ncml", n = 10)
print(sample)

 [1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"                                                                                                                                                                                                                                           
 [2] "<netcdf xmlns=\"http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2\">"                                                                                                                                                                                                            
 [3] "\t<aggregation type=\"union\">"                                                                                                                                                                                                                                                       
 [4] "\t\t<netcdf>"                                                                                                                              

Now we are ready to loading the data pointing to the catalogue. Usually, before opening the data we would like to have an overview of its contents and structure. This is possible with `dataInventory`, that will provide information on available variables, spatio-temporal extent, size in Mb and so on...

In [8]:
di <- dataInventory(dataset = "../data_catalogs/CCLM6-0-1-URB_fwi_vars.ncml")

[2025-04-17 19:45:49.513671] Doing inventory ...

[2025-04-17 19:45:53.269976] Retrieving info for 'hurs' (2 vars remaining)

[2025-04-17 19:45:53.426528] Retrieving info for 'sfcWind' (1 vars remaining)

[2025-04-17 19:45:53.462445] Retrieving info for 'tas' (0 vars remaining)

[2025-04-17 19:45:53.492653] Done.



In [9]:
str(di)

List of 3
 $ hurs   :List of 7
  ..$ Description: chr "Near-Surface Relative Humidity"
  ..$ DataType   : chr "float"
  ..$ Shape      : int [1:3] 368184 406 418
  ..$ Units      : chr "%"
  ..$ DataSizeMb : num 249935
  ..$ Version    : logi NA
  ..$ Dimensions :List of 3
  .. ..$ time:List of 4
  .. .. ..$ Type      : chr "Time"
  .. .. ..$ TimeStep  : chr ".041666 days"
  .. .. ..$ Units     : chr "days since 1949-12-01 00:00:00"
  .. .. ..$ Date_range: chr "1980-01-01T00:00:00Z - 2021-12-31T23:00:00Z"
  .. ..$ rlat:List of 5
  .. .. ..$ Type       : chr "GeoY"
  .. .. ..$ Units      : chr "degrees"
  .. .. ..$ Values     : num [1:406] -23 -22.9 -22.8 -22.7 -22.6 ...
  .. .. ..$ Shape      : int 406
  .. .. ..$ Coordinates: chr "rlat"
  .. ..$ rlon:List of 5
  .. .. ..$ Type       : chr "GeoX"
  .. .. ..$ Units      : chr "degrees"
  .. .. ..$ Values     : num [1:418] -28 -27.9 -27.8 -27.7 -27.6 ...
  .. .. ..$ Shape      : int 418
  .. .. ..$ Coordinates: chr "rlon"
 $ sfcWind:List

## Accumulated precipitation catalogue

We create a different catalogue for precipitation, since this variable has been pre-processed by us in order to obtain 12-12 accumulated values:

In [3]:
dir.pr <- "../data_tmp/CCLM6-0-1-URB/"
makeAggregatedDataset(dir.pr, pattern = "nc$",
                      ncml.file = "../data_catalogs/CCLM6-0-1-URB_pr12.ncml")

[2025-04-15 10:04:30.036542] Creating dataset from 42 files

[2025-04-15 10:04:31.68554] Scanning file 1 out of 1

[2025-04-15 10:04:31.693069] Defining aggregating dimension length
This process may be slow but will significantly speed-up data retrieval...

[2025-04-15 10:04:31.957102] Dimension length defined

[2025-04-15 10:04:31.958521] NcML file "../data_catalogs/CCLM6-0-1-URB_pr12.ncml" created from 42 files corresponding to 1 variables

Use 'dataInventory' to obtain a description of the dataset



## More information

The notebook `FWI_example.ipynb` illustrates data loading from the catalogues.
