# 3. Using `yadg` to process multiple files into one data archive
If one wishes to process multiple files, whether of the same or different filetypes, into a single data archive, `yadg` offers a convenient way of doing so by using the `yadg process` command coupled with a *DataSchema*. The usage of the `yadg process` command can be accessed using the `--help` argument:

In [None]:
!yadg process --help

As shown above, the `infile` argument is a *DataSchema* file (JSON or YAML) specifying which files and folders are to be processed, and how. The *DataSchema* is documented in the [`dgbowl_schemas` package](https://dgbowl.github.io/dgbowl-schemas/master/apidoc/dgbowl_schemas.yadg.dataschema.html#module-dgbowl_schemas.yadg.dataschema); we will highlight the main functionalities below.

The `outfile` argument specifies the path of the output NetCDF file. Unlike in the [`yadg extract` usage](02_extract.ipynb), here the NetCDF file contains a `Datatree` as opposed to a `xarray.Dataset`.

## 3.1 Main features of a *DataSchema*
The default *DataSchema* for `yadg-5.0` is *DataSchema-5.0*, which is composed of three elements:

  - `metadata`, containing versioning and provenance information 
  - `step_defaults`, providing defaults for timezones, locales and file encoding,
  - `steps`, enumerating instructions for parsing of sets of files.

Note that previous versions of *DataSchemas* can be still used with `yadg-5.0`, as they will be automatically updated to the latest version, using default settings as necessary.

### 3.1.1 *DataSchema->`metadata`*

    metadata:
        provenance:
            type: manual
        version: "5.0"

The above example shows a sample `metadata` section of a valid *DataSchema-5.0*. The entries are:

  - `metadata->provenance->type` entry: a required string annotation of the *DataSchema* provenance, with `"manual"` denoting a hand-written schema (alternatives include `yadg extract`, if an internal *DataSchema* has been generated in the course of `yadg extract` run, etc.)
  - `metadata->version` entry: is a required string literal (not a float!) equal to `"5.0"`

### 3.1.2 *DataSchema->`step_defaults`*

    step_defaults:
        timezone: Europe/Berlin

The above example shows a sample `step_defaults` section of a valid *DataSchema-5.0*. The whole section is optional, however specifying at least the `step_defaults->timezone` is good practice, as the `timezone` is used to make all `DateTime` objects within `yadg` timezone-aware. The specified `timezone` should be that of the data, not of the computer where the data is being processed. A wrong `timezone` might lead to wrong unix timestamps, but more crucially, to relative off-by-hour errors (e.g. when measurements span daylight savings adjustment) or even worse errors when combining data containing timezone-aware timestamps with other, unannotated data.

Other optional entries are the `locale`, which switches on locale-aware processing of certain text file types (decimal and thousand separators), and `encoding`, which adjusts the default encoding (`UTF-8`) for all files.

### 3.1.3 *DataSchema->`steps`*
    
    steps:
      - parser: electrochem
        input:
            folders: ["process/data"]
            suffix: "mpr"
        extractor:
          filetype: "eclab.mpr"
        tag: electro

The above example shows one *step*, i.e. a block in the `steps` enumeration. The following entries are required:

  - `parser`, denoting the `yadg` parser used to process the matched files
  - `input`, specifying files to be processed directly, or matched within the specified folders using filename components,
  - `extractor->filetype`, specifying the filetype of the matched files.

Note that some `filetypes` may be processed by more than one `parser`, which is why both have to be specified. All files matched by the `input` section will be collated into one node in the resulting `Datatree`.

## 3.2 Usage example: electrochemistry and gas chromatography
A sample *DataSchema* is located in the [`process/ds.5.0.yml`](process/ds.5.0.yml) file. It contains the same `metadata` and `step_defaults` sections as above, but it has two steps:

- the first processing any files present in the `process/data/` folder, which have a `mpr` suffix and contain `LSV` in their filename, treating them as an `eclab.mpr` filetype,
- the second processing a single specified `process/data/CuDura05_03.zip` archive, treating it as a `fusion.zip` filetype.

We can process this *DataSchema*, generating a NetCDF file in `process/output.nc`, as follows:


In [None]:
!yadg process process/ds.5.0.yml process/output.nc

Remember that `yadg process` stores `DataTree` objects in the NetCDF file instead of a simple `xarray.Dataset`. We can have a look at what is in the file using the `xarray-datatree` package, as follows:

In [None]:
from datatree import open_datatree
dt = open_datatree("process/output.nc")
dt

We can access the individual leaves of the `DataTree` containing our data using dictionary notation. For this we can use the `tag` entries in the respective steps (i.e. `LSV` and `GC`):

In [None]:
from IPython.display import display
display(dt["LSV"])
display(dt["GC"])

As you can see, the first leaf, `LSV`, corresponds to the first *step* and contains 235 points along the `uts` dimension; the second leaf, `GC`, corresponds to the second *step*, and contains 120 points along the `uts` dimension with an additional `species` dimension.

As the `uts` timestamps are referencing the Unix epoch, the data is guaranteed to be properly aligned in time, assuming the `timezone` of the data files has been set properly.

[Back to index](index.ipynb)