<div style="text-align:center;">
    <img src="./pics/datacat_logo.svg" width="80%"></img>
    <br>
    <br>
    <h1>DataLad MetaLad - A primer</h1>
</div>


### Quick Links:

- [DataLad Catalog demo site](https://datalad.github.io/datalad-catalog/)
- [3-minute explainer video](https://www.youtube.com/watch?v=4GERwj49KFc)
- [DataLad Catalog code repository](https://github.com/datalad/datalad-catalog)
- [DataLad Catalog documentation](http://docs.datalad.org/projects/catalog/en/latest/?badge=latest)
- [DataLad website](https://www.datalad.org/)
- [DataLad code repository](https://github.com/datalad/datalad)
- [DataLad MetaLad code repository](https://github.com/datalad/datalad-metalad)

## What is DataLad MetaLad?

DataLad MetaLad is a free and open source command line tool that has a Python API and is available for installation via [PyPI](https://pypi.org/project/datalad-metalad/). It is an extension that equips DataLad with a command suite for metadata handling, including metadata extraction, aggregation, filtering, and reporting.

## But what is metadata?

Metadata, generally, means "data about data". Let's consider data as being a set of files structured in some hierarchical file tree that, together, constitute a dataset. *Metadata* describe both the files in that dataset, as well as the dataset's overall content more generally. It could be useful to look at it from different angles: low- and high-level metadata. Low-level metadata cover the basic properties of the files themselves that are implicit to the dataset, such as the file names, types, sizes, relative paths, and file access locations. These can be derived from the dataset and its files *as is*. High-level metadata, on the other hand, will often be added to the dataset explicitly, or otherwise aggregated from lower-level metadata into a top-level summary, and might include things like a description of your dataset, a list of dataset maintainers/contributors, and other domain-specific descriptive variables.

Some metadata might conform to an industry standard, such as the [DataCite Metadata Schema](https://schema.datacite.org/). Standardized file formats may also contain format-specific information (such as bit rate and duration for audio files, or resolution and color mode for image files), while domain-standard files (such as [Digital Imaging and Communications in Medicine](https://www.dicomstandard.org/), i.e. DICOM) also supply embedded or sidecar metadata. Other types of metadata might be more of an accepted convention used in a specific field or institute. Metadata could even just be structured based on your own approach that you find logical, for example a text-based `README` file with content under standard headings.

## Why do we need metadata?

The usefulness of having metadata in addition to the data itself becomes clear when considering the FAIR principles of data management: [Findable, Accessible, Interoperable, and Reusable](https://www.go-fair.org/fair-principles/).

Structured and linked metadata provide a powerful opportunity by allowing us to create an abstract representation of a full dataset that is separate from the actual file content. By representing the dataset using its associated metadata, we can easily make this metadata available online, while not having to consider the often challenging practicalities of sharing data openly. This means that data content can be stored securely while metadata can be shared and operated on widely, thus improving decentralization and FAIRness.

<br>
<div style="text-align:center;">
    <img src="./pics/datacat2_the_opportunity.svg" width="60%"></img>
    <h5> The split between data content and metadata presents a powerful opportunity to improve FAIR data management</h5>
</div>

Additionally, if we have metadata that is machine readable and standardized, it opens up access to loads of community contributed tools and pipelines that understand the metadata and can perform standard operations on them. Such features also make the metadata, and hence the dataset, interoperable with wider standards.

## Where does DataLad MetaLad fit into the meta-picture?

MetaLad provides us with all the functionality needed to handle metadata of DataLad datasets effectively. With MetaLad you can associate metadata items of arbitrary size, format, and amount with a dataset and its files, and you can extract, view, filter and aggregate these metadata items individually or as part of a pipeline. DataLad MetaLad has a set of standard commands that facilitate this process:

- `meta-add`: add a metadata record or a list of metadata records to a metadata store, usually to the git-repo of the dataset.
- `meta-extract`: run an extractor on a file or dataset and emit the resulting metadata
- `meta-filter`: run a filter over existing metadata and return the resulting metadata
- `meta-aggregate`: aggregate metadata from multiple local or remote metadata-stores into a local metadata store.
- `meta-dump`: report metadata from local or remote metadata store
- `meta-conduct`: execute processing pipelines of chained commands (e.g. `meta-extract` and `meta-add`) on a list of objects such as files or datasets.

## The high-level MetaLad Concepts

### The metadata store

Firstly, MetaLad recognizes metadata associated with a DataLad dataset and its files _once the metadata have been added to the **metadata store**_. A metadata store can be any git repository, and not necessarily the same DataLad dataset referenced by the associated metadata. In this way, MetaLad has the ability to transport metadata independently of the data in the dataset. 

### Metadata extraction

Before metadata can be added to a metadata store, it has to be extracted from its source and made available in a portable format, such as a [JSON](https://www.json.org/json-en.html) object. The metadata source can be implicit to the dataset (such as the file tree structure and file formats) or it could be an explicit file with standardized content about the data. Metadata sources do not have to be part of the actual dataset for which metadata is being generated.

From the source, metadata is extracted by a **metadata extractor**. MetaLad distinguishes between two extractor-types: dataset-level extractors and file-level extractors. The former are executed with a view on a dataset, the latter are executed with specific information about a single file-path in a dataset. The job of an extractor is to understand the structure of the metadata source, grab the bits and pieces that it needs, and put all of this information into a format that MetaLad understands, which in the Python API is a standard dictionary of key-value pairs.

DataLad MetaLad ships with a number of dataset- and file-level metadata extractors, such as `metalad_core` and `metalad_studyminimeta`. Users can also develop their own extractors to be used by DataLad MetaLad, using the [DataLad Extension Template](https://github.com/datalad/datalad-extension-template)

### MetaLad operations

With extracted metadata available after running the `datalad meta-extract` command, metadata can be added to a metadata store by running `datalad meta-add`. It is then often desirable to export or filter all metadata currently available in a given metadata store. These operations can be done with `datalad meta-dump` and `datalad meta-filter`, respectively. The ability also exists to report metadata from a remote metadata store without downloading the complete remote metadata. In fact only the minimal necessary information is transported from the remote metadata store. This ability is available to all metadata-based operations, for example, also to filtering.

Another useful operation is metadata aggregation, using `datalad meta-aggregate`. This will aggregate the metadata available in identified metadata stores into a single specified metadata store, i.e. transporting metadata across the 'borders' of datasets. This is often useful in the context of super- and subdatasets, i.e. dataset linkage, in DataLad. One might want to have all metadata of a linked set of datasets available in a single metadata store, and this is possible with the use of MetaLad's aggregation functionality.

Lastly, it is likely that metadata operations will not be executed individually and manually, but rather automatically as part of some continuous process. In addition, metadata extraction, aggregation and adding operations might become time consuming when run on large datasets with many files. It is therefore useful to have both a mechanism that allows metadata operations to be constructed into an execution pipeline, and one that allows parallel execution. This is made possible with `datalad meta-conduct`


## So what now?

You should now have a good overview of what DataLad MetaLad is and what it can do. So let's get some hands-on experience!

### [Tutorial - Metadata handling](datalad_metalad_getting_started.ipynb)

This tutorial gives an overview of and hands-on experience with the metadata handling capabilities of DataLad MetaLad, including:
- adding metadata to a DataLad dataset
- extracting metadata from a DataLad dataset
- aggregating and dumping metadata
- handling dataset- and file-level metadata