Map dataset properties to schema.org terms #51

mih · 2019-01-25T13:00:09Z

I want to improve the usability of the output of our internal metadata extractors -- those that look at common properties of all datasets (I consider anything else out-of-scope here), and therefore can always run wrt to applications like datalad/datalad-revolution#76

For this, it would be helpful to discuss and agree on a mapping of such properties on schema.org terms. The following is a list of terms that (I think) are applicable, and their proposed mapping.
Please contribute by extending the list, and arguing for/against my proposal. Thx!

After a bit of thinking, I am of the opinion that we should avoid shoehorning meaning onto terms that exist, but are not a 100% match. So the following is a list of only those properties, where I see such amatch.

Standard, with a definable source

The following aspects have a (potentially definitive) source within the scope of the core metadata extractors.

`identifier` [recommended by Google]

The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc.
https://schema.org/identifier (PropertyValue | Text | URL)

This is a dataset's UUID. This ID is also used to identify relationships between datasets.

See the hasPart section for a potential reason to also consider the latest commit SHA as an additional identifier.

`contributor`

A secondary contributor to the CreativeWork or Event.
https://schema.org/contributor (Organization | Person)

We have no way to infer author, but we can surely state that any author of a commit in the history is a contributor. An extractor should consult the mailmap to give a sensible report. The list of contributors reported by the extractor need not be exhaustive.

Something like this: git log --use-mailmap --no-merges --format=format:'%aN%x00%aE%n%cN%x00%cE' |sort |uniq

`hasPart`

Indicates an item or CreativeWork that is part of this item, or CreativeWork (in some sense)
https://schema.org/hasPart (CreativeWork)

These are primarily subdatasets (referenced by their dataset ID), but we could provide a list of files (by annex-key or shasum too).

This one is tricky to assemble as a given dataset may not have all information about subdatasets (e.g. their IDs aggregated, and we cannot rely on or require all subdatasets to be installed. What we do know, however, is the state of the subdataset that is referenced (commit) -- this is as much of a precise ID then the UUID, but much more volatile.

`isPartOf`

Indicates an item or CreativeWork that this item, or CreativeWork (in some sense), is part of.
https://schema.org/isPartOf (CreativeWork)

This is the UUID of the superdataset. BUT see includedInDataCatalog. For any given dataset, this is easier to determine that the hasPart side of thing. We just need to look for a single superdataset once vs. all the subdatasets.

`distribution`

A downloadable form of this dataset, at a specific location, in a specific format.
https://schema.org/distribution (DataDownload)

This is any remote of a dataset, described by a compound object (see below for applicable properties).

`distribution.contentUrl`

Actual bytes of the media object
https://schema.org/contentUrl (URL)

A URL that datalad install can act on. The media object here is the dataset itself.

`dateCreated`

The date on which the CreativeWork was created or the item was added to a DataFeed.
https://schema.org/dateCreated (Date | DateTime)

Timestamp of the initial commit.

`(distribution.)dateModified`

The date on which the CreativeWork was most recently modified or when the item's entry was modified within a DataFeed.
https://schema.org/dateModified (Date | DateTime)

The timestamp of the last commit on record for the dataset, or, in case of a DataDownload, the respective remote.

`distribution.uploadDate`

Date when this media object was uploaded to this site.
https://schema.org/uploadDate (Date)

We cannot easily say this, unless we make "publish" leave a trace.

`distribution.name`

The name of the item
https://schema.org/name (Text)

Name of a remote. Not sure if special remotes qualify, as we need to identify distributions of the dataset,
not (parts) of its content.

`distribution.url`

URL of the item
https://schema.org/url (URL)

The (fetch) URL of a remote.

`distribution.description`

A description of the item.
https://schema.org/description (Text)

A short description of the nature of the remote (Git,or git-annex special-remote of type ...)

`distribution.identifier`

The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc.
https://schema.org/identifier (PropertyValue | Text | URL)

This UUID of a git-annex key-store for a remote (if any exists).

`provider`

The service provider, service operator, or service performer; the goods producer. Another party (a seller) may offer those services or goods on behalf of the provider. A provider may also serve as the seller.
https://schema.org/provider (Organization | Person)

This could be a record for DataLad itself, identifying it as the service provider (in the scope of the Dataset record; for DataDownload it would be a data portal or something else, but this is impossible to infer in general ).

There is also publisher, but I don't think that matches the role of DataLad.

`version` [recommended by Google]

The version of the CreativeWork embodied by a specified resource.
https://schema.org/version (Number | Text)

0-<ncommits>-<refcommit-shasum>

Poor man's alternative to git describe -- which we cannot use unconditionally, as it needs (annotated) tags to function. Instead, we count any commit and use the initial commit as a universal reference. Above format mimics git describe output, but uses 0 as a constant prefix (not a tag).

ncommits = git log --no-merges --format=oneline |wc -l

`includedInDataCatalog` [recommended by Google]

A data catalog which contains this dataset.
https://schema.org/includedInDataCatalog (DataDownload)

List of remotes (i.e. distributions, see above), plus the super dataset (topmost only, to not by redundant with isPartOf).

Not sure how to reference the superdataset as a DataCatalog-type. For DataLad any Dataset is also a DataCatalog. Maybe we should market any dataset as a catalog, but we would loose the distribution field when switching the type (and not sure of things like Google dataset search are happy with this).

One approach to deal with this in a context like datalad/datalad-revolution#76 (where we know that a superdataset is serving the purpose of a data catalog, as opposed to just tracking dependencies) would be to generate a single page with a DataCatalog-type metadata record, and for all subdataset pages refer to this page as the containing catalog. In DataLad's actual metadata records, however, we do not use this (as the "is-in" relationship is only reliably determined from a (distant) superdataset). Instead, we limit the record to immediate child relationships, i.e. hasPart, and only inject isPartOf and includedInDataCatalog at the time of exporting metadata in a specific context, for a specific purpose.

Should be standard, but have no standard source

For the following aspects we could implement heuristics. Extracted metadata should only contain facts. Any such heuristics should be employed/executed at a late stage in an application context (where it is known how error-tolerant one can be). Hence, we only have to think about, what kind of factual information we want to extract to enable such heuristics.

`license`

A license document that applies to this content, typically indicated by URL.
https://schema.org/license (CreativeWork | URL)

We could extract the content of LICENSE or COPYING, if such file exists.

`keywords`

Keywords or tags used to describe this content. Multiple entries in a keywords list are typically delimited by commas.
https://schema.org/keywords (Text)

Amend with the names of metadata extractors.

The text was updated successfully, but these errors were encountered:

mih transferred this issue from datalad/datalad May 30, 2020

mih mentioned this issue Nov 20, 2021

Metadata extractor duties wishy-whashy #177

Open

mih mentioned this issue Jul 4, 2023

Relevant/related issues from metalad psychoinformatics-de/datalad-tabby#32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map dataset properties to schema.org terms #51

Map dataset properties to schema.org terms #51

mih commented Jan 25, 2019

Map dataset properties to schema.org terms #51

Map dataset properties to schema.org terms #51

Comments

mih commented Jan 25, 2019

Standard, with a definable source

identifier [recommended by Google]

contributor

hasPart

isPartOf

distribution

distribution.contentUrl

dateCreated

(distribution.)dateModified

distribution.uploadDate

distribution.name

distribution.url

distribution.description

distribution.identifier

provider

version [recommended by Google]

includedInDataCatalog [recommended by Google]

Should be standard, but have no standard source

license

keywords

`identifier` [recommended by Google]

`contributor`

`hasPart`

`isPartOf`

`distribution`

`distribution.contentUrl`

`dateCreated`

`(distribution.)dateModified`

`distribution.uploadDate`

`distribution.name`

`distribution.url`

`distribution.description`

`distribution.identifier`

`provider`

`version` [recommended by Google]

`includedInDataCatalog` [recommended by Google]

`license`

`keywords`