Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map dataset properties to schema.org terms #51

Open
mih opened this issue Jan 25, 2019 · 0 comments
Open

Map dataset properties to schema.org terms #51

mih opened this issue Jan 25, 2019 · 0 comments

Comments

@mih
Copy link
Member

mih commented Jan 25, 2019

I want to improve the usability of the output of our internal metadata extractors -- those that look at common properties of all datasets (I consider anything else out-of-scope here), and therefore can always run wrt to applications like datalad/datalad-revolution#76

For this, it would be helpful to discuss and agree on a mapping of such properties on schema.org terms. The following is a list of terms that (I think) are applicable, and their proposed mapping.
Please contribute by extending the list, and arguing for/against my proposal. Thx!

After a bit of thinking, I am of the opinion that we should avoid shoehorning meaning onto terms that exist, but are not a 100% match. So the following is a list of only those properties, where I see such amatch.

Standard, with a definable source

The following aspects have a (potentially definitive) source within the scope of the core metadata extractors.

identifier [recommended by Google]

The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc.
https://schema.org/identifier (PropertyValue | Text | URL)

This is a dataset's UUID. This ID is also used to identify relationships between datasets.

See the hasPart section for a potential reason to also consider the latest commit SHA as an additional identifier.

contributor

A secondary contributor to the CreativeWork or Event.
https://schema.org/contributor (Organization | Person)

We have no way to infer author, but we can surely state that any author of a commit in the history is a contributor. An extractor should consult the mailmap to give a sensible report. The list of contributors reported by the extractor need not be exhaustive.

Something like this: git log --use-mailmap --no-merges --format=format:'%aN%x00%aE%n%cN%x00%cE' |sort |uniq

hasPart

Indicates an item or CreativeWork that is part of this item, or CreativeWork (in some sense)
https://schema.org/hasPart (CreativeWork)

These are primarily subdatasets (referenced by their dataset ID), but we could provide a list of files (by annex-key or shasum too).

This one is tricky to assemble as a given dataset may not have all information about subdatasets (e.g. their IDs aggregated, and we cannot rely on or require all subdatasets to be installed. What we do know, however, is the state of the subdataset that is referenced (commit) -- this is as much of a precise ID then the UUID, but much more volatile.

isPartOf

Indicates an item or CreativeWork that this item, or CreativeWork (in some sense), is part of.
https://schema.org/isPartOf (CreativeWork)

This is the UUID of the superdataset. BUT see includedInDataCatalog. For any given dataset, this is easier to determine that the hasPart side of thing. We just need to look for a single superdataset once vs. all the subdatasets.

distribution

A downloadable form of this dataset, at a specific location, in a specific format.
https://schema.org/distribution (DataDownload)

This is any remote of a dataset, described by a compound object (see below for applicable properties).

distribution.contentUrl

Actual bytes of the media object
https://schema.org/contentUrl (URL)

A URL that datalad install can act on. The media object here is the dataset itself.

dateCreated

The date on which the CreativeWork was created or the item was added to a DataFeed.
https://schema.org/dateCreated (Date | DateTime)

Timestamp of the initial commit.

(distribution.)dateModified

The date on which the CreativeWork was most recently modified or when the item's entry was modified within a DataFeed.
https://schema.org/dateModified (Date | DateTime)

The timestamp of the last commit on record for the dataset, or, in case of a DataDownload, the respective remote.

distribution.uploadDate

Date when this media object was uploaded to this site.
https://schema.org/uploadDate (Date)

We cannot easily say this, unless we make "publish" leave a trace.

distribution.name

The name of the item
https://schema.org/name (Text)

Name of a remote. Not sure if special remotes qualify, as we need to identify distributions of the dataset,
not (parts) of its content.

distribution.url

URL of the item
https://schema.org/url (URL)

The (fetch) URL of a remote.

distribution.description

A description of the item.
https://schema.org/description (Text)

A short description of the nature of the remote (Git,or git-annex special-remote of type ...)

distribution.identifier

The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc.
https://schema.org/identifier (PropertyValue | Text | URL)

This UUID of a git-annex key-store for a remote (if any exists).

provider

The service provider, service operator, or service performer; the goods producer. Another party (a seller) may offer those services or goods on behalf of the provider. A provider may also serve as the seller.
https://schema.org/provider (Organization | Person)

This could be a record for DataLad itself, identifying it as the service provider (in the scope of the Dataset record; for DataDownload it would be a data portal or something else, but this is impossible to infer in general ).

There is also publisher, but I don't think that matches the role of DataLad.

version [recommended by Google]

The version of the CreativeWork embodied by a specified resource.
https://schema.org/version (Number | Text)

0-<ncommits>-<refcommit-shasum>

Poor man's alternative to git describe -- which we cannot use unconditionally, as it needs (annotated) tags to function. Instead, we count any commit and use the initial commit as a universal reference. Above format mimics git describe output, but uses 0 as a constant prefix (not a tag).

ncommits = git log --no-merges --format=oneline |wc -l

includedInDataCatalog [recommended by Google]

A data catalog which contains this dataset.
https://schema.org/includedInDataCatalog (DataDownload)

List of remotes (i.e. distributions, see above), plus the super dataset (topmost only, to not by redundant with isPartOf).

Not sure how to reference the superdataset as a DataCatalog-type. For DataLad any Dataset is also a DataCatalog. Maybe we should market any dataset as a catalog, but we would loose the distribution field when switching the type (and not sure of things like Google dataset search are happy with this).

One approach to deal with this in a context like datalad/datalad-revolution#76 (where we know that a superdataset is serving the purpose of a data catalog, as opposed to just tracking dependencies) would be to generate a single page with a DataCatalog-type metadata record, and for all subdataset pages refer to this page as the containing catalog. In DataLad's actual metadata records, however, we do not use this (as the "is-in" relationship is only reliably determined from a (distant) superdataset). Instead, we limit the record to immediate child relationships, i.e. hasPart, and only inject isPartOf and includedInDataCatalog at the time of exporting metadata in a specific context, for a specific purpose.

Should be standard, but have no standard source

For the following aspects we could implement heuristics. Extracted metadata should only contain facts. Any such heuristics should be employed/executed at a late stage in an application context (where it is known how error-tolerant one can be). Hence, we only have to think about, what kind of factual information we want to extract to enable such heuristics.

license

A license document that applies to this content, typically indicated by URL.
https://schema.org/license (CreativeWork | URL)

We could extract the content of LICENSE or COPYING, if such file exists.

keywords

Keywords or tags used to describe this content. Multiple entries in a keywords list are typically delimited by commas.
https://schema.org/keywords (Text)

Amend with the names of metadata extractors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant