You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to improve the usability of the output of our internal metadata extractors -- those that look at common properties of all datasets (I consider anything else out-of-scope here), and therefore can always run wrt to applications like datalad/datalad-revolution#76
For this, it would be helpful to discuss and agree on a mapping of such properties on schema.org terms. The following is a list of terms that (I think) are applicable, and their proposed mapping.
Please contribute by extending the list, and arguing for/against my proposal. Thx!
After a bit of thinking, I am of the opinion that we should avoid shoehorning meaning onto terms that exist, but are not a 100% match. So the following is a list of only those properties, where I see such amatch.
Standard, with a definable source
The following aspects have a (potentially definitive) source within the scope of the core metadata extractors.
identifier [recommended by Google]
The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc. https://schema.org/identifier (PropertyValue | Text | URL)
This is a dataset's UUID. This ID is also used to identify relationships between datasets.
See the hasPart section for a potential reason to also consider the latest commit SHA as an additional identifier.
We have no way to infer author, but we can surely state that any author of a commit in the history is a contributor. An extractor should consult the mailmap to give a sensible report. The list of contributors reported by the extractor need not be exhaustive.
Something like this: git log --use-mailmap --no-merges --format=format:'%aN%x00%aE%n%cN%x00%cE' |sort |uniq
hasPart
Indicates an item or CreativeWork that is part of this item, or CreativeWork (in some sense) https://schema.org/hasPart (CreativeWork)
These are primarily subdatasets (referenced by their dataset ID), but we could provide a list of files (by annex-key or shasum too).
This one is tricky to assemble as a given dataset may not have all information about subdatasets (e.g. their IDs aggregated, and we cannot rely on or require all subdatasets to be installed. What we do know, however, is the state of the subdataset that is referenced (commit) -- this is as much of a precise ID then the UUID, but much more volatile.
isPartOf
Indicates an item or CreativeWork that this item, or CreativeWork (in some sense), is part of. https://schema.org/isPartOf (CreativeWork)
This is the UUID of the superdataset. BUT see includedInDataCatalog. For any given dataset, this is easier to determine that the hasPart side of thing. We just need to look for a single superdataset once vs. all the subdatasets.
distribution
A downloadable form of this dataset, at a specific location, in a specific format. https://schema.org/distribution (DataDownload)
This is any remote of a dataset, described by a compound object (see below for applicable properties).
A URL that datalad install can act on. The media object here is the dataset itself.
dateCreated
The date on which the CreativeWork was created or the item was added to a DataFeed. https://schema.org/dateCreated (Date | DateTime)
Timestamp of the initial commit.
(distribution.)dateModified
The date on which the CreativeWork was most recently modified or when the item's entry was modified within a DataFeed. https://schema.org/dateModified (Date | DateTime)
The timestamp of the last commit on record for the dataset, or, in case of a DataDownload, the respective remote.
A short description of the nature of the remote (Git,or git-annex special-remote of type ...)
distribution.identifier
The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc. https://schema.org/identifier (PropertyValue | Text | URL)
This UUID of a git-annex key-store for a remote (if any exists).
provider
The service provider, service operator, or service performer; the goods producer. Another party (a seller) may offer those services or goods on behalf of the provider. A provider may also serve as the seller. https://schema.org/provider (Organization | Person)
This could be a record for DataLad itself, identifying it as the service provider (in the scope of the Dataset record; for DataDownload it would be a data portal or something else, but this is impossible to infer in general ).
There is also publisher, but I don't think that matches the role of DataLad.
version [recommended by Google]
The version of the CreativeWork embodied by a specified resource. https://schema.org/version (Number | Text)
0-<ncommits>-<refcommit-shasum>
Poor man's alternative to git describe -- which we cannot use unconditionally, as it needs (annotated) tags to function. Instead, we count any commit and use the initial commit as a universal reference. Above format mimics git describe output, but uses 0 as a constant prefix (not a tag).
List of remotes (i.e. distributions, see above), plus the super dataset (topmost only, to not by redundant with isPartOf).
Not sure how to reference the superdataset as a DataCatalog-type. For DataLad any Dataset is also a DataCatalog. Maybe we should market any dataset as a catalog, but we would loose the distribution field when switching the type (and not sure of things like Google dataset search are happy with this).
One approach to deal with this in a context like datalad/datalad-revolution#76 (where we know that a superdataset is serving the purpose of a data catalog, as opposed to just tracking dependencies) would be to generate a single page with a DataCatalog-type metadata record, and for all subdataset pages refer to this page as the containing catalog. In DataLad's actual metadata records, however, we do not use this (as the "is-in" relationship is only reliably determined from a (distant) superdataset). Instead, we limit the record to immediate child relationships, i.e. hasPart, and only inject isPartOf and includedInDataCatalog at the time of exporting metadata in a specific context, for a specific purpose.
Should be standard, but have no standard source
For the following aspects we could implement heuristics. Extracted metadata should only contain facts. Any such heuristics should be employed/executed at a late stage in an application context (where it is known how error-tolerant one can be). Hence, we only have to think about, what kind of factual information we want to extract to enable such heuristics.
license
A license document that applies to this content, typically indicated by URL. https://schema.org/license (CreativeWork | URL)
We could extract the content of LICENSE or COPYING, if such file exists.
keywords
Keywords or tags used to describe this content. Multiple entries in a keywords list are typically delimited by commas. https://schema.org/keywords (Text)
Amend with the names of metadata extractors.
The text was updated successfully, but these errors were encountered:
mih
transferred this issue from datalad/datalad
May 30, 2020
I want to improve the usability of the output of our internal metadata extractors -- those that look at common properties of all datasets (I consider anything else out-of-scope here), and therefore can always run wrt to applications like datalad/datalad-revolution#76
For this, it would be helpful to discuss and agree on a mapping of such properties on schema.org terms. The following is a list of terms that (I think) are applicable, and their proposed mapping.
Please contribute by extending the list, and arguing for/against my proposal. Thx!
After a bit of thinking, I am of the opinion that we should avoid shoehorning meaning onto terms that exist, but are not a 100% match. So the following is a list of only those properties, where I see such amatch.
Standard, with a definable source
The following aspects have a (potentially definitive) source within the scope of the core metadata extractors.
identifier
[recommended by Google]This is a dataset's UUID. This ID is also used to identify relationships between datasets.
See the
hasPart
section for a potential reason to also consider the latest commit SHA as an additional identifier.contributor
We have no way to infer
author
, but we can surely state that any author of a commit in the history is a contributor. An extractor should consult the mailmap to give a sensible report. The list of contributors reported by the extractor need not be exhaustive.Something like this:
git log --use-mailmap --no-merges --format=format:'%aN%x00%aE%n%cN%x00%cE' |sort |uniq
hasPart
These are primarily subdatasets (referenced by their dataset ID), but we could provide a list of files (by annex-key or shasum too).
This one is tricky to assemble as a given dataset may not have all information about subdatasets (e.g. their IDs aggregated, and we cannot rely on or require all subdatasets to be installed. What we do know, however, is the state of the subdataset that is referenced (commit) -- this is as much of a precise ID then the UUID, but much more volatile.
isPartOf
This is the UUID of the superdataset. BUT see
includedInDataCatalog
. For any given dataset, this is easier to determine that thehasPart
side of thing. We just need to look for a single superdataset once vs. all the subdatasets.distribution
This is any remote of a dataset, described by a compound object (see below for applicable properties).
distribution.contentUrl
A URL that
datalad install
can act on. The media object here is the dataset itself.dateCreated
Timestamp of the initial commit.
(distribution.)dateModified
The timestamp of the last commit on record for the dataset, or, in case of a DataDownload, the respective remote.
distribution.uploadDate
We cannot easily say this, unless we make "publish" leave a trace.
distribution.name
Name of a remote. Not sure if special remotes qualify, as we need to identify distributions of the dataset,
not (parts) of its content.
distribution.url
The (fetch) URL of a remote.
distribution.description
A short description of the nature of the remote (Git,or git-annex special-remote of type ...)
distribution.identifier
This UUID of a git-annex key-store for a remote (if any exists).
provider
This could be a record for DataLad itself, identifying it as the service provider (in the scope of the Dataset record; for DataDownload it would be a data portal or something else, but this is impossible to infer in general ).
There is also
publisher
, but I don't think that matches the role of DataLad.version
[recommended by Google]0-<ncommits>-<refcommit-shasum>
Poor man's alternative to
git describe
-- which we cannot use unconditionally, as it needs (annotated) tags to function. Instead, we count any commit and use the initial commit as a universal reference. Above format mimicsgit describe
output, but uses0
as a constant prefix (not a tag).ncommits
=git log --no-merges --format=oneline |wc -l
includedInDataCatalog
[recommended by Google]List of remotes (i.e.
distribution
s, see above), plus the super dataset (topmost only, to not by redundant withisPartOf
).Not sure how to reference the superdataset as a DataCatalog-type. For DataLad any Dataset is also a DataCatalog. Maybe we should market any dataset as a catalog, but we would loose the
distribution
field when switching the type (and not sure of things like Google dataset search are happy with this).One approach to deal with this in a context like datalad/datalad-revolution#76 (where we know that a superdataset is serving the purpose of a data catalog, as opposed to just tracking dependencies) would be to generate a single page with a
DataCatalog
-type metadata record, and for all subdataset pages refer to this page as the containing catalog. In DataLad's actual metadata records, however, we do not use this (as the "is-in" relationship is only reliably determined from a (distant) superdataset). Instead, we limit the record to immediate child relationships, i.e.hasPart
, and only injectisPartOf
andincludedInDataCatalog
at the time of exporting metadata in a specific context, for a specific purpose.Should be standard, but have no standard source
For the following aspects we could implement heuristics. Extracted metadata should only contain facts. Any such heuristics should be employed/executed at a late stage in an application context (where it is known how error-tolerant one can be). Hence, we only have to think about, what kind of factual information we want to extract to enable such heuristics.
license
We could extract the content of LICENSE or COPYING, if such file exists.
keywords
Amend with the names of metadata extractors.
The text was updated successfully, but these errors were encountered: