Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Integration of FAIR data point (FDP) functionality for Vault data #341

Open
erikvdbergh opened this issue Jan 15, 2024 · 8 comments

Comments

@erikvdbergh
Copy link

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

FAIR data point is a standard describing a REST API for creating, storing, and serving FAIR metadata. It is used in many research projects in the Netherlands, and the goal is to create a network of FAIR data points to enable full interoperability across domains. See an example implementation here: https://github.com/FAIRDataTeam/FAIRDataPoint. It would be fantastic if Vault data could be delivered to the outside world through a FDP compatible API, so that each Yoda instance adds to this network. This also requires RDF expression of metadata, which is currently not yet supported by Yoda.

Additional context

See https://www.go-fair.org/how-to-go-fair/fair-data-point/ for more information about FDP

@erikvdbergh
Copy link
Author

Tagging @rwwh to add / comment on this issue.

@Danny-dK
Copy link

All vault data or only published vault data? The Yoda metadata or file metadata? If all vault data, that would be less sensible especially as that data could contain sensitive and / or incomplete data. I could imagine some sort of flag in the metadata that indicates to allow to be used in FDP. I could also imagine that if the user wants this, that the user just publishes restricted or closed access which allows the metadata to be openly accessible. In that sense, the metadata is already available and searchable. When I look at https://specs.fairdatapoint.org/fdp-specs-v1.2.html#usagescenarios I don't think I understand what the added benefit is over the existing implementations in Yoda for metadata discovery (which is searchable online). I could imagine that the Yoda metadata could be in some other format that conforms to FDP which it is not now or to be able to discover file metadata (is that what you mean with the feature request)?

@rwwh
Copy link

rwwh commented Feb 21, 2024

The EC is standardizing the exchange of dataset-level metadata as DCAT, more specifically DCAT-AP. data.overheid.nl and data.europe.eu are working through exchange of DCAT metadata. Multiple organisations are publishing their DCAT metadata using the FAIR Data Point protocol, such as healthdata.nl and healthdataportal.eu. That compatibility really helps: a single Yoda instance is searchable, but using a machine-actionable DCAT API like FAIR Data Point it becomes possible to build a hierarchy of catalogues indexing different systems, Yoda and Dataverse alike. For instance, there is already a plugin for CKAN that can harvest FAIR data points. To increase interoperability, extensions to DCAT-AP for different research domains are being developed internationally.

I think it makes sense to be able to choose whether metadata is exposed for a data set, however I don't think it makes sense if datasets are listed openly in the human interface of a repository and not in the FAIR Data Point API. That is probably not necessary as a separate choice.

Note that opening up dataset-level metadata is not saying anything about open access to the data. The data can be restricted or closed access, and it is still valuable to have a public exposure of the metadata. This is why "FAIR" contains the "A" for "Accessible under clear conditions" rather than "Open". In health-ri, for example, we are working on a unified "data access request" workflow. At the european level, such processes are also envisioned for the European Health Data Space and the One Million Human Genomes initiative. These are all based on restricted datasets but open dataset-level metadata.

@Danny-dK
Copy link

Danny-dK commented Feb 21, 2024

@rwwh Apologies if I'm too dumb, I'm not sure I get it. Are we talking about the Yoda metadata (example of yoda metadata form here https://utrechtuniversity.github.io/yoda-portal/) that is entered by a user or the file metadata? If in terms of the Yoda metadata, if a user wants it to be openly available, then the user publishes the metadata. If a user does not want that metadata openly available, the user does not publish the metadata and then I wouldn't see the added benefit of a choice for a user to contribute their metadata to an FDP (cause apparently they don't want it to be available). When published, that metadata is then searchable and findable through for example commons.datacite.org. I don't know if datacite entries can be incorporated in FDP.
I'm not sure I understand this sentence:

I don't think it makes sense if datasets are listed openly in the human interface of a repository and not in the FAIR Data Point API. That is probably not necessary as a separate choice.

So if the metadata is open it is not necessary for an choice to add to FDP?

Note that opening up dataset-level metadata is not saying anything about open access to the data. The data can be restricted or closed access, and it is still valuable to have a public exposure of the metadata.

Yes indeed. Within Yoda when a dataset is published with open access, the metadata form becomes publicly available with a DOI with a button to gain direct access to the datafiles. When restricted or closed access is chosen, only the metadata is available publicly, while the underlying data is not (but can be requested when 'restricted access' was chosen.

In health-ri, for example, we are working on a unified "data access request" workflow.

The YOUth project at UU is also using a data request workflow using a specially designed Yoda data request module. The first step is that a requesting party first needs to fill in a form explaining hypothesis, aim and goals, intended analysis methods, which data to request, etc. That request is then evaluated by a data manager, project manager, and a committee of professors involved in the YOUth project. The committee can make suggestions, remarks, and approve or reject. The project manager has the final say in approval or rejection. Once approved, the data manager make a sharing agreement available that the requesting party has to sign and send back, which the data manager then archives. The data manager then start gathering the data from the various storage location, checks for any inconsistencies, and finally makes the data available through Yoda. This entire workflow is pretty much entirely from within Yoda.

@rwwh
Copy link

rwwh commented Feb 24, 2024

DCAT defines metadata of the entire repository, of catalogues within that repositories, datasets within those catalogues, distributions of those datasets, and versioning.

I think I agree with you that it only makes sense to show the public metadata of datasets in a yoda instance in a FAIR Data Point form; what I was trying to argue is that the human interface of the instance should probably list the same data sets as the FAIR Data Point interface. Hence, if a user "publishes" the metadata, they appear both in the human readable and in the machine readable interfaces.

Regarding data access requests: this is an independent issue. I would be careful to build this from scratch specifically for Yoda; it may be better to check what existing systems exist for this (e.g. https://www.csc.fi/en/rems-kayttovaltuuksien-hallintajarjestelma). Researchers may need access to more than one data set for their work, and this is the reason that several of the initiatives I am seeing around me are trying to harmonize the request workflows, and sometimes also centralize them. Note that "the project manager has a final say" may be what many people think, but in fact European laws exist that forbid discrimination of requests. If a similar data request comes from a friend in Amsterdam or an unknown researcher in Helsinki or Athens, those requests must be treated identically. Legally, institutes could be better served by data access committees checking objectively than by project managers.

@Danny-dK
Copy link

Ah, I think I'm starting to get it, thanks! This would mean that the Yoda metadata form should be more machine readable then the JSON form now. @lwesterhof to what degree is this possible in the future of Yoda?

Regarding the data requests (which is indeed an independent issue). It's not necessarily from scratch, it is already in production in Yoda. I do agree it would be nice if the request would be evaluated by independent people, but the crux is that independent people generally are less familiar with the data and research (in this case cohort data of many young children). Hence the committee consists of various different professors from different departments (reducing the likelihood of a person being friend of the entire committee) and the evaluation and remarks are all stored in Yoda and known to the entire committee. If project manager (or whatever function it may be) decides against the majority of opinion of committee, this would then be discussed accordingly (formal objections and such; these data requests can sometimes take weeks to months). To some extend an independent person will be able to judge a data request, but the possible discomfort and in depth knowledge of the data, in my opinion, really is part of the people involved in a project to assess whether a hypothesis and intended analysis method is valid. I'm not sure whether the data requests (the required form to fill in) is publicly available (I think it was, but it has been some time since I heard about/tested the module), at least preregistration of the intended research is required. In my opinion, but who am I, I think the risk of non-identical treatment of a data request is more of an issue when everything is behind closed doors than with open procedures that committees and requesting researchers can find and object to. But indeed as you say, it is an independent issue.

@peer35
Copy link

peer35 commented Mar 6, 2024

Ok, so an FDP would actually mean something sort of similar to the OAI-PMH endpoint https://utrechtuniversity.github.io/yoda/design/overview/yoda-moai.html. That does make sense and will increase findability of the datasets. Of course only the metadata of published datasets should be made available.

@rwwh
Copy link

rwwh commented Mar 6, 2024

Yes OAI-PMH is also a way to exchange metadata; it is an older standard that is non-semantic.

DCAT is a semantic standard, and DCAT-AP is the way dataset metadata is exchanged in Europe now. See eg https://joinup.ec.europa.eu/collection/semic-support-centre/dcat-ap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants