-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RF: Provide an "interface" to perform ID mapping to an external DB #2
Conversation
See comments inside.
Thanks @mih! This looks great! About the validity of the Datalad IDs, I think that to be valid as identifiers in an RDF graph (such as a JSON-LD doc) they need to be IRIs. If I am correct, all we should need is to add a prefix to the Datalad IDs but we need to think about what would make sense as prefix in this case. For NIDM objects we usually use the NIIRI prefix that maps to http://iri.nidash.org/. |
Ah, ok. That makes sense. For datalad metadata with have @base set to http://dx.datalad.org in the context, so that any non-IRI identifiers get this as a resolver. Can this check be turned off? |
We can't turn this off because it's deep down within the libraries we are using. But if your @base is set to 'http://dx.datalad.org' then the ids could be updated to add this as a prefix, e.g |
After a bit of deliberation, I think we shouldn't set the @base in DataLad either. Instead, we should declare a |
I now found that
It will only accept URIs with a registered namespace. Will try to adjust the patch to do that. |
A cannot get the |
I did not have in mind that we had to register all namespaces for the IRIs to be valid. Could it be a typing issue instead? I had a lot of those in the past... |
FTR: Here is the syntax to amend the NIDM context record with a prefix for datalad IDs:
|
Here is the companion PR for DataLad: datalad/datalad-metalad#15 |
And here is a playground link showing a joint representation of metadata from two sources (NIDM results, and DataLad core): https://json-ld.org/playground/#startTab=tab-expanded&json-ld=http%3A%2F%2Fkumo.ovgu.de%2F~mih%2Fjsob.json ping @jbpoline |
@cmaumet from my perspective this is ready to go. Everything else can be done, or worked around, in DataLad. Later we should figure out how to properly inject an ID namespace in the NIDM context. But for now search&replace is good enough. |
Thanks for all the work @mih!
++1. I've created incf-nidash#56 as a reminder. Let's merge :) |
@mih - the output you point to above looks weird - there seems to be three embedded graphs. is there a specific reason for that? |
@satra "weird" is OK for me, "wrong" would be an issue. The origin of this structure is that there are three different metadata sources represented in the entire data structure. DataLad makes no attempt to compact/expand/ or otherwise modify them. This would take ages on any sizable dataset hierarchy. I consider using the named graph approach to add information on the origin of the information (ala describedBy), but for now that is not a priority. |
@mih - the part that looks weird and (may be unnecessarily complex) is when i make this graph compact, i see:
even if three sources are being added, at the end of the day it's a single graph (can be disjoint), but a graph nonetheless. so you can have: option 1:
or option 2: { "@context": [ ... where you simply insert new records into the records array. |
more generally though on this issue:
a nice property of graphs (especially rdf graphs) is that they can be merged without care. the queries and use cases may care, but from a database perspective they can be merged. in this case it is really a question of augmenting a graph in two ways:
for 1. this is simply a matter of adding a new record:
where for 2, it's a matter of adding
|
These graphs can be merged, but only once we have expanded the identifiers. I do not want to do that, as this is the step that requires availability of all context information. But some folks use
and then the machine is down twice a day ;-) So what I do is the closest approximation that I can think of and only needs the literal content of the metadata: I merge any graphs that have exactly the same context specification. That explains two levels of nesting, but not the third. That seems to be a bug in my NIDMresults extractor draft. Line 10 in http://kumo.ovgu.de/~mih/jsob.json shows a one-item list that contains the actual graph nodes. Will fix that. |
Third level is gone. |
With this change in place, DataLad can replace the mapping function with something like this:
where
idmap
is a DataLad-provided dictionary that maps absolute file paths to DataLad's file IDs. Theid
of both the file, and the parent objects (e.g. a Contrast) are adjusted. This happens on everyself.file=
set-call, that typically happens in the constructor ofNIDMObject
sTechnically this is all nice, but it doesn't work, because DataLad's IDs are rejected in
site-packages/prov/model.py
line 1608, byself.valid_qualified_name(identifier)
, whereidentifier
would look likeMD5E-s612--5fa06e4a914134058849a67e671e0057.png
, for example.valid_qualified_name()
returnsNone
for this. I do not yet understand why, or what could be done about it. Feedback or suggestions on this would be great!FTR, here is the full code of DataLad calling nidmresults-fsl: