Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: Provide an "interface" to perform ID mapping to an external DB #2

Merged
merged 1 commit into from
May 29, 2019

Conversation

mih
Copy link

@mih mih commented May 23, 2019

With this change in place, DataLad can replace the mapping function with something like this:

def _map_ids(parentobj, fileobj):
    id_ = idmap.get(fileobj.path, fileobj.id)
    fileobj.id = id_
    parentobj.id = id_

where idmap is a DataLad-provided dictionary that maps absolute file paths to DataLad's file IDs. The id of both the file, and the parent objects (e.g. a Contrast) are adjusted. This happens on every self.file= set-call, that typically happens in the constructor of NIDMObjects

Technically this is all nice, but it doesn't work, because DataLad's IDs are rejected in site-packages/prov/model.py line 1608, by self.valid_qualified_name(identifier), where identifier would look like MD5E-s612--5fa06e4a914134058849a67e671e0057.png, for example. valid_qualified_name() returns None for this. I do not yet understand why, or what could be done about it. Feedback or suggestions on this would be great!

FTR, here is the full code of DataLad calling nidmresults-fsl:

    def _map_ids(parentobj, fileobj):
        id_ = idmap.get(fileobj.path, fileobj.id)
        fileobj.id = id_
        parentobj.id = id_

    from mock import patch
    from nidmfsl.fsl_exporter.fsl_exporter import FSLtoNIDMExporter
    with make_tempfile(mkdir=True) as tmpdir, \
            patch(
                'nidmresults.objects.generic.NIDMObject._map_fileid',
                _map_ids):
        exporter = FSLtoNIDMExporter(
            out_dirname=tmpdir,
            zipped=False,
            feat_dir=text_type(feat_dir),
            # this is all fake, we cannot know it, but NIDM FSL wants it
            # TODO try fishing it out from the result again
            groups=[['control', 1]])
        exporter.parse()
        outdir = exporter.export()
        md = jsonload(text_type(Path(outdir) / 'nidm.json'))

@cmaumet
Copy link
Owner

cmaumet commented May 24, 2019

Thanks @mih! This looks great!

About the validity of the Datalad IDs, I think that to be valid as identifiers in an RDF graph (such as a JSON-LD doc) they need to be IRIs. If I am correct, all we should need is to add a prefix to the Datalad IDs but we need to think about what would make sense as prefix in this case. For NIDM objects we usually use the NIIRI prefix that maps to http://iri.nidash.org/.

@mih
Copy link
Author

mih commented May 24, 2019

Ah, ok. That makes sense. For datalad metadata with have @base set to http://dx.datalad.org in the context, so that any non-IRI identifiers get this as a resolver. Can this check be turned off?

@cmaumet
Copy link
Owner

cmaumet commented May 24, 2019

We can't turn this off because it's deep down within the libraries we are using. But if your @base is set to 'http://dx.datalad.org' then the ids could be updated to add this as a prefix, e.g http://dx.datalad.org:MD5E-s612--5fa06e4a914134058849a67e671e0057.png (which would be consistent with your datalad IDs while just MD5E-s612--5fa06e4a914134058849a67e671e0057.png would not because we have not set the same @base in the NIDM document).

@mih
Copy link
Author

mih commented May 27, 2019

After a bit of deliberation, I think we shouldn't set the @base in DataLad either. Instead, we should declare a dl: prefix pointing to http://dx.datalad.org. This will prevent any accidental misidentification of IDs.

@mih
Copy link
Author

mih commented May 27, 2019

I now found that valid_qualified_name(identifier) will even reject an identifier like

'http://dx.datalad.org/MD5E-s612--5fa06e4a914134058849a67e671e0057.png'

It will only accept URIs with a registered namespace. Will try to adjust the patch to do that.

@mih
Copy link
Author

mih commented May 27, 2019

A cannot get the ProvBundle that is populated with this graph nodes to get a new registered prefix. I must be missing something. Few levels up the stack, Exporter.doc shows a properly registered dl namespace for DataLad IDs.

@cmaumet
Copy link
Owner

cmaumet commented May 27, 2019

I now found that valid_qualified_name(identifier) will even reject an identifier like
'http://dx.datalad.org/MD5E-s612--5fa06e4a914134058849a67e671e0057.png'
It will only accept URIs with a registered namespace. Will try to adjust the patch to do that.

I did not have in mind that we had to register all namespaces for the IRIs to be valid. Could it be a typing issue instead? I had a lot of those in the past...

@mih
Copy link
Author

mih commented May 27, 2019

FTR:

Here is the syntax to amend the NIDM context record with a prefix for datalad IDs:

      "@context": [
        "http://purl.org/nidash/context",
        {
          "datalad": "http://dx.datalad.org/"
        }
      ],

@mih
Copy link
Author

mih commented May 27, 2019

Here is the companion PR for DataLad: datalad/datalad-metalad#15

@mih
Copy link
Author

mih commented May 27, 2019

And here is a playground link showing a joint representation of metadata from two sources (NIDM results, and DataLad core): https://json-ld.org/playground/#startTab=tab-expanded&json-ld=http%3A%2F%2Fkumo.ovgu.de%2F~mih%2Fjsob.json

ping @jbpoline

@mih
Copy link
Author

mih commented May 27, 2019

@cmaumet from my perspective this is ready to go. Everything else can be done, or worked around, in DataLad.

Later we should figure out how to properly inject an ID namespace in the NIDM context. But for now search&replace is good enough.

@cmaumet
Copy link
Owner

cmaumet commented May 29, 2019

@cmaumet from my perspective this is ready to go. Everything else can be done, or worked around, in DataLad.

Thanks for all the work @mih!

Later we should figure out how to properly inject an ID namespace in the NIDM context. But for now search&replace is good enough.

++1. I've created incf-nidash#56 as a reminder.

Let's merge :)

@cmaumet cmaumet merged commit 0507edd into cmaumet:nice_jsonld May 29, 2019
@satra
Copy link

satra commented May 29, 2019

@mih - the output you point to above looks weird - there seems to be three embedded graphs. is there a specific reason for that?

also see: https://www.w3.org/TR/json-ld11/#named-graphs

@mih
Copy link
Author

mih commented May 29, 2019

@satra "weird" is OK for me, "wrong" would be an issue. The origin of this structure is that there are three different metadata sources represented in the entire data structure. DataLad makes no attempt to compact/expand/ or otherwise modify them. This would take ages on any sizable dataset hierarchy. I consider using the named graph approach to add information on the origin of the information (ala describedBy), but for now that is not a priority.

@mih mih deleted the idmap branch May 29, 2019 12:31
@satra
Copy link

satra commented May 29, 2019

@mih - the part that looks weird and (may be unnecessarily complex) is when i make this graph compact, i see:

  "@graph": [
    {
      "@graph": [
        {
          "@graph": [
            {

even if three sources are being added, at the end of the day it's a single graph (can be disjoint), but a graph nonetheless.

so you can have:

option 1:

[ { "@context": ...
 },
{ "@context": ...
},
{ "@context": ...
}
]

or

option 2:

{ "@context": [ ...
],
"records": [
]
}

where you simply insert new records into the records array.

@satra
Copy link

satra commented May 29, 2019

more generally though on this issue:

DataLad makes no attempt to compact/expand/ or otherwise modify them. This would take ages on any sizable dataset hierarchy.

a nice property of graphs (especially rdf graphs) is that they can be merged without care. the queries and use cases may care, but from a database perspective they can be merged.

in this case it is really a question of augmenting a graph in two ways:

  1. connecting entities that were used/generated by an activity
  2. remapping entities to datalad IRIs

for 1. this is simply a matter of adding a new record:

{"@id": "id of activity",
"used": [ iri1, iri2, ...],
"generated": [iri3, iri4, ...]
}

where used would be a @set type container.

for 2, it's a matter of adding isSameAs

{"@id": "niiri:...",
"isSameAs": "datalad:..."
}

or
{"@id": "datalad:...",
"isSameAs": "niiri:..."
}

@mih
Copy link
Author

mih commented May 29, 2019

These graphs can be merged, but only once we have expanded the identifiers. I do not want to do that, as this is the step that requires availability of all context information. But some folks use

"@context": "http://purl.org/nidash/context"

and then the machine is down twice a day ;-)

So what I do is the closest approximation that I can think of and only needs the literal content of the metadata: I merge any graphs that have exactly the same context specification.

That explains two levels of nesting, but not the third. That seems to be a bug in my NIDMresults extractor draft. Line 10 in http://kumo.ovgu.de/~mih/jsob.json shows a one-item list that contains the actual graph nodes. Will fix that.

@mih
Copy link
Author

mih commented May 29, 2019

Third level is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants