RF: Provide an "interface" to perform ID mapping to an external DB #2

mih · 2019-05-23T20:35:28Z

With this change in place, DataLad can replace the mapping function with something like this:

def _map_ids(parentobj, fileobj):
    id_ = idmap.get(fileobj.path, fileobj.id)
    fileobj.id = id_
    parentobj.id = id_

where idmap is a DataLad-provided dictionary that maps absolute file paths to DataLad's file IDs. The id of both the file, and the parent objects (e.g. a Contrast) are adjusted. This happens on every self.file= set-call, that typically happens in the constructor of NIDMObjects

Technically this is all nice, but it doesn't work, because DataLad's IDs are rejected in site-packages/prov/model.py line 1608, by self.valid_qualified_name(identifier), where identifier would look like MD5E-s612--5fa06e4a914134058849a67e671e0057.png, for example. valid_qualified_name() returns None for this. I do not yet understand why, or what could be done about it. Feedback or suggestions on this would be great!

FTR, here is the full code of DataLad calling nidmresults-fsl:

    def _map_ids(parentobj, fileobj):
        id_ = idmap.get(fileobj.path, fileobj.id)
        fileobj.id = id_
        parentobj.id = id_

    from mock import patch
    from nidmfsl.fsl_exporter.fsl_exporter import FSLtoNIDMExporter
    with make_tempfile(mkdir=True) as tmpdir, \
            patch(
                'nidmresults.objects.generic.NIDMObject._map_fileid',
                _map_ids):
        exporter = FSLtoNIDMExporter(
            out_dirname=tmpdir,
            zipped=False,
            feat_dir=text_type(feat_dir),
            # this is all fake, we cannot know it, but NIDM FSL wants it
            # TODO try fishing it out from the result again
            groups=[['control', 1]])
        exporter.parse()
        outdir = exporter.export()
        md = jsonload(text_type(Path(outdir) / 'nidm.json'))

See comments inside.

cmaumet · 2019-05-24T13:53:04Z

Thanks @mih! This looks great!

About the validity of the Datalad IDs, I think that to be valid as identifiers in an RDF graph (such as a JSON-LD doc) they need to be IRIs. If I am correct, all we should need is to add a prefix to the Datalad IDs but we need to think about what would make sense as prefix in this case. For NIDM objects we usually use the NIIRI prefix that maps to http://iri.nidash.org/.

mih · 2019-05-24T14:02:37Z

Ah, ok. That makes sense. For datalad metadata with have @base set to http://dx.datalad.org in the context, so that any non-IRI identifiers get this as a resolver. Can this check be turned off?

cmaumet · 2019-05-24T14:07:43Z

We can't turn this off because it's deep down within the libraries we are using. But if your @base is set to 'http://dx.datalad.org' then the ids could be updated to add this as a prefix, e.g http://dx.datalad.org:MD5E-s612--5fa06e4a914134058849a67e671e0057.png (which would be consistent with your datalad IDs while just MD5E-s612--5fa06e4a914134058849a67e671e0057.png would not because we have not set the same @base in the NIDM document).

mih · 2019-05-27T05:19:57Z

After a bit of deliberation, I think we shouldn't set the @base in DataLad either. Instead, we should declare a dl: prefix pointing to http://dx.datalad.org. This will prevent any accidental misidentification of IDs.

mih · 2019-05-27T06:22:38Z

I now found that valid_qualified_name(identifier) will even reject an identifier like

'http://dx.datalad.org/MD5E-s612--5fa06e4a914134058849a67e671e0057.png'

It will only accept URIs with a registered namespace. Will try to adjust the patch to do that.

mih · 2019-05-27T07:55:13Z

A cannot get the ProvBundle that is populated with this graph nodes to get a new registered prefix. I must be missing something. Few levels up the stack, Exporter.doc shows a properly registered dl namespace for DataLad IDs.

cmaumet · 2019-05-27T08:10:33Z

I now found that valid_qualified_name(identifier) will even reject an identifier like
'http://dx.datalad.org/MD5E-s612--5fa06e4a914134058849a67e671e0057.png'
It will only accept URIs with a registered namespace. Will try to adjust the patch to do that.

I did not have in mind that we had to register all namespaces for the IRIs to be valid. Could it be a typing issue instead? I had a lot of those in the past...

mih · 2019-05-27T11:53:16Z

FTR:

Here is the syntax to amend the NIDM context record with a prefix for datalad IDs:

      "@context": [
        "http://purl.org/nidash/context",
        {
          "datalad": "http://dx.datalad.org/"
        }
      ],

mih · 2019-05-27T11:54:20Z

Here is the companion PR for DataLad: datalad/datalad-metalad#15

mih · 2019-05-27T12:56:34Z

And here is a playground link showing a joint representation of metadata from two sources (NIDM results, and DataLad core): https://json-ld.org/playground/#startTab=tab-expanded&json-ld=http%3A%2F%2Fkumo.ovgu.de%2F~mih%2Fjsob.json

ping @jbpoline

mih · 2019-05-27T13:03:21Z

@cmaumet from my perspective this is ready to go. Everything else can be done, or worked around, in DataLad.

Later we should figure out how to properly inject an ID namespace in the NIDM context. But for now search&replace is good enough.

cmaumet · 2019-05-29T12:10:15Z

@cmaumet from my perspective this is ready to go. Everything else can be done, or worked around, in DataLad.

Thanks for all the work @mih!

Later we should figure out how to properly inject an ID namespace in the NIDM context. But for now search&replace is good enough.

++1. I've created incf-nidash#56 as a reminder.

Let's merge :)

satra · 2019-05-29T12:20:24Z

@mih - the output you point to above looks weird - there seems to be three embedded graphs. is there a specific reason for that?

also see: https://www.w3.org/TR/json-ld11/#named-graphs

mih · 2019-05-29T12:31:40Z

@satra "weird" is OK for me, "wrong" would be an issue. The origin of this structure is that there are three different metadata sources represented in the entire data structure. DataLad makes no attempt to compact/expand/ or otherwise modify them. This would take ages on any sizable dataset hierarchy. I consider using the named graph approach to add information on the origin of the information (ala describedBy), but for now that is not a priority.

satra · 2019-05-29T13:45:04Z

@mih - the part that looks weird and (may be unnecessarily complex) is when i make this graph compact, i see:

  "@graph": [
    {
      "@graph": [
        {
          "@graph": [
            {

even if three sources are being added, at the end of the day it's a single graph (can be disjoint), but a graph nonetheless.

so you can have:

option 1:

[ { "@context": ...
 },
{ "@context": ...
},
{ "@context": ...
}
]

or

option 2:

{ "@context": [ ...
],
"records": [
]
}

where you simply insert new records into the records array.

satra · 2019-05-29T14:02:54Z

more generally though on this issue:

DataLad makes no attempt to compact/expand/ or otherwise modify them. This would take ages on any sizable dataset hierarchy.

a nice property of graphs (especially rdf graphs) is that they can be merged without care. the queries and use cases may care, but from a database perspective they can be merged.

in this case it is really a question of augmenting a graph in two ways:

connecting entities that were used/generated by an activity
remapping entities to datalad IRIs

for 1. this is simply a matter of adding a new record:

{"@id": "id of activity",
"used": [ iri1, iri2, ...],
"generated": [iri3, iri4, ...]
}

where used would be a @set type container.

for 2, it's a matter of adding isSameAs

{"@id": "niiri:...",
"isSameAs": "datalad:..."
}

or
{"@id": "datalad:...",
"isSameAs": "niiri:..."
}

mih · 2019-05-29T14:35:09Z

These graphs can be merged, but only once we have expanded the identifiers. I do not want to do that, as this is the step that requires availability of all context information. But some folks use

"@context": "http://purl.org/nidash/context"

and then the machine is down twice a day ;-)

So what I do is the closest approximation that I can think of and only needs the literal content of the metadata: I merge any graphs that have exactly the same context specification.

That explains two levels of nesting, but not the third. That seems to be a bug in my NIDMresults extractor draft. Line 10 in http://kumo.ovgu.de/~mih/jsob.json shows a one-item list that contains the actual graph nodes. Will fix that.

mih · 2019-05-29T15:14:49Z

Third level is gone.

RF: Provide an "interface" to perform ID mapping to an external DB

e4fc4dc

See comments inside.

cmaumet mentioned this pull request May 29, 2019

Inject an ID namespace in the NIDM context when using external ID mapping incf-nidash/nidmresults#56

Open

cmaumet merged commit 0507edd into cmaumet:nice_jsonld May 29, 2019

mih deleted the idmap branch May 29, 2019 12:31

cmaumet mentioned this pull request Jun 2, 2019

Release 2.1.0 incf-nidash/nidmresults#57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RF: Provide an "interface" to perform ID mapping to an external DB #2

RF: Provide an "interface" to perform ID mapping to an external DB #2

mih commented May 23, 2019

cmaumet commented May 24, 2019 •

edited

mih commented May 24, 2019

cmaumet commented May 24, 2019

mih commented May 27, 2019

mih commented May 27, 2019

mih commented May 27, 2019

cmaumet commented May 27, 2019

mih commented May 27, 2019

mih commented May 27, 2019

mih commented May 27, 2019 •

edited

mih commented May 27, 2019

cmaumet commented May 29, 2019 •

edited

satra commented May 29, 2019

mih commented May 29, 2019

satra commented May 29, 2019 •

edited

satra commented May 29, 2019 •

edited

mih commented May 29, 2019 •

edited

mih commented May 29, 2019

RF: Provide an "interface" to perform ID mapping to an external DB #2

RF: Provide an "interface" to perform ID mapping to an external DB #2

Conversation

mih commented May 23, 2019

cmaumet commented May 24, 2019 • edited

mih commented May 24, 2019

cmaumet commented May 24, 2019

mih commented May 27, 2019

mih commented May 27, 2019

mih commented May 27, 2019

cmaumet commented May 27, 2019

mih commented May 27, 2019

mih commented May 27, 2019

mih commented May 27, 2019 • edited

mih commented May 27, 2019

cmaumet commented May 29, 2019 • edited

satra commented May 29, 2019

mih commented May 29, 2019

satra commented May 29, 2019 • edited

satra commented May 29, 2019 • edited

mih commented May 29, 2019 • edited

mih commented May 29, 2019

cmaumet commented May 24, 2019 •

edited

mih commented May 27, 2019 •

edited

cmaumet commented May 29, 2019 •

edited

satra commented May 29, 2019 •

edited

satra commented May 29, 2019 •

edited

mih commented May 29, 2019 •

edited