NF: JSON-LD based metadata #682

mih · 2016-08-02T12:52:37Z

FIrst the picture of the current state:

{
  "@context": "http://schema.org/",
  "@graph": [
    {
      "id": "2146817a-d7ab-46e5-9133-4a725306dbc5",
      "type": "Dataset",
      "dc:hasPart": {
        "id": "5a798601-8d66-42a9-ada9-4dceab591cbb"
      },
      "dc:hasVersion": {
        "id": "974d1090-32e4-4d0c-86aa-b345c5c2ba41"
      },
      "name": "the mother"
    },
    {
      "id": "5a798601-8d66-42a9-ada9-4dceab591cbb",
      "type": "Dataset",
      "location": "sub",
      "name": "child"
    }
  ]
}

This is the metadata of a dataset with one subdataset, that also has a clone somewhere else (see datalad/metadata/tests/test_base.py:test_basic_metadata). The name properties are coming from a BIDS structure (presence of BIDS metadata was auto-detected), the rest comes from Git/Git-annex itself.

The goal here is to get a list (one item per dataset) of dicts with a more-or-less plain key-value mapping (no nested structures) -- @id being the only and necessary exception to get valid JSON-LD. Such metadata structure should be easily convertible into a queryable form (SQL, or NO-SQL, etc).

Confession:

The above is not quiet true, as this is a compacted and flattened graph, for which a call to pyld's jsonld.flatten() is not yet pushed. This is a rather expensive call (needs network (or local cache) for term resolution), but it allows us to condense pretty much arbitrary metadata sets into a minimal form. It only needs to be done when metadata is aggregated and cached, not for query. pyld is in Debian and available through pip.

In contrast to what I said before, I now think we should store metadata as JSON-LD and not as triples. JSON-LD can get us triples if we need them. But JSON-LD is plain JSON, hence we can limit metadata read requirements to just json, and do not need to worry about JSON-LD.

If you want to play with the formats look here: http://tinyurl.com/jb62uhu
The top show the actual metadata structure generated by the current code. Various other flavors are available at the bottom. For expansion to triples/quads the UUIDs would get prefixed with something like http://db.datalad.org/ds/. Where we could provide information on the datasets that we know. The fact that the UUID are clone-specific should not hurt. The current PR already stores the origin UUID for newly created datasets, and the graph info would allow us to trace clones.

More info: http://json-ld.org

TODO:

cache metadata
retrieve cached metadata for query, instead of reparse

yarikoptic · 2016-08-02T13:33:48Z

datalad/metadata/__init__.py

+      Metadata type label or `None` if no type setting is found and and optional
+      auto-detection yielded no results
+    """
+    cfg = GitConfigParser(opj(ds.path, '.datalad', 'config'),


even though we had #437 about harmonization of config files to possibly follow git format, not yet sure if that should be the way to go... discussion will continue there...

yarikoptic · 2016-08-09T19:29:29Z

Away from laptop atm, but osx failures suggest that this branch is a bit behind master, could you please merge master or rebase?

yarikoptic · 2016-08-09T20:01:56Z

datalad/metadata/__init__.py

@@ -177,7 +177,7 @@ def get_metadata(ds, guess_type=False, ignore_subdatasets=False,
        for subds_path in ds.get_subdatasets(recursive=False):
            subds_meta_fname = opj(meta_path, subds_path, metadata_filename)
            if exists(subds_meta_fname):
-                subds_meta = json.load(open(subds_meta_fname, 'rb'))
+                subds_meta = json.loads(open(subds_meta_fname, 'rb').read().decode('utf-8'))


fwiw I did just use

with open(filename) as f: j = json.load(f)

just fine in code elsewhere but I probably didn't have any encoded content.

mih · 2016-08-11T10:48:25Z

Travis failure is due to #699

coveralls · 2016-08-11T11:42:24Z

Coverage increased (+0.06%) to 86.358% when pulling 3fa3095 on hanke:nf-metaparser into 8f7b5a4 on datalad:master.

codecov-io · 2016-08-11T11:43:25Z

Codecov Report

Merging #682 into master will decrease coverage by 2.03%.
The diff coverage is 90.18%.

@@            Coverage Diff             @@
##           master     #682      +/-   ##
==========================================
- Coverage   88.85%   86.81%   -2.04%     
==========================================
  Files         200      212      +12     
  Lines       18199    18774     +575     
==========================================
+ Hits        16170    16299     +129     
- Misses       2029     2475     +446

Impacted Files	Coverage Δ
datalad/tests/test_cmdline_main.py	`97.22% <ø> (ø)`	⬆️
datalad/cmdline/main.py	`84.61% <ø> (ø)`	⬆️
datalad/support/vcr_.py	`59.57% <0%> (-7.82%)`	⬇️
datalad/interface/__init__.py	`100% <100%> (ø)`	⬆️
datalad/metadata/parsers/__init__.py	`100% <100%> (ø)`
...ata/parsers/tests/test_frictionless_datapackage.py	`100% <100%> (ø)`
datalad/support/gitrepo.py	`88.35% <100%> (ø)`	⬆️
datalad/metadata/parsers/tests/__init__.py	`100% <100%> (ø)`
datalad/metadata/parsers/tests/test_bids.py	`100% <100%> (ø)`
datalad/interface/add_archive_content.py	`90.27% <100%> (+0.05%)`	⬆️
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4673ff...590d7a7. Read the comment docs.

mih · 2016-08-11T12:07:20Z

Only OSX failure is left -- reason unclear to me.

yarikoptic · 2016-08-11T14:02:32Z

FWIW on OSX tests failed because somewhere 'realpath' is taken dereferencing tempdir symlink, so:

ValueError: path /var/folders/90/vkz4dwlx0ss72djd8hywgppc0000gp/T/datalad_temp_tree_test_basic_metadatadOTrcu/origin/sub outside dataset <Dataset path=/private/var/folders/90/vkz4dwlx0ss72djd8hywgppc0000gp/T/datalad_temp_tree_test_basic_metadatadOTrcu/origin>

note the /private prefix.

yarikoptic · 2016-08-11T14:15:45Z

datalad/config.py

@@ -129,7 +129,7 @@ def _get_default_file_candidates(self):
            cfg_file_candidates.append(opj(home_cfg_base_path, 'datalad.cfg'))

        # current dir config
-        cfg_file_candidates.append(opj('.datalad', 'config'))
+        cfg_file_candidates.append('datalad.cfg')


Why not to keep all datalad specific stuff under .datalad ? its content is under git control.

Also "in line" with .git/config (just that .git/ content is not under git), especially if we finally migrate to git based config format

None of the other files is under Git. Why should this one be?

files which shouldn't go under git control should be under .git/ so they do not pollute git status and would not require custom .gitignore etc

so far we also have a number of config settings which should be stored under git:

how do we crawl (for now in a custom .datalad/crawl/crawl.cfg)

how metadata gets aggregated (somewhere under .datalad/?)

and imho it is not worth placing any datalad files under root dir of a dataset.

yarikoptic · 2016-08-11T14:49:32Z

FWIW -- I will look into that symlink issue since it is also the case on my local system

unfortunately: GitRepo.get_toppath resolves symlinks in the path (well -- "git", "rev-parse", "--show-toplevel" does it). Sending a PR with tentative fix (and other tune ups from going through the code)

mih#3

yarikoptic · 2016-08-12T18:33:46Z

datalad/metadata/__init__.py

+        # flatten list
+        item for versionlist in
+        # extract uuids of all listed repos
+        [[implicit_meta[i]['@id'] for r in implicit_meta[i] if '@id' in r]


why it is not r[@id] instead of implicit_meta[i]['@id'] ?

r is only the key, not the value, AFAIK

Ah, so it is checking if key is @id or @identity or @idiocracy?

On August 12, 2016 3:07:55 PM EDT, Michael Hanke notifications@github.com wrote:

meta.append(get_implicit_metadata(ds, ds_identifier))

implicit_meta = get_implicit_metadata(ds, ds_identifier)

create a lookup dict to find parts by subdataset mountpoint

has_part = implicit_meta.get('dcterms:hasPart', [])

if not isinstance(has_part, list):

has_part = [has_part]

has_part = {hp['location']: hp for hp in has_part}

figure out all other version of this dataset: origin or

siblings

build a flat list of UUIDs

candidates = ('dcterms:isVersionOf', 'dcterms:hasVersion')

ds_versions = [

# flatten list

item for versionlist in

# extract uuids of all listed repos

[[implicit_meta[i]['@id'] for r in implicit_meta[i] if '@id'
in r]

r is only the key, not the value, AFAIK

You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/datalad/datalad/pull/682/files/b8409420b30e186f48d7dd6ea1ab2642b228f969..9bdc084470f9c6f457421613eeb95d942a3ee54e#r74644874

Hmm, likely you are seeing this more clearly than I do. Just want to point out that this flattens a nested list. I am in an intense conversion with an amazing bottle of Gin Tonic. Will stop converting it into code for now ;-)

mih · 2016-08-15T11:52:09Z

@yarikoptic @bpoldrack Now it is slowly getting more interesting: some extraction, some aggregation, some query is implemented. I can't say that I believe this is all here to stay in this exact form, but it feels like it is something already.

Will likely approach an actual query command next.

mih · 2016-09-05T05:25:10Z

FTR: "yaml" dependency introduced by @yarikoptic does not exist -- seems PyYAML.

coveralls · 2016-09-05T05:50:47Z

Coverage decreased (-2.3%) to 86.511% when pulling 596ad25 on mih:nf-metaparser into e4673ff on datalad:master.

plus quite a fix fixes

mih · 2016-09-05T08:36:16Z

datasets.datalad.org (git)-[master] % datalad search-datasets '.*Hanke.*' --report location
Match: openfmri/ds000113c
Match: openfmri/ds000113d
datalad search-datasets '.*Hanke.*' --report location  6,16s user 0,12s system 96% cpu 6,499 total

coveralls · 2016-09-05T08:56:25Z

Coverage decreased (-2.2%) to 86.667% when pulling 5f670c2 on mih:nf-metaparser into e4673ff on datalad:master.

mih · 2016-09-05T10:38:41Z

TODO: cache flattened graph for repeated access; test for search_dataset

coveralls · 2016-09-05T10:55:47Z

Coverage decreased (-2.4%) to 86.462% when pulling 00a2340 on mih:nf-metaparser into e4673ff on datalad:master.

mih · 2016-09-05T12:51:34Z

FTR: Almost the entire time flattening the meta data graph is spend on compacting the expanded graph.

To me it seems like the best to cache the output.

coveralls · 2016-09-05T13:29:23Z

Coverage decreased (-2.4%) to 86.459% when pulling 62b4f7c on mih:nf-metaparser into e4673ff on datalad:master.

mih · 2016-09-05T13:34:37Z

Now with caching all over the place:

datasets.datalad.org (git)-[master] % time datalad search-datasets '.*Hanke.*' --report name
Match: studyforrest_multires7t
Match: studyforrest_phase2
datalad search-datasets '.*Hanke.*' --report name  6,14s user 0,06s system 99% cpu 6,233 total
mih@meiner ...datalad/DONTLOOK/datasets.datalad.org (git)-[master] % time datalad search-datasets '.*Hanke.*' --report name
Match: studyforrest_multires7t
Match: studyforrest_phase2
datalad search-datasets '.*Hanke.*' --report name  0,46s user 0,05s system 97% cpu 0,524 total
mih@meiner ...datalad/DONTLOOK/datasets.datalad.org (git)-[master] % time datalad search-datasets '.*Hanke.*' --report name
Match: studyforrest_multires7t
Match: studyforrest_phase2
datalad search-datasets '.*Hanke.*' --report name  0,42s user 0,07s system 97% cpu 0,505 total
mih@meiner ...datalad/DONTLOOK/datasets.datalad.org (git)-[master] % time datalad search-datasets '.*Hanke.*' --report name
Match: studyforrest_multires7t
Match: studyforrest_phase2
datalad search-datasets '.*Hanke.*' --report name  0,44s user 0,06s system 96% cpu 0,512 total

This is all tested on the full set of datalad datasets. Initial 6s runtime comes to a large percentage from expanding, compacting and flattening the meta data graph. The resulting graph is then cached and subsequent searches are <500ms total runtime (incl. datalad startup).

IMHO: Good enough (TM) -- for now.

mih · 2016-09-05T16:14:49Z

Python 3 support isn't quite there yet. Filed issue #756 to not forget about it. But I have to detach myself from this now and will merge as soon as the tests can be made to pass.

coveralls · 2016-09-05T16:24:23Z

Coverage decreased (-2.06%) to 86.794% when pulling 3596a4b on mih:nf-metaparser into e4673ff on datalad:master.

coveralls · 2016-09-05T17:08:50Z

Coverage decreased (-2.03%) to 86.817% when pulling 590d7a7 on mih:nf-metaparser into e4673ff on datalad:master.

mih · 2016-09-05T17:18:23Z

FTR: Missing coverage is primarily the untested result renderer of search-datasets

mih added the conference agenda item Scheduled to be discussed in a developer meeting label Aug 2, 2016

mih mentioned this pull request Aug 2, 2016

Embrace meta data #618

Closed

yarikoptic reviewed Aug 2, 2016
View reviewed changes

mih force-pushed the nf-metaparser branch from c781f98 to 0063bf7 Compare August 2, 2016 19:08

mih changed the title ~~WiP: Baby steps towards metadata support~~ WiP: JSON-LD based metadata Aug 7, 2016

mih force-pushed the nf-metaparser branch 2 times, most recently from e220b25 to 82add2e Compare August 8, 2016 18:46

yarikoptic reviewed Aug 9, 2016
View reviewed changes

mih force-pushed the nf-metaparser branch 2 times, most recently from 2bfda5a to b4d579c Compare August 11, 2016 05:01

mih force-pushed the nf-metaparser branch from 5365eca to 3fa3095 Compare August 11, 2016 11:10

yarikoptic reviewed Aug 11, 2016
View reviewed changes

mih force-pushed the nf-metaparser branch from 8362e4c to b840942 Compare August 12, 2016 17:13

yarikoptic reviewed Aug 12, 2016
View reviewed changes

mih force-pushed the nf-metaparser branch 2 times, most recently from 3f55338 to 25aa1ec Compare August 15, 2016 09:08

This was referenced Aug 15, 2016

helper to automate commits for the datasets (including super-datasets and its "children") #472

Closed

Possible format for minimal dataset description #379

Closed

BF: Fixup yaml dependency declaration.

596ad25

mih added 2 commits September 5, 2016 09:51

PY3: Will it ever end

20fd941

RF+BF: centralize aggregation of dataset locations

5f670c2

plus quite a fix fixes

ENH: Cache schema requests to make offline operation possible

62b4f7c

mih force-pushed the nf-metaparser branch from 00a2340 to 62b4f7c Compare September 5, 2016 13:11

ENH: Cache query-optimized meta data graph

b6c061b

PY3: I love you too -- deep inside me, somewhere, I really do!

e9350ed

mih force-pushed the nf-metaparser branch 2 times, most recently from cd282d5 to e06b22c Compare September 5, 2016 14:27

TST: Smoke test for search_datasets

794c700

mih force-pushed the nf-metaparser branch 2 times, most recently from f817e87 to 3596a4b Compare September 5, 2016 16:03

PY3: More love

590d7a7

mih force-pushed the nf-metaparser branch from 3596a4b to 590d7a7 Compare September 5, 2016 16:44

mih merged commit 5ff2165 into datalad:master Sep 5, 2016

mih mentioned this pull request Sep 6, 2016

Mandatory initial content/commit for new datasets? #569

Closed

mih deleted the nf-metaparser branch June 24, 2017 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NF: JSON-LD based metadata #682

NF: JSON-LD based metadata #682

mih commented Aug 2, 2016 •

edited

Loading

yarikoptic Aug 2, 2016

yarikoptic commented Aug 9, 2016

yarikoptic Aug 9, 2016

mih commented Aug 11, 2016

coveralls commented Aug 11, 2016 •

edited

Loading

codecov-io commented Aug 11, 2016 •

edited by codecov bot

Loading

mih commented Aug 11, 2016

yarikoptic commented Aug 11, 2016 •

edited

Loading

yarikoptic Aug 11, 2016 •

edited

Loading

mih Aug 12, 2016

yarikoptic Aug 12, 2016

yarikoptic commented Aug 11, 2016 •

edited

Loading

yarikoptic Aug 12, 2016

mih Aug 12, 2016

yarikoptic Aug 12, 2016

create a lookup dict to find parts by subdataset mountpoint

figure out all other version of this dataset: origin or

build a flat list of UUIDs

mih Aug 12, 2016

mih commented Aug 15, 2016

mih commented Sep 5, 2016 •

edited

Loading

coveralls commented Sep 5, 2016

mih commented Sep 5, 2016

coveralls commented Sep 5, 2016 •

edited

Loading

mih commented Sep 5, 2016

coveralls commented Sep 5, 2016

mih commented Sep 5, 2016

coveralls commented Sep 5, 2016

mih commented Sep 5, 2016 •

edited

Loading

mih commented Sep 5, 2016

coveralls commented Sep 5, 2016 •

edited

Loading

coveralls commented Sep 5, 2016 •

edited

Loading

mih commented Sep 5, 2016

NF: JSON-LD based metadata #682

NF: JSON-LD based metadata #682

Conversation

mih commented Aug 2, 2016 • edited Loading

Choose a reason for hiding this comment

yarikoptic commented Aug 9, 2016

Choose a reason for hiding this comment

mih commented Aug 11, 2016

coveralls commented Aug 11, 2016 • edited Loading

codecov-io commented Aug 11, 2016 • edited by codecov bot Loading

Codecov Report

mih commented Aug 11, 2016

yarikoptic commented Aug 11, 2016 • edited Loading

yarikoptic Aug 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarikoptic commented Aug 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

create a lookup dict to find parts by subdataset mountpoint

figure out all other version of this dataset: origin or

build a flat list of UUIDs

Choose a reason for hiding this comment

mih commented Aug 15, 2016

mih commented Sep 5, 2016 • edited Loading

coveralls commented Sep 5, 2016

mih commented Sep 5, 2016

coveralls commented Sep 5, 2016 • edited Loading

mih commented Sep 5, 2016

coveralls commented Sep 5, 2016

mih commented Sep 5, 2016

coveralls commented Sep 5, 2016

mih commented Sep 5, 2016 • edited Loading

mih commented Sep 5, 2016

coveralls commented Sep 5, 2016 • edited Loading

coveralls commented Sep 5, 2016 • edited Loading

mih commented Sep 5, 2016

mih commented Aug 2, 2016 •

edited

Loading

coveralls commented Aug 11, 2016 •

edited

Loading

codecov-io commented Aug 11, 2016 •

edited by codecov bot

Loading

yarikoptic commented Aug 11, 2016 •

edited

Loading

yarikoptic Aug 11, 2016 •

edited

Loading

yarikoptic commented Aug 11, 2016 •

edited

Loading

mih commented Sep 5, 2016 •

edited

Loading

coveralls commented Sep 5, 2016 •

edited

Loading

mih commented Sep 5, 2016 •

edited

Loading

coveralls commented Sep 5, 2016 •

edited

Loading

coveralls commented Sep 5, 2016 •

edited

Loading