Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: JSON-LD based metadata #682

Merged
merged 84 commits into from
Sep 5, 2016
Merged

NF: JSON-LD based metadata #682

merged 84 commits into from
Sep 5, 2016

Conversation

mih
Copy link
Member

@mih mih commented Aug 2, 2016

FIrst the picture of the current state:

{
  "@context": "http://schema.org/",
  "@graph": [
    {
      "id": "2146817a-d7ab-46e5-9133-4a725306dbc5",
      "type": "Dataset",
      "dc:hasPart": {
        "id": "5a798601-8d66-42a9-ada9-4dceab591cbb"
      },
      "dc:hasVersion": {
        "id": "974d1090-32e4-4d0c-86aa-b345c5c2ba41"
      },
      "name": "the mother"
    },
    {
      "id": "5a798601-8d66-42a9-ada9-4dceab591cbb",
      "type": "Dataset",
      "location": "sub",
      "name": "child"
    }
  ]
}

This is the metadata of a dataset with one subdataset, that also has a clone somewhere else (see datalad/metadata/tests/test_base.py:test_basic_metadata). The name properties are coming from a BIDS structure (presence of BIDS metadata was auto-detected), the rest comes from Git/Git-annex itself.

The goal here is to get a list (one item per dataset) of dicts with a more-or-less plain key-value mapping (no nested structures) -- @id being the only and necessary exception to get valid JSON-LD. Such metadata structure should be easily convertible into a queryable form (SQL, or NO-SQL, etc).

Confession:

The above is not quiet true, as this is a compacted and flattened graph, for which a call to pyld's jsonld.flatten() is not yet pushed. This is a rather expensive call (needs network (or local cache) for term resolution), but it allows us to condense pretty much arbitrary metadata sets into a minimal form. It only needs to be done when metadata is aggregated and cached, not for query. pyld is in Debian and available through pip.

In contrast to what I said before, I now think we should store metadata as JSON-LD and not as triples. JSON-LD can get us triples if we need them. But JSON-LD is plain JSON, hence we can limit metadata read requirements to just json, and do not need to worry about JSON-LD.

If you want to play with the formats look here: http://tinyurl.com/jb62uhu
The top show the actual metadata structure generated by the current code. Various other flavors are available at the bottom. For expansion to triples/quads the UUIDs would get prefixed with something like http://db.datalad.org/ds/. Where we could provide information on the datasets that we know. The fact that the UUID are clone-specific should not hurt. The current PR already stores the origin UUID for newly created datasets, and the graph info would allow us to trace clones.

More info: http://json-ld.org

TODO:

  • cache metadata
  • retrieve cached metadata for query, instead of reparse

@mih mih added the conference agenda item Scheduled to be discussed in a developer meeting label Aug 2, 2016
@mih mih mentioned this pull request Aug 2, 2016
Metadata type label or `None` if no type setting is found and and optional
auto-detection yielded no results
"""
cfg = GitConfigParser(opj(ds.path, '.datalad', 'config'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though we had #437 about harmonization of config files to possibly follow git format, not yet sure if that should be the way to go... discussion will continue there...

@mih mih changed the title WiP: Baby steps towards metadata support WiP: JSON-LD based metadata Aug 7, 2016
@mih mih force-pushed the nf-metaparser branch 2 times, most recently from e220b25 to 82add2e Compare August 8, 2016 18:46
@yarikoptic
Copy link
Member

Away from laptop atm, but osx failures suggest that this branch is a bit behind master, could you please merge master or rebase?

@@ -177,7 +177,7 @@ def get_metadata(ds, guess_type=False, ignore_subdatasets=False,
for subds_path in ds.get_subdatasets(recursive=False):
subds_meta_fname = opj(meta_path, subds_path, metadata_filename)
if exists(subds_meta_fname):
subds_meta = json.load(open(subds_meta_fname, 'rb'))
subds_meta = json.loads(open(subds_meta_fname, 'rb').read().decode('utf-8'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I did just use

with open(filename) as f:
  j = json.load(f)

just fine in code elsewhere but I probably didn't have any encoded content.

@mih mih force-pushed the nf-metaparser branch 2 times, most recently from 2bfda5a to b4d579c Compare August 11, 2016 05:01
@mih
Copy link
Member Author

mih commented Aug 11, 2016

Travis failure is due to #699

@coveralls
Copy link

coveralls commented Aug 11, 2016

Coverage Status

Coverage increased (+0.06%) to 86.358% when pulling 3fa3095 on hanke:nf-metaparser into 8f7b5a4 on datalad:master.

@codecov-io
Copy link

codecov-io commented Aug 11, 2016

Codecov Report

Merging #682 into master will decrease coverage by 2.03%.
The diff coverage is 90.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #682      +/-   ##
==========================================
- Coverage   88.85%   86.81%   -2.04%     
==========================================
  Files         200      212      +12     
  Lines       18199    18774     +575     
==========================================
+ Hits        16170    16299     +129     
- Misses       2029     2475     +446
Impacted Files Coverage Δ
datalad/tests/test_cmdline_main.py 97.22% <ø> (ø) ⬆️
datalad/cmdline/main.py 84.61% <ø> (ø) ⬆️
datalad/support/vcr_.py 59.57% <0%> (-7.82%) ⬇️
datalad/interface/__init__.py 100% <100%> (ø) ⬆️
datalad/metadata/parsers/__init__.py 100% <100%> (ø)
...ata/parsers/tests/test_frictionless_datapackage.py 100% <100%> (ø)
datalad/support/gitrepo.py 88.35% <100%> (ø) ⬆️
datalad/metadata/parsers/tests/__init__.py 100% <100%> (ø)
datalad/metadata/parsers/tests/test_bids.py 100% <100%> (ø)
datalad/interface/add_archive_content.py 90.27% <100%> (+0.05%) ⬆️
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4673ff...590d7a7. Read the comment docs.

@mih
Copy link
Member Author

mih commented Aug 11, 2016

Only OSX failure is left -- reason unclear to me.

@yarikoptic
Copy link
Member

yarikoptic commented Aug 11, 2016

FWIW on OSX tests failed because somewhere 'realpath' is taken dereferencing tempdir symlink, so:

ValueError: path /var/folders/90/vkz4dwlx0ss72djd8hywgppc0000gp/T/datalad_temp_tree_test_basic_metadatadOTrcu/origin/sub outside dataset <Dataset path=/private/var/folders/90/vkz4dwlx0ss72djd8hywgppc0000gp/T/datalad_temp_tree_test_basic_metadatadOTrcu/origin>

note the /private prefix.

@@ -129,7 +129,7 @@ def _get_default_file_candidates(self):
cfg_file_candidates.append(opj(home_cfg_base_path, 'datalad.cfg'))

# current dir config
cfg_file_candidates.append(opj('.datalad', 'config'))
cfg_file_candidates.append('datalad.cfg')
Copy link
Member

@yarikoptic yarikoptic Aug 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to keep all datalad specific stuff under .datalad ? its content is under git control.

Also "in line" with .git/config (just that .git/ content is not under git), especially if we finally migrate to git based config format

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the other files is under Git. Why should this one be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. files which shouldn't go under git control should be under .git/ so they do not pollute git status and would not require custom .gitignore etc
  2. so far we also have a number of config settings which should be stored under git:
    • how do we crawl (for now in a custom .datalad/crawl/crawl.cfg)
    • how metadata gets aggregated (somewhere under .datalad/?)

and imho it is not worth placing any datalad files under root dir of a dataset.

@yarikoptic
Copy link
Member

yarikoptic commented Aug 11, 2016

FWIW -- I will look into that symlink issue since it is also the case on my local system

unfortunately: GitRepo.get_toppath resolves symlinks in the path (well -- "git", "rev-parse", "--show-toplevel" does it). Sending a PR with tentative fix (and other tune ups from going through the code)

mih#3

# flatten list
item for versionlist in
# extract uuids of all listed repos
[[implicit_meta[i]['@id'] for r in implicit_meta[i] if '@id' in r]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it is not r[@id] instead of implicit_meta[i]['@id'] ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r is only the key, not the value, AFAIK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so it is checking if key is @id or @identity or @idiocracy?

On August 12, 2016 3:07:55 PM EDT, Michael Hanke notifications@github.com wrote:

  • meta.append(get_implicit_metadata(ds, ds_identifier))
  • implicit_meta = get_implicit_metadata(ds, ds_identifier)
  • create a lookup dict to find parts by subdataset mountpoint

  • has_part = implicit_meta.get('dcterms:hasPart', [])
  • if not isinstance(has_part, list):
  •    has_part = [has_part]
    
  • has_part = {hp['location']: hp for hp in has_part}
  • figure out all other version of this dataset: origin or

siblings

  • build a flat list of UUIDs

  • candidates = ('dcterms:isVersionOf', 'dcterms:hasVersion')
  • ds_versions = [
  •    # flatten list
    
  •    item for versionlist in
    
  •    # extract uuids of all listed repos
    
  •    [[implicit_meta[i]['@id'] for r in implicit_meta[i] if '@id'
    
    in r]

r is only the key, not the value, AFAIK

You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/datalad/datalad/pull/682/files/b8409420b30e186f48d7dd6ea1ab2642b228f969..9bdc084470f9c6f457421613eeb95d942a3ee54e#r74644874

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, likely you are seeing this more clearly than I do. Just want to point out that this flattens a nested list. I am in an intense conversion with an amazing bottle of Gin Tonic. Will stop converting it into code for now ;-)

@mih mih force-pushed the nf-metaparser branch 2 times, most recently from 3f55338 to 25aa1ec Compare August 15, 2016 09:08
@mih
Copy link
Member Author

mih commented Aug 15, 2016

@yarikoptic @bpoldrack Now it is slowly getting more interesting: some extraction, some aggregation, some query is implemented. I can't say that I believe this is all here to stay in this exact form, but it feels like it is something already.

Will likely approach an actual query command next.

@mih
Copy link
Member Author

mih commented Sep 5, 2016

FTR: "yaml" dependency introduced by @yarikoptic does not exist -- seems PyYAML.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.3%) to 86.511% when pulling 596ad25 on mih:nf-metaparser into e4673ff on datalad:master.

@mih
Copy link
Member Author

mih commented Sep 5, 2016

datasets.datalad.org (git)-[master] % datalad search-datasets '.*Hanke.*' --report location
Match: openfmri/ds000113c
Match: openfmri/ds000113d
datalad search-datasets '.*Hanke.*' --report location  6,16s user 0,12s system 96% cpu 6,499 total

@coveralls
Copy link

coveralls commented Sep 5, 2016

Coverage Status

Coverage decreased (-2.2%) to 86.667% when pulling 5f670c2 on mih:nf-metaparser into e4673ff on datalad:master.

@mih
Copy link
Member Author

mih commented Sep 5, 2016

TODO: cache flattened graph for repeated access; test for search_dataset

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.4%) to 86.462% when pulling 00a2340 on mih:nf-metaparser into e4673ff on datalad:master.

@mih
Copy link
Member Author

mih commented Sep 5, 2016

FTR: Almost the entire time flattening the meta data graph is spend on compacting the expanded graph.

To me it seems like the best to cache the output.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.4%) to 86.459% when pulling 62b4f7c on mih:nf-metaparser into e4673ff on datalad:master.

@mih
Copy link
Member Author

mih commented Sep 5, 2016

Now with caching all over the place:

datasets.datalad.org (git)-[master] % time datalad search-datasets '.*Hanke.*' --report name
Match: studyforrest_multires7t
Match: studyforrest_phase2
datalad search-datasets '.*Hanke.*' --report name  6,14s user 0,06s system 99% cpu 6,233 total
mih@meiner ...datalad/DONTLOOK/datasets.datalad.org (git)-[master] % time datalad search-datasets '.*Hanke.*' --report name
Match: studyforrest_multires7t
Match: studyforrest_phase2
datalad search-datasets '.*Hanke.*' --report name  0,46s user 0,05s system 97% cpu 0,524 total
mih@meiner ...datalad/DONTLOOK/datasets.datalad.org (git)-[master] % time datalad search-datasets '.*Hanke.*' --report name
Match: studyforrest_multires7t
Match: studyforrest_phase2
datalad search-datasets '.*Hanke.*' --report name  0,42s user 0,07s system 97% cpu 0,505 total
mih@meiner ...datalad/DONTLOOK/datasets.datalad.org (git)-[master] % time datalad search-datasets '.*Hanke.*' --report name
Match: studyforrest_multires7t
Match: studyforrest_phase2
datalad search-datasets '.*Hanke.*' --report name  0,44s user 0,06s system 96% cpu 0,512 total

This is all tested on the full set of datalad datasets. Initial 6s runtime comes to a large percentage from expanding, compacting and flattening the meta data graph. The resulting graph is then cached and subsequent searches are <500ms total runtime (incl. datalad startup).

IMHO: Good enough (TM) -- for now.

@mih mih force-pushed the nf-metaparser branch 2 times, most recently from cd282d5 to e06b22c Compare September 5, 2016 14:27
@mih mih force-pushed the nf-metaparser branch 2 times, most recently from f817e87 to 3596a4b Compare September 5, 2016 16:03
@mih
Copy link
Member Author

mih commented Sep 5, 2016

Python 3 support isn't quite there yet. Filed issue #756 to not forget about it. But I have to detach myself from this now and will merge as soon as the tests can be made to pass.

@coveralls
Copy link

coveralls commented Sep 5, 2016

Coverage Status

Coverage decreased (-2.06%) to 86.794% when pulling 3596a4b on mih:nf-metaparser into e4673ff on datalad:master.

@coveralls
Copy link

coveralls commented Sep 5, 2016

Coverage Status

Coverage decreased (-2.03%) to 86.817% when pulling 590d7a7 on mih:nf-metaparser into e4673ff on datalad:master.

@mih
Copy link
Member Author

mih commented Sep 5, 2016

FTR: Missing coverage is primarily the untested result renderer of search-datasets

@mih mih merged commit 5ff2165 into datalad:master Sep 5, 2016
@mih mih deleted the nf-metaparser branch June 24, 2017 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants