-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NF: JSON-LD based metadata #682
Conversation
Metadata type label or `None` if no type setting is found and and optional | ||
auto-detection yielded no results | ||
""" | ||
cfg = GitConfigParser(opj(ds.path, '.datalad', 'config'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though we had #437 about harmonization of config files to possibly follow git format, not yet sure if that should be the way to go... discussion will continue there...
e220b25
to
82add2e
Compare
Away from laptop atm, but osx failures suggest that this branch is a bit behind master, could you please merge master or rebase? |
@@ -177,7 +177,7 @@ def get_metadata(ds, guess_type=False, ignore_subdatasets=False, | |||
for subds_path in ds.get_subdatasets(recursive=False): | |||
subds_meta_fname = opj(meta_path, subds_path, metadata_filename) | |||
if exists(subds_meta_fname): | |||
subds_meta = json.load(open(subds_meta_fname, 'rb')) | |||
subds_meta = json.loads(open(subds_meta_fname, 'rb').read().decode('utf-8')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw I did just use
with open(filename) as f:
j = json.load(f)
just fine in code elsewhere but I probably didn't have any encoded content.
2bfda5a
to
b4d579c
Compare
Travis failure is due to #699 |
Codecov Report
@@ Coverage Diff @@
## master #682 +/- ##
==========================================
- Coverage 88.85% 86.81% -2.04%
==========================================
Files 200 212 +12
Lines 18199 18774 +575
==========================================
+ Hits 16170 16299 +129
- Misses 2029 2475 +446
Continue to review full report at Codecov.
|
Only OSX failure is left -- reason unclear to me. |
FWIW on OSX tests failed because somewhere 'realpath' is taken dereferencing tempdir symlink, so:
note the |
@@ -129,7 +129,7 @@ def _get_default_file_candidates(self): | |||
cfg_file_candidates.append(opj(home_cfg_base_path, 'datalad.cfg')) | |||
|
|||
# current dir config | |||
cfg_file_candidates.append(opj('.datalad', 'config')) | |||
cfg_file_candidates.append('datalad.cfg') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to keep all datalad specific stuff under .datalad
? its content is under git control.
Also "in line" with .git/config
(just that .git/ content is not under git), especially if we finally migrate to git based config format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the other files is under Git. Why should this one be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- files which shouldn't go under git control should be under
.git/
so they do not pollute git status and would not require custom .gitignore etc - so far we also have a number of config settings which should be stored under git:
- how do we crawl (for now in a custom .datalad/crawl/crawl.cfg)
- how metadata gets aggregated (somewhere under .datalad/?)
and imho it is not worth placing any datalad files under root dir of a dataset.
FWIW -- I will look into that symlink issue since it is also the case on my local system unfortunately: GitRepo.get_toppath resolves symlinks in the path (well -- "git", "rev-parse", "--show-toplevel" does it). Sending a PR with tentative fix (and other tune ups from going through the code) |
# flatten list | ||
item for versionlist in | ||
# extract uuids of all listed repos | ||
[[implicit_meta[i]['@id'] for r in implicit_meta[i] if '@id' in r] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it is not r[@id]
instead of implicit_meta[i]['@id']
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r is only the key, not the value, AFAIK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so it is checking if key is @id or @identity or @idiocracy?
On August 12, 2016 3:07:55 PM EDT, Michael Hanke notifications@github.com wrote:
- meta.append(get_implicit_metadata(ds, ds_identifier))
- implicit_meta = get_implicit_metadata(ds, ds_identifier)
create a lookup dict to find parts by subdataset mountpoint
- has_part = implicit_meta.get('dcterms:hasPart', [])
- if not isinstance(has_part, list):
has_part = [has_part]
- has_part = {hp['location']: hp for hp in has_part}
figure out all other version of this dataset: origin or
siblings
build a flat list of UUIDs
- candidates = ('dcterms:isVersionOf', 'dcterms:hasVersion')
- ds_versions = [
# flatten list
item for versionlist in
# extract uuids of all listed repos
in r][[implicit_meta[i]['@id'] for r in implicit_meta[i] if '@id'
r is only the key, not the value, AFAIK
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/datalad/datalad/pull/682/files/b8409420b30e186f48d7dd6ea1ab2642b228f969..9bdc084470f9c6f457421613eeb95d942a3ee54e#r74644874
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, likely you are seeing this more clearly than I do. Just want to point out that this flattens a nested list. I am in an intense conversion with an amazing bottle of Gin Tonic. Will stop converting it into code for now ;-)
3f55338
to
25aa1ec
Compare
@yarikoptic @bpoldrack Now it is slowly getting more interesting: some extraction, some aggregation, some query is implemented. I can't say that I believe this is all here to stay in this exact form, but it feels like it is something already. Will likely approach an actual query command next. |
FTR: "yaml" dependency introduced by @yarikoptic does not exist -- seems PyYAML. |
plus quite a fix fixes
|
TODO: cache flattened graph for repeated access; test for |
FTR: Almost the entire time flattening the meta data graph is spend on compacting the expanded graph. To me it seems like the best to cache the output. |
Now with caching all over the place:
This is all tested on the full set of datalad datasets. Initial 6s runtime comes to a large percentage from expanding, compacting and flattening the meta data graph. The resulting graph is then cached and subsequent searches are <500ms total runtime (incl. datalad startup). IMHO: Good enough (TM) -- for now. |
cd282d5
to
e06b22c
Compare
f817e87
to
3596a4b
Compare
Python 3 support isn't quite there yet. Filed issue #756 to not forget about it. But I have to detach myself from this now and will merge as soon as the tests can be made to pass. |
FTR: Missing coverage is primarily the untested result renderer of search-datasets |
FIrst the picture of the current state:
This is the metadata of a dataset with one subdataset, that also has a clone somewhere else (see datalad/metadata/tests/test_base.py:test_basic_metadata). The
name
properties are coming from a BIDS structure (presence of BIDS metadata was auto-detected), the rest comes from Git/Git-annex itself.The goal here is to get a list (one item per dataset) of dicts with a more-or-less plain key-value mapping (no nested structures) --
@id
being the only and necessary exception to get valid JSON-LD. Such metadata structure should be easily convertible into a queryable form (SQL, or NO-SQL, etc).Confession:
The above is not quiet true, as this is a compacted and flattened graph, for which a call to pyld's
jsonld.flatten()
is not yet pushed. This is a rather expensive call (needs network (or local cache) for term resolution), but it allows us to condense pretty much arbitrary metadata sets into a minimal form. It only needs to be done when metadata is aggregated and cached, not for query. pyld is in Debian and available through pip.In contrast to what I said before, I now think we should store metadata as JSON-LD and not as triples. JSON-LD can get us triples if we need them. But JSON-LD is plain JSON, hence we can limit metadata read requirements to just
json
, and do not need to worry about JSON-LD.If you want to play with the formats look here: http://tinyurl.com/jb62uhu
The top show the actual metadata structure generated by the current code. Various other flavors are available at the bottom. For expansion to triples/quads the UUIDs would get prefixed with something like
http://db.datalad.org/ds/
. Where we could provide information on the datasets that we know. The fact that the UUID are clone-specific should not hurt. The current PR already stores the origin UUID for newly created datasets, and the graph info would allow us to trace clones.More info: http://json-ld.org
TODO: