Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: metadata aggregation - check also the "content" of metadata files #3007

Merged
merged 21 commits into from Nov 27, 2018

Conversation

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Nov 22, 2018

This pull hopefully addresses #2841

This pull request proposes to not rely solely on the "state" of the repo in the ID of the metadata file, but also verify that actual content is the same across two datasets.

_the_same_across_datasets might be a bit over-engineered, but thought to make it work for >= 2 datasets just because logic should be generic.

  • Sits on top of #3002
  • Also refactors - moves annex style split_ext from addurls to generic support.path

TODOs

  • Finish/Merge #3009 first (since this one merges it in)
  • Check if actually addresses usecase of #2841
  • Unit-test

Please have a look @datalad/developers

@codecov
Copy link

@codecov codecov bot commented Nov 23, 2018

Codecov Report

Merging #3007 into master will decrease coverage by 4.39%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #3007     +/-   ##
=========================================
- Coverage   90.36%   85.96%   -4.4%     
=========================================
  Files         245      245             
  Lines       32276    32411    +135     
=========================================
- Hits        29167    27863   -1304     
- Misses       3109     4548   +1439
Impacted Files Coverage Δ
datalad/plugin/tests/test_addurls.py 59.92% <ø> (-40.08%) ⬇️
datalad/support/json_py.py 98.64% <100%> (+1.38%) ⬆️
datalad/support/path.py 92% <100%> (+2%) ⬆️
datalad/support/annexrepo.py 87.5% <100%> (-0.6%) ⬇️
datalad/utils.py 86.29% <100%> (-0.05%) ⬇️
datalad/plugin/addurls.py 72.98% <100%> (-26.72%) ⬇️
datalad/support/tests/test_path.py 88% <100%> (+6.75%) ⬆️
datalad/tests/utils.py 86.59% <100%> (-2.8%) ⬇️
datalad/metadata/tests/test_aggregation.py 99.06% <100%> (ø) ⬆️
datalad/metadata/aggregate.py 77.08% <40.47%> (-14.89%) ⬇️
... and 78 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d59932e...cf62c09. Read the comment docs.

yarikoptic added 9 commits Nov 23, 2018
.xz was not added while considering lexists for aggregated metadata...
Not sure how it all worked -- probably was extracting more frequently than
needed
We must not cause re-extraction if content differs - we must only
cause re-copying
So we could create over existing annexed content
…mode) committing a list of files

This should avoid the issue of a plain "commit -- files" to fail if annex
migration happens in the background with

    Cannot make a partial commit with unlocked annexed files
…e annexed file(s)

A small price to pay to make sure that behaves consistently
one thing at a time, so postponed to dig into this for now
@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Nov 23, 2018

I was wrong

either effect of this one or #3007 or was before -- aggregation with forced extraction within subdataset doesn't bring that metadata into superdataset and requires an additional run

$> datalad aggregate-metadata --incremental -d . --force-extraction ds000002
....
$> ls -Lld {.,ds000002}/.datalad/metadata/objects/f8/cn-9fc41a9de2b35ab590770977a99eac.xz
-r--r--r-- 1 yoh datalad 75648 Nov 19 18:31 ./.datalad/metadata/objects/f8/cn-9fc41a9de2b35ab590770977a99eac.xz
-r-------- 1 yoh datalad 67680 Jun  7 22:55 ds000002/.datalad/metadata/objects/f8/cn-9fc41a9de2b35ab590770977a99eac.xz

$> datalad aggregate-metadata --incremental -d . ds000002                                
[INFO   ] Aggregate metadata for dataset /mnt/btrfs/datasets/datalad/crawl/openfmri/ds000002 
[INFO   ] Update aggregate metadata in dataset at: /mnt/btrfs/datasets/datalad/crawl/openfmri 
aggregate_metadata(ok): /mnt/btrfs/datasets/datalad/crawl/openfmri (dataset)
[INFO   ] Attempting to save 200 files/datasets 
save(ok): /mnt/btrfs/datasets/datalad/crawl/openfmri (dataset)
action summary:
  aggregate_metadata (ok: 1)
  save (ok: 1)

$> ls -Lld {.,ds000002}/.datalad/metadata/objects/f8/cn-9fc41a9de2b35ab590770977a99eac.xz
-r-------- 2 yoh datalad 67680 Jun  7 22:57 ./.datalad/metadata/objects/f8/cn-9fc41a9de2b35ab590770977a99eac.xz
-r-------- 1 yoh datalad 67680 Jun  7 22:55 ds000002/.datalad/metadata/objects/f8/cn-9fc41a9de2b35ab590770977a99eac.xz
edit 1: Nah -- was an error of the user operating heavy machinery ;-) the effect of running without `--update-mode all`, with "target" being the default. So the super-dataset was the only one getting metadata updates

yarikoptic added 5 commits Nov 23, 2018
I have disabled careless mode while testing and forgot to enable it back.
That resulted in a commit attempt to commit nothing failing
… metadata

Eventually the "open" PR would be finished and provide needed convenience I guess
* bf-commit-changed-annex: (22 commits)
  BF: do not attempt to get list of non-staged changed files in direct mode
  DOC: comment
  BF: return back "careless" operation of the commit
  ENH: TODO to do test with also unannexed files
  ENH(TST): test with and without explicit unlocking before changing the annexed file(s)
  BF: possibly make use of temporary index every time (not only direct mode) committing a list of files
  ENH(TST,BK): a test to demonstrate inability to commit changes to annexed files
  ENH: bind os.path.basename in our .support.path
  BF: sort modified files before comparison
  ENH: remove_existing option for create_tree
  RF: Go to 50% margin
  BF: Assume 20% safety margin for max cmdline length
  BF: use temporary filename instead of .git/logs to store build log
  RF: disable freezing by default
  Added sample bisection scripts for annex
  Messing with neurodebian freeze to get some prior build env
  ENH: silence cleanup
  DOC: tools/bisect-git-annex: Revise script's commentary
  CLN: tools/bisect-git-annex: Fix a typo
  RF: it is silly to provide git annex repo as argument since needs to be ran within one already
  ...
@yarikoptic yarikoptic added this to the Release 0.11.1 milestone Nov 24, 2018
@yarikoptic yarikoptic mentioned this pull request Nov 24, 2018
yarikoptic added a commit to yarikoptic/datalad that referenced this issue Nov 27, 2018
@yarikoptic yarikoptic merged commit cf62c09 into datalad:master Nov 27, 2018
6 of 7 checks passed
yarikoptic added a commit that referenced this issue Nov 27, 2018
	## 0.11.1 (Nov 25, 2018) -- v7-better-than-v6

	Rushed out bugfix release to stay fully compatible with recent
	[git-annex] which introduced v7 to replace v6.

	### Fixes

	- [install]: be able to install recursively into a dataset ([#2982])
	- [save]: be able to commit/save changes whenever files potentially
	  could have swapped their storage between git and annex
	  ([#1651]) ([#2752]) ([#3009])
	- [aggregate-metadata]:
	  - dataset's itself is now not "aggregated" if specific paths are
		provided for aggregation ([#3002]). That resolves the issue of
		`-r` invocation aggregating all subdatasets of the specified dataset
		as well
	  - also compare/verify the actual content checksum of aggregated metadata
		while considering subdataset metadata for re-aggregation ([#3007])
	- `annex` commands are now chunked assuming 50% "safety margin" on the
	  maximal command line length. Should resolve crashes while operating
	  ot too many files at ones ([#3001])
	- `run` sidecar config processing ([#2991])
	- no double trailing period in docs ([#2984])
	- correct identification of the repository with symlinks in the paths
	  in the tests ([#2972])
	- re-evaluation of dataset properties in case of dataset changes ([#2946])
	- [text2git] procedure to use `ds.repo.set_gitattributes`
	  ([#2974]) ([#2954])
	- Switch to use plain `os.getcwd()` if inconsistency with env var
	  `$PWD` is detected ([#2914])
	- Make sure that credential defined in env var takes precedence
	  ([#2960]) ([#2950])

	### Enhancements and new features

	- [shub://datalad/datalad:git-annex-dev](https://singularity-hub.org/containers/5663/view)
	  provides a Debian buster Singularity image with build environment for
	  [git-annex]. [tools/bisect-git-annex]() provides a helper for running
	  `git bisect` on git-annex using that Singularity container ([#2995])
	- Added [.zenodo.json]() for better integration with Zenodo for citation
	- [run-procedure] now provides names and help messages with a custom
	  renderer for ([#2993])
	- Documentation: point to [datalad-revolution] extension (prototype of
	  the greater DataLad future)
	- [run]
	  - support injecting of a detached command ([#2937])
	- `annex` metadata extractor now extracts `annex.key` metadata record.
	  Should allow now to identify uses of specific files etc ([#2952])
	- Test that we can install from http://datasets.datalad.org
	- Proper rendering of `CommandError` (e.g. in case of "out of space"
	  error) ([#2958])

* tag '0.11.1':
  Adjust the date -- 25th fell through due to __version__ fiasco
  BF+ENH(TST): boost hardcoded version + provide a test to guarantee consistency in the future
  This (expensive) approach is not needed in v6+
  small tuneup to changelog
@yarikoptic yarikoptic deleted the bf-metadata-check branch Feb 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

1 participant