Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: Reevaluation of Dataset properties #2946

Merged
merged 17 commits into from Nov 5, 2018

Conversation

@bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Oct 24, 2018

Reevaluate properties when physically underlying repo changes, ceases to exist or comes into existence.
Note, that Dataset is supposed to represent a path reference to a repository that may not even exist yet, whereas *Repo classes are supposed to represent the actual repository. This leads to the Repos being flyweights wrt to realpath and Dataset being flyweight wrt the unresolved path!

There's a sideeffect to be mentioned: Previously it was possible to read/write .datalad/config via Dataset.config even if there was no repository (yet) at all. This is not the case anymore and I think this makes sense. Ultimately the previous behavior suggests, that you can store config at "dataset-level" (Parameter where='dataset') while there is no dataset. This is misleading and I don't see a reason for it. However, a Dataset` instance without an underlying physical repository still provides access to user and system level config, of course.

Also:
Closes #2942

Changes

  • reevaluate Dataset.repo
  • reevaluate Dataset.config
  • reevaluate Dataset.id
  • This change is complete

Please have a look @datalad/developers

bpoldrack added 2 commits Oct 24, 2018
Reevaluate properties when physically underlying repo changes, ceases to
exist or comes into existence.
Note, that Dataset is supposed to represent a path reference to a
repository that may not even exist yet, whereas *Repo classes are
supposed to represent the actual repository. This leads to the Repos
being flyweights wrt to realpath and Dataset being flyweight wrt the
unresolved path!
@codecov
Copy link

@codecov codecov bot commented Oct 25, 2018

Codecov Report

Merging #2946 into master will increase coverage by 0.02%.
The diff coverage is 96.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2946      +/-   ##
==========================================
+ Coverage   90.35%   90.38%   +0.02%     
==========================================
  Files         246      245       -1     
  Lines       32136    32211      +75     
==========================================
+ Hits        29038    29113      +75     
  Misses       3098     3098
Impacted Files Coverage Δ
datalad/interface/tests/test_run.py 99.83% <100%> (ø) ⬆️
datalad/support/exceptions.py 80.13% <100%> (+0.13%) ⬆️
datalad/distribution/tests/test_dataset.py 100% <100%> (ø) ⬆️
datalad/metadata/tests/test_base.py 99.26% <100%> (ø) ⬆️
datalad/distribution/create.py 90.14% <100%> (+0.06%) ⬆️
datalad/support/gitrepo.py 88.9% <83.33%> (+0.06%) ⬆️
datalad/support/annexrepo.py 88.07% <83.33%> (+0.01%) ⬆️
datalad/distribution/dataset.py 94.55% <95.45%> (+0.61%) ⬆️
datalad/distribution/siblings.py 68.8% <0%> (-0.3%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fdda117...7fda144. Read the comment docs.

bpoldrack added 3 commits Oct 25, 2018
A Dataset's ConfigManager (Dataset.config property) does not allow for
reading/writing config from repository level config files anymore.
Don't pretend to have a notion of dataset-/reop-level if there's not yet
(or anymore) an actual repository on the file system.
Since underlying repo and config may change, property 'id' also needs to
be reevaluated and cannot be persistent wrt to the Dataset instance but
only wrt to the actual repository.
This also requires a change in the logic of datalad-create, which relied
on this persistency.
@bpoldrack bpoldrack force-pushed the bf-ds-property-reevaluation branch from 7aebf69 to 3d356e2 Oct 25, 2018
bpoldrack added 2 commits Oct 25, 2018
we are currently unable to resolve them, since we rely on os.path.realpath for that and it doesn't resolve windows' symlinks
@bpoldrack bpoldrack force-pushed the bf-ds-property-reevaluation branch from 3d356e2 to d10895c Oct 25, 2018
@bpoldrack bpoldrack changed the title [WIP] BF: Reevaluation of Dataset properties BF: Reevaluation of Dataset properties Oct 25, 2018
Copy link
Member

@yarikoptic yarikoptic left a comment

Left minor comments. But also, although following concern might sound unrelated, and I guess OPTimization is worth a separate issue/PR but code changes within .config (which becomes less lazy and querying .repo now on each call) hints on us just missing some design pattern to be used to avoid all the repeating dances here and elsewhere (config). I am afraid that with this PR we might notice even further slow downs (see my comments within def repo). Anything you could notice running those few benchmarks we have?

datalad/distribution/tests/test_dataset.py Outdated Show resolved Hide resolved
datalad/distribution/dataset.py Outdated Show resolved Hide resolved
datalad/distribution/dataset.py Outdated Show resolved Hide resolved
datalad/distribution/dataset.py Outdated Show resolved Hide resolved
datalad/distribution/dataset.py Show resolved Hide resolved
datalad/distribution/dataset.py Show resolved Hide resolved
@bpoldrack
Copy link
Member Author

@bpoldrack bpoldrack commented Oct 26, 2018

Re the loss of laziness:
Generally you're right and as of now, there's an impact on the benchmarks. But: Note, that this comes "only" from config and id now referring to repo. And repo wasn't lazy before (see my comment above). What changes here is that we do not detach config and id from what is going on underneath. Both are still lazy in that it doesn't cost more than a pointer assignment if the underlying ConfigManager instance didn't change. Note also, that self.config.get(...) in id doesn't mean to query the config files all the time, since ConfigManager keeps stuff in memory (except when reload() is called).

@bpoldrack
Copy link
Member Author

@bpoldrack bpoldrack commented Oct 26, 2018

However, I'm looking into making it cheaper. We actually do check validity of the repo several times, where once was sufficient. We do it explicitly in Dataset.repo, then the flyweight engine does it and in case the actual constructors of *Repo are called, it is done again, making it way costlier than it needs to be. I'll try to address at least the redundancy of Dataset.repo and flyweight mechanics within this PR.

@bpoldrack bpoldrack changed the title BF: Reevaluation of Dataset properties [WIP] BF: Reevaluation of Dataset properties Oct 26, 2018
@bpoldrack
Copy link
Member Author

@bpoldrack bpoldrack commented Oct 26, 2018

FTR: I'm close to a significant performance improvement. Still have an issue with catching the case, that there was a valid repo before and is now, but wasn't in between (recreation). Dataset fails to reevaluate repo in this case and points to the wrong object (which has implications => wrong ConfigManager used). Problem is: Did not figure out yet, why this even worked before! ;-)

- If there already is a repo instance, reduce the cost of evaluating
this property to the actually required testing. This comes at an
adiitional cost in case the instance has to change however. See code
comment for reference on what can and should be done about this in addition.
- BF: is_installed
  Former implementation isn't valid anymore, due to reevaluation of
'repo'.
@bpoldrack bpoldrack force-pushed the bf-ds-property-reevaluation branch from ff0d74f to 8844925 Oct 27, 2018
@bpoldrack
Copy link
Member Author

@bpoldrack bpoldrack commented Oct 27, 2018

Okay. That's it for now, I guess. Comparison to current master:

% asv compare 84f44bc882f01f4e3a4c28bff92a6cb7e6f1fc69 884492538644c43b24cbe2792ccd7c028ab18033 --machine tree     

All benchmarks:

    before     after       ratio
  [84f44bc8] [88449253]
      1.11s      1.09s      0.99  api.supers.time_createadd
      1.08s      1.07s      0.99  api.supers.time_createadd_to_dataset
      3.16s      3.17s      1.00  api.supers.time_installr
   127.22ms   126.12ms      0.99  api.supers.time_ls
      1.12s      1.12s      1.00  api.supers.time_ls_recursive
      1.27s      1.27s      1.00  api.supers.time_ls_recursive_long_all
      1.80s      1.81s      1.01  api.supers.time_remove
   351.43ms   360.70ms      1.03  api.testds.time_create_test_dataset1
      2.36s      2.35s      1.00  api.testds.time_create_test_dataset2x2
     2.04ms     2.05ms      1.01  core.runner.time_echo
     2.36ms     2.36ms      1.00  core.runner.time_echo_gitrunner
   508.62ms   506.12ms      1.00  core.startup.time_help_np
   101.02ms   100.66ms      1.00  core.startup.time_import
   456.64ms   446.75ms      0.98  core.startup.time_import_api

Please note two things:

  • This is all about the speed of Dataset.repo evaluation. If you consider, that we now do it significantly more often, due to the implicit access to repo via config, not getting slower overall is quite something.
  • There's more that can be done about it, but it should involve the entire call path from Dataset.repo to Flyweight to AnnexRepo.__init__ and GitRepo.__init__ (plus probably some RF'ing for is_valid_repo). So, that's a topic for a different PR, I think.

@bpoldrack bpoldrack changed the title [WIP] BF: Reevaluation of Dataset properties BF: Reevaluation of Dataset properties Oct 27, 2018
@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Oct 27, 2018

Cool, I will have a look
I guess the idea to bind .repo in functions to a local variable then still might be quite helpful, or you are planning to work through the stack to see if anything could be done (in a separate pr)?

@bpoldrack
Copy link
Member Author

@bpoldrack bpoldrack commented Oct 27, 2018

I guess the idea to bind .repo in functions to a local variable then still might be quite helpful, or you are planning to work through the stack to see if anything could be done (in a separate pr)?

I do plan to do so, since I started looking into it anyway.
However, this can only lead to further speed up of the validation. Using local variables in functions will remain a good idea. May be I should add that to the docstring of repo.

datalad/distribution/tests/test_dataset.py Outdated Show resolved Hide resolved
datalad/distribution/tests/test_dataset.py Outdated Show resolved Hide resolved
@bpoldrack bpoldrack force-pushed the bf-ds-property-reevaluation branch from f492083 to 0f6cc99 Oct 28, 2018
@bpoldrack
Copy link
Member Author

@bpoldrack bpoldrack commented Oct 29, 2018

FTR: nd14_04 failure seems to be the usual one, but nd16_04 failure is

ERROR: datalad.support.tests.test_annexrepo.test_AnnexRepo_addurl_to_file_batched
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/tests/utils.py", line 418, in newfunc
    return t(*(arg + (d,)), **kw)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/tests/utils.py", line 530, in newfunc
    return tfunc(*(args + (path, url)), **kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/tests/utils.py", line 591, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/support/tests/test_annexrepo.py", line 1036, in test_AnnexRepo_addurl_to_file_batched
    ar.add_url_to_file(testfile, testurl, batch=True)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/support/gitrepo.py", line 228, in newfunc
    return func(self, file_new, *args, **kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/support/annexrepo.py", line 2147, in add_url_to_file
    % (url, str(out_json)))
AnnexBatchCommandError: AnnexBatchCommandError: command 'addurl'
Error, annex reported failure for addurl (url='http://127.0.0.1:40389/about.txt'): {'command': 'addurl', 'file': 'about.txt', 'success': False}

ATM I have no idea how this could possibly be related.
It passed for all other builds, and it used to pass for this one as well as of 8844925. Changes since were docstrings and changed assertions in test_dataset.py.

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Oct 29, 2018

I will restart 16.04 (when get to work) just to make sure that it was a fluke

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Oct 29, 2018

ok, rerun showed that the test failure was a fluke (now fails with the keyring issue, which is fixed in master -- buildbot doesn't do master merge before running)

Copy link
Member

@yarikoptic yarikoptic left a comment

More of questions than recommendations since logic for all of the detections of changes already is quite convoluted

datalad/distribution/dataset.py Show resolved Hide resolved
datalad/distribution/dataset.py Show resolved Hide resolved
datalad/distribution/dataset.py Show resolved Hide resolved
datalad/distribution/dataset.py Show resolved Hide resolved
datalad/interface/tests/test_run.py Show resolved Hide resolved
datalad/support/exceptions.py Show resolved Hide resolved
datalad/support/gitrepo.py Outdated Show resolved Hide resolved
@bpoldrack
Copy link
Member Author

@bpoldrack bpoldrack commented Nov 2, 2018

FTR: appveyor fails in direct mode in test_datasets_datalad_org. This appears to be unrelated.

datalad/support/gitrepo.py Outdated Show resolved Hide resolved
@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Nov 2, 2018

pushed 6f256ea which introduces use of that updates_tree to other relevant command calls

Here is the unfortunate impact from that:

$> asv compare -m hopa aa052ccd57334ad3ae6cdd202baf920caa7de411 6f256eae8e0ffeb4f65ef50e1242c4871bd33aea

All benchmarks:

       before           after         ratio
     [aa052ccd]       [6f256eae]
+           1.68s            1.86s     1.11  api.supers.time_createadd
+           1.65s            1.97s     1.20  api.supers.time_createadd_to_dataset
            4.36s            4.59s     1.05  api.supers.time_installr
          155±9ms          153±5ms     0.98  api.supers.time_ls
            1.56s            1.54s     0.98  api.supers.time_ls_recursive
            1.77s            1.77s     1.00  api.supers.time_ls_recursive_long_all
            2.49s            2.46s     0.99  api.supers.time_remove
+         499±5ms          565±8ms     1.13  api.testds.time_create_test_dataset1
+           3.76s            4.23s     1.12  api.testds.time_create_test_dataset2x2
      1.83±0.06ms      1.87±0.03ms     1.02  core.runner.time_echo
      2.03±0.03ms      2.05±0.04ms     1.01  core.runner.time_echo_gitrunner
          534±1ms          530±0ms     0.99  core.startup.time_help_np
          106±3ms          105±3ms     0.99  core.startup.time_import
          470±2ms        462±0.8ms     0.98  core.startup.time_import_api

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Nov 2, 2018

here is the overall impact I see in comparison with the master

$> asv compare -m hopa 34d4360e4b984e5f04fed183338e7a2bdf5f9163 6f256eae8e0ffeb4f65ef50e1242c4871bd33aea
                                                               
All benchmarks:

       before           after         ratio
     [34d4360e]       [6f256eae]
            1.71s            1.86s     1.09  api.supers.time_createadd
+           1.70s            1.97s     1.16  api.supers.time_createadd_to_dataset
            4.89s            4.59s     0.94  api.supers.time_installr
          145±4ms          153±5ms     1.05  api.supers.time_ls
            1.52s            1.54s     1.01  api.supers.time_ls_recursive
            1.83s            1.77s     0.97  api.supers.time_ls_recursive_long_all
            2.44s            2.46s     1.01  api.supers.time_remove
+         498±3ms          565±8ms     1.13  api.testds.time_create_test_dataset1
+           3.22s            4.23s     1.32  api.testds.time_create_test_dataset2x2
       2.41±0.2ms      1.87±0.03ms    ~0.78  core.runner.time_echo
-      2.92±0.3ms      2.05±0.04ms     0.70  core.runner.time_echo_gitrunner
+         461±8ms          530±0ms     1.15  core.startup.time_help_np
+        82.6±2ms          105±3ms     1.27  core.startup.time_import
+         410±2ms        462±0.8ms     1.13  core.startup.time_import_api

not sure yet why import/help have notable impact - should analyze, may be some rogue import needs to be delayed or some structure not created?

@mih
Copy link
Member

@mih mih commented Nov 3, 2018

I am concernd by the update trees thing and how it is used. I fear that I config reload is much slower than one would hope in some cases (nas drive, etc). And now we are doing it unconditionally all over the place. Even to support esotheric scenarios like .datalad being. submodule!?

Those config reloads are only concerned with .datalad/config. Why not test for this specifically, instead of bruteforcing config reloads that parse much more than this file?

@bpoldrack
Copy link
Member Author

@bpoldrack bpoldrack commented Nov 4, 2018

here is the overall impact I see in comparison with the master

Well, that's why I didn't want it to be default and was stating that we need to carefully consider what methods actually need it. This is just another topic that should be addressed on its own from my point of view. There are additional reloads that should be reconsidered, if we do it that way.
I really don't know why this PR has to grow to resolve everything it shines a light on.

I'd just go back to 7fda144 and get that merged. Speeding up Dataset.repo and ironing out config reloads are separate issues, which should be addressed separatly.

If we can't agree on that, I need to leave it and bring this into revolution instead. I need that fix, the PR actually is about.

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Nov 4, 2018

Sure, bring back or remove triggering for "isoteric" situations. I felt that many others not mentioned originally were quite warranted but that is ok, we can just wait for bug reports to add them back

@bpoldrack bpoldrack force-pushed the bf-ds-property-reevaluation branch from 6f256ea to 7fda144 Nov 5, 2018
@bpoldrack
Copy link
Member Author

@bpoldrack bpoldrack commented Nov 5, 2018

Ok. Went back by one commit.
Failures are unrelated to this PR. Thus: Merging.

@bpoldrack bpoldrack merged commit e4795ee into datalad:master Nov 5, 2018
4 of 7 checks passed
mih added a commit to datalad/datalad-revolution that referenced this issue Nov 5, 2018
@yarikoptic yarikoptic added this to the Release 0.11.1 milestone Nov 24, 2018
yarikoptic added a commit that referenced this issue Nov 27, 2018
	## 0.11.1 (Nov 25, 2018) -- v7-better-than-v6

	Rushed out bugfix release to stay fully compatible with recent
	[git-annex] which introduced v7 to replace v6.

	### Fixes

	- [install]: be able to install recursively into a dataset ([#2982])
	- [save]: be able to commit/save changes whenever files potentially
	  could have swapped their storage between git and annex
	  ([#1651]) ([#2752]) ([#3009])
	- [aggregate-metadata]:
	  - dataset's itself is now not "aggregated" if specific paths are
		provided for aggregation ([#3002]). That resolves the issue of
		`-r` invocation aggregating all subdatasets of the specified dataset
		as well
	  - also compare/verify the actual content checksum of aggregated metadata
		while considering subdataset metadata for re-aggregation ([#3007])
	- `annex` commands are now chunked assuming 50% "safety margin" on the
	  maximal command line length. Should resolve crashes while operating
	  ot too many files at ones ([#3001])
	- `run` sidecar config processing ([#2991])
	- no double trailing period in docs ([#2984])
	- correct identification of the repository with symlinks in the paths
	  in the tests ([#2972])
	- re-evaluation of dataset properties in case of dataset changes ([#2946])
	- [text2git] procedure to use `ds.repo.set_gitattributes`
	  ([#2974]) ([#2954])
	- Switch to use plain `os.getcwd()` if inconsistency with env var
	  `$PWD` is detected ([#2914])
	- Make sure that credential defined in env var takes precedence
	  ([#2960]) ([#2950])

	### Enhancements and new features

	- [shub://datalad/datalad:git-annex-dev](https://singularity-hub.org/containers/5663/view)
	  provides a Debian buster Singularity image with build environment for
	  [git-annex]. [tools/bisect-git-annex]() provides a helper for running
	  `git bisect` on git-annex using that Singularity container ([#2995])
	- Added [.zenodo.json]() for better integration with Zenodo for citation
	- [run-procedure] now provides names and help messages with a custom
	  renderer for ([#2993])
	- Documentation: point to [datalad-revolution] extension (prototype of
	  the greater DataLad future)
	- [run]
	  - support injecting of a detached command ([#2937])
	- `annex` metadata extractor now extracts `annex.key` metadata record.
	  Should allow now to identify uses of specific files etc ([#2952])
	- Test that we can install from http://datasets.datalad.org
	- Proper rendering of `CommandError` (e.g. in case of "out of space"
	  error) ([#2958])

* tag '0.11.1':
  Adjust the date -- 25th fell through due to __version__ fiasco
  BF+ENH(TST): boost hardcoded version + provide a test to guarantee consistency in the future
  This (expensive) approach is not needed in v6+
  small tuneup to changelog
@bpoldrack bpoldrack deleted the bf-ds-property-reevaluation branch Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants