Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: maintain cache of discovered subdatasets in the loop of annotate_paths #3570

Merged

Conversation

@yarikoptic
Copy link
Member

commented Jul 30, 2019

Otherwise there is a rediscovery of subdatasets for EVERY path provided to annotate_paths even when they all belong to the same parent dataset.

Closes: #3567

TODOs:

  • verify that nothing is broken at least according to the tests. As is mentioned in the code comment, because of its "generator" nature, if annotate_path is called from within some loop altering the list of submodules, cache could get invalidated. If there is such a usage case we should probably add some kind of sensing of mtime for .gitmodules of the parent or add an explicit argument to not retain this cache, but then things will get slow again whenever not needed
  • see and report if any notable effect on benchmarks
  • verify that it helps in a real use case which "inspired" this fix
  • (not here) we really should add a benchmark suite operating on a set of paths within a bigger dataset across a variety of commands
OPT: maintain cache of discovered subdatasets in the loop of annotate…
…_paths

Otherwise there is a rediscovery of subdatasets for EVERY path provided to annotate_paths
even when they all belong to the same parent dataset.
@yarikoptic

This comment has been minimized.

Copy link
Member Author

commented Jul 30, 2019

benchmarks on travis show no significant effect of any kind and all suggest a bit of slowness... kheh kheh ;-)
       before           after         ratio
     [c411d6a2]       [58c23cd8]
     <bm/merge-target>       <bm/pr>   
       1.25±0.01s       1.26±0.02s     1.00  api.supers.time_createadd
       1.22±0.01s       1.24±0.01s     1.02  api.supers.time_createadd_to_dataset
       3.32±0.04s       3.51±0.06s     1.06  api.supers.time_installr
          136±1ms          148±8ms     1.09  api.supers.time_ls
       1.23±0.01s       1.28±0.06s     1.04  api.supers.time_ls_recursive
       1.34±0.02s       1.40±0.02s     1.04  api.supers.time_ls_recursive_long_all
       2.01±0.02s       2.06±0.01s     1.03  api.supers.time_remove
          398±4ms          412±3ms     1.03  api.testds.time_create_test_dataset1
       2.78±0.03s       2.95±0.03s     1.06  api.testds.time_create_test_dataset2x2
      2.04±0.03ms      2.13±0.03ms     1.05  core.runner.time_echo
      2.84±0.05ms      2.94±0.02ms     1.04  core.runner.time_echo_gitrunner
         655±10ms          688±3ms     1.05  core.startup.time_help_np
        156±0.8ms          160±1ms     1.02  core.startup.time_import
          595±4ms        621±0.9ms     1.04  core.startup.time_import_api
@codecov

This comment has been minimized.

Copy link

commented Jul 31, 2019

Codecov Report

Merging #3570 into 0.11.x will increase coverage by 1.09%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           0.11.x    #3570      +/-   ##
==========================================
+ Coverage   80.25%   81.35%   +1.09%     
==========================================
  Files         255      255              
  Lines       33639    33648       +9     
==========================================
+ Hits        26997    27373     +376     
+ Misses       6642     6275     -367
Impacted Files Coverage Δ
datalad/interface/annotate_paths.py 84.9% <100%> (+0.23%) ⬆️
datalad/interface/tests/test_annotate_paths.py 100% <100%> (ø) ⬆️
datalad/utils.py 64.51% <0%> (+0.64%) ⬆️
datalad/distribution/dataset.py 91.89% <0%> (+0.9%) ⬆️
datalad/support/annexrepo.py 58.59% <0%> (+0.95%) ⬆️
datalad/log.py 66.82% <0%> (+0.96%) ⬆️
datalad/distribution/utils.py 85.56% <0%> (+1.03%) ⬆️
datalad/support/gitrepo.py 69.17% <0%> (+1.04%) ⬆️
datalad/interface/run_procedure.py 87.97% <0%> (+1.26%) ⬆️
datalad/distribution/remove.py 73.22% <0%> (+1.57%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c411d6a...af9c5a9. Read the comment docs.

@yarikoptic

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2019

real life use case,

  • 382,940 files (via git ls-tree -r) in the repo with 249,691 local annex keys
  • 1,830 *feat sub directories pointing to 368,292 symlinks (annexed) and 12,810 regular files (in git)
  • .git/objects is quite sizeable -- 59,322 MB (may be I should git gc after the last datalad save run)
  • no sub-modules (but this repo is a 3rd level submodule)
  • all data is already here (and didn't go anywhere ;-))

here is timings with eyeball averaging 3 runs:

  • pure git annex get -- 3m30s
  • pure 'git' '-c' 'annex.merge-annex-branches=false' 'annex' 'find' '--json' '--not' '--in' 'here' '--' which we use in datalad get to see what we actually need to get (to get idea on total size before running annex get) -- 11m43s (I guess we should check with @joeyh on how to get a more performant querying to get closer to get performance)
  • patched (this PR) datalad get -- 11m48s ( so we are bound by annex find above)
  • master datalad get -- 13m13s (so not that bad, but >10% time wasted)
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Aug 1, 2019
OPT: use find --in ., not --in here to gain 3x speed up on some cases
Originally found while timing effects of a PR:
  datalad#3570 (comment)
Similar change is now done also on git-annex level in
  d1a0c7b16fcd41f74def394a55c32f67340670b7
but to not wait for its new release, adjusting our
find invocations
@yarikoptic

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2019

so, checked with @joeyh, and with using . instead of here we would get boost

  • pure 'git' '-c' 'annex.merge-annex-branches=false' 'annex' 'find' '--json' '--not' '--in' '.' '--' 4m

PR #3574 will introduce that change.

@yarikoptic

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2019

ok, since there is a boost, and no side-effects were detected so far, merging

@yarikoptic yarikoptic merged commit bf57d0b into datalad:0.11.x Aug 1, 2019

5 checks passed

WIP Ready for review
Details
codecov/patch 100% of diff hit (target 80.25%)
Details
codecov/project 81.35% (+1.09%) compared to c411d6a
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@yarikoptic yarikoptic added this to the Release 0.11.7 milestone Aug 1, 2019

yarikoptic added a commit that referenced this pull request Sep 6, 2019
Merge tag '0.11.7' into debian
0.11.7 (Sep 02, 2019) -- python2-we-still-love-you-but-...

Primarily bugfixes with some optimizations and refactorings.

 Fixes

- [addurls][]
  - now provides better handling when the URL file isn't in the
    expected format.  ([#3579][])
  - always considered a relative file for the URL file argument as
    relative to the current working directory, which goes against the
    convention used by other commands of taking relative paths as
    relative to the dataset argument.  ([#3582][])

- [run-procedure][]
  - hard coded "python" when formatting the command for non-executable
    procedures ending with ".py".  `sys.executable` is now used.
    ([#3624][])
  - failed if arguments needed more complicated quoting than simply
    surrounding the value with double quotes.  This has been resolved
    for systems that support `shlex.quote`, but note that on Windows
    values are left unquoted. ([#3626][])

- [siblings][] now displays an informative error message if a local
  path is given to `--url` but `--name` isn't specified.  ([#3555][])

- [sshrun][], the command DataLad uses for `GIT_SSH_COMMAND`, didn't
  support all the parameters that Git expects it to.  ([#3616][])

- Fixed a number of Unicode py2-compatibility issues. ([#3597][])

 Enhancements and new features

- The [annotate-paths][] helper now caches subdatasets it has seen to
  avoid unnecessary calls.  ([#3570][])

- A repeated configuration query has been dropped from the handling of
  `--proc-pre` and `--proc-post`.  ([#3576][])

- Calls to `git annex find` now use `--in=.` instead of the alias
  `--in=here` to take advantage of an optimization that git-annex (as
  of the current release, 7.20190730) applies only to the
  former. ([#3574][])

- [addurls][] now suggests close matches when the URL or file format
  contains an unknown field.  ([#3594][])

- Shared logic used in the setup.py files of Datalad and its
  extensions has been moved to modules in the _datalad_build_support/
  directory.  ([#3600][])

- Get ready for upcoming git-annex dropping support for direct mode
  ([#3631][])

* tag '0.11.7': (87 commits)
  DOC: Added an entry to changelogn on merged 3631
  ENH: finalizing changelog for 0.11.7
  TST: Update tests for a git-annex without direct mode
  TST: utils: Add decorator that skips when direct mode is unsupported
  ENH: annexrepo: Refuse to initialize in direct mode if unsupported
  ENH: annexrepo: Add check_direct_mode_support method
  BF+TST: Avoid leaking patched git-annex version
  TST+RF: test_annexrepo: Split up a test
  CHANGELOG.md: Second batch for 0.11.7
  TST: run_procedure: Mark test_spaces() as known Windows failure
  TST: run_procedure: Mark test_quoting as known windows failure
  TST: run_procedure: Test more arguments that need quoting
  BF(py2): run_procedure: Avoid encoding error in log message
  TST: add run_procedure test with spaces in file name
  TST/RF: non-hardcoded Python executable
  RF: newline at end of file
  RF: helper instead of conditional
  RF: remove superfluous imports
  BF/TST: remove quoting
  ENH: replace conditionals with helper function
  ...
yarikoptic added a commit that referenced this pull request Sep 6, 2019
Merge tag '0.11.7' into debian
0.11.7 (Sep 02, 2019) -- python2-we-still-love-you-but-...

Primarily bugfixes with some optimizations and refactorings.

 Fixes

- [addurls][]
  - now provides better handling when the URL file isn't in the
    expected format.  ([#3579][])
  - always considered a relative file for the URL file argument as
    relative to the current working directory, which goes against the
    convention used by other commands of taking relative paths as
    relative to the dataset argument.  ([#3582][])

- [run-procedure][]
  - hard coded "python" when formatting the command for non-executable
    procedures ending with ".py".  `sys.executable` is now used.
    ([#3624][])
  - failed if arguments needed more complicated quoting than simply
    surrounding the value with double quotes.  This has been resolved
    for systems that support `shlex.quote`, but note that on Windows
    values are left unquoted. ([#3626][])

- [siblings][] now displays an informative error message if a local
  path is given to `--url` but `--name` isn't specified.  ([#3555][])

- [sshrun][], the command DataLad uses for `GIT_SSH_COMMAND`, didn't
  support all the parameters that Git expects it to.  ([#3616][])

- Fixed a number of Unicode py2-compatibility issues. ([#3597][])

- [download-url][] now will create leading directories of the output path
  if they do not exist ([#3646][])

 Enhancements and new features

- The [annotate-paths][] helper now caches subdatasets it has seen to
  avoid unnecessary calls.  ([#3570][])

- A repeated configuration query has been dropped from the handling of
  `--proc-pre` and `--proc-post`.  ([#3576][])

- Calls to `git annex find` now use `--in=.` instead of the alias
  `--in=here` to take advantage of an optimization that git-annex (as
  of the current release, 7.20190730) applies only to the
  former. ([#3574][])

- [addurls][] now suggests close matches when the URL or file format
  contains an unknown field.  ([#3594][])

- Shared logic used in the setup.py files of Datalad and its
  extensions has been moved to modules in the _datalad_build_support/
  directory.  ([#3600][])

- Get ready for upcoming git-annex dropping support for direct mode
  ([#3631][])

* tag '0.11.7':
  Changelog entry for download-url paths handling
  ENH: downloaders: Ensure directories for target exist
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.