New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: maintain cache of discovered subdatasets in the loop of annotate_paths #3570
OPT: maintain cache of discovered subdatasets in the loop of annotate_paths #3570
Conversation
…_paths Otherwise there is a rediscovery of subdatasets for EVERY path provided to annotate_paths even when they all belong to the same parent dataset.
benchmarks on travis show no significant effect of any kind and all suggest a bit of slowness... kheh kheh ;-) before after ratio
[c411d6a2] [58c23cd8]
<bm/merge-target> <bm/pr>
1.25±0.01s 1.26±0.02s 1.00 api.supers.time_createadd
1.22±0.01s 1.24±0.01s 1.02 api.supers.time_createadd_to_dataset
3.32±0.04s 3.51±0.06s 1.06 api.supers.time_installr
136±1ms 148±8ms 1.09 api.supers.time_ls
1.23±0.01s 1.28±0.06s 1.04 api.supers.time_ls_recursive
1.34±0.02s 1.40±0.02s 1.04 api.supers.time_ls_recursive_long_all
2.01±0.02s 2.06±0.01s 1.03 api.supers.time_remove
398±4ms 412±3ms 1.03 api.testds.time_create_test_dataset1
2.78±0.03s 2.95±0.03s 1.06 api.testds.time_create_test_dataset2x2
2.04±0.03ms 2.13±0.03ms 1.05 core.runner.time_echo
2.84±0.05ms 2.94±0.02ms 1.04 core.runner.time_echo_gitrunner
655±10ms 688±3ms 1.05 core.startup.time_help_np
156±0.8ms 160±1ms 1.02 core.startup.time_import
595±4ms 621±0.9ms 1.04 core.startup.time_import_api |
Codecov Report
@@ Coverage Diff @@
## 0.11.x #3570 +/- ##
==========================================
+ Coverage 80.25% 81.35% +1.09%
==========================================
Files 255 255
Lines 33639 33648 +9
==========================================
+ Hits 26997 27373 +376
+ Misses 6642 6275 -367
Continue to review full report at Codecov.
|
real life use case,
here is timings with eyeball averaging 3 runs:
|
Originally found while timing effects of a PR: datalad#3570 (comment) Similar change is now done also on git-annex level in d1a0c7b16fcd41f74def394a55c32f67340670b7 but to not wait for its new release, adjusting our find invocations
ok, since there is a boost, and no side-effects were detected so far, merging |
0.11.7 (Sep 02, 2019) -- python2-we-still-love-you-but-... Primarily bugfixes with some optimizations and refactorings. Fixes - [addurls][] - now provides better handling when the URL file isn't in the expected format. ([#3579][]) - always considered a relative file for the URL file argument as relative to the current working directory, which goes against the convention used by other commands of taking relative paths as relative to the dataset argument. ([#3582][]) - [run-procedure][] - hard coded "python" when formatting the command for non-executable procedures ending with ".py". `sys.executable` is now used. ([#3624][]) - failed if arguments needed more complicated quoting than simply surrounding the value with double quotes. This has been resolved for systems that support `shlex.quote`, but note that on Windows values are left unquoted. ([#3626][]) - [siblings][] now displays an informative error message if a local path is given to `--url` but `--name` isn't specified. ([#3555][]) - [sshrun][], the command DataLad uses for `GIT_SSH_COMMAND`, didn't support all the parameters that Git expects it to. ([#3616][]) - Fixed a number of Unicode py2-compatibility issues. ([#3597][]) Enhancements and new features - The [annotate-paths][] helper now caches subdatasets it has seen to avoid unnecessary calls. ([#3570][]) - A repeated configuration query has been dropped from the handling of `--proc-pre` and `--proc-post`. ([#3576][]) - Calls to `git annex find` now use `--in=.` instead of the alias `--in=here` to take advantage of an optimization that git-annex (as of the current release, 7.20190730) applies only to the former. ([#3574][]) - [addurls][] now suggests close matches when the URL or file format contains an unknown field. ([#3594][]) - Shared logic used in the setup.py files of Datalad and its extensions has been moved to modules in the _datalad_build_support/ directory. ([#3600][]) - Get ready for upcoming git-annex dropping support for direct mode ([#3631][]) * tag '0.11.7': (87 commits) DOC: Added an entry to changelogn on merged 3631 ENH: finalizing changelog for 0.11.7 TST: Update tests for a git-annex without direct mode TST: utils: Add decorator that skips when direct mode is unsupported ENH: annexrepo: Refuse to initialize in direct mode if unsupported ENH: annexrepo: Add check_direct_mode_support method BF+TST: Avoid leaking patched git-annex version TST+RF: test_annexrepo: Split up a test CHANGELOG.md: Second batch for 0.11.7 TST: run_procedure: Mark test_spaces() as known Windows failure TST: run_procedure: Mark test_quoting as known windows failure TST: run_procedure: Test more arguments that need quoting BF(py2): run_procedure: Avoid encoding error in log message TST: add run_procedure test with spaces in file name TST/RF: non-hardcoded Python executable RF: newline at end of file RF: helper instead of conditional RF: remove superfluous imports BF/TST: remove quoting ENH: replace conditionals with helper function ...
0.11.7 (Sep 02, 2019) -- python2-we-still-love-you-but-... Primarily bugfixes with some optimizations and refactorings. Fixes - [addurls][] - now provides better handling when the URL file isn't in the expected format. ([#3579][]) - always considered a relative file for the URL file argument as relative to the current working directory, which goes against the convention used by other commands of taking relative paths as relative to the dataset argument. ([#3582][]) - [run-procedure][] - hard coded "python" when formatting the command for non-executable procedures ending with ".py". `sys.executable` is now used. ([#3624][]) - failed if arguments needed more complicated quoting than simply surrounding the value with double quotes. This has been resolved for systems that support `shlex.quote`, but note that on Windows values are left unquoted. ([#3626][]) - [siblings][] now displays an informative error message if a local path is given to `--url` but `--name` isn't specified. ([#3555][]) - [sshrun][], the command DataLad uses for `GIT_SSH_COMMAND`, didn't support all the parameters that Git expects it to. ([#3616][]) - Fixed a number of Unicode py2-compatibility issues. ([#3597][]) - [download-url][] now will create leading directories of the output path if they do not exist ([#3646][]) Enhancements and new features - The [annotate-paths][] helper now caches subdatasets it has seen to avoid unnecessary calls. ([#3570][]) - A repeated configuration query has been dropped from the handling of `--proc-pre` and `--proc-post`. ([#3576][]) - Calls to `git annex find` now use `--in=.` instead of the alias `--in=here` to take advantage of an optimization that git-annex (as of the current release, 7.20190730) applies only to the former. ([#3574][]) - [addurls][] now suggests close matches when the URL or file format contains an unknown field. ([#3594][]) - Shared logic used in the setup.py files of Datalad and its extensions has been moved to modules in the _datalad_build_support/ directory. ([#3600][]) - Get ready for upcoming git-annex dropping support for direct mode ([#3631][]) * tag '0.11.7': Changelog entry for download-url paths handling ENH: downloaders: Ensure directories for target exist
Otherwise there is a rediscovery of subdatasets for EVERY path provided to annotate_paths even when they all belong to the same parent dataset.
Closes: #3567
TODOs: