-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement get_submodules_() without get_content_info() (Reincarnated 6942) #7189
Conversation
The previous implementation resulted in a circular dependency after a check for a Git-version dependent reporting behavior was introduced with 8ff8613 Moreover, the previous implementation also queried a repository for all content, merely to discard any record that is not a subdataset. This could slow a responds dramatically for dataset with few subdatasets but many files. This change reimplements `get_submodules_()` with a plain call to `git ls-files` parameterized with the paths of recorded submodules taken from `.gitmodules`. This changes the behavior to be more in-line with `git submodule` by refusing to report on submodules that are not recorded in .gitmodules. Fixes datalad#6940 Fixes datalad#6941
Previously we ignored them in values as mandatory reasons for quoting. This change expands the `quote_config()` helper to perform this additional test. Fixes datalad#6943
Thanks to @yarikoptic for pointing this out.
Both tests pass on master and fail here ATM. The 2nd test, matching for absent s_unknown result returns all submodules but even without gitmodule_name.
Codecov ReportBase: 88.63% // Head: 90.73% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #7189 +/- ##
==========================================
+ Coverage 88.63% 90.73% +2.09%
==========================================
Files 325 325
Lines 44109 44125 +16
Branches 5863 5870 +7
==========================================
+ Hits 39096 40036 +940
+ Misses 4998 4074 -924
Partials 15 15
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
if you just reincarnated it, odd that those tests I have added do not fail -- but that is a good sign ;-) |
I don't mind this PR since detrimental effect for one usage pattern seems to be less that the performance boost for another, but I want to avoid claiming something to be standard whenever there is none "standard". Using globs in command line is IMHO no less common than pointing to a subpath. So this PR will make |
- Fix a typo - Clean up an unused import statement - check if paths from .gitmodules and paths passed to the command are relative to eachother from both directions - because we might recurse deeper into a dataset hierarchy with a subdatasets call, the test test_get_subdatasets failed when we soley checked from one direction.
f81e6d7
to
c478280
Compare
True, sorry - that was poorly phrased. I squashed the few fixes I did on top of the old branch into a single commit, and tuned the changelog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @adswa!
I have not yet looked into the next failure really. Have you?
I haven't as they looked completely unrelated to submodules:
But I can check if |
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Code Climate has analyzed commit 6d40969 and detected 1 issue on this pull request. Here's the issue category breakdown:
View more on Code Climate. |
Ok, datalad-next failure unrelated, @yarikoptic "doesn't mind" and it doesn't prevent any further work on #6974 - let's merge then. Thanks for digging this up, @adswa. |
…nt_paths Redone datalad#6974 . Now that datalad#7189 has finished the RFing to avoid some slow operations at the cost of making some use cases slower, this should resolve it back to avoid any slow operation. get_parent_paths function was originally created for get_content_info for that specific reason (performance). datalad#6974 commits avoided a hypothetical infinite recursion (my code analysis says it was not possible, see datalad#6941 (comment)) and performance issues by introducing similar logic directly in the code of get_modules_. This commit RFs that back to use get_parent_paths but also extends it to return not parent paths but actual paths. So we reuse the same logic but manipulate what is returned by the function. That allows us to subselect paths which are under "parents". It also adds rudimentary tests for invocation with paths limiting etc.
…nt_paths Redone datalad#6974 . Now that datalad#7189 has finished the RFing to avoid some slow operations at the cost of making some use cases slower, this should resolve it back to avoid any slow operation. get_parent_paths function was originally created for get_content_info for that specific reason (performance). datalad#6974 commits avoided a hypothetical infinite recursion (my code analysis says it was not possible, see datalad#6941 (comment)) and performance issues by introducing similar logic directly in the code of get_modules_. This commit RFs that back to use get_parent_paths but also extends it to return not parent paths but actual paths. So we reuse the same logic but manipulate what is returned by the function. That allows us to subselect paths which are under "parents". It also adds rudimentary tests for invocation with paths limiting etc.
…ted_paths Redone datalad#6974 and datalad#7211 which attempted to reuse get_parent_paths with tune ups -- was not sufficient since need to have a bit ad-hoc limiting by the path WITHIN submodule to select that submodule (which would not work for plain directory). Overall -- now that datalad#7189 has finished the RFing to avoid some slow operations at the cost of making some use cases slower, this should resolve it back to avoid any slow operation. datalad#6974 commits avoided a hypothetical infinite recursion (my code analysis says it was not possible, see datalad#6941 (comment)) and performance issues by introducing similar logic directly in the code of get_modules_. get_limited_paths relies on sorting of the paths, should be O(N*log(N)) but generally likely faster since likely to get them sorted to start with. Timing this branch: ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; ) datalad 0.17.9+185.gd236b5a6d 0.79user 0.14system 0:00.89elapsed 105%CPU (0avgtext+0avgdata 28896maxresident)k 824inputs+0outputs (1major+8149minor)pagefaults 0swaps 1113 and master ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; ) datalad 0.17.9+183.g34a28a622 4.29user 0.11system 0:04.39elapsed 100%CPU (0avgtext+0avgdata 29092maxresident)k 0inputs+0outputs (0major+8226minor)pagefaults 0swaps 1113 and maint ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; ) datalad 0.17.9+82.g405ece550 0.97user 0.12system 0:01.05elapsed 104%CPU (0avgtext+0avgdata 30248maxresident)k 0inputs+0outputs (0major+8633minor)pagefaults 0swaps 1113 so we are no worse (if not better) than maint and definetely gain over master for such use cases of N submodules/N paths given. And for a single path we might even be not only faster but fixing some bug in maint: ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; ) datalad 0.17.9+185.g8b21bfceb 0.65user 0.11system 0:00.75elapsed 101%CPU (0avgtext+0avgdata 28040maxresident)k 104inputs+0outputs (0major+7652minor)pagefaults 0swaps 1 ❯ git co master Switched to branch 'master' Your branch is up to date with 'origin/master'. ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; ) datalad 0.17.9+183.g34a28a622 0.75user 0.10system 0:00.83elapsed 102%CPU (0avgtext+0avgdata 28816maxresident)k 0inputs+0outputs (0major+8203minor)pagefaults 0swaps 1113 ❯ git co maint Switched to branch 'maint' Your branch is up to date with 'origin/maint'. ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; ) datalad 0.17.9+82.g405ece550 Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2388, in get_submodules_ for path, props in self.get_content_info( File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2769, in get_content_info posix_paths = get_parent_paths(posix_paths, submodules) File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 191, in get_parent_paths _get_parent_paths_check(path, sep, asep) File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 213, in _get_parent_paths_check raise ValueError("Expected relative within directory paths, got %r" % path) ValueError: Expected relative within directory paths, got '/' Command exited with non-zero status 1 0.39user 0.04system 0:00.43elapsed 101%CPU (0avgtext+0avgdata 29832maxresident)k 0inputs+0outputs (0major+8107minor)pagefaults 0swaps 0
…ted_paths Redone datalad#6974 and datalad#7211 which attempted to reuse get_parent_paths with tune ups -- was not sufficient since need to have a bit ad-hoc limiting by the path WITHIN submodule to select that submodule (which would not work for plain directory). Overall -- now that datalad#7189 has finished the RFing to avoid some slow operations at the cost of making some use cases slower, this should resolve it back to avoid any slow operation. datalad#6974 commits avoided a hypothetical infinite recursion (my code analysis says it was not possible, see datalad#6941 (comment)) and performance issues by introducing similar logic directly in the code of get_modules_. get_limited_paths relies on sorting of the paths, should be O(N*log(N)) but generally likely faster since likely to get them sorted to start with. Timing this branch: ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; ) datalad 0.17.9+185.gd236b5a6d 0.79user 0.14system 0:00.89elapsed 105%CPU (0avgtext+0avgdata 28896maxresident)k 824inputs+0outputs (1major+8149minor)pagefaults 0swaps 1113 and master ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; ) datalad 0.17.9+183.g34a28a622 4.29user 0.11system 0:04.39elapsed 100%CPU (0avgtext+0avgdata 29092maxresident)k 0inputs+0outputs (0major+8226minor)pagefaults 0swaps 1113 and maint ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; ) datalad 0.17.9+82.g405ece550 0.97user 0.12system 0:01.05elapsed 104%CPU (0avgtext+0avgdata 30248maxresident)k 0inputs+0outputs (0major+8633minor)pagefaults 0swaps 1113 so we are no worse (if not better) than maint and definetely gain over master for such use cases of N submodules/N paths given. And for a single path we might even be not only faster but fixing some bug in maint: ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; ) datalad 0.17.9+185.g8b21bfceb 0.65user 0.11system 0:00.75elapsed 101%CPU (0avgtext+0avgdata 28040maxresident)k 104inputs+0outputs (0major+7652minor)pagefaults 0swaps 1 ❯ git co master Switched to branch 'master' Your branch is up to date with 'origin/master'. ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; ) datalad 0.17.9+183.g34a28a622 0.75user 0.10system 0:00.83elapsed 102%CPU (0avgtext+0avgdata 28816maxresident)k 0inputs+0outputs (0major+8203minor)pagefaults 0swaps 1113 ❯ git co maint Switched to branch 'maint' Your branch is up to date with 'origin/maint'. ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; ) datalad 0.17.9+82.g405ece550 Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2388, in get_submodules_ for path, props in self.get_content_info( File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2769, in get_content_info posix_paths = get_parent_paths(posix_paths, submodules) File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 191, in get_parent_paths _get_parent_paths_check(path, sep, asep) File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 213, in _get_parent_paths_check raise ValueError("Expected relative within directory paths, got %r" % path) ValueError: Expected relative within directory paths, got '/' Command exited with non-zero status 1 0.39user 0.04system 0:00.43elapsed 101%CPU (0avgtext+0avgdata 29832maxresident)k 0inputs+0outputs (0major+8107minor)pagefaults 0swaps 0
PR released in |
A while ago, @mih proposed #6942. This effort was supposed to be super-seeded by #6974, which aimed to fix a short coming of #6942, namely an O(N²) performance hit in an edgecase (when one provides paths to all subdatasets to a subdataset call). That fix is currently in draft mode, as it unexpectedly required more work than anticipated. @yarikoptic therefore stated in #6942 before it was closed
As I can still see the performance boost #6942 brought, I wanted to reincarnate the PR in its last state.
This is with
master
and on this branch with the same dataset @mih tested in #6942:I tested the tests that were failing before after merging
master
on my Debian system and on a Windows system, were they passed. Based on that, I'm slightly hopeful that our CI might turn green now as well 🤞I do see the performance hit in the edge case @yarikoptic discovered, though, still. Here are my timings in the human-connectome-project openaccess dataset:
Nevertheless, I feel like the change so far does speed up the standard case considerably. If I don't explicitly provide paths to subdatasets, the timing difference is negligible:
Fixes #6940