-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RF: get_modules_ - avoid O(N^2) if paths provided via use of get_limited_paths #7220
Conversation
Codecov ReportBase: 88.73% // Head: 90.74% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #7220 +/- ##
==========================================
+ Coverage 88.73% 90.74% +2.00%
==========================================
Files 325 325
Lines 44124 44181 +57
Branches 5867 5879 +12
==========================================
+ Hits 39154 40090 +936
+ Misses 4955 4076 -879
Partials 15 15
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
typing check fail seems to no longer have anything to do with type checking and rather with installation in the env
since we have that typo fixed in 41d88cb in maint... @bpoldrack beat me with #7222 for the PR to merge in maint -- thanks! |
742adbe
to
bfe6637
Compare
green -- nice! I will rebase now that master has maint thanks to @bpoldrack |
…ted_paths Redone datalad#6974 and datalad#7211 which attempted to reuse get_parent_paths with tune ups -- was not sufficient since need to have a bit ad-hoc limiting by the path WITHIN submodule to select that submodule (which would not work for plain directory). Overall -- now that datalad#7189 has finished the RFing to avoid some slow operations at the cost of making some use cases slower, this should resolve it back to avoid any slow operation. datalad#6974 commits avoided a hypothetical infinite recursion (my code analysis says it was not possible, see datalad#6941 (comment)) and performance issues by introducing similar logic directly in the code of get_modules_. get_limited_paths relies on sorting of the paths, should be O(N*log(N)) but generally likely faster since likely to get them sorted to start with. Timing this branch: ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; ) datalad 0.17.9+185.gd236b5a6d 0.79user 0.14system 0:00.89elapsed 105%CPU (0avgtext+0avgdata 28896maxresident)k 824inputs+0outputs (1major+8149minor)pagefaults 0swaps 1113 and master ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; ) datalad 0.17.9+183.g34a28a622 4.29user 0.11system 0:04.39elapsed 100%CPU (0avgtext+0avgdata 29092maxresident)k 0inputs+0outputs (0major+8226minor)pagefaults 0swaps 1113 and maint ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; ) datalad 0.17.9+82.g405ece550 0.97user 0.12system 0:01.05elapsed 104%CPU (0avgtext+0avgdata 30248maxresident)k 0inputs+0outputs (0major+8633minor)pagefaults 0swaps 1113 so we are no worse (if not better) than maint and definetely gain over master for such use cases of N submodules/N paths given. And for a single path we might even be not only faster but fixing some bug in maint: ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; ) datalad 0.17.9+185.g8b21bfceb 0.65user 0.11system 0:00.75elapsed 101%CPU (0avgtext+0avgdata 28040maxresident)k 104inputs+0outputs (0major+7652minor)pagefaults 0swaps 1 ❯ git co master Switched to branch 'master' Your branch is up to date with 'origin/master'. ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; ) datalad 0.17.9+183.g34a28a622 0.75user 0.10system 0:00.83elapsed 102%CPU (0avgtext+0avgdata 28816maxresident)k 0inputs+0outputs (0major+8203minor)pagefaults 0swaps 1113 ❯ git co maint Switched to branch 'maint' Your branch is up to date with 'origin/maint'. ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; ) datalad 0.17.9+82.g405ece550 Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2388, in get_submodules_ for path, props in self.get_content_info( File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2769, in get_content_info posix_paths = get_parent_paths(posix_paths, submodules) File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 191, in get_parent_paths _get_parent_paths_check(path, sep, asep) File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 213, in _get_parent_paths_check raise ValueError("Expected relative within directory paths, got %r" % path) ValueError: Expected relative within directory paths, got '/' Command exited with non-zero status 1 0.39user 0.04system 0:00.43elapsed 101%CPU (0avgtext+0avgdata 29832maxresident)k 0inputs+0outputs (0major+8107minor)pagefaults 0swaps 0
The CI threw def get_limited_paths(paths: list[str|Path], limits: list[str|Path], *, include_within_path: bool = False) -> list[Path]: TypeError: unsupported operand type(s) for |: 'type' and 'type' running build_cfginfo
bfe6637
to
e1fd384
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks OK to me, although I do find the get_limited_paths
function quite confusing (internal naming of variables as well as docstring).
Re function name:
How about filter_paths
? After all, limits
is a filter for paths
to go through, no?
Co-authored-by: Benjamin Poldrack <bpoldrack@users.noreply.github.com>
@yarikoptic: Not sure what's going in with this breaking all builds (not happening in other PRs against master as far as I can tell), but the kind of failure ( |
…th submodule paths
not sure if it would be effective benchmark in this not so big superdataset, but something at least
e796724
to
46eb0e3
Compare
yeap yeap, thanks @bpoldrack -- force pushed the fix to apply |
appveyor on mac seems to have difficulty installing git-annex==> Fetching git-annex
==> Downloading https://hackage.haskell.org/package/git-annex-10.20221212/git-an
######################################################################## 100.0%
Warning: Your Xcode (12.3) is outdated.
Please update to Xcode 12.4 (or delete it).
Xcode can be updated from the App Store.
==> Installing dependencies for git-annex: docutils, pygments, python@3.10, sphinx-doc, ghc, cabal-install, libmagic and ghc@8.10
==> Installing git-annex dependency: docutils
==> Pouring docutils--0.19.catalina.bottle.1.tar.gz
🍺 /usr/local/Cellar/docutils/0.19: 494 files, 3.9MB
==> Installing git-annex dependency: pygments
==> Pouring pygments--2.13.0_1.all.bottle.tar.gz
🍺 /usr/local/Cellar/pygments/2.13.0_1: 905 files, 11.9MB
==> Installing git-annex dependency: python@3.10
==> Pouring python@3.10--3.10.8.catalina.bottle.1.tar.gz
==> /usr/local/Cellar/python@3.10/3.10.8/bin/python3.10 -m ensurepip
==> /usr/local/Cellar/python@3.10/3.10.8/bin/python3.10 -m pip install -v --no-deps --no-index --upgrade --isolated --target=/usr/local/lib/python3.10/site-packages /usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3
🍺 /usr/local/Cellar/python@3.10/3.10.8: 3,114 files, 56.8MB
==> Installing git-annex dependency: sphinx-doc
==> Pouring sphinx-doc--5.3.0.catalina.bottle.2.tar.gz
🍺 /usr/local/Cellar/sphinx-doc/5.3.0: 3,664 files, 55.8MB
==> Installing git-annex dependency: ghc
Warning: Your Xcode (12.3) is outdated.
Please update to Xcode 12.4 (or delete it).
Xcode can be updated from the App Store.
==> ./configure --prefix=/private/tmp/ghc-20221215-3586-s0g334/ghc-9.4.3/binary
I will restart those |
@jwodder could you please look into how/why installation of git-annex fails on appveyor OSXes via datalad-installer -- e.g. https://ci.appveyor.com/project/mih/datalad/builds/45696242/job/h7a2890c5yyt5ekw -- if anything which could be done on our end? |
@bpoldrack PR should be ready for re-review. appveyor gotchas are unrelated AFAIK |
@yarikoptic The logs look like they might be truncated to me, as they end in the middle of a Regardless, note that the output from Homebrew includes:
EDIT: Actually, the "Messages" tab on that page contains:
which is likely what directly caused the build to fail. |
|
* origin/master: Switch macOS appveyor builds to Monterey Automated deployment to update contributors 2022-12-14 BF: Don't change ui.backend of test process ENH(DX,TST): log at DEBUG about encountered error, unit-test that exception handling block RF: Move basic functionality to SpecialRemote BF: Don't overwrite AVAILABILITY Send ERROR to annex when special remote process fails Add Matthias Riße to .zenodo.json Update changelog.d/pr-7213.md [release-action] Autogenerate changelog snippet for PR 7213 NF: make extra_remote_settings configurable in git config Don't be scary when failing to annex-enable
Code Climate has analyzed commit 1634a62 and detected 1 issue on this pull request. Here's the issue category breakdown:
View more on Code Climate. |
Ok, thank you @yarikoptic! |
Redone #6974 and #7211 which attempted to reuse get_parent_paths with tune ups
-- was not sufficient since need to have a bit ad-hoc limiting by the path
WITHIN submodule to select that submodule (which would not work for plain directory).
Overall -- now that #7189 has finished the RFing to avoid some slow
operations at the cost of making some use cases slower, this should resolve it
back to avoid any slow operation.
#6974 commits avoided a hypothetical infinite recursion (my code analysis
says it was not possible, see #6941 (comment)) and performance issues by
introducing similar logic directly in the code of get_modules_.
get_limited_paths relies on sorting of the paths, should be O(N*log(N)) but
generally likely faster since likely to get them sorted to start with.
Timing this branch:
and master
and maint
so we are no worse (if not better) than maint and definetely gain over master
for such use cases of N submodules/N paths given. And for a single path we might even be not only faster but fixing some bug in maint: