Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: get_modules_ - avoid O(N^2) if paths provided via use of get_limited_paths #7220

Merged
merged 10 commits into from Dec 23, 2022

Conversation

yarikoptic
Copy link
Member

Redone #6974 and #7211 which attempted to reuse get_parent_paths with tune ups
-- was not sufficient since need to have a bit ad-hoc limiting by the path
WITHIN submodule to select that submodule (which would not work for plain directory).

Overall -- now that #7189 has finished the RFing to avoid some slow
operations at the cost of making some use cases slower, this should resolve it
back to avoid any slow operation.

#6974 commits avoided a hypothetical infinite recursion (my code analysis
says it was not possible, see #6941 (comment)) and performance issues by
introducing similar logic directly in the code of get_modules_.

get_limited_paths relies on sorting of the paths, should be O(N*log(N)) but
generally likely faster since likely to get them sorted to start with.

Timing this branch:

❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
datalad 0.17.9+185.gd236b5a6d
0.79user 0.14system 0:00.89elapsed 105%CPU (0avgtext+0avgdata 28896maxresident)k
824inputs+0outputs (1major+8149minor)pagefaults 0swaps
1113

and master

❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
datalad 0.17.9+183.g34a28a622
4.29user 0.11system 0:04.39elapsed 100%CPU (0avgtext+0avgdata 29092maxresident)k
0inputs+0outputs (0major+8226minor)pagefaults 0swaps
1113

and maint

❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
datalad 0.17.9+82.g405ece550
0.97user 0.12system 0:01.05elapsed 104%CPU (0avgtext+0avgdata 30248maxresident)k
0inputs+0outputs (0major+8633minor)pagefaults 0swaps
1113

so we are no worse (if not better) than maint and definetely gain over master
for such use cases of N submodules/N paths given. And for a single path we might even be not only faster but fixing some bug in maint:

❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
datalad 0.17.9+185.g8b21bfceb
0.65user 0.11system 0:00.75elapsed 101%CPU (0avgtext+0avgdata 28040maxresident)k
104inputs+0outputs (0major+7652minor)pagefaults 0swaps
1

❯ git co master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.

❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
datalad 0.17.9+183.g34a28a622
0.75user 0.10system 0:00.83elapsed 102%CPU (0avgtext+0avgdata 28816maxresident)k
0inputs+0outputs (0major+8203minor)pagefaults 0swaps
1113

❯ git co maint
Switched to branch 'maint'
Your branch is up to date with 'origin/maint'.

❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
datalad 0.17.9+82.g405ece550
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2388, in get_submodules_
    for path, props in self.get_content_info(
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2769, in get_content_info
    posix_paths = get_parent_paths(posix_paths, submodules)
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 191, in get_parent_paths
    _get_parent_paths_check(path, sep, asep)
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 213, in _get_parent_paths_check
    raise ValueError("Expected relative within directory paths, got %r" % path)
ValueError: Expected relative within directory paths, got '/'
Command exited with non-zero status 1
0.39user 0.04system 0:00.43elapsed 101%CPU (0avgtext+0avgdata 29832maxresident)k
0inputs+0outputs (0major+8107minor)pagefaults 0swaps
0

datalad/support/path.py Outdated Show resolved Hide resolved
datalad/support/path.py Outdated Show resolved Hide resolved
@yarikoptic yarikoptic added semver-performance Changes only improve performance, no impact on version CHANGELOG-missing When a PR's description does not contain a changelog item, yet. labels Dec 9, 2022
@github-actions github-actions bot removed the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Dec 9, 2022
@codecov
Copy link

codecov bot commented Dec 9, 2022

Codecov Report

Base: 88.73% // Head: 90.74% // Increases project coverage by +2.00% 🎉

Coverage data is based on head (1634a62) compared to base (39eaebf).
Patch coverage: 95.58% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7220      +/-   ##
==========================================
+ Coverage   88.73%   90.74%   +2.00%     
==========================================
  Files         325      325              
  Lines       44124    44181      +57     
  Branches     5867     5879      +12     
==========================================
+ Hits        39154    40090     +936     
+ Misses       4955     4076     -879     
  Partials       15       15              
Impacted Files Coverage Δ
datalad/support/gitrepo.py 92.57% <81.25%> (-0.21%) ⬇️
datalad/support/path.py 96.58% <100.00%> (+1.17%) ⬆️
datalad/support/tests/test_gitrepo.py 99.77% <100.00%> (+<0.01%) ⬆️
datalad/support/tests/test_path.py 98.70% <100.00%> (+0.45%) ⬆️
datalad/_version.py 45.68% <0.00%> (+0.27%) ⬆️
datalad/__init__.py 98.00% <0.00%> (+16.00%) ⬆️
datalad/tests/utils.py 52.75% <0.00%> (+52.75%) ⬆️
datalad/tests/test_tests_utils.py 92.89% <0.00%> (+92.89%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

changelog.d/pr-7220.md Outdated Show resolved Hide resolved
@yarikoptic
Copy link
Member Author

typing check fail seems to no longer have anything to do with type checking and rather with installation in the env

  File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/packaging/_tokenizer.py", line 164, in raise_syntax_error
    span=span,
packaging._tokenizer.ParserSyntaxError: Expected closing RIGHT_PARENTHESIS
    sphinx-rtd-theme (>=0.5.1") ; extra == 'devel'
                     ~~~~~~~~^

since we have that typo fixed in 41d88cb in maint... @bpoldrack beat me with #7222 for the PR to merge in maint -- thanks!

@yarikoptic
Copy link
Member Author

green -- nice! I will rebase now that master has maint thanks to @bpoldrack

yarikoptic and others added 5 commits December 9, 2022 21:40
…ted_paths

Redone datalad#6974 and datalad#7211 which attempted to reuse get_parent_paths with tune ups
-- was not sufficient since need to have a bit ad-hoc limiting by the path
WITHIN submodule to select that submodule (which would not work for plain directory).

Overall -- now that datalad#7189 has finished the RFing to avoid some slow
operations at the cost of making some use cases slower, this should resolve it
back to avoid any slow operation.

datalad#6974 commits avoided a hypothetical infinite recursion (my code analysis
says it was not possible, see datalad#6941 (comment)) and performance issues by
introducing similar logic directly in the code of get_modules_.

get_limited_paths relies on sorting of the paths, should be O(N*log(N)) but
generally likely faster since likely to get them sorted to start with.

Timing this branch:

    ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
    datalad 0.17.9+185.gd236b5a6d
    0.79user 0.14system 0:00.89elapsed 105%CPU (0avgtext+0avgdata 28896maxresident)k
    824inputs+0outputs (1major+8149minor)pagefaults 0swaps
    1113

and master

    ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
    datalad 0.17.9+183.g34a28a622
    4.29user 0.11system 0:04.39elapsed 100%CPU (0avgtext+0avgdata 29092maxresident)k
    0inputs+0outputs (0major+8226minor)pagefaults 0swaps
    1113

and maint

    ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
    datalad 0.17.9+82.g405ece550
    0.97user 0.12system 0:01.05elapsed 104%CPU (0avgtext+0avgdata 30248maxresident)k
    0inputs+0outputs (0major+8633minor)pagefaults 0swaps
    1113

so we are no worse (if not better) than maint and definetely gain over master
for such use cases of N submodules/N paths given. And for a single path we might even be not only faster but fixing some bug in maint:

	❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
	datalad 0.17.9+185.g8b21bfceb
	0.65user 0.11system 0:00.75elapsed 101%CPU (0avgtext+0avgdata 28040maxresident)k
	104inputs+0outputs (0major+7652minor)pagefaults 0swaps
	1

	❯ git co master
	Switched to branch 'master'
	Your branch is up to date with 'origin/master'.

	❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
	datalad 0.17.9+183.g34a28a622
	0.75user 0.10system 0:00.83elapsed 102%CPU (0avgtext+0avgdata 28816maxresident)k
	0inputs+0outputs (0major+8203minor)pagefaults 0swaps
	1113

	❯ git co maint
	Switched to branch 'maint'
	Your branch is up to date with 'origin/maint'.

	❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
	datalad 0.17.9+82.g405ece550
	Traceback (most recent call last):
	  File "<string>", line 1, in <module>
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2388, in get_submodules_
		for path, props in self.get_content_info(
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2769, in get_content_info
		posix_paths = get_parent_paths(posix_paths, submodules)
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 191, in get_parent_paths
		_get_parent_paths_check(path, sep, asep)
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 213, in _get_parent_paths_check
		raise ValueError("Expected relative within directory paths, got %r" % path)
	ValueError: Expected relative within directory paths, got '/'
	Command exited with non-zero status 1
	0.39user 0.04system 0:00.43elapsed 101%CPU (0avgtext+0avgdata 29832maxresident)k
	0inputs+0outputs (0major+8107minor)pagefaults 0swaps
	0
The CI threw
def get_limited_paths(paths: list[str|Path], limits: list[str|Path], *, include_within_path: bool = False) -> list[Path]:
TypeError: unsupported operand type(s) for |: 'type' and 'type'
running build_cfginfo
datalad/support/path.py Outdated Show resolved Hide resolved
Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks OK to me, although I do find the get_limited_paths function quite confusing (internal naming of variables as well as docstring).

Re function name:

How about filter_paths? After all, limits is a filter for paths to go through, no?

datalad/support/gitrepo.py Outdated Show resolved Hide resolved
datalad/support/gitrepo.py Outdated Show resolved Hide resolved
Co-authored-by: Benjamin Poldrack <bpoldrack@users.noreply.github.com>
@bpoldrack
Copy link
Member

@yarikoptic: Not sure what's going in with this breaking all builds (not happening in other PRs against master as far as I can tell), but the kind of failure (Value Error embedded null byte from Popen) seems to suggest, that we somehow manage to smuggle the \0 from here https://github.com/datalad/datalad/pull/7220/files#diff-bfc1c47d032fc72b15b6a79db3d49e4c4334cf96318c5efe48a3403744b32fb1L2417 into the paths returned and pass them on to another git call? Something like that?

not sure if it would be effective benchmark in this not so big
superdataset, but something at least
@yarikoptic
Copy link
Member Author

yeap yeap, thanks @bpoldrack -- force pushed the fix to apply [:-1] to rpath before reusing it (I was reusing once with that stripping and once without)

@yarikoptic
Copy link
Member Author

appveyor on mac seems to have difficulty installing git-annex
==> Fetching git-annex
==> Downloading https://hackage.haskell.org/package/git-annex-10.20221212/git-an
######################################################################## 100.0%
Warning: Your Xcode (12.3) is outdated.
Please update to Xcode 12.4 (or delete it).
Xcode can be updated from the App Store.
==> Installing dependencies for git-annex: docutils, pygments, python@3.10, sphinx-doc, ghc, cabal-install, libmagic and ghc@8.10
==> Installing git-annex dependency: docutils
==> Pouring docutils--0.19.catalina.bottle.1.tar.gz
🍺  /usr/local/Cellar/docutils/0.19: 494 files, 3.9MB
==> Installing git-annex dependency: pygments
==> Pouring pygments--2.13.0_1.all.bottle.tar.gz
🍺  /usr/local/Cellar/pygments/2.13.0_1: 905 files, 11.9MB
==> Installing git-annex dependency: python@3.10
==> Pouring python@3.10--3.10.8.catalina.bottle.1.tar.gz
==> /usr/local/Cellar/python@3.10/3.10.8/bin/python3.10 -m ensurepip
==> /usr/local/Cellar/python@3.10/3.10.8/bin/python3.10 -m pip install -v --no-deps --no-index --upgrade --isolated --target=/usr/local/lib/python3.10/site-packages /usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3
🍺  /usr/local/Cellar/python@3.10/3.10.8: 3,114 files, 56.8MB
==> Installing git-annex dependency: sphinx-doc
==> Pouring sphinx-doc--5.3.0.catalina.bottle.2.tar.gz
🍺  /usr/local/Cellar/sphinx-doc/5.3.0: 3,664 files, 55.8MB
==> Installing git-annex dependency: ghc
Warning: Your Xcode (12.3) is outdated.
Please update to Xcode 12.4 (or delete it).
Xcode can be updated from the App Store.
==> ./configure --prefix=/private/tmp/ghc-20221215-3586-s0g334/ghc-9.4.3/binary

I will restart those

@yarikoptic
Copy link
Member Author

@jwodder could you please look into how/why installation of git-annex fails on appveyor OSXes via datalad-installer -- e.g. https://ci.appveyor.com/project/mih/datalad/builds/45696242/job/h7a2890c5yyt5ekw -- if anything which could be done on our end?

@yarikoptic
Copy link
Member Author

@bpoldrack PR should be ready for re-review. appveyor gotchas are unrelated AFAIK

@jwodder
Copy link
Member

jwodder commented Dec 15, 2022

@yarikoptic The logs look like they might be truncated to me, as they end in the middle of a brew install. Do you see anything past line 191?

Regardless, note that the output from Homebrew includes:

Warning: You are using macOS 10.15.
We (and Apple) do not provide support for this old version.
It is expected behaviour that some formulae will fail to build in this old version.

EDIT: Actually, the "Messages" tab on that page contains:

Build execution time has reached the maximum allowed time for your plan (60 minutes).

which is likely what directly caused the build to fail.

@yarikoptic
Copy link
Member Author

Do you see anything past line 191?

I don't. So, overall, it seems that it just needs newer OSX. May be appveyor script should just install from the datalad/git-annex built .dmg's and not bother with brew (filed #7233 attn @mih).

@adswa
Copy link
Member

adswa commented Dec 22, 2022

So, overall, it seems that it just needs newer OSX.

maint already has them #7227

* origin/master:
  Switch macOS appveyor builds to Monterey
  Automated deployment to update contributors 2022-12-14
  BF: Don't change ui.backend of test process
  ENH(DX,TST): log at DEBUG about encountered error, unit-test that exception handling block
  RF: Move basic functionality to SpecialRemote
  BF: Don't overwrite AVAILABILITY
  Send ERROR to annex when special remote process fails
  Add Matthias Riße to .zenodo.json
  Update changelog.d/pr-7213.md
  [release-action] Autogenerate changelog snippet for PR 7213
  NF: make extra_remote_settings configurable in git config
  Don't be scary when failing to annex-enable
@codeclimate
Copy link

codeclimate bot commented Dec 22, 2022

Code Climate has analyzed commit 1634a62 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 1

View more on Code Climate.

@yarikoptic yarikoptic added this to the 0.18.0 milestone Dec 22, 2022
@bpoldrack
Copy link
Member

Ok, thank you @yarikoptic!

@bpoldrack bpoldrack merged commit 70f617e into datalad:master Dec 23, 2022
@yarikoptic yarikoptic deleted the bf-6940-enh-3 branch January 20, 2023 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-performance Changes only improve performance, no impact on version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants