Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement get_submodules_() without get_content_info() (Reincarnated 6942) #7189

Merged
merged 8 commits into from
Nov 28, 2022

Conversation

adswa
Copy link
Member

@adswa adswa commented Nov 18, 2022

A while ago, @mih proposed #6942. This effort was supposed to be super-seeded by #6974, which aimed to fix a short coming of #6942, namely an O(N²) performance hit in an edgecase (when one provides paths to all subdatasets to a subdataset call). That fix is currently in draft mode, as it unexpectedly required more work than anticipated. @yarikoptic therefore stated in #6942 before it was closed

@mih - I ran into need for more work on get_parent_paths since its goal is a bit different for matching paths within submodules not limiting some paths by other paths... needs a bit more work.
so I am ok with you solution, even if later to be improved later. But the tests I added in #6974 fails here and passes on master, so we would need to at least fix it up to not introduce regression.

As I can still see the performance boost #6942 brought, I wanted to reincarnate the PR in its last state.
This is with master and on this branch with the same dataset @mih tested in #6942:

# on master
(gooey) adina@muninn in /tmp/ukb_cat on git:master
❱ time datalad subdatasets                                     
subdataset(ok): code/pipeline (dataset)
datalad subdatasets  177.08s user 16.30s system 99% cpu 3:13.39 total
# on this branch
(gooey) adina@muninn in /tmp/ukb_cat on git:master
❱ rehash
(gooey) adina@muninn in /tmp/ukb_cat on git:master
❱ datalad --version
datalad 0.17.9+158.g087ac2612
(gooey) adina@muninn in /tmp/ukb_cat on git:master
❱ time datalad subdatasets
subdataset(ok): code/pipeline (dataset)
datalad subdatasets  0.28s user 0.06s system 100% cpu 0.336 total

I tested the tests that were failing before after merging master on my Debian system and on a Windows system, were they passed. Based on that, I'm slightly hopeful that our CI might turn green now as well 🤞

I do see the performance hit in the edge case @yarikoptic discovered, though, still. Here are my timings in the human-connectome-project openaccess dataset:

# on this branch
(gooey) adina@muninn in ~/repos/data/human-connectome-project-openaccess on git:master
❱ datalad --version; (builtin cd /home/adina/repos/data/human-connectome-project-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )

datalad 0.17.9+159.ge642f4159
1113
python -c   3.42s user 0.02s system 100% cpu 3.425 total
wc -l  0.00s user 0.00s system 0% cpu 3.425 total
# on master
(gooey) adina@muninn in ~/repos/data/human-connectome-project-openaccess on git:master
❱ datalad --version; (builtin cd /home/adina/repos/data/human-connectome-project-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )

datalad 0.17.9+153.g90cf7f374
1113
python -c   0.24s user 0.02s system 104% cpu 0.253 total
wc -l  0.00s user 0.00s system 0% cpu 0.252 total

Nevertheless, I feel like the change so far does speed up the standard case considerably. If I don't explicitly provide paths to subdatasets, the timing difference is negligible:

# on master
(gooey) adina@muninn in ~/repos/data/human-connectome-project-openaccess on git:master
❱ datalad --version; (builtin cd /home/adina/repos/data/human-connectome-project-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_())))" | wc -l; ) 

datalad 0.17.9+153.g90cf7f374
1113
python -c   0.17s user 0.03s system 102% cpu 0.187 total
wc -l  0.00s user 0.00s system 0% cpu 0.187 total
# on this branch
(gooey) adina@muninn in ~/repos/data/human-connectome-project-openaccess on git:master
❱ rehash
(gooey) adina@muninn in ~/repos/data/human-connectome-project-openaccess on git:master
❱ datalad --version; (builtin cd /home/adina/repos/data/human-connectome-project-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_())))" | wc -l; ) 

datalad 0.17.9+159.ge642f4159
1113
python -c   0.18s user 0.03s system 107% cpu 0.197 total
wc -l  0.00s user 0.00s system 0% cpu 0.196 total

Fixes #6940

mih and others added 5 commits August 15, 2022 17:01
The previous implementation resulted in a circular dependency after
a check for a Git-version dependent reporting behavior was introduced
with 8ff8613

Moreover, the previous implementation also queried a repository for
all content, merely to discard any record that is not a subdataset.
This could slow a responds dramatically for dataset with few subdatasets
but many files.

This change reimplements `get_submodules_()` with a plain call to
`git ls-files` parameterized with the paths of recorded submodules
taken from `.gitmodules`.

This changes the behavior to be more in-line with `git submodule`
by refusing to report on submodules that are not recorded in
.gitmodules.

Fixes datalad#6940
Fixes datalad#6941
Previously we ignored them in values as mandatory reasons for quoting.
This change expands the `quote_config()` helper to perform this
additional test.

Fixes datalad#6943
Both tests pass on master and fail here ATM.  The 2nd test, matching for absent s_unknown
result returns all submodules but even without gitmodule_name.
@adswa adswa added the semver-performance Changes only improve performance, no impact on version label Nov 18, 2022
@codecov
Copy link

codecov bot commented Nov 18, 2022

Codecov Report

Base: 88.63% // Head: 90.73% // Increases project coverage by +2.09% 🎉

Coverage data is based on head (6d40969) compared to base (90cf7f3).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7189      +/-   ##
==========================================
+ Coverage   88.63%   90.73%   +2.09%     
==========================================
  Files         325      325              
  Lines       44109    44125      +16     
  Branches     5863     5870       +7     
==========================================
+ Hits        39096    40036     +940     
+ Misses       4998     4074     -924     
  Partials       15       15              
Impacted Files Coverage Δ
datalad/support/gitrepo.py 92.70% <100.00%> (+<0.01%) ⬆️
datalad/support/tests/test_gitrepo.py 99.77% <100.00%> (+<0.01%) ⬆️
datalad/distribution/utils.py 94.00% <0.00%> (-6.00%) ⬇️
datalad/support/locking.py 94.87% <0.00%> (-3.85%) ⬇️
datalad/support/tests/test_locking.py 93.81% <0.00%> (-2.07%) ⬇️
datalad/local/wtf.py 75.83% <0.00%> (-1.68%) ⬇️
datalad/interface/utils.py 95.28% <0.00%> (-1.05%) ⬇️
datalad/tests/utils_pytest.py 89.42% <0.00%> (-0.50%) ⬇️
datalad/cli/parser.py 86.03% <0.00%> (-0.46%) ⬇️
datalad/support/annexrepo.py 90.33% <0.00%> (-0.09%) ⬇️
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@yarikoptic
Copy link
Member

if you just reincarnated it, odd that those tests I have added do not fail -- but that is a good sign ;-)

@yarikoptic yarikoptic added the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Nov 18, 2022
@github-actions github-actions bot removed the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Nov 18, 2022
@yarikoptic
Copy link
Member

Nevertheless, I feel like the change so far does speed up the standard case considerably.

I don't mind this PR since detrimental effect for one usage pattern seems to be less that the performance boost for another, but I want to avoid claiming something to be standard whenever there is none "standard". Using globs in command line is IMHO no less common than pointing to a subpath. So this PR will make datalad faster for some people and slow for other. And there is no "standard" way to use datalad (e.g. "not using globs is the standard way").

adswa and others added 2 commits November 21, 2022 08:00
- Fix a typo
- Clean up an unused import statement
- check if paths from .gitmodules and paths passed to the command are
  relative to eachother from both directions - because we might recurse
  deeper into a dataset hierarchy with a subdatasets call, the test
  test_get_subdatasets failed when we soley checked from one direction.
@adswa adswa force-pushed the reincarnate-bf-6940 branch from f81e6d7 to c478280 Compare November 21, 2022 07:10
@adswa
Copy link
Member Author

adswa commented Nov 21, 2022

I want to avoid claiming something to be standard whenever there is none "standard"

True, sorry - that was poorly phrased.

I squashed the few fixes I did on top of the old branch into a single commit, and tuned the changelog.

Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @adswa!

I have not yet looked into the next failure really. Have you?

@adswa
Copy link
Member Author

adswa commented Nov 21, 2022

I have not yet looked into the next failure really. Have you?

I haven't as they looked completely unrelated to submodules:

FAILED constraints/tests/test_cmdarg_validation.py::test_cmd_with_validation - AssertionError: assert [{'http://exa.../ '/dev/null'}] == [{'http://exa.../'/dev/null')}]
  At index 0 diff: {'http://example.com/': 'some/dir/file'} != {'http://example.com/': PosixPath('some/dir/file')}
  Full diff:
    [
  -  {'http://example.com/': PosixPath('some/dir/file')},
  ?                         ----------               -
  +  {'http://example.com/': 'some/dir/file'},
  -  {'file:///dev/null': PosixPath('/dev/null')},
  ?                       ----------           -
  +  {'file:///dev/null': '/dev/null'},
    ]
FAILED tests/test_download.py::test_download - datalad.support.exceptions.IncompleteResultsError: Command did not complete successfully. 1 failed:
[{'action': 'download',
  'error_message': '404 Client Error: File not found for url: '
                   'http://127.0.0.1:35849/testfile.txt%20testfile2.txt',
  'exception': 404 Client Error: File not found for url: http://127.0.0.1:35849/testfile.txt%20testfile2.txt [download.py:__call__:315,http_url_operations.py:download:74,http_url_operations.py:download:69,models.py:raise_for_status:1021],
  'exception_traceback': '[download.py:__call__:315,http_url_operations.py:download:74,http_url_operations.py:download:69,models.py:raise_for_status:1021]',
  'message': 'download failure',
  'path': PosixPath('/tmp/datalad_temp_test_download81zvewl0/testfile.txt testfile2.txt'),
  'status': 'error',
  'status_code': 404,
  'url': 'http://127.0.0.1:35849/testfile.txt testfile2.txt'}]

But I can check if download is somehow impacted. Edit: Upon a first quick glance at download I don't see how any code from this PR is executed - the command barely even touches datasets...
Edit 2: The failure is unrelated, and also surfaced in other PRs (such as #7176 just now)

@yarikoptic yarikoptic changed the title Reincarnate 6942 Reimplement get_submodules_() without get_content_info() (Reincarnated 6942) Nov 21, 2022
changelog.d/pr-7189.md Outdated Show resolved Hide resolved
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
@codeclimate
Copy link

codeclimate bot commented Nov 21, 2022

Code Climate has analyzed commit 6d40969 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 1

View more on Code Climate.

@bpoldrack
Copy link
Member

Ok, datalad-next failure unrelated, @yarikoptic "doesn't mind" and it doesn't prevent any further work on #6974 - let's merge then.

Thanks for digging this up, @adswa.

@bpoldrack bpoldrack merged commit 8ca0eab into datalad:master Nov 28, 2022
@adswa adswa deleted the reincarnate-bf-6940 branch November 28, 2022 09:54
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Dec 1, 2022
…nt_paths

Redone datalad#6974 .  Now that
datalad#7189 has finished the RFing to avoid
some slow operations at the cost of making some use cases slower, this
should resolve it back to avoid any slow operation.

get_parent_paths function was originally created for get_content_info for
that specific reason (performance). datalad#6974 commits avoided a hypothetical
infinite recursion (my code analysis says it was not possible, see
datalad#6941 (comment)) and
performance issues by introducing similar logic directly in the code of
get_modules_.

This commit RFs that back to use get_parent_paths but also extends it to return
not parent paths but actual paths. So we reuse the same logic but manipulate
what is returned by the function. That allows us to subselect paths which are
under "parents".

It also adds rudimentary tests for invocation with paths limiting etc.
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Dec 1, 2022
…nt_paths

Redone datalad#6974 .  Now that
datalad#7189 has finished the RFing to avoid
some slow operations at the cost of making some use cases slower, this
should resolve it back to avoid any slow operation.

get_parent_paths function was originally created for get_content_info for
that specific reason (performance). datalad#6974 commits avoided a hypothetical
infinite recursion (my code analysis says it was not possible, see
datalad#6941 (comment)) and
performance issues by introducing similar logic directly in the code of
get_modules_.

This commit RFs that back to use get_parent_paths but also extends it to return
not parent paths but actual paths. So we reuse the same logic but manipulate
what is returned by the function. That allows us to subselect paths which are
under "parents".

It also adds rudimentary tests for invocation with paths limiting etc.
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Dec 8, 2022
…ted_paths

Redone datalad#6974 and datalad#7211 which attempted to reuse get_parent_paths with tune ups
-- was not sufficient since need to have a bit ad-hoc limiting by the path
WITHIN submodule to select that submodule (which would not work for plain directory).

Overall -- now that datalad#7189 has finished the RFing to avoid some slow
operations at the cost of making some use cases slower, this should resolve it
back to avoid any slow operation.

datalad#6974 commits avoided a hypothetical infinite recursion (my code analysis
says it was not possible, see datalad#6941 (comment)) and performance issues by
introducing similar logic directly in the code of get_modules_.

get_limited_paths relies on sorting of the paths, should be O(N*log(N)) but
generally likely faster since likely to get them sorted to start with.

Timing this branch:

    ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
    datalad 0.17.9+185.gd236b5a6d
    0.79user 0.14system 0:00.89elapsed 105%CPU (0avgtext+0avgdata 28896maxresident)k
    824inputs+0outputs (1major+8149minor)pagefaults 0swaps
    1113

and master

    ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
    datalad 0.17.9+183.g34a28a622
    4.29user 0.11system 0:04.39elapsed 100%CPU (0avgtext+0avgdata 29092maxresident)k
    0inputs+0outputs (0major+8226minor)pagefaults 0swaps
    1113

and maint

    ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
    datalad 0.17.9+82.g405ece550
    0.97user 0.12system 0:01.05elapsed 104%CPU (0avgtext+0avgdata 30248maxresident)k
    0inputs+0outputs (0major+8633minor)pagefaults 0swaps
    1113

so we are no worse (if not better) than maint and definetely gain over master
for such use cases of N submodules/N paths given. And for a single path we might even be not only faster but fixing some bug in maint:

	❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
	datalad 0.17.9+185.g8b21bfceb
	0.65user 0.11system 0:00.75elapsed 101%CPU (0avgtext+0avgdata 28040maxresident)k
	104inputs+0outputs (0major+7652minor)pagefaults 0swaps
	1

	❯ git co master
	Switched to branch 'master'
	Your branch is up to date with 'origin/master'.

	❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
	datalad 0.17.9+183.g34a28a622
	0.75user 0.10system 0:00.83elapsed 102%CPU (0avgtext+0avgdata 28816maxresident)k
	0inputs+0outputs (0major+8203minor)pagefaults 0swaps
	1113

	❯ git co maint
	Switched to branch 'maint'
	Your branch is up to date with 'origin/maint'.

	❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
	datalad 0.17.9+82.g405ece550
	Traceback (most recent call last):
	  File "<string>", line 1, in <module>
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2388, in get_submodules_
		for path, props in self.get_content_info(
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2769, in get_content_info
		posix_paths = get_parent_paths(posix_paths, submodules)
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 191, in get_parent_paths
		_get_parent_paths_check(path, sep, asep)
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 213, in _get_parent_paths_check
		raise ValueError("Expected relative within directory paths, got %r" % path)
	ValueError: Expected relative within directory paths, got '/'
	Command exited with non-zero status 1
	0.39user 0.04system 0:00.43elapsed 101%CPU (0avgtext+0avgdata 29832maxresident)k
	0inputs+0outputs (0major+8107minor)pagefaults 0swaps
	0
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Dec 10, 2022
…ted_paths

Redone datalad#6974 and datalad#7211 which attempted to reuse get_parent_paths with tune ups
-- was not sufficient since need to have a bit ad-hoc limiting by the path
WITHIN submodule to select that submodule (which would not work for plain directory).

Overall -- now that datalad#7189 has finished the RFing to avoid some slow
operations at the cost of making some use cases slower, this should resolve it
back to avoid any slow operation.

datalad#6974 commits avoided a hypothetical infinite recursion (my code analysis
says it was not possible, see datalad#6941 (comment)) and performance issues by
introducing similar logic directly in the code of get_modules_.

get_limited_paths relies on sorting of the paths, should be O(N*log(N)) but
generally likely faster since likely to get them sorted to start with.

Timing this branch:

    ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
    datalad 0.17.9+185.gd236b5a6d
    0.79user 0.14system 0:00.89elapsed 105%CPU (0avgtext+0avgdata 28896maxresident)k
    824inputs+0outputs (1major+8149minor)pagefaults 0swaps
    1113

and master

    ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
    datalad 0.17.9+183.g34a28a622
    4.29user 0.11system 0:04.39elapsed 100%CPU (0avgtext+0avgdata 29092maxresident)k
    0inputs+0outputs (0major+8226minor)pagefaults 0swaps
    1113

and maint

    ❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
    datalad 0.17.9+82.g405ece550
    0.97user 0.12system 0:01.05elapsed 104%CPU (0avgtext+0avgdata 30248maxresident)k
    0inputs+0outputs (0major+8633minor)pagefaults 0swaps
    1113

so we are no worse (if not better) than maint and definetely gain over master
for such use cases of N submodules/N paths given. And for a single path we might even be not only faster but fixing some bug in maint:

	❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
	datalad 0.17.9+185.g8b21bfceb
	0.65user 0.11system 0:00.75elapsed 101%CPU (0avgtext+0avgdata 28040maxresident)k
	104inputs+0outputs (0major+7652minor)pagefaults 0swaps
	1

	❯ git co master
	Switched to branch 'master'
	Your branch is up to date with 'origin/master'.

	❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
	datalad 0.17.9+183.g34a28a622
	0.75user 0.10system 0:00.83elapsed 102%CPU (0avgtext+0avgdata 28816maxresident)k
	0inputs+0outputs (0major+8203minor)pagefaults 0swaps
	1113

	❯ git co maint
	Switched to branch 'maint'
	Your branch is up to date with 'origin/maint'.

	❯ datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; /bin/time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')[1]))))" | wc -l; )
	datalad 0.17.9+82.g405ece550
	Traceback (most recent call last):
	  File "<string>", line 1, in <module>
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2388, in get_submodules_
		for path, props in self.get_content_info(
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2769, in get_content_info
		posix_paths = get_parent_paths(posix_paths, submodules)
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 191, in get_parent_paths
		_get_parent_paths_check(path, sep, asep)
	  File "/home/yoh/proj/datalad/datalad-master/datalad/support/path.py", line 213, in _get_parent_paths_check
		raise ValueError("Expected relative within directory paths, got %r" % path)
	ValueError: Expected relative within directory paths, got '/'
	Command exited with non-zero status 1
	0.39user 0.04system 0:00.43elapsed 101%CPU (0avgtext+0avgdata 29832maxresident)k
	0inputs+0outputs (0major+8107minor)pagefaults 0swaps
	0
@yarikoptic-gitmate
Copy link
Collaborator

PR released in 0.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-performance Changes only improve performance, no impact on version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants