-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: create-sibling do not try to OPT by limiting paths to installed submodules #6528
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…submodules Prior approach did not work well together well with --since specification, where subdatasets recursively would traverse entire tree for no need. And it was added in 8c673a4 RFing as an OPT to avoid expensive traversal of subdatasets which are not installed. But for those there should be no extra cost since we cannot descend into not installed subdatasets. But may be there is cost since we will be analyzing entire trees (I guess) and not just subdatasets, which is what matters for create-sibling -- so there could be more OPT I think
yarikoptic
added
the
semver-performance
Changes only improve performance, no impact on version
label
Mar 8, 2022
eh -- something odd with snapshot.debian.org causing fails on appveyor -- 503 error[202](https://ci.appveyor.com/project/mih/datalad/builds/42832527/job/kfu2vjm0syg3avyk#L202)2-03-08T22:02:33+0000 [INFO ] datalad_installer URL: http://snapshot.debian.org/archive/debian/20210906T[204](https://ci.appveyor.com/project/mih/datalad/builds/42832527/job/kfu2vjm0syg3avyk#L204)127Z/pool/main/g/git-annex/git-annex_8.20[210](https://ci.appveyor.com/project/mih/datalad/builds/42832527/job/kfu2vjm0syg3avyk#L210)903-1_amd64.deb
2022-03-08T22:02:33+0000 [INFO ] datalad_installer Extra args: None
2022-03-08T22:02:33+0000 [INFO ] datalad_installer Downloading http://snapshot.debian.org/archive/debian/20210906T204127Z/pool/main/g/git-annex/git-annex_8.20210903-1_amd64.deb
Traceback (most recent call last):
File "/home/appveyor/venv3.7.12/bin/datalad-installer", line 8, in <module>
sys.exit(main())
File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1788, in main
return manager.main(argv)
File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 657, in main
self.addcomponent(name=cr.name, **cr.kwargs)
File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 690, in addcomponent
component(self).provide(**kwargs)
File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1032, in provide
bins = self.get_installer(method).install(self.NAME, **kwargs)
File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1118, in install
bindir = self.install_package(package, **kwargs)
File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1386, in install_package
download_file(url, debpath)
File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1722, in download_file
with urlopen(req) as r:
File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line [222](https://ci.appveyor.com/project/mih/datalad/builds/42832527/job/kfu2vjm0syg3avyk#L222), in urlopen
return opener.open(url, data, timeout)
File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Backend fetch failed
Command exited with code 1
Running "on_finish" scripts
while [ -f ~/BLOCK ]; do sleep 5; done |
yarikoptic
commented
Mar 9, 2022
yarikoptic
commented
Mar 9, 2022
yarikoptic
commented
Mar 9, 2022
yarikoptic
commented
Mar 9, 2022
to make it possible to efficiently report only diffs pertinent to datasets hierarchy while also working efficiently with "to" kwarg. We should not list all subdatasets recursively in "above" logic but rather instruct this diff_dataset logic to act only on subdatasets It seems to bring it back to the scale of "with subdatasets in logic above" but not quite exactly matching it (git)lena:~datalad/datalad-master[bf-rf-create-sibling]git $> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, None, "HEAD", path=[sds["path"] for sds in ds.subdatasets(recursive=True, state="present", result_renderer="disabled")], constant_refs=True, annex=None, untracked="no", recursive=False, eval_file_type=False)]; print(Counter(r))' Counter({'dataset': 79}) python -c 3.77s user 1.27s system 110% cpu 4.547 total $> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, None, "HEAD", constant_refs=True, annex=None, untracked="no", recursive=True, eval_file_type=False, datasets_only=True)]; print(Counter(r))' Counter({'dataset': 79}) python -c 4.82s user 1.69s system 111% cpu 5.827 total but I think it is ok since gains for other usecase (--since) are of larger impact with this PR, actually boosts it even further: (git)lena:~datalad/datalad-master[bf-rf-create-sibling]git $> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, "HEAD^^^^^^", "HEAD", path=[sds["path"] for sds in ds.subdatasets(recursive=True, state="present", result_renderer="disabled")], constant_refs=True, annex=None, untracked="no", recursive=False, eval_file_type=False)]; print(Counter(r))' Counter({'dataset': 47, None: 1}) $> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, "HEAD^^^^^^", "HEAD", constant_refs=True, annex=None, untracked="no", recursive=True, eval_file_type=False, datasets_only=True)]; print(Counter(r))' Counter({'dataset': 47, None: 1}) python -c 1.46s user 0.40s system 109% cpu 1.697 total
yarikoptic
force-pushed
the
bf-rf-create-sibling
branch
from
March 10, 2022 17:16
85c602b
to
11aa8d2
Compare
Code Climate has analyzed commit 11aa8d2 and detected 1 issue on this pull request. Here's the issue category breakdown:
View more on Code Climate. |
This PR is ready for your review @datalad/developers . In case if there is no objections, I will merge it next monday |
yarikoptic
added a commit
to yarikoptic/datalad
that referenced
this pull request
Mar 30, 2022
…on committed Initial motivation for this change was fixing datalad#6596 where while specifying --since, create-sibling did not detect needing to create a sibling for a subdataset. A stale (missed during datalad#6528 OPT/RF) comment said that value of constant_refs=True should not matter since it was not recursive originally. But now here we do recurse and it does matter since we migth completely skip subtree where such refs are not yet known. Changing to False resulted in the test failing which made me realize that sibling was created even for not yet committed changes to submodules. Since for push it would really not work to push something which is not yet saved, I think the change in behavior for create-sibling is also desired. And it fixes datalad#6596. So I have "fixed" and expanded test with the problematic case I had discovered.
yarikoptic
added a commit
to yarikoptic/datalad
that referenced
this pull request
Mar 30, 2022
…on committed Initial motivation for this change was fixing datalad#6596 where while specifying --since, create-sibling did not detect needing to create a sibling for a subdataset. A stale (missed during datalad#6528 OPT/RF) comment said that value of constant_refs=True should not matter since it was not recursive originally. But now here we do recurse and it does matter since we migth completely skip subtree where such refs are not yet known. Changing to False resulted in the test failing which made me realize that sibling was created even for not yet committed changes to submodules. Since for push it would really not work to push something which is not yet saved, I think the change in behavior for create-sibling is also desired. And it fixes datalad#6596. So I have "fixed" and expanded test with the problematic case I had discovered.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prior approach did not work well together well with --since specification, where
subdatasets recursively would traverse entire tree for no need. And it was added in
8c673a4 RFing as an OPT to avoid expensive traversal
of subdatasets which are not installed. But for those there should be no extra cost
since we cannot descend into not installed subdatasets. But may be there is cost since
we will be analyzing entire trees (I guess) and not just subdatasets, which is
what matters for create-sibling -- so there could be more OPT I think
Fixes #6527
edit1: nope -- as is it would not be good enough since
diff_dataset
would return full diff, not only for subdataset statesthat seems to be ok/OPT in case whenever there is `--since`, and we just need filter out for not clean datasets
but if there is no --since -- without limiting for subdatasets paths, we are spending too much to do full diff
so it would need more to actually become an OPT ...
Changelog
💫 Enhancements and new features
--since
specification since would consider only submodules related to the changes since that point. Fixes create-sibling --since: still traverses too much (full?) tree for no good reason #6527